# NLTK Workshop 
Updated March, 2016

Introductory code to practice learning Python's Natural Language Toolkit (NLTK), much of which is taken from the excellent [NLTK Book](http://www.nltk.org/book/). 


Begin by importing NLTK along with all the resources used in the NLTK book -- this second part assumes that you have already downloaded the book resources. If you haven't, first enter `nltk.download()` and select the "book" resources for download.

In [2]:
import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Basic Text Data

In [7]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [8]:
len(text1)

260819

In [13]:
len(set(text1))

19317

In [15]:
len(text1) / len(set(text1))

13

In [12]:
text1.tokens[:25]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.',
 '(',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 ')',
 'The',
 'pale',
 'Usher']

## Collocations
Find "collocations", that is, word combinations that occur more often than would be expected by chance.

In [5]:
text1.collocations()

Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand


A more involved approach allows users to make use of additional functionality.

In [91]:
from nltk.collocations import *
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
finder.nbest(trigram_measures.pmi, 10)

[(u'olive', u'leaf', u'plucked'),
 (u'rider', u'falls', u'backward'),
 (u'sewed', u'fig', u'leaves'),
 (u'yield', u'royal', u'dainties'),
 (u'during', u'mating', u'season'),
 (u'Salt', u'Sea', u').'),
 (u'Sea', u').', u'Twelve'),
 (u'Their', u'hearts', u'failed'),
 (u'Valley', u').', u'Melchizedek'),
 (u'doing', u'forced', u'labor')]

In [92]:
finder.apply_freq_filter(3)
finder.nbest(trigram_measures.pmi, 10)

[(u'Beer', u'Lahai', u'Roi'),
 (u'seven', u'ewe', u'lambs'),
 (u'God', u'Most', u'High'),
 (u'built', u'an', u'altar'),
 (u'every', u'living', u'creature'),
 (u'an', u'everlasting', u'covenant'),
 (u'every', u'creeping', u'thing'),
 (u'sixty', u'-', u'five'),
 (u'soul', u'may', u'bless'),
 (u'after', u'its', u'kind')]

## Concordances
Count words and see them in context.

In [87]:
text1.concordance("whale")

Displaying 25 of 1226 matches:
s , and to teach them by what name a whale - fish is to be called in our tongue
t which is not true ." -- HACKLUYT " WHALE . ... Sw . and Dan . HVAL . This ani
ulted ." -- WEBSTER ' S DICTIONARY " WHALE . ... It is more immediately from th
ISH . WAL , DUTCH . HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALE
HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALEINE , FRENCH . BALLE
least , take the higgledy - piggledy whale statements , however authentic , in 
 dreadful gulf of this monster ' s ( whale ' s ) mouth , are immediately lost a
 patient Job ." -- RABELAIS . " This whale ' s liver was two cartloads ." -- ST
 Touching that monstrous bulk of the whale or ork we have received nothing cert
 of oil will be extracted out of one whale ." -- IBID . " HISTORY OF LIFE AND D
ise ." -- KING HENRY . " Very like a whale ." -- HAMLET . " Which to secure , n
restless paine , Like as the wounded whale to shore flies thro ' the maine ." -
. OF SPER

In [8]:
text1.count("whale")

906

In [11]:
from __future__ import division
text1.count("whale") / len(text1) * 100

0.3473673313677301

In [12]:
fdist = FreqDist(text1)

In [93]:
fdist

<FreqDist with 19317 samples and 260819 outcomes>

In [95]:
fdist.items()[:50]

[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767),
 ('have', 760),
 ('there', 715),
 ('But', 705),
 ('or', 697),
 ('were', 680),
 ('now', 646),
 ('which', 640),
 ('?', 637),
 ('me', 627),
 ('like', 624)]

In [13]:
fdist.most_common(50)

AttributeError: 'FreqDist' object has no attribute 'most_common'

### Filtering Word Lists

In [30]:
all_words = [w.lower() for w in text1 if w.isalpha()]
fdist_words = FreqDist(all_words)

In [31]:
fdist_words.items()[:50]

[('the', 14431),
 ('of', 6609),
 ('and', 6430),
 ('a', 4736),
 ('to', 4625),
 ('in', 4172),
 ('that', 3085),
 ('his', 2530),
 ('it', 2522),
 ('i', 2127),
 ('he', 1896),
 ('but', 1818),
 ('s', 1802),
 ('as', 1741),
 ('is', 1725),
 ('with', 1722),
 ('was', 1644),
 ('for', 1617),
 ('all', 1526),
 ('this', 1394),
 ('at', 1319),
 ('whale', 1226),
 ('by', 1204),
 ('not', 1151),
 ('from', 1088),
 ('him', 1067),
 ('so', 1065),
 ('on', 1062),
 ('be', 1045),
 ('one', 921),
 ('you', 894),
 ('there', 869),
 ('now', 785),
 ('had', 779),
 ('have', 768),
 ('or', 713),
 ('were', 684),
 ('they', 667),
 ('which', 648),
 ('like', 647),
 ('me', 633),
 ('then', 630),
 ('their', 620),
 ('some', 618),
 ('what', 618),
 ('when', 606),
 ('are', 598),
 ('an', 596),
 ('my', 589),
 ('no', 586)]

In [97]:
from nltk.corpus import stopwords
filtered_words = [w for w in lowercase_words if w not in stopwords.words('english')]
fdist_filtered_words = FreqDist(filtered_words)
fdist_filtered_words.items()[:50]

[('whale', 1226),
 ('one', 921),
 ('like', 647),
 ('upon', 566),
 ('man', 527),
 ('ship', 518),
 ('ahab', 511),
 ('ye', 472),
 ('sea', 455),
 ('old', 450),
 ('would', 432),
 ('though', 384),
 ('head', 345),
 ('yet', 345),
 ('boat', 336),
 ('time', 334),
 ('long', 333),
 ('captain', 329),
 ('still', 312),
 ('great', 306),
 ('said', 304),
 ('two', 298),
 ('must', 283),
 ('seemed', 283),
 ('white', 281),
 ('last', 277),
 ('see', 272),
 ('thou', 271),
 ('way', 271),
 ('whales', 268),
 ('stubb', 257),
 ('queequeg', 252),
 ('little', 249),
 ('round', 247),
 ('three', 245),
 ('men', 244),
 ('say', 244),
 ('sperm', 244),
 ('may', 240),
 ('first', 235),
 ('every', 232),
 ('well', 230),
 ('us', 228),
 ('much', 223),
 ('could', 216),
 ('good', 216),
 ('hand', 214),
 ('side', 208),
 ('ever', 206),
 ('never', 206)]

In [63]:
[word for word in set(filtered_words) if len(word) > 15]

['hermaphroditical',
 'subterraneousness',
 'apprehensiveness',
 'uninterpenetratingly',
 'irresistibleness',
 'responsibilities',
 'comprehensiveness',
 'uncompromisedness',
 'superstitiousness',
 'uncomfortableness',
 'preternaturalness',
 'circumnavigating',
 'cannibalistically',
 'supernaturalness',
 'circumnavigations',
 'indispensableness',
 'simultaneousness',
 'undiscriminating',
 'characteristically',
 'physiognomically',
 'indiscriminately',
 'circumnavigation']

In [78]:
import re
[(word, filtered_words.count(word)) for word in set(filtered_words) if re.search('^un.*ly$', word)]

[('unfriendly', 1),
 ('universally', 4),
 ('unexpectedly', 1),
 ('unblinkingly', 1),
 ('ungentlemanly', 1),
 ('uninvitedly', 1),
 ('unrustlingly', 1),
 ('unmannerly', 2),
 ('uninterpenetratingly', 1),
 ('unmanageably', 1),
 ('unhesitatingly', 1),
 ('unwittingly', 3),
 ('unappeasedly', 1),
 ('unerringly', 3),
 ('unceasingly', 3),
 ('unaccountably', 2),
 ('unspeakably', 2),
 ('unmistakably', 1),
 ('unusually', 5),
 ('unsightly', 1),
 ('unearthly', 12),
 ('unreasonably', 2),
 ('ungainly', 1),
 ('untraditionally', 1),
 ('unfathomably', 2),
 ('untimely', 2),
 ('unfrequently', 3),
 ('unsweetly', 1),
 ('unrestingly', 3),
 ('unmurmuringly', 1),
 ('unconditionally', 1),
 ('unholy', 2),
 ('unconsciously', 7),
 ('unduly', 2),
 ('undoubtedly', 1),
 ('unthinkingly', 2),
 ('ungodly', 3),
 ('unceremoniously', 1),
 ('unmethodically', 2),
 ('unwarrantably', 2),
 ('untrackably', 1),
 ('unofficially', 1),
 ('unlikely', 1),
 ('uniformly', 1),
 ('unavoidably', 1),
 ('unprecedentedly', 1),
 ('unmeaningly', 

## Classifying Text

Assigning text to categories algorithmically.

In [108]:
nltk.pos_tag(sent2)

[('The', 'DT'),
 ('family', 'NN'),
 ('of', 'IN'),
 ('Dashwood', 'NNP'),
 ('had', 'VBD'),
 ('long', 'RB'),
 ('been', 'VBN'),
 ('settled', 'VBN'),
 ('in', 'IN'),
 ('Sussex', 'NNP'),
 ('.', '.')]

[Reference](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) for part of speech tags.

In [85]:
nltk.corpus.brown.tagged_paras()

[[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]], [[('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atl