In [4]:
import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


The below materials are extracted from the NLTK book:
<br>http://www.nltk.org/book/

# 1 Language Processing and Python
## 1.1.3 Searching Text
- concordance: shows every occurrrence of a given word, together with some context.
<br>in:word
<br>out:contexts of the word occurring 

In [7]:
text1.concordance("happy")

Displaying 8 of 8 matches:
 , was called a Cape - Cod - man . A happy - go - lucky ; neither craven nor va
 rivers ; through sun and shade ; by happy hearts or broken ; through all the w
 most sea - terms , this one is very happy and significant . For the whale is i
ng by way of getting a living . Oh ! happy that the world is such an excellent 
e says , Monsieur , that he ' s very happy to have been of any service to us ."
irst love ; we marry and think to be happy for aye , when pop comes Libra , or 
 , a desperate burglar slid into his happy home , and robbed them all of everyt
rous thing in his soul . That glad , happy air , that winsome sky , did at last


- similar: look for words appear in similar range of contexts of a given word.
<br>
in: word <br> out:simiar words

In [16]:
text2.similar("happy")

much that impossible well long soon and young civil before sorry far
glad so she rich given sure unwilling great


- common_contexts: contexts shared by two or more words
<br>in: two or more words, enclosed by square brackets
<br>out: common contexts

In [33]:
text1.common_contexts(["man","happy"])

very_to


- dispersion_plot:display the location of a word in a text
<br>in:one or more words, enclosed by square brackets 
<br>out:dispersion plot
<br>note:can plot the frequency of word usage through time using ngrams 

In [34]:
text1.dispersion_plot(["happy","quick","freedom"])

- generate: random text using a given text
<br>in:textname before calling the function <br>out:random text
<br>note:not available i NLTK 3.0

## 1.1.4 Counting Vocabulary 

- set:obtain the vocabulary items
    <br>in:text<br>out: vocabulary
    <br>note: lexical diversity = len(set(text))/len(text)

In [39]:
#sorted(set(text2))

In [41]:
text1.count("what")

442

- FreqDist: frequency distribution of the words
<br>in:text<br>out:list 

In [6]:
fdist1=FreqDist(text2)
print(fdist1.most_common(10))

[(',', 9397), ('to', 4063), ('.', 3975), ('the', 3861), ('of', 3565), ('and', 3350), ('her', 2436), ('a', 2043), ('I', 2004), ('in', 1904)]


## 1.3.2 Fine-grained Selection of Words
- extract words by length and frequency counts

In [44]:
#long_words = [w for w in V if len(w) > 15]
#sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)

## 1.3.3 Collocations and Bigrams
- Collocations and Bigrams
collocations are sequences of words that occur together usually often. 

In [45]:
text2.collocations()

Colonel Brandon; Sir John; Lady Middleton; Miss Dashwood; every thing;
thousand pounds; dare say; Miss Steeles; said Elinor; Miss Steele;
every body; John Dashwood; great deal; Harley Street; Berkeley Street;
Miss Dashwoods; young man; Combe Magna; every day; next morning


## 1.3.4 NLTK's frequency Distributions
fdist = FreqDist(samples)   #create a frequency distribution containing the given samples
<br>fdist[sample] += 1          #increment the count for this sample
<br>fdist['monstrous']           #count of the number of times a given sample occurred
<br>fdist.freq('monstrous')           #frequency of a given sample
<br>fdist.N()                      #total number of samples
<br>fdist.most_common(n)           #the n most common samples and their frequencies
<br>for sample in fdist:           #iterate over the samples
<br>fdist.max()                     # #sample with the greatest count
<br>fdist.tabulate()           #tabulate the frequency distribution
<br>fdist.plot()                      #graphical plot of the frequency distribution
<br>fdist.plot(cumulative=True)           #cumulative plot of the frequency distribution
<br>fdist1 |= fdist2                      #update fdist1 with counts from fdist2
<br>fdist1 < fdist2            #test if samples in fdist1 occur less frequently than in fdist2

## 1.3.5 Word Comparison Operators
s.startswith(t)	test if s starts with t
<br>s.endswith(t)	test if s ends with t
<br>t in s	    test if t is a substring of s
<br>s.islower()	test if s contains cased characters and all are lowercase
<br>s.isupper()	test if s contains cased characters and all are uppercase
<br>s.isalpha()	test if s is non-empty and all characters in s are alphabetic
<br>s.isalnum()	test if s is non-empty and all characters in s are alphanumeric
<br>s.isdigit()	test if s is non-empty and all characters in s are digits
<br>s.istitle()	test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals)

# 2. Accessing Text Corpora and Lexical Resources
## 2.1.1 Gutenberg Corpus 
http://www.gutenberg.org/  25000 free electronic books
## 2.1.2 Web and Chat text
content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews
- refers to the book for more corpus

## 2.4.1 Unusual words
computes the vocabulary of a text, then removes all items that occur in an existing wordlist, leaving just the uncommon or mis-spelt words.
<br>in: corpus
<br>out: list of unusual words

In [7]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

print(unusual_words(text1)[:10])

['abated', 'abating', 'abednego', 'abhorred', 'abided', 'abjectus', 'ablutions', 'abominated', 'aboriginalness', 'abortions']


## 2.4.1 Stopwords 
high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing

In [55]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

## 2.4.4 Shoebox and Toolbox Lexicons
consists of a collection of entries, where each entry is made up of one or more fields

In [8]:
from nltk.corpus import toolbox
print(toolbox.entries('rotokas.dic')[3])

('kaakaaro', [('ps', 'N'), ('pt', 'NT'), ('ge', 'mixture'), ('tkp', '???'), ('eng', 'mixtures'), ('eng', 'charm used to keep married men and women youthful and attractive'), ('cmt', 'Check vowel length. Is it kaakaaro or kaakaro? Does lexeme have suffix, -aro or -ro?'), ('dt', '20/Nov/2006'), ('ex', 'Kaakaroto ira purapaiveira aue iava opita, voeao-pa airepa oraouirara, ra va aiopaive.'), ('xp', 'Kokonas ol i save wokim long ol kain samting bilong ol nupela marit, bai ol i ken kaikai.'), ('xe', 'Mixtures are made from coconut for newlyweds, who eat them.')])


## 2.5 Wordnets
find Synonyms: 
<br>in: word <br> out: word

In [66]:
from nltk.corpus import wordnet as wn
wn.synsets('happy')

[Synset('happy.a.01'),
 Synset('felicitous.s.02'),
 Synset('glad.s.02'),
 Synset('happy.s.04')]

In [69]:
#definitions
print(wn.synset('happy.a.01').definition())
#examples
print(wn.synset('happy.a.01').examples())

enjoying or showing or marked by joy or pleasure
['a happy smile', 'spent many happy days on the beach', 'a happy marriage']


- Other lexical relations
<br>hyponyms: more specific
<br>hypernyms:navigate up the hierarchy
<br>meronyms: components
<br>holonyms: things the words are contained in 
<br>word1.path_similarity(word2):caluculate semantic similarity based on paths

# 3 Processing Raw Text
## 3.1 Tokenize word
word_tokenize
<br>in:texts<br>out: list of words
<br><br>Text:create a text from tokens
<br>in:tokens<br>out:text

## 3.6 Normalizing Text
Stemmers:strip off any affixes
- PorterStemmer
- LancasterStemmer

<br>Lemmatization:removes affixes if the resulting word is in its dictionary
- WordNetLemmatizer

In [9]:
from nltk import WordNetLemmatizer
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in text1][:10])

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.']


## 3.7 NLTK's regular expression tokenizer
nltk.regexp_tokenize: tokenizer using regular expression 
<br>in:text, pattern<br>out:tokens

## 3.8 Segmentation
- sentence segmentation
<br>nltk.sent_tokenize:<br>in:text<br>out:sentences

# 5 Categorizing and Tagging words
## 5.1 Using a Tagger
- Pos-tagger: processes a sequence of words, and attaches a part of speech tag to each word

In [86]:
tokens=nltk.word_tokenize("From this day till the end of all days")
nltk.pos_tag(tokens)
#some nltk packages have tagged words

[('From', 'IN'),
 ('this', 'DT'),
 ('day', 'NN'),
 ('till', 'VB'),
 ('the', 'DT'),
 ('end', 'NN'),
 ('of', 'IN'),
 ('all', 'DT'),
 ('days', 'NNS')]

## 5.5 N-Gram Tagging
- Unigram Tagging
<br>nltk.UnigramTagger: most frequent tag for each word

In [10]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
print(unigram_tagger.tag(brown_sents[2007]))

[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]


- n-gram tagger: a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens
<br>n=2: nltk.BigramTagger
<br>note:accuracy score is low when number of words are not present in the training data(the word will be assigned "None")
- combine tagger: combine default, unigram and n-gram taggers to address the trade-off between accuracy and coverage. Use specified backoff taggers
- Brill_tagger: guess the tag of each word, then go back and fix the mistakes.

In [99]:
#t0 = nltk.DefaultTagger('NN')
#t1 = nltk.UnigramTagger(train_sents, backoff=t0)
#t2 = nltk.BigramTagger(train_sents, backoff=t1)
#t2.evaluate(test_sents)

# 6. Learning to Classify Text
## 6.1 Supervised Classification
- NaiveBayesClassifier
<br>train,classify,prob_classify,show_most_informative_features
- DecisionTreeClassifier

## 6.3 Evaluation
- classify.accuracy 
<br>in:prediction,test set <br>out:accuracy
- ConfusionMatrix

# 7. Extracting information from Text
## 7.2.1 Noun Phrase Chunking
RegexpParser(grammar):
use a grammar to parse the sentences

In [3]:
#this example shows noun Chunking
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), 
... ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}" 
cp = nltk.RegexpParser(grammar) 
result = cp.parse(sentence)
print(result)
result.draw()

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


## 7.3.2 Simple Evaluation and Baselines: IOB tagging
- use RegexpParser to evaluate the performance of chunked sentences
<br>I: in the chunk<br>O: not in a chunk <br> B:the tag is the beginning of a chunk

In [3]:
grammar = r"NP: {<[CDJNP].*>+}"
from nltk.corpus import conll2000
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


## 7.4.2 Trees
create a tree by giving a node label and a list of children:

In [7]:
tree1 = nltk.Tree('NP', ['Alice'])
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
#incorporate into larger trees
tree4 = nltk.Tree('S', [tree1, tree2])
print(tree4)
print(tree4[1].label())
print(tree4.leaves())

(S (NP Alice) (NP the rabbit))
NP
['Alice', 'the', 'rabbit']


## 7.5 Named Entity Recognition (NER)
definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, time, money, percent, facility geo-political entities like city, state, country, and so on. 

In [17]:
from nltk import chunk
sent = nltk.corpus.treebank.tagged_sents()[1]
print(nltk.ne_chunk(sent))#if binary = True, return NE and non NE

(S
  (PERSON Mr./NNP)
  (PERSON Vinken/NNP)
  is/VBZ
  chairman/NN
  of/IN
  (ORGANIZATION Elsevier/NNP)
  N.V./NNP
  ,/,
  the/DT
  (GPE Dutch/NNP)
  publishing/VBG
  group/NN
  ./.)


# 8. Analyzing Sentence Structure
## 8.4.1 Recursive Descent Parsing (Top down)
With the initial goal (find an S), the S root node is created. As the above process recursively expands its goals using the productions of the grammar, the parse tree is extended downwards (hence the name recursive descent)

In [23]:
grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)
rd_parser = nltk.RecursiveDescentParser(grammar1)
sent = 'Mary saw a dog'.split()
for tree in rd_parser.parse(sent):
    print(tree)

(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))


- short coming of Recursive Descent Parsing: <br>
First, left-recursive productions like NP -> NP PP send it into an infinite loop. 
<br>Second, the parser wastes a lot of time considering words and structures that do not correspond to the input sentence. 
<br>Third, the backtracking process may discard parsed constituents that will need to be rebuilt again later.

## 8.4.2 Shift-Reduce Parsing (Bottom up)
tries to find sequences of words and phrases that correspond to the right hand side of a grammar production, and replace them with the left-hand side, until the whole sentence is reduced to an S

In [25]:
sr_parser = nltk.ShiftReduceParser(grammar1)
sent = 'Mary saw a dog'.split()
for tree in sr_parser.parse(sent):
     print(tree)

(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))


- short coming of Shift-Reduce Parsing: <br>
A shift-reduce parser can reach a dead end and fail to find any parse
- advantage over shift-reduce parsers: <br>
only build structure that corresponds to the words in the input. Furthermore, they only build each sub-structure once

## 8.4.3 The Left-Corner Parser （hybrid）
Top-down parser with bottom-up filtering 

## 8.5 Dependencies and Dependency Grammar
dependency grammar focusses on how words relate to other words. Dependency is a binary asymmetric relation that holds between a head and its dependents

# Other Important Modules from NLTK package
https://www.nltk.org/py-modindex.html#
## Chat: different chatbots
examples: eliza, iesha,rude,suntsu, util,zen
## Combinatory Categorial Grammar
- ccg.api 
- ccg.chart:
<br>standard English rules: chart.DefaultRuleSet
<br>construct parser by calling: parser=chart.CCGChartParser(<lexicon>,<ruleset>)
<br>
nltk.ccg.chart.compute_semantics(children, edge)
<br>
nltk.ccg.chart.printCCGDerivation(tree)
<br>nltk.ccg.chart.printCCGTree(lwidth, tree)
- ccg.combinator
- ccg.lexicon: create lexicon for CCG 

## Chunk
- nltk.chunk.regexp.ChinkRule(tag_pattern, descr)
<br>find any substring that matches the tag pattern and that is contained in a chunk, and remove it from that chunk, thus creating two new chunks
- nltk.chunk.regexp.ChunkRule(tag_pattern, descr)
<br>find any substring that matches this tag pattern and that is not already part of a chunk, and create a new chunk containing that substring
- nltk.chunk.regexp.ChunkRuleWithContext(left_context_tag_pattern, chunk_tag_pattern, right_context_tag_pattern,descr)
<br>

## Classify
in: features, labels <br>out: classfication model<br>
- nltk.classify.positivenaivebayes
<br>A variant of the Naive Bayes Classifier that performs binary classification with partially-labeled training sets: when the training set has only one incomplete labelled class. 
- nltk.classify.scikitlearn
<br>implement a wrapper around scikit-learn classifiers

In [22]:
import sklearn
from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
classif = SklearnClassifier(LinearSVC())
#A scikit-learn classifier may include preprocessing steps when it’s wrapped in a Pipeline object

- nltk.classify.textcat
<br>A module for language identification using the TextCat algorithm

## cluster
- contains a number of basic clustering algorithms: k-means, E-M clusterer and group average agglomerative clusterer.

## corpus
- used to read corpus files in a variety of formats. The functions are named based on the type of information they return:
<br>words(): list of str
<br>sents(): list of (list of str)
<br>paras(): list of (list of (list of str))
<br>tagged_words(): list of (str,str) tuple
<br>tagged_sents(): list of (list of (str,str))
<br>tagged_paras(): list of (list of (list of (str,str)))
<br>chunked_sents(): list of (Tree w/ (str,str) leaves)
<br>parsed_sents(): list of (Tree with str leaves)
<br>parsed_paras(): list of (list of (Tree with str leaves))
<br>xml(): A single xml ElementTree
<br>raw(): unprocessed corpus contents

In [23]:
from nltk.corpus import brown
print(", ".join(brown.words()[:10]))

The, Fulton, County, Grand, Jury, said, Friday, an, investigation, of


## Download
download corpora, models, and other data packages that can be used with NLTK
- download()
<br>display an interactive interface
- download("package name")
- Downloader.default_download_dir()
<br>see download directory

## Draw
displaying cfg, lexical dispersion, table, tree

## Feature Structures
representing feature structures, and for performing basic operations on those feature structures
- FeatDict: 
<br> feature dictionaries, Feature identifiers may be strings or instances of the Feature class
- FeatList:
<br>Feature lists, feature identifiers are integers
- Feature structure variables are encoded using the nltk.sem.Variable class
- it is possible to track the bindings of variables if you choose to, by supplying your own initial bindings dictionary to the unify() function

In [24]:
from nltk.featstruct import unify
unify(dict(x=1, y=dict()), dict(a='a', y=dict(b='b')))  
{'y': {'b': 'b'}, 'x': 1, 'a': 'a'}

{'a': 'a', 'x': 1, 'y': {'b': 'b'}}

## Grammar
Basic data classes for representing context free grammars. A “grammar” specifies which trees can represent the structure of a given text. Each of these trees is called a “parse tree” for the text (or simply a “parse”). In this context, the leaves of a parse tree are word tokens; and the node values are phrasal categories, such as NP and VP.

## Metrics
association, confusionmatrix, distance, socres, segmentation, spearman correlation coefficient

## Parsers
Classes and interfaces for producing tree structures that represent the internal organization of a text

## probability 
Classes for representing and processing probabilistic information.

- FreqDist
<br>class is used to encode “frequency distributions”
- ProbDistI
<br>defines a standard interface for “probability distributions”
- ConditionalFreqDist and ConditionalProbDistI
<br>used to encode conditional distributions.

## Semantic interpretation
- logic 
<br>provides support for analyzing expressions of First Order Logic (FOL).
- evaluate 
<br>allows users to recursively determine truth in a model for formulas of FOL.

## Stem
remove morphological affixes from words, leaving only the word stem. 
- nltk.tag.pos_tag(tokens, tagset=None, lang='eng')
<br>tagset: the tagset to be used
<br>lang: language
<br>out:tagged tokens
<br>return type:list(tuple(str, str))
- nltk.tag.pos_tag_sents(sentences, tagset=None, lang='eng')
<br>return type:list(list(tuple(str, str)))

## Tokenize
Tokenizers divide strings into lists of substrings

## tree
Class for representing hierarchical language structures, such as syntax trees and morphological trees.