# Chapter 5 Categorizing and Tagging Words
## Jonathan Stewart

### Section 1

In [3]:
import nltk



In [6]:
text = nltk.word_tokenize("I want to color the book with the color purple")

nltk.pos_tag(text)


[('I', 'PRP'),
 ('want', 'VBP'),
 ('to', 'TO'),
 ('color', 'VB'),
 ('the', 'DT'),
 ('book', 'NN'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('color', 'NN'),
 ('purple', 'NN')]

[('I', 'PRP'),
 ('want', 'VBP'),
 ('to', 'TO'),
 ('color', 'VB'),
 ('the', 'DT'),
 ('book', 'NN'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('color', 'NN'),
 ('purple', 'NN')]
 
 The POS-tagger got both forms of color correct.

## 2)

 Given the list of past participles produced by list(cfd2['VBN']), try to collect a list of all the word-tag pairs that immediately precede items in that list.

In [10]:
import nltk
wsj = nltk.corpus.treebank.tagged_words()
cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
participles = list(cfd2['VBN'])
prior_list = [] # list to hold the word-tag pairs that immediately precede items in that list
for w in participles: # for each word in the list
    idx1=wsj.index((w, 'VBN'))  # find the integer location of the word/tag tuple in the corpus
    prior_wt = wsj[idx1-1] #retrieve the prior word by replacing ?? with the index of the prior word/tag pair
    prior_list.append(prior_wt)
    # append to the word/tag pair to the list declared above
print(prior_list)

[('has', 'VBZ'), ('is', 'VBZ'), ('was', 'VBD'), ('were', 'VBD'), ('*-2', '-NONE-'), ('has', 'VBZ'), ('*-2', '-NONE-'), ('was', 'VBD'), ('temporarily', 'RB'), ('has', 'VBZ'), ('is', 'VBZ'), ('were', 'VBD'), ('has', 'VBZ'), ('has', 'VBZ'), ('highly', 'RB'), ('for', 'IN'), ('has', 'VBZ'), ('*-2', '-NONE-'), ('rates', 'NNS'), ('*-2', '-NONE-'), ('are', 'VBP'), ('*-2', '-NONE-'), ('have', 'VBP'), ('has', 'VBZ'), ('of', 'IN'), ('of', 'IN'), ('have', 'VBP'), ('*-2', '-NONE-'), ('had', 'VBD'), ('has', 'VBZ'), ('has', 'VBZ'), ('of', 'IN'), ('is', 'VBZ'), ('year', 'NN'), ('companies', 'NNS'), ('have', 'VBP'), ('other', 'JJ'), ('sharply', 'RB'), ('was', 'VBD'), ('greatly', 'RB'), (',', ','), ('galvanized', 'JJ'), ('workers', 'NNS'), ('and', 'CC'), (':', ':'), ('to', 'TO'), ('be', 'VB'), ('was', 'VBD'), ('has', 'VBZ'), ('be', 'VB'), ('are', 'VBP'), ('be', 'VB'), ('be', 'VB'), ('watches', 'NNS'), ('through', 'IN'), ('has', 'VBZ'), ('RULING', 'NN'), ('*-1', '-NONE-'), ('have', 'VBP'), ('poorly', 'RB

 The above succeeded in finding the preceding word-tag pairs to every word tagged as 'VBN'

## 3) Mapping Words to Properties Using Python Dictionaries

In [15]:
## 3) Mapping Words to Properties Using Python Dictionaries
d1 = {}
d1['I'] = 'PRP'
d1['want'] = 'VBP'
d1['to'] = 'TO'
d1['color'] = 'VB'

d2 = {}
d2['color'] = 'VB'
d2['the'] = 'DT'
d2['book'] = 'NN'
d2['with'] = 'IN'
d2['purple'] = 'NN'

d1.update(d2)

print(d1)

{'to': 'TO', 'with': 'IN', 'the': 'DT', 'book': 'NN', 'purple': 'NN', 'color': 'VB', 'I': 'PRP', 'want': 'VBP'}


The d1.update(d2) added all key value pairs in d2 to d1, if they were not already present in d1. This is an easy way to update dictionaries with key value pairs. If there is more than one dictionary being generated, this ensures that new key-value pairs are added.

## 4) Automatic Tagging

Come up with at least two patterns to improve the performance of the regular expression tagger presented in this chapter (and duplicated below). 

In [18]:
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds
     (r'.*ed$', 'VBD'),                # simple past
     (r'.*es$', 'VBZ'),                # 3rd singular present
     (r'.*ould$', 'MD'),               # modals
     (r'.*\'s$', 'NN$'),               # possessive nouns
     (r'pre*.$', 'JJ'),                # adjectives
     (r'.*ly$', 'RB'),                        # adverb
     (r'.*s$', 'NNS'),                 # plural nouns
     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'.*', 'NN')                     # nouns (default)
]

regexp_tagger = nltk.RegexpTagger(patterns)


I added two patterns. The first is words that begin with 'pre'. While there are a few words that begin with 'pre' that are not adjectives, such as 'pretend' or 'preen', most words, such as 'preexisting' or 'preceding' are adjectives. The second one I added was classifying words that end in 'ly' as adverbs. There are a few exceptions, such as 'lilly', but overall most words, such as 'quickly' or 'silently' tend to be adverbs.

## Section 5 - N-Gram Tagging
 
    Exercise: Train a unigram tagger and run it on some new text. Observe that some words are not assigned a tag. Why not?



In [26]:
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents)
print('t1 evaluation: ', t1.evaluate(test_sents))
#t1.tag(brown_tagged_sents[size:])

t1 evaluation:  0.8118209907305891


Some of the words do not have tags because they do not exist in the trained dataset. Since this is a unigram model, there are no transition probabilities that require word pairs to exist in the data, however if that word is not present, then that 'state' is never created during the set up of the training phase, and all probabilities for that word will be zero.

Your Turn: Extend the example below by defining a TrigramTagger called t3, which backs off to t2.

In [20]:
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t3 = nltk.TrigramTagger(train_sents, backoff = t2)
print('t3 evaluation: ', t3.evaluate(test_sents))
print('t2 evaluation: ', t2.evaluate(test_sents))
print('t1 evaluation: ', t1.evaluate(test_sents))

t3 evaluation:  0.8436160669789694
t2 evaluation:  0.84570915977275
t1 evaluation:  0.835841722316356


T3 did slightly worse in comparison to t2, but better than t1. Backoff helps with keeping models small. for every N states/words, there needs to be a n^3 state matrix in the case of a trigram model, versus an n^2 for a bigram modl. So, if a trigram and bigram model give the same tag, the bigram tag is used. 

## Section 6 - Transformation-Based Tagging

Run the Brill Tagger demo using the code below. Select 5 of the useful rules and rewrite them using an English sentence. For example, if the rule is NN->VB if Pos:TO@[-1] the related sentence would be something like, change a noun to a verb if the preceding word is tagged TO.

In [27]:
from nltk.tbl import demo as brill_tagger
brill_tagger.demo()

Loading tagged data from treebank... 
Read testing data (200 sents/5251 wds)
Read training data (800 sents/19933 wds)
Read baseline data (800 sents/19933 wds) [reused the training set]
Trained baseline tagger
    Accuracy on test set: 0.8355
Training tbl tagger...
TBL train (fast) (seqs: 800; tokens: 19933; tpls: 24; min score: 3; min acc: None)
Finding initial useful rules...
    Found 12772 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  23  23   0   0  | POS->VBZ if Pos:PRP@[-2,-1]
  16  17   1   0  | NN->VB if Pos:-NONE-@[-2] & Pos:TO@[-1]
  14  15   1   0  | VBD->VBN if Pos:VBZ@[-1]
  12  12   0   0  | VBP->VB if Pos:MD@[-2,-1]


RB->JJ if Pos:DT@[-1] & Pos:NN@[1] - Change the adverb to a adjective if the word preceding the word being taggged is a determiner and the word following it is a singular or mass noun. \\

NN->VBP if Pos:NNS@[-2] & Pos:RB@[-1]  - Change the singular or plural noun to a present verb if the word two words behind it is a plural noun and the word immediately preceding it is an adjective. \\

IN->WDT if Pos:-NONE-@[1] & Pos:MD@[2] - Change the preposition to a determ if there is no word after the word and the word two words after that is a modal word (can, should, etc.) \\

JJS->RBS if Word:most@[0] & Word:the@[-1] & Pos:DT@[-1] - Change the superlative adjective tag to a superlative adverb if the word is most, and the word preceding is is 'the', and the word that follows it is coded as a determiner tag. \\

VB->NN if Pos:POS@[-1] - change the word tagged as a verb to a noun if the word preceding it is tagged as a possessive ending. 