### Word Sense Disambiguity

Given the following sentences:

    The agent will book the to the show for the entire family.
    But you can generally book tickets online.
    When you book tickets online they provide you with a book of stamps
    
If you could see the above sentences the word book is used in different context. In first two sentences the word book(verb) refers to the meaning 'reserve' while in the second portion of the third sentence book(noun) refers to a physical entity.

## Part - 1

    Use the Lesk Module to find the similar words of the word *book* using the above sentences. Record your observations.
    
## Part - 2

Tag sentences using Brill Tagger.

### Brill Tagger

The BrillTagger class is a **transformation-based tagger**. The BrillTagger class uses a series
of rules to correct the results of an initial tagger. These rules are scored based on how many
errors they correct minus the number of new errors they produce.

The idea is simple Brill Tagger tries to correct the mistake made by the inital tagger. Brill tagger inputs an initial tagger and the templates which autmatically tells to create new rules based on the Training Set.

**Recommended Steps:**

1. Initially tag the sentence using POS Tagger. Then observe the POS tags for the word book in different context
2. Then create a tagged_sentence using the POS Tagger correcting it with the mistakes it made.
3. Now create a Brill Tagger using an initial tagger (POS) and pass templates(rules) to it.
4. Train the Brill Tagger using the Tagged Sentence
5. Test the Brill Tagger on the following sentences:
       > "I bought this book from Kerala"
       > "He will book tickets to Kerala"
       
## Part - 3

    Perform Part-1 again but passing the POS tags produced by the Brill Tagger.
    

#### Import the required modules.

In [15]:
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import brown
from nltk.tag import pos_tag, map_tag

#### Tokenize each sentence.

In [17]:
sent1 = 'The agent will book the to the show for the entire family.'
sent1 = word_tokenize(sent1)
sent2 = 'But you can generally book tickets online.'
sent2 = word_tokenize(sent2)
sent3 = 'When you book tickets online they provide you with a book of stamps'
sent3 = word_tokenize(sent3)

#### Lesk Module to find the similar words of the word "book" 

In [3]:
print(lesk(sent1, 'book'))

print(lesk(sent2, 'book'))

print(lesk(sent3, 'book'))

Synset('koran.n.01')
Synset('script.n.01')
Synset('book.n.11')


# Part 2

### POS Tagging

In [4]:
def pos_tagger(sen):
    tagged = []
    tokenized = wordpunct_tokenize(sen)
    for i in tokenized:
        words = word_tokenize(i)
        tagged.append((pos_tag(words)))
    return tagged

In [18]:
pos_tagger("The agent will book the to the show for the entire family.")

[[('The', 'DT')],
 [('agent', 'NN')],
 [('will', 'MD')],
 [('book', 'NN')],
 [('the', 'DT')],
 [('to', 'TO')],
 [('the', 'DT')],
 [('show', 'NN')],
 [('for', 'IN')],
 [('the', 'DT')],
 [('entire', 'JJ')],
 [('family', 'NN')],
 [('.', '.')]]

#### Corrected sentences

In [19]:
corrected = [[('The', 'DT')],
 [('agent', 'NN')],
 [('will', 'MD')],
 [('book', 'VB')],
 [('the', 'DT')],
 [('to', 'TO')],
 [('the', 'DT')],
 [('show', 'NN')],
 [('for', 'IN')],
 [('the', 'DT')],
 [('entire', 'JJ')],
 [('family', 'NN')],
 [('.', '.')]]

In [5]:
sen = "He will book a ticket today. The Amazon Books Store offers you millions of titles. I have a book to finish. How to book a cab?"

In [6]:
f = "He\\PRP will\\VB book\\VB a\\DT ticket\\NN today\\NN .\\PUNCT The\\DT Amazon\\NN Books\\NNS Store\\NN offers\\NNS you\\PRP millions\\NNS of\\IN titles\\NNS .\\PUNCT I\\PRP have\\VB a\\DT book\\NN to\\TO finish\\VB .\\PUNCT How\\WRB to\\TO book\\VB a\\DT cab\\NN ?\\PUNCT "

#### Import modules for brill tagger

In [7]:
import nltk  
from nltk.tag import RegexpTagger
from nltk.tag import UnigramTagger, pos_tag
from nltk.tag.brill import *
from nltk.tag import brill
from nltk import brill_trainer
#from nltk.tag.simplify import simplify_wsj_tag


#### Import the train file.

In [27]:
from nltk.corpus.reader import TaggedCorpusReader
malay_tagged = TaggedCorpusReader('.', 't001.txt', sep="\\")
p = list(malay_tagged.tagged_sents())

#### Training Data

In [30]:
print(p)

[[('I', 'PRP'), ('will', 'VB'), ('book', 'VB'), ('a', 'DT'), ('ticket', 'NN'), ('today', 'NN'), ('.', 'PUNCT'), ('The', 'DT'), ('Amazon', 'NN'), ('Books', 'NNS'), ('Store', 'NN'), ('offers', 'NNS'), ('you', 'PRP'), ('millions', 'NNS'), ('of', 'IN'), ('titles', 'NNS'), ('.', 'PUNCT'), ('I', 'PRP'), ('have', 'VB'), ('a', 'DT'), ('book', 'NN'), ('to', 'TO'), ('finish', 'VB'), ('He', 'PRP'), ('will', 'VB'), ('book', 'VB'), ('tickets', 'NNS'), ('today', 'NN'), ('.', 'PUNCT'), ('.', 'PUNCT'), ('How', 'WRB'), ('to', 'TO'), ('book', 'VB'), ('a', 'DT'), ('cab', 'NN'), ('?', 'PUNCT'), ('I', 'PRP'), ('will', 'VB'), ('book', 'VB'), ('a', 'DET'), ('ticket', 'NN'), ('for', 'ADP'), ('today', 'NN'), ('evening', 'NN'), ('.', 'PUNCT'), ('I', 'PRP'), ('have', 'VB'), ('booked', 'VB'), ('a', 'DET'), ('room', 'NN'), ('for', 'ADP'), ('entire', 'ADJ'), ('family', 'NN'), ('at', 'ADP'), ('the', 'DET'), ('Park', 'NN'), ('Hotel', 'NN'), ('.', 'PUNCT')]]


#### Load pos tagger.

In [14]:
from nltk.data import load
tagdict = load('taggers/maxent_treebank_pos_tagger/english.pickle')

#### Train and test brill tagger.

In [31]:
def train_brill_tagger(train_data):
   

    templates = [
        brill.Template(brill.Pos([-1])),
        brill.Template(brill.Pos([1])),
        brill.Template(brill.Pos([-2])),
        brill.Template(brill.Pos([2])),
        brill.Template(brill.Pos([-2, -1])),
        brill.Template(brill.Pos([1, 2])),
        brill.Template(brill.Pos([-3, -2, -1])),
        brill.Template(brill.Pos([1, 2, 3])),
        brill.Template(brill.Pos([-1]), brill.Pos([1])),
        brill.Template(brill.Word([-1])),
        brill.Template(brill.Word([1])),
        brill.Template(brill.Word([-2])),
        brill.Template(brill.Word([2])),
        brill.Template(brill.Word([-2, -1])),
        brill.Template(brill.Word([1, 2])),
        brill.Template(brill.Word([-3, -2, -1])),
        brill.Template(brill.Word([1, 2, 3])),
        brill.Template(brill.Word([-1]), brill.Word([1]))]
    trainer = brill_trainer.BrillTaggerTrainer(tagdict,templates=templates,trace=3)
    brill_tagger = trainer.train(train_data, max_rules=10)
    return brill_tagger

mt = train_brill_tagger(p)    

TBL train (fast) (seqs: 1; tokens: 58; tpls: 18; min score: 2; min acc: None)
Finding initial useful rules...
    Found 565 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
   5   5   0   0  | .->PUNCT if Pos:NN@[-2,-1]
   4   4   0   1  | NNP->NN if Pos:DT@[-3,-2,-1]
   3   3   0   0  | IN->ADP if Pos:NN@[-1]
   3   3   0   0  | MD->VB if Pos:PRP@[-1]
   2   2   0   0  | VBP->VB if Pos:PRP@[-1]
   2   2   0   0  | DT->DET if Pos:ADP@[2]


In [13]:
print(mt.tag(sent3))

[('When', 'WRB'), ('you', 'PRP'), ('book', 'VB'), ('tickets', 'NNS'), ('online', 'VBP'), ('they', 'PRP'), ('provide', 'VB'), ('you', 'PRP'), ('with', 'IN'), ('a', 'DET'), ('book', 'NN'), ('of', 'ADP'), ('stamps', 'NNS')]


#### Use Brill Tagger for tagging

In [21]:
print(mt.tag(sent1))

[('The', 'DT'), ('agent', 'NN'), ('will', 'MD'), ('book', 'VB'), ('the', 'DT'), ('to', 'TO'), ('the', 'DET'), ('show', 'NN'), ('for', 'ADP'), ('the', 'DT'), ('entire', 'JJ'), ('family', 'NN'), ('.', 'PUNCT')]


In [22]:
print(mt.tag(sent2))

[('But', 'CC'), ('you', 'PRP'), ('can', 'VB'), ('generally', 'RB'), ('book', 'VB'), ('tickets', 'NNS'), ('online', 'NN'), ('.', 'PUNCT')]


In [23]:
print(mt.tag(sent3))

[('When', 'WRB'), ('you', 'PRP'), ('book', 'VB'), ('tickets', 'NNS'), ('online', 'VBP'), ('they', 'PRP'), ('provide', 'VB'), ('you', 'PRP'), ('with', 'IN'), ('a', 'DET'), ('book', 'NN'), ('of', 'ADP'), ('stamps', 'NNS')]


In [25]:
print(mt.tag(word_tokenize('I bought this book from Kerala')))

[('I', 'PRP'), ('bought', 'VBD'), ('this', 'DET'), ('book', 'NN'), ('from', 'ADP'), ('Kerala', 'NN')]


In [26]:
print(mt.tag(word_tokenize("He will book tickets to Kerala")))

[('He', 'PRP'), ('will', 'VB'), ('book', 'VB'), ('tickets', 'NNS'), ('to', 'TO'), ('Kerala', 'NNP')]
