## POS Tagging, HMMs, Viterbi

Let's learn how to do POS tagging by Viterbi Heuristic using tagged Treebank corpus. Before going through the code, let's first understand the pseudo-code for the same. 

1. Tagged Treebank corpus is available (Sample data to training and test data set)
   - Basic text and structure exploration
2. Creating HMM model on the tagged data set.
   - Calculating Emission Probabaility: P(observation|state)
   - Calculating Transition Probability: P(state2|state1)
3. Developing algorithm for Viterbi Heuristic
4. Checking accuracy on the test data set


## 1. Exploring Treebank Tagged Corpus

In [1]:
#Importing libraries
import nltk, re, pprint
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import pprint, time
import random
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize

In [2]:
# reading the Treebank tagged sentences
wsj = list(nltk.corpus.treebank.tagged_sents())

In [3]:
# first few tagged sentences
print(wsj[:40])

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','), ('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'), ('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ','), ('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')], [('A', 'DT'), ('f

In [4]:
# Splitting into train and test
random.seed(1234)
train_set, test_set = train_test_split(wsj,test_size=0.3)

print(len(train_set))
print(len(test_set))
print(train_set[:40])

2739
1175
[[('Three', 'CD'), ('divisions', 'NNS'), ('at', 'IN'), ('American', 'NNP'), ('Express', 'NNP'), ('*ICH*-1', '-NONE-'), ('are', 'VBP'), ('working', 'VBG'), ('with', 'IN'), ('Buick', 'NNP'), ('on', 'IN'), ('the', 'DT'), ('promotion', 'NN'), (':', ':'), ('the', 'DT'), ('establishment', 'NN'), ('services', 'NNS'), ('division', 'NN'), (',', ','), ('which', 'WDT'), ('*T*-57', '-NONE-'), ('is', 'VBZ'), ('responsible', 'JJ'), ('for', 'IN'), ('all', 'DT'), ('merchants', 'NNS'), ('and', 'CC'), ('companies', 'NNS'), ('that', 'WDT'), ('*T*-2', '-NONE-'), ('accept', 'VBP'), ('the', 'DT'), ('card', 'NN'), (';', ':'), ('the', 'DT'), ('travel', 'NN'), ('division', 'NN'), (';', ':'), ('and', 'CC'), ('the', 'DT'), ('merchandise', 'NN'), ('sales', 'NNS'), ('division', 'NN'), ('.', '.')], [('Adds', 'VBZ'), ('*ICH*-1', '-NONE-'), ('Mitsui', 'NNP'), ("'s", 'POS'), ('Mr.', 'NNP'), ('Klauser', 'NNP'), (':', ':'), ('``', '``'), ('Unlike', 'IN'), ('corporations', 'NNS'), ('in', 'IN'), ('this', 'DT'), 

In [5]:
# Getting list of tagged words
train_tagged_words=[tup for sent in train_set for tup in sent]
print(len(train_tagged_words))

70514


In [6]:
# tokens 
tokens = [pair[0] for pair in train_tagged_words]
tokens[:10]

['Three',
 'divisions',
 'at',
 'American',
 'Express',
 '*ICH*-1',
 'are',
 'working',
 'with',
 'Buick']

In [7]:
# vocabulary
V = set(tokens)
print(len(V))

10231


In [8]:
# number of tags
T = set([pair[1] for pair in train_tagged_words])
len(T)

46

In [9]:
print(T)

{':', 'NNP', 'MD', 'JJR', 'POS', 'WP$', 'TO', '-LRB-', 'PRP$', 'RBR', 'PDT', 'VBP', 'WP', 'FW', '-RRB-', 'RBS', 'VB', 'LS', 'EX', 'NNS', 'SYM', 'VBD', 'UH', 'VBN', '$', 'VBG', 'JJ', 'NNPS', 'PRP', 'VBZ', 'RB', ',', 'NN', 'IN', '#', 'WRB', 'DT', 'RP', 'WDT', '.', "''", '-NONE-', 'CC', 'JJS', '``', 'CD'}


## 2. POS Tagging Algorithm - HMM

We'll use the HMM algorithm to tag the words. Given a sequence of words to be tagged, the task is to assign the most probable tag to the word. 

In other words, to every word w, assign the tag t that maximises the likelihood P(t/w). Since P(t/w) = P(w/t). P(t) / P(w), after ignoring P(w), we have to compute P(w/t) and P(t).


P(w/t) is basically the probability that given a tag (say NN), what is the probability of it being w (say 'building'). This can be computed by computing the fraction of all NNs which are equal to w, i.e. 

P(w/t) = count(w, t) / count(t). 


The term P(t) is the probability of tag t, and in a tagging task, we assume that a tag will depend only on the previous tag. In other words, the probability of a tag being NN will depend only on the previous tag t(n-1). So for e.g. if t(n-1) is a JJ, then t(n) is likely to be an NN since adjectives often precede a noun (blue coat, tall building etc.).


Given the penn treebank tagged dataset, we can compute the two terms P(w/t) and P(t) and store them in two large matrices. The matrix of P(w/t) will be sparse, since each word will not be seen with most tags ever, and those terms will thus be zero. 


### Emission Probabilities

In [10]:
# computing P(w/t) and storing in T x V matrix
t = len(T)
v = len(V)
w_given_t = np.zeros((t, v))

In [11]:
# compute word given tag: Emission Probability
def word_given_tag(word, tag, train_bag = train_tagged_words):
    tag_list = [pair for pair in train_bag if pair[1]==tag]
    count_tag = len(tag_list)
    w_given_tag_list = [pair[0] for pair in tag_list if pair[0]==word]
    count_w_given_tag = len(w_given_tag_list)
    
    return (count_w_given_tag, count_tag)

In [12]:
# examples

# large
print("\n", "large")
print(word_given_tag('large', 'JJ'))
print(word_given_tag('large', 'VB'))
print(word_given_tag('large', 'NN'), "\n")

# will
print("\n", "will")
print(word_given_tag('will', 'MD'))
print(word_given_tag('will', 'NN'))
print(word_given_tag('will', 'VB'))

# book
print("\n", "book")
print(word_given_tag('book', 'NN'))
print(word_given_tag('book', 'VB'))


 large
(22, 4087)
(0, 1778)
(0, 9262) 


 will
(197, 655)
(0, 9262)
(0, 1778)

 book
(4, 9262)
(0, 1778)


### Transition Probabilities

In [13]:
# compute tag given tag: tag2(t2) given tag1 (t1), i.e. Transition Probability

def t2_given_t1(t2, t1, train_bag = train_tagged_words):
    tags = [pair[1] for pair in train_bag]
    count_t1 = len([t for t in tags if t==t1])
    count_t2_t1 = 0
    for index in range(len(tags)-1):
        if tags[index]==t1 and tags[index+1] == t2:
            count_t2_t1 += 1
    return (count_t2_t1, count_t1)

In [14]:
# examples
print(t2_given_t1(t2='NNP', t1='JJ'))
print(t2_given_t1('NN', 'JJ'))
print(t2_given_t1('NN', 'DT'))
print(t2_given_t1('NNP', 'VB'))
print(t2_given_t1(',', 'NNP'))
print(t2_given_t1('PRP', 'PRP'))
print(t2_given_t1('VBG', 'NNP'))

(135, 4087)
(1816, 4087)
(2723, 5756)
(67, 1778)
(976, 6504)
(2, 1200)
(6, 6504)


In [15]:
#Please note P(tag|start) is same as P(tag|'.')
print(t2_given_t1('DT', '.'))
print(t2_given_t1('VBG', '.'))
print(t2_given_t1('NN', '.'))
print(t2_given_t1('NNP', '.'))


(563, 2710)
(10, 2710)
(115, 2710)
(515, 2710)
