# AnTeDe Lab D: Your own HMM PoS tagger

## Session goal
The goal of this session is build your own PoS tagger. This notebook is based on a programming assignment written by Konstantin Taranov 
and Ondrej Skopek for a Natural Language Understanding course offered at ETH Zurich in spring 2019.

In [53]:
import nltk
nltk.download('brown')
from nltk.corpus import brown as corpus

[nltk_data] Downloading package brown to /Users/Daniele/nltk_data...
[nltk_data]   Package brown is already up-to-date!


We place START and STOP tags in between the individual sentences in the corpus to avoid mixing things up.

In [54]:
tagged_words = []
all_tags = []

starting_p={}
count=0

sentences =corpus.tagged_sents(tagset='universal')

for sent in sentences:
    tagged_words.append(("START", "START"))
    all_tags.append("START")
    start=True
    for word, tag in sent:
        try:
            starting_p[tag]
        except:
            starting_p[tag]=0
        if start:
            starting_p[tag]=starting_p[tag]+1
            count=count+1
            start=False
            
        all_tags.append(tag)
        tagged_words.append((tag, word))
    tagged_words.append(("END", "END"))
    all_tags.append("END")

for tag in starting_p.keys():
    starting_p[tag]=starting_p[tag]/count
    
print (starting_p)
print (count)

{'DET': 0.21342867108475758, 'NOUN': 0.1411405650505755, 'ADJ': 0.034339030345308684, 'VERB': 0.04513428671084758, 'ADP': 0.1228461806766655, '.': 0.08892570631321939, 'ADV': 0.0913498430415068, 'CONJ': 0.049128008371119636, 'PRT': 0.036675967910708054, 'PRON': 0.15969654691314963, 'NUM': 0.016811998604813395, 'X': 0.0005231949773282176}
57340


In [55]:
print (tagged_words[0:10])  
print (all_tags[0:10])  

[('START', 'START'), ('DET', 'The'), ('NOUN', 'Fulton'), ('NOUN', 'County'), ('ADJ', 'Grand'), ('NOUN', 'Jury'), ('VERB', 'said'), ('NOUN', 'Friday'), ('DET', 'an'), ('NOUN', 'investigation')]
['START', 'DET', 'NOUN', 'NOUN', 'ADJ', 'NOUN', 'VERB', 'NOUN', 'DET', 'NOUN']


We use NLTK to estimate the transition probabilities and the emission probabilities as conditional probabilities.

Compute $P(t_{i} | t_{i-1})= \frac{C(t_{i-1}, t_{i})}{C(t_{i-1})}$

In [56]:
cfd_tags = nltk.ConditionalFreqDist(nltk.bigrams(all_tags))
cpd_tags = nltk.ConditionalProbDist(cfd_tags, nltk.MLEProbDist)

In [57]:
print("('ADV', 'ADJ') appears",
      cfd_tags['ADV']['ADJ'], " times" )

print ("ADV appears "+str(cfd_tags['ADV'].N())+" times")


# P('ADJ' | 'ADV')
print("The probability of P('ADJ'|'ADV') is roughly",
      round(cpd_tags['ADV'].prob('ADJ'), 4))

('ADV', 'ADJ') appears 7666  times
ADV appears 56239 times
The probability of P('ADJ'|'ADV') is roughly 0.1363


Compute $P(w_{i} | t_{i}) =  \frac{C(t_{i}, w_{i})}{C(t_{i})}$

In [58]:
cfd_tw = nltk.ConditionalFreqDist(tagged_words)
cpd_tw = nltk.ConditionalProbDist(cfd_tw, nltk.MLEProbDist)

In [59]:
# C('dog', 'NOUN'):
print("Frequency of C('NOUN', 'dog') is",
      cfd_tw['NOUN']['dog'])

# P('dog' | 'NOUN')
print("Probability of P('dog' | 'NOUN') is",
      cpd_tw['NOUN'].prob('dog') )

Frequency of C('NOUN', 'dog') is 70
Probability of P('dog' | 'NOUN') is 0.0002540300045725401


Here follows the code for the Viterbi algorithm, adapted from 8c.

In [60]:
def viterbi(observations, states, starting_p, transition_p, emission_p):
    
    # your trellis is a list of dictionaries
    trellis = [{}]

    # first column of the trellis: 
    # how likely you are to start in each state, multiplied by 
    # how likely you are to generate the initial observation 
    # from each state
    for state in states:
        trellis[0][state] = \
        {"probability":\
         starting_p[state] * emission_p[state][observations[0]],\
               "previous state": None}

        
    # for loop over the trellis columns, left to right
    for k in range(1, len(observations)):

        # add a column
        new_column = {}

        # for each row in the column
        for state in states:
            
            max_path_p=0
            

            for previous_state in states:

                up_to_here_p =\
                    trellis[k-1][previous_state]["probability"]*\
                    transition_p[previous_state][state]

                if up_to_here_p > max_path_p:

                    max_path_p = up_to_here_p

                    prev_st_selected = previous_state

                    

            max_p = max_path_p * emission_p[state][observations[k]]

            
            
            new_column[state]=\
            {"probability": max_p,\
             "previous": prev_st_selected}
            
        trellis.append(new_column)

    max_prob = max(value["probability"]\
                   for value in trellis[-1].values())

    previous = None

    return trellis

Here we get our transmission probabilities in the same format we had them in notebook 8c. Let's check we still have the same P('ADJ'|'ADV') as before.

In [61]:
tagset = list(starting_p.keys())
print (tagset)

transition_p={}
for tag in tagset:
    transition_p[tag]={}
    for conditioned_on_tag in tagset:
        transition_p[tag][conditioned_on_tag]=\
        cpd_tags[conditioned_on_tag].prob(tag)


print (transition_p['ADJ']['ADV'])

['DET', 'NOUN', 'ADJ', 'VERB', 'ADP', '.', 'ADV', 'CONJ', 'PRT', 'PRON', 'NUM', 'X']
0.13631110083749712


Here you can use your code from 8c to get the PoS tag sequence chosen by your tagger.

In [62]:
def get_pos_tag_sequence (trellis, observations):
# BEGIN_REMOVE
    import numpy as np
    probs=[trellis[-1][state]['probability'] for state in trellis[-1].keys()]
    chosen=np.max(np.asarray(probs))
    chosen_state=[(state, trellis[-1][state]['previous'])\
                  for state in trellis[-1].keys() if trellis[-1][state]['probability']==chosen]

    opt=[chosen_state[0][0]]
    previous=chosen_state[0][1]

    for t in range(len(trellis) - 2, -1, -1):
        opt.insert(0, trellis[t + 1][previous]["previous"])
        previous = trellis[t + 1][previous]["previous"]

    result=[]    
    for i, word in enumerate(observations):
        result.append((word, opt[i]))
    return result    
        
# END_REMOVE        

Here's the function that we will call to run our PoS tagger.

In [63]:
import nltk.tokenize as tokenize

def hmm_postagger (sentence):
    observations = tokenize.word_tokenize(sentence)
    

    emission_p={}
    for tag in tagset:
        emission_p[tag]={}
        for word in observations:
            emission_p[tag][word]=cpd_tw[tag].prob(word)

    my_trellis = viterbi (observations, tagset, starting_p, transition_p, emission_p)
    return get_pos_tag_sequence(my_trellis, observations)

Let's test our PoS tagger. Come up with other test sentences to see how well it works.

In [64]:
sentence = "My dog is a really good dog."
result = hmm_postagger (sentence)
print (result)

[('My', 'DET'), ('dog', 'NOUN'), ('is', 'VERB'), ('a', 'DET'), ('really', 'ADV'), ('good', 'ADJ'), ('dog', 'NOUN'), ('.', '.')]


In [65]:
sentence = "Gondor has a very high income tax rate for married couples."
result = hmm_postagger (sentence)
print (result)

UnboundLocalError: local variable 'prev_st_selected' referenced before assignment