#POS Tagging Algorithm - HMM

Hidden Markov Model based algorithm is used to tag the words. Given a sequence of words to be tagged, the task is to assign the most probable tag to the word.

In other words, to every word w, assign the tag t that maximises the likelihood P(t/w). Since P(t/w) = P(w/t). P(t) / P(w), after ignoring P(w), we have to compute P(w/t) and P(t).

P(w/t) is basically the probability that given a tag (say NN), what is the probability of it being w (say 'building'). This can be computed by computing the fraction of all NNs which are equal to w, i.e.

P(w/t) = count(w, t) / count(t).

The term P(t) is the probability of tag t, and in a tagging task, we assume that a tag will depend only on the previous tag. In other words, the probability of a tag being NN will depend only on the previous tag t(n-1). So for e.g. if t(n-1) is a JJ, then t(n) is likely to be an NN since adjectives often precede a noun (blue coat, tall building etc.).

Given the penn treebank tagged dataset, we can compute the two terms P(w/t) and P(t) and store them in two large matrices. The matrix of P(w/t) will be sparse, since each word will not be seen with most tags ever, and those terms will thus be zero.

#Viterbi Algorithm
Let's now use the computed probabilities P(w, tag) and P(t2, t1) to assign tags to each word in the document. We'll run through each word w and compute P(tag/w)=P(w/tag).P(tag) for each tag in the tag set, and then assign the tag having the max P(tag/w).

We'll store the assigned tags in a list of tuples, similar to the list 'train_tagged_words'. Each tuple will be a (token, assigned_tag). As we progress further in the list, each tag to be assigned will use the tag of the previous token.

Note: P(tag|start) = P(tag|'.')

In [2]:
#method-1
import nltk
from nltk.corpus import treebank
from sklearn.metrics import accuracy_score
nltk.download('treebank')

# Load the Penn Treebank dataset
dataset = treebank.tagged_sents()

# Split the dataset into training and testing sets
train_data = dataset[:3500]
test_data = dataset[3500:]

# Define the HMM-based POS tagger using the built-in module in NLTK
tagger = nltk.tag.HiddenMarkovModelTagger.train(train_data)

# Apply the POS tagger on the test set using the Viterbi algorithm built-in module in NLTK
predicted_tags = []
actual_tags = []

for sent in test_data:
    words = [word for word, tag in sent]
    tags = [tag for word, tag in sent]
    predicted_tags += [tag for word, tag in tagger.tag(words)]
    actual_tags += tags

# Evaluate the accuracy of the POS tagger
accuracy = accuracy_score(actual_tags, predicted_tags)
print(f"Accuracy: {accuracy:.2%}")


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


Accuracy: 89.89%


In [3]:
#method-2

import nltk
nltk.download('treebank')
from nltk.corpus import treebank


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [4]:
# Split the corpus into training and testing sets
train_sents = treebank.tagged_sents()[:int(len(treebank.tagged_sents())*0.9)]
test_sents = treebank.tagged_sents()[int(len(treebank.tagged_sents())*0.9):]


In [5]:
from nltk.tag import HiddenMarkovModelTagger

# Train the HMM model
hmm_tagger = HiddenMarkovModelTagger.train(train_sents)


In [7]:
# Tag the test set using the Viterbi algorithm
hmm_pred = hmm_tagger.tag_sents([sent for sent in test_sents])

# Evaluate the accuracy of the model
hmm_acc = hmm_tagger.evaluate(test_sents)
print(f"HMM Accuracy: {hmm_acc:.2%}")


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  hmm_acc = hmm_tagger.evaluate(test_sents)


HMM Accuracy: 89.93%
