# Intro

<img width=600 src="images/ner.jpg">
<p>This notebook demonstrate part of speech (POS) tagging with the Hidden Markov model. The data was taken from the <a href="https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus">Annotated Corpus for Named Entity Recognition</a></p>

# Import dependencies

In [1]:
import numpy as np
import pandas as pd

import sys
sys.path.insert(0, "..")
import mymllib
import mymllib.metrics.classification as metrics

# Load and preprocess data

<p>Before proceeding, please, download the dataset using the link above and extract it to <i>ner_dataset</i> directory like this:</p>
<p><i>
./ner_dataset/<br/>
└── ner_dataset.csv<br/>
</i></p>


Let's load the dataset: 

In [2]:
dataset = pd.read_table("./ner_dataset/ner_dataset.csv", sep=",")
dataset.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


The dataset contains POS and named entity tags, which both can be predicted using the Hidden Markov model, but we'll focus just on the POS in this notebook:

In [3]:
dataset = dataset.drop(columns="Tag")
dataset

Unnamed: 0,Sentence #,Word,POS
0,Sentence: 1,Thousands,NNS
1,,of,IN
2,,demonstrators,NNS
3,,have,VBP
4,,marched,VBN
...,...,...,...
1048570,,they,PRP
1048571,,responded,VBD
1048572,,to,TO
1048573,,the,DT


Tags that belong to punctuation are removed, since they neither need a probabilistic model for prediction, nor seem to be very helpful for predicting other tags. All letters are converted to lowercase. No stemming is applied, since it might prevent model from, for instance, distinguishing verbs in different tense.

In [4]:
punctuation = dataset.loc[dataset.POS.isin(['.', ',', '``', ':', '$', ';'])]
dataset = dataset.drop(punctuation.index).reset_index(drop=True)
dataset.Word = dataset.Word.str.lower()
dataset

Unnamed: 0,Sentence #,Word,POS
0,Sentence: 1,thousands,NNS
1,,of,IN
2,,demonstrators,NNS
3,,have,VBP
4,,marched,VBN
...,...,...,...
962096,,they,PRP
962097,,responded,VBD
962098,,to,TO
962099,,the,DT


POS tags are encoded with numeric labels:

In [5]:
idx_to_pos = dataset.POS.astype("category").cat.categories.tolist()
pos_to_idx = {idx_to_pos[i]: i for i in range(len(idx_to_pos))}
dataset.POS = dataset.POS.map(pos_to_idx)
dataset.head()

Unnamed: 0,Sentence #,Word,POS
0,Sentence: 1,thousands,14
1,,of,5
2,,demonstrators,14
3,,have,30
4,,marched,29


Strings and NaNs are replaced with numeric sentence indices:

In [6]:
# Start-Of-Sentence indices
sos_idx = dataset.loc[~dataset["Sentence #"].isnull()].index
sentence_idx = np.empty(dataset.shape[0], np.uint)
for i in range(1, len(sos_idx)):
    sentence_idx[sos_idx[i - 1]:sos_idx[i]] = i - 1
sentence_idx[sos_idx[i]:] = i
dataset["Sentence #"] = sentence_idx
dataset

Unnamed: 0,Sentence #,Word,POS
0,0,thousands,14
1,0,of,5
2,0,demonstrators,14
3,0,have,30
4,0,marched,29
...,...,...,...
962096,47841,they,17
962097,47841,responded,27
962098,47841,to,24
962099,47841,the,2


Train/test split is performed by randomly selecting 80% of sentences that will be used for training and leaving other sentences for testing:

In [7]:
num_sentences = int(sentence_idx[-1] + 1)
rng = np.random.default_rng(seed=42)
train_sentences = rng.choice(num_sentences, size=round(num_sentences * 0.8), replace=False)

train_mask = dataset["Sentence #"].isin(train_sentences)
train_dataset = dataset.loc[train_mask]
test_dataset = dataset.loc[~train_mask]

print("Train dataset size:", train_dataset.shape[0])
print("Test dataset size:", test_dataset.shape[0])

Train dataset size: 769379
Test dataset size: 192722


A vocabulary is required to replace words with numeric tokens. It is built using only the train subset and contain only those words that have occured more than once. Other words will be replaced with an \<UNK\> token to teach the model how to handle out of vocabulary words.

As one can see from the non-random sample of the vocabulary, it contains certain "words" that could be better handled by using some rule-based approach instead of a probabilistic model (like numbers, dates or punctuation characters with non-punctuation tags). However we will handle them like ordinary words anyway for the sake of simplicity.

In [8]:
unique_words = train_dataset.Word.value_counts()
idx_to_word = sorted(unique_words.loc[unique_words > 1].index)
idx_to_word.insert(0, "<UNK>")
word_to_idx = {word: idx for idx, word in enumerate(idx_to_word)}

print("Vocabulary size:", len(idx_to_word), end="\n\n")
print("Vocabulary sample:", idx_to_word[:30], sep="\n", end="\n\n")
print("Random vocabulary sample:", rng.choice(idx_to_word, size=30, replace=False).tolist(),
      sep="\n")

Vocabulary size: 16920

Vocabulary sample:
['<UNK>', '%', '&', "'", "'d", "'ll", "'m", "'re", "'s", "'ve", '(', ')', '-', '--', '/', '0.08', '0.1', '0.2', '0.3', '0.4', '0.5', '0.7', '0.8', '0.9', '01-jan', '02-feb', '02-jan', '02-jun', '03-apr', '03-feb']

Random vocabulary sample:
['powers', 'warren', 'finally', 'undersea', 'dc', 'ruiz', 'haram', 'activities', 'crown', 'federline', 'kerik', 'beheaded', 'compatriot', 'shifted', 'pressures', 'onboard', 'skepticism', 'speculate', 'pohamba', 'doubt', 'without', 'readiness', 'monitor', 'farouk', 'anti-war', 'bombay', 'sukhumi', 'seventh', 'abrupt', 'portable']


Extract observable states (words) and hidden states (parts of speech) from train and test datasets while replacing words with corresponding numeric labels:

In [9]:
def extract_states(dataset):
    words = []
    pos = []
    sentence_nums = dataset["Sentence #"].unique().tolist()
    for sentence_num in sentence_nums:
        idx = dataset["Sentence #"] == sentence_num
        words.append([word_to_idx[word] if word in word_to_idx else word_to_idx["<UNK>"]
                      for word in dataset["Word"][idx].tolist()])
        pos.append(dataset["POS"][idx].tolist())
    return words, pos

X_train, Y_train = extract_states(train_dataset)
X_test, Y_test = extract_states(test_dataset)

# Train and test the model

Finally we can train the Hidden Markov model:

In [10]:
hidden_markov = mymllib.nlp.HiddenMarkov()
hidden_markov.fit(X_train, Y_train)

Let's test model's accuracy (predictions are flattened to be compatible with generic metrics):

In [11]:
Y_train_pred = hidden_markov.predict(X_train)
Y_test_pred = hidden_markov.predict(X_test)

y_train = [val for y in Y_train for val in y]
y_train_pred = [val for y in Y_train_pred for val in y]
print("Train accuracy:", metrics.accuracy(y_train, y_train_pred))
print("Train balanced accuracy:", metrics.balanced_accuracy(y_train, y_train_pred))

print()

y_test = [val for y in Y_test for val in y]
y_test_pred = [val for y in Y_test_pred for val in y]
print("Test accuracy:", metrics.accuracy(y_test, y_test_pred))
print("Test balanced accuracy", metrics.balanced_accuracy(y_test, y_test_pred))

Train accuracy: 0.9542553150007993
Train balanced accuracy: 0.9520806530840403

Test accuracy: 0.9470740237232905
Test balanced accuracy 0.9110006713841714


# Conclusion

It is a pleasant surprise that even a basic approach with the Hidden Markov model achieves 94.7% unbalanced and 91.1% balanced accuracy when prediction POS tags on a test dataset. Handling some parts of speech like dates in a specific way or using a more sophisticated model could lead to even better results, but what we achieved in this notebook is still a great baseline for further improvements.