# Part of Speech tagging

Part of speech tagging means assigning the correct part of speech labels to words in a sentence. In itself might appear to be a toy problem, but the part of speech label can be actually important for further models, such as entity extraction or sentiment.

It's also an interesting problem because you're tagging all the words in a sentence. So it's more than just classifying or rules. So the commonly also specialized machine learning models are used which take this sequence structure into account.

In this notebook a brief look at some off-the-shelf approaches and we also attempt to roll our own model, just to further our understanding of the topic.


## Data overview

Let's start by loading the data and seeing what it looks like.

In [1]:
from sklearn.model_selection import train_test_split
import nltk

sentences = nltk.corpus.treebank.tagged_sents()

train_sentences, test_sentences = train_test_split(sentences, test_size=0.25)

In [2]:
sentences[1]

[('Mr.', 'NNP'),
 ('Vinken', 'NNP'),
 ('is', 'VBZ'),
 ('chairman', 'NN'),
 ('of', 'IN'),
 ('Elsevier', 'NNP'),
 ('N.V.', 'NNP'),
 (',', ','),
 ('the', 'DT'),
 ('Dutch', 'NNP'),
 ('publishing', 'VBG'),
 ('group', 'NN'),
 ('.', '.')]

So the data consists of a bunch of sentences, with for each sentence the words and the associated Part of Speech tags.

## First: Ngram Tagger

The first model we'll have a go at is a UnigramTagger, this is a naive model that just assigns the most frequent PoS tag observed for a word.

In [4]:
from nltk.tag.sequential import DefaultTagger, UnigramTagger, BigramTagger

model = UnigramTagger(train_sentences, backoff=DefaultTagger('NN'))

print("Model score = ", model.evaluate(test_sentences))

Model score =  0.8875074955026984


One step better is a Bigram Tagger, this takes the previous words Part of Speech tag into account.

In [5]:
model = BigramTagger(train_sentences, backoff=UnigramTagger(train_sentences, backoff=DefaultTagger('NN')))

print("Model score = ", model.evaluate(test_sentences))

Model score =  0.896662002798321


## HMM Tagger

The models in the previous section are quite limited, as such their performance leaves one wanting.

Another approach that's popular are HMM models, these models explicitly model the relation between sequential observations and seem quite suitabe for the problem.

Let's see how this model does?

In [7]:
from nltk.tag.hmm import HiddenMarkovModelTagger

model = HiddenMarkovModelTagger.train(train_sentences)

print("Model score = ", model.evaluate(test_sentences))

Model score =  0.907615430741555


## CRF Tagger

The last model we tried was an improvement, but we're still not where we'd like to be. Another option offered by the nltk library is a CRF model. These models can use multiple features and also explicitly model the sequential relation in a sentence.

Let's see!

In [8]:
from nltk.tag.crf import CRFTagger

model = CRFTagger()
model.train(train_sentences, 'model.bin')

print("Model score = ", model.evaluate(test_sentences))

Model score =  0.9491505096941835


## Perceptron Tagger

The last built-in model we'll try out is a perceptron tagger. It's based on a neural network with various features that capture the sequential relation. Supposedly this is the "gold standard"?

In [9]:
from nltk.tag.perceptron import PerceptronTagger

model = PerceptronTagger()
model.train(train_sentences)

print("Model score = ", model.evaluate(test_sentences))

Model score =  0.972856286228263


## Roll our own tagger

It's certainly interesting to experiment with the built-in taggers NLTK offers, but as curious engineers we won't stop there of course. How does such a model work internally and what features do we need to make good predictions?

In the section below we'll attempt to make a tagger using a sklean classifier and an appriopriate list of features.

In [10]:
from sklearn.feature_selection import SelectKBest
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline

First thing we do, is define a feature extraction function. This takes as input a sentence and a position. Output is a dictionary of features.  If you look in the NLTK source, a CRF model and their PerceptronTagger use a very similar feature function.

So let's hope we are close to their accuracy levels!

In [11]:
def features(sentence, index):
    return {
        'word': sentence[index].lower(),
        #'stem': stemmer.stem(sentence[index]).lower(),
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_title': sentence[index][0].istitle(),
        'is_upper': sentence[index].isupper(),
        'is_lower': sentence[index].islower(),
        'is_digit': sentence[index].isdigit(),
        'has_upper_inside': sentence[index][1:].lower() != sentence[index][1:],
        'has_hyphen': '-' in sentence[index],
        'prefix-1': sentence[index][0].lower(),
        'prefix-2': sentence[index][:2].lower(),
        'prefix-3': sentence[index][:3].lower(),
        'suffix-1': sentence[index][-1].lower(),
        'suffix-2': sentence[index][-2:].lower(),
        'suffix-3': sentence[index][-3:].lower(),
        'prev_word': '' if index == 0 else sentence[index - 1].lower(),
        #'prev_stem': '' if index == 0 else stemmer.stem(sentence[index - 1]).lower(),
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1].lower(),
        #'next_stem': '' if index == len(sentence) - 1 else stemmer.stem(sentence[index + 1]).lower(),
    }

Next up, we need to transform the dataset. Sklearn works on a single X and y vector, while NLTK uses lists of sentences. So we add a small function to translate.

In [12]:
def transform_to_dataset(tagged_sentences):
    X, y = [], []
    for tagged in tagged_sentences:
        for index in range(len(tagged)):
            X.append(features([w for w, t in tagged], index))
            y.append(tagged[index][1])
    return X, y

Finally we define the model. Here i've chosen to use a small neural network, other classifiers also work though. A neural network seemed to work particularly well here, i presume because there are more than two labels and the interactions between features.

In [15]:
clf = make_pipeline(
    DictVectorizer(),
    SelectKBest(k=5000),
    MLPClassifier(hidden_layer_sizes=(12,), early_stopping=True)
)

So, how did we do?!

In [16]:
clf.fit(*transform_to_dataset(train_sentences))

print("Accuracy:", clf.score(*transform_to_dataset(test_sentences)))

  f = msb / msw


Accuracy: 0.9516290225864481
