## POS Tagger
<p>This is a little exercise making a pos tagger because not all pos taggers are alike. The point of making this pos tagger is to try making it from scratch; however, we will need to use <strong>sklearn</strong> to run the algorithm for the classifier.</p>
<p>Let's make a function that creates an output of lingusitic features called ... you guessed it, <em>features</em>.</p>

In [1]:
def features(sentence, index):
    """ sentence: [w1, w2, ...], index: the index of the word """
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'is_all_caps': sentence[index].upper() == sentence[index],
        'is_all_lower': sentence[index].lower() == sentence[index],
        'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
        'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
    }


In [2]:
# test it out
print(features(['This', 'is', 'a', 'sentence'], 2))

{'word': 'a', 'is_first': False, 'is_last': False, 'is_capitalized': False, 'is_all_caps': False, 'is_all_lower': True, 'prefix-1': 'a', 'prefix-2': 'a', 'prefix-3': 'a', 'suffix-1': 'a', 'suffix-2': 'a', 'suffix-3': 'a', 'prev_word': 'is', 'next_word': 'sentence', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}


<p>The point of this is to create a pos tagger without the need of NLTK, but we will do a small cheat and use its corpus for creating a training set; otherwise, we are going to be working on this thing for a long, long time by hand.</p>

In [3]:
import nltk
 
tagged_sentences = nltk.corpus.treebank.tagged_sents() # let's borrow NLTK's tagged corpus
 
print(tagged_sentences[0]) # just print the first sentence from the corpus
print("Tagged sentences: ", len(tagged_sentences))
print("Tagged words:", len(nltk.corpus.treebank.tagged_words()))
 
# [(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'), (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'), (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')]
# Tagged sentences:  3914
# Tagged words: 100676

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Tagged sentences:  3914
Tagged words: 100676


<p>We will need to strip the tags from the tagged sentences for the training set from which we originally took from NLTK's corpus, so here is a function made just for that.</p>

In [4]:
def untag(tagged_sentence):
    return [w for w, t in tagged_sentence]

<p>Now we must divide up our training set from our testing set prior to any training.</p>

In [5]:
# Split the dataset for training and testing
cutoff = int(.75 * len(tagged_sentences))
training_sentences = tagged_sentences[:cutoff]
test_sentences = tagged_sentences[cutoff:]
 
print(len(training_sentences))   # 2935
print(len(test_sentences))         # 979
 
def transform_to_dataset(tagged_sentences):
    X, y = [], []
 
    for tagged in tagged_sentences:
        for index in range(len(tagged)):
            X.append(features(untag(tagged), index))
            y.append(tagged[index][1])
 
    return X, y
 
X, y = transform_to_dataset(training_sentences)

2935
979


<p>We will use <strong>sklearn</strong> to train our pos tagger using a <strong>decision tree classifier</strong>.

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
 
clf = Pipeline([
    ('vectorizer', DictVectorizer(sparse=False)),
    ('classifier', DecisionTreeClassifier(criterion='entropy'))
])
 
clf.fit(X[:10000], y[:10000])   # Use only the first 10K samples if you're running it multiple times.
 
print('Training completed')
 
X_test, y_test = transform_to_dataset(test_sentences)
 
print("Accuracy:", clf.score(X_test, y_test))

Training completed
Accuracy: 0.8966334565322192


<p>We could use NLTK's already built function <strong>word_tokenize</strong>, but I would rather try this from scratch because that is the point of this first of all.</p>

In [7]:
import re # let's use regex
def word_tokenize(sentence):
    sentence = re.sub(r'[^\w\s]','',sentence) # use regex to strip punctuation
    sentence = sentence.lower() # drop all to lower case
    sentence = sentence.split() # list the words from the input
    return sentence

In [8]:
print(word_tokenize('This is my friend, John.'))

['this', 'is', 'my', 'friend', 'john']


<p>Let's see how our homemade pos tagger works.</p>

In [9]:
def pos_tag(sentence):
    tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
    return zip(sentence, tags)
 
print(list(pos_tag(word_tokenize('This is my friend, John.'))))

[('this', 'DT'), ('is', 'VBZ'), ('my', 'NN'), ('friend', 'NN'), ('john', 'NN')]
