# Task: Part of speech tagging

In this task we try to recreate a very rudimentary POS tagger "from scratch" using SpaCy and CRF models. 

(We disregard the fact, that SpaCy has a built in POS tagger for the moment for demonstration purposes.)

The input is a tokenized English sentence. The task is to label each word with a part of speech (POS) tag. The tag set, which is identical the [Universal Dependencies project's](https://universaldependencies.org/) basic tag set is the following:

- NOUN: noun
- VERB: verb
- DET: determiner
- ADJ: adjective
- ADP: adposition (e.g., prepositions)
- ADV: adverb
- CONJ: conjunction
- NUM: numeral
- PART: particle (function word that cannot be inflected, has no meaning in
  itself and doesn't fit elsewhere, e.g., "to")
- PRON: pronoun
- .: punctuation
- X: other

The code in this task is an adaptation of the NER code in the sklearn-crfsuite documentation.

# The data set

__Brown__ corpus: "The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kučera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. **It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961"** (Wikpedia: Brown Corpus)

Let's download and inspect the data!

In [1]:
%%capture
!pip install nltk

In [2]:
import nltk

from nltk.corpus import brown
nltk.download('brown')

brown.words()

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [3]:
nltk.download('universal_tagset')
brown.tagged_words(tagset='universal')

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


[('The', 'DET'), ('Fulton', 'NOUN'), ...]

In [4]:
brown.sents()

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [5]:
len(brown.words())

1161192

From the brown the object provided by NLTK we will work with the tagged sentence list:

In [6]:
#For each sentence we have created a tuple of (word, POS_tag).
sents = brown.tagged_sents(tagset="universal")

sents[:2]

[[('The', 'DET'),
  ('Fulton', 'NOUN'),
  ('County', 'NOUN'),
  ('Grand', 'ADJ'),
  ('Jury', 'NOUN'),
  ('said', 'VERB'),
  ('Friday', 'NOUN'),
  ('an', 'DET'),
  ('investigation', 'NOUN'),
  ('of', 'ADP'),
  ("Atlanta's", 'NOUN'),
  ('recent', 'ADJ'),
  ('primary', 'NOUN'),
  ('election', 'NOUN'),
  ('produced', 'VERB'),
  ('``', '.'),
  ('no', 'DET'),
  ('evidence', 'NOUN'),
  ("''", '.'),
  ('that', 'ADP'),
  ('any', 'DET'),
  ('irregularities', 'NOUN'),
  ('took', 'VERB'),
  ('place', 'NOUN'),
  ('.', '.')],
 [('The', 'DET'),
  ('jury', 'NOUN'),
  ('further', 'ADV'),
  ('said', 'VERB'),
  ('in', 'ADP'),
  ('term-end', 'NOUN'),
  ('presentments', 'NOUN'),
  ('that', 'ADP'),
  ('the', 'DET'),
  ('City', 'NOUN'),
  ('Executive', 'ADJ'),
  ('Committee', 'NOUN'),
  (',', '.'),
  ('which', 'DET'),
  ('had', 'VERB'),
  ('over-all', 'ADJ'),
  ('charge', 'NOUN'),
  ('of', 'ADP'),
  ('the', 'DET'),
  ('election', 'NOUN'),
  (',', '.'),
  ('``', '.'),
  ('deserves', 'VERB'),
  ('the', 'DET'),

In [7]:
# How many sentences are there in corpus.
len(sents)

57340

We divide our data set into a train and a valid part:

In [8]:
valid_sents = sents[:5734] # First 5734 sentences are validation sentences.
train_sents = sents[5734:] # Remaining (57340 - 5734 = 51606 sents) are part of Train set.

In [9]:
# Remove this afterwards.
train_sents

[[('Viewed', 'VERB'), ('from', 'ADP'), ('afar', 'ADV'), (',', '.'), ('the', 'DET'), ('CDC', 'NOUN'), ('looks', 'VERB'), ('like', 'ADP'), ('a', 'DET'), ('rather', 'ADV'), ('stalwart', 'ADJ'), ('political', 'ADJ'), ('pyramid', 'NOUN'), (':', '.'), ('its', 'DET'), ('elected', 'VERB'), ('directorate', 'NOUN'), ('fans', 'VERB'), ('out', 'PRT'), ('into', 'ADP'), ('an', 'DET'), ('array', 'NOUN'), ('of', 'ADP'), ('district', 'NOUN'), ('leaders', 'NOUN'), ('and', 'CONJ'), ('standing', 'VERB'), ('committees', 'NOUN'), (',', '.'), ('and', 'CONJ'), ('thence', 'ADV'), ('into', 'ADP'), ('its', 'DET'), ('component', 'NOUN'), ('clubs', 'NOUN'), ('and', 'CONJ'), ('affiliated', 'VERB'), ('groups', 'NOUN'), ('--', '.'), ('500', 'NUM'), ('or', 'CONJ'), ('so', 'ADV'), ('.', '.')], [('Much', 'ADJ'), ('of', 'ADP'), ('its', 'DET'), ('strength', 'NOUN'), ('stems', 'VERB'), ('from', 'ADP'), ('the', 'DET'), ('comfortable', 'ADJ'), ('knowledge', 'NOUN'), ('that', 'ADP'), ('every', 'DET'), ('``', '.'), ('volunteer

# Feature template

Since the plan is to build a CRF model, we need a __feature template__, which ***generates features for a word in a sentence*** (our sequence in the sequence tagging task). We use spaCy for feature extraction.

In [10]:
# We download the English language model for Spacy
#Spacy install, load and such stuff
#!python -m spacy download en_core_web_sm

#Import
import spacy
import en_core_web_sm
from spacy.tokens import Doc
#By model load, please deactivate unnecessary pipeline elements!

en = spacy.load('en_core_web_sm', disable=["ner"])

In [11]:
en.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f79ba04da58>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f79b2914a68>)]

We write a function which generates features for a token in a sentence, which is already a spaCy document. The feature vector is represented as a `dict` mapping feature names to their values.

The desired **feature set for a token is**:

- `bias`: A constant value of 1 as an input
- `token.lower`: the lowercased textual form of the token
- `token.suffix`: the textual form of the token's suffix as defined by SpaCy,
- `token.prefix`: the textual form of the token's prefix as defined by SpaCy,
- `token.is_upper`: boolean value indicating if the token is uppercase,
- `token.is_title`: boolean value indicating if the token is a title,
- `token.is_digit`: boolean value indicating if the token consists of numbers.

These are only the `Token`'s own properties, but they represent no context.

We would like to include information about  the previous and next words, as well as indicating if the `Token` is the beginning or the end of sentence.

The **contextual features** should be:
 
- `-1:token.lower`: What is the lowercase textual form of the previous token?,
- `-1:token.is_title`: Is the previous token a title?,
- `-1:token.is_upper`: Is the previous token uppercase?,
- `+1:token.lower`: What is the lowercase textual form of the next token?,
- `+1:token.is_title`: Is the next token a title?,
- `+1:token.is_upper`: Is the next token uppercase?,
- `BOS`: Boolean value indicating if the token is the beginning of a sentence,
- `EOS`: Boolean value indicating if the token is the end of a sentence

In [44]:
# in furtur we will send a sentence and its index "i" which indicate the word 
# in this function we create a dic for that ith word.
def token2features(sent, i):
    """Return a feature dict for a token. 
    sent is a spaCy Doc containing a sentence, i is the token's index in it.
    """
    features = {}
    
    features['bias'] = 1.0
    features['token.lower'] = sent[i].lower_
    features['token.suffix'] = sent[i].suffix_
    features['token.prefix'] = sent[i].prefix_
    features['token.is_upper'] = sent[i].is_upper
    features['token.is_title'] = sent[i].is_title
    features['token.is_digit'] = sent[i].is_digit
    
    # we dont have a token before the first token i.e. token at index=0
    # Hence dont create the folowing features for it
    if i>0:
      features['-1:token.lower'] = sent[i-1].lower_
      features['-1:token.is_title'] = sent[i-1].is_title
      features['-1:token.is_upper'] = sent[i-1].is_upper

    # we dont have a token after the last token i.e. token at index=len(sent)-1
    # Hence dont create the folowing features for i>=len(sent)-1
    if i<len(sent)-1:
      features['+1:token.lower'] = sent[i+1].lower_
      features['+1:token.is_title'] = sent[i+1].is_title
      features['+1:token.is_upper:'] = sent[i+1].is_upper
    
    # if its BOS, i would be equal to 0,
    # hence BOS=TRUE for that token
    if i==0:
      features['BOS'] = True
    
    # if its EOS, index i would be one less than 
    #len of the sentence (index starts from 0)
    if i==len(sent)-1:
      features['EOS'] = True
    #features = [features]
    return features

For training, we will also need functions to generate feature dict and label lists for sentences in our training corpus:

In [45]:
def sent2features(sent):
    "Return a list of feature dicts for a sentence in the data set."
    # Create a doc by instantiating a Doc class and iterating through the sentence token by token.
    # Please bear in mind, that Brown has token-POS pairs, latter one we don't need here...

    # get the tokens form token-POS pairs in "sent"
    tokens = [token_POS_pair[0] for token_POS_pair in sent]
    doc = Doc(en.vocab, tokens)
  
    # Plese use the above defined token2features function on each token to generate the features
    # For the whole sentence!
    sent_features=[] #create a list that stores feature dic
    for i in range(len(tokens)): 
      sent_features.append(token2features(doc, i))

    return sent_features

def sent2labels(sent):
    
    #Please create / filter only the labels for given sentence!
    labels = [token_POS_pair[1] for token_POS_pair in sent]
    
    return labels

Sanity check: let's see the values for the first 2 tokens in the corpus:

In [48]:
# This is for only first 2 words in the sentence
print(sent2features(sents[0])[:2])
print(sent2labels(sents[0])[:2])

[{'bias': 1.0, 'token.lower': 'the', 'token.suffix': 'The', 'token.prefix': 'T', 'token.is_upper': False, 'token.is_title': True, 'token.is_digit': False, '+1:token.lower': 'fulton', '+1:token.is_title': True, '+1:token.is_upper:': False, 'BOS': True}, {'bias': 1.0, 'token.lower': 'fulton', 'token.suffix': 'ton', 'token.prefix': 'F', 'token.is_upper': False, 'token.is_title': True, 'token.is_digit': False, '-1:token.lower': 'the', '-1:token.is_title': True, '-1:token.is_upper': False, '+1:token.lower': 'county', '+1:token.is_title': True, '+1:token.is_upper:': False}]
['DET', 'NOUN']


# Putting the data into final form

Everything is ready to generate the training data in the form which is usable for the CRFsuite. Note that our inputs and labels will be  2-level representations, lists of lists, because we deal with token sequences (sentences).

In [49]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_valid = [sent2features(s) for s in valid_sents]
y_valid = [sent2labels(s) for s in valid_sents]

CPU times: user 20 s, sys: 2.97 s, total: 23 s
Wall time: 23 s


In [52]:
print("Feature dict for the first token in the first validation sentence:")
print(X_valid[0][0])
print("Its label:")
print(y_valid[0][0])

Feature dict for the first token in the first validation sentence:
{'bias': 1.0, 'token.lower': 'the', 'token.suffix': 'The', 'token.prefix': 'T', 'token.is_upper': False, 'token.is_title': True, 'token.is_digit': False, '+1:token.lower': 'fulton', '+1:token.is_title': True, '+1:token.is_upper:': False, 'BOS': True}
Its label:
DET


# Training and evaluation

We use the super-optimized [CRFsuite](http://www.chokkan.org/software/crfsuite/) via the scikit-learn compatible [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io) wrapper to train a CRF model on the data.

In [58]:
%%capture # only to avoid ugly printouts during install
!pip install sklearn_crfsuite

UsageError: unrecognized arguments: only to avoid ugly printouts during install


In [59]:
# Please import and train an averaged perceptron model from CRFsuite and use it's custom metrics,
# especially the multiple forms of accuracy score to evaluate the model! 
import sklearn_crfsuite
from sklearn_crfsuite import metrics
from sklearn_crfsuite import scorers

**Define the Model and then train**

In [179]:
%%time
# Create the model
crf = sklearn_crfsuite.CRF(algorithm='ap',max_iterations=23,verbose=True)
# Train the model
crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████| 51606/51606 [00:13<00:00, 3914.93it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 276114
Seconds required: 2.533

Averaged perceptron
max_iterations: 23
epsilon: 0.000000

Iter 1   time=1.72  loss=3419.61  feature_norm=479.36
Iter 2   time=1.59  loss=2119.91  feature_norm=605.32
Iter 3   time=1.54  loss=1803.85  feature_norm=696.98
Iter 4   time=1.52  loss=1613.51  feature_norm=771.06
Iter 5   time=1.51  loss=1479.13  feature_norm=834.37
Iter 6   time=1.48  loss=1373.25  feature_norm=890.05
Iter 7   time=1.47  loss=1306.41  feature_norm=940.22
Iter 8   time=1.46  loss=1234.49  feature_norm=985.92
Iter 9   time=1.47  loss=1164.34  feature_norm=1028.09
Iter 10  time=1.45  loss=1131.19  feature_norm=1067.25
Iter 11  time=1.44  loss=1085.22  feature_norm=1103.89
Iter 12  time=1.52  loss=1060.18  feature_norm=1138.42
Iter 13  time=1.39  loss=1006.77  feature_norm=1170.95
Iter 14  time

**Predict on X_valid**

In [180]:
y_pred = crf.predict(X_valid)

**Check the performance on Validation set**

In [181]:
print(metrics.flat_classification_report(y_valid, y_pred, labels=None, digits=3))

              precision    recall  f1-score   support

           .      1.000     1.000     1.000     14377
         ADJ      0.913     0.925     0.919      8525
         ADP      0.982     0.984     0.983     15138
         ADV      0.933     0.940     0.936      4438
        CONJ      0.993     0.998     0.996      3435
         DET      0.997     0.996     0.996     14128
        NOUN      0.980     0.974     0.977     36341
         NUM      0.985     0.984     0.984      2386
        PRON      0.988     0.995     0.992      3427
         PRT      0.945     0.940     0.943      2877
        VERB      0.976     0.978     0.977     18229
           X      0.851     0.570     0.683       100

    accuracy                          0.977    123401
   macro avg      0.962     0.940     0.949    123401
weighted avg      0.977     0.977     0.977    123401



Precision: Out of all the tags we predicted label=True, what fraction actually were True. 

Recall: Out of all the tags that actually had label=True, what fraction did we correctly predict as label=True.

F1-Score: Weighted average of Precision and Recall.

Getting a low F1-Score for X tag. It's okay to check the skewness of labels in the data set.

In [143]:
Verb=dot=ADJ=ADP=ADV=CONJ=DET=NOUN=NUM=PRON=PRT=X=0
counter=0
for i in range(len(train_sents)):
  POS_tags = [POS[1] for POS in train_sents[i]]
  counter += len(POS_tags)
  Verb += POS_tags.count("VERB")
  dot += POS_tags.count(".")
  ADJ += POS_tags.count("ADJ")
  ADP += POS_tags.count("ADP")
  ADV += POS_tags.count("ADV")
  CONJ += POS_tags.count("CONJ")
  DET += POS_tags.count("DET")
  NOUN += POS_tags.count("NOUN")
  NUM += POS_tags.count("NUM")
  PRON += POS_tags.count("PRON")
  PRT += POS_tags.count("PRT")
  X += POS_tags.count("X")

In [182]:
print("Percentage of POS Tags in Train set")
print("Verbs:      "+str(round(Verb/counter*100, 2)), 
      "\nPuntuations: "+str(round(dot/counter*100, 2)),
      "\nADJs:       "+str(round(ADJ/counter*100, 2)), 
      "\nADPs:       "+str(round(ADP/counter*100, 2)),
      "\nADVs:       "+str(round(ADV/counter*100, 2)), 
      "\nCONJs:      "+str(round(CONJ/counter*100, 2)), 
      "\nDETs:       "+str(round(DET/counter*100, 2)),
      "\nNOUNs:      "+str(round(NOUN/counter*100, 2)), 
      "\nNUMs:       "+str(round(NUM/counter*100, 2)),
      "\nPRONs:      "+str(round(PRON/counter*100, 2)), 
      "\nPRTs:       "+str(round(PRT/counter*100, 2)),
      "\nXs:         "+str(round(X/counter*100, 2))
      )

Percentage of POS Tags in Train set
Verbs:      15.85 
Puntuations: 12.83 
ADJs:       7.25 
ADPs:       12.49 
ADVs:       4.99 
CONJs:      3.35 
DETs:       11.84 
NOUNs:      23.05 
NUMs:       1.2 
PRONs:      4.42 
PRTs:       2.6 
Xs:         0.12


It can be seen that we only have 0.12% of X tags, Hence we get poor F1-Score for this particular tag. 

Overall the model performed really well. Except for the X tag.

Let's instantiate and fit our model. CRFsuite implements several learning methods, here we use "ap", i.e., averaged perceptron.

# Demonstration

Just for the fun, we can try out the model.

In [95]:
def predict_tags(sent):
    """Predict tags for a sentence.
    sent is a string.
    """
    doc = en(sent)
    return crf.predict([[token2features(doc, i) for i in range(len(doc))]])
    

In [96]:
 while True:
        sent = input("\nEnter a sentence to tag or press return to quit:\n")
        if sent:
            print(predict_tags(sent))
        else:
            print("\nEmpty input received -- bye!")
            break


Enter a sentence to tag or press return to quit:
My name is Asjad. I love Data Science.
[['DET', 'NOUN', 'VERB', 'NOUN', '.', 'PRON', 'VERB', 'NOUN', 'NOUN', '.']]

Enter a sentence to tag or press return to quit:
?
[['.']]


KeyboardInterrupt: ignored