# Task: Part of speech tagging

In this task we try to recreate a very rudimentary POS tagger "from scratch" using SpaCy and CRF models. 

(We disregard the fact, that SpaCy has a built in POS tagger for the moment for demonstration purposes.)

The input is a tokenized English sentence. The task is to label each word with a part of speech (POS) tag. The tag set, which is identical the [Universal Dependencies project's](https://universaldependencies.org/) basic tag set is the following:

- NOUN: noun
- VERB: verb
- DET: determiner
- ADJ: adjective
- ADP: adposition (e.g., prepositions)
- ADV: adverb
- CONJ: conjunction
- NUM: numeral
- PART: particle (function word that cannot be inflected, has no meaning in
  itself and doesn't fit elsewhere, e.g., "to")
- PRON: pronoun
- .: punctuation
- X: other

The code in this task is an adaptation of the NER code in the sklearn-crfsuite documentation.

# The data set

__Brown__ corpus: "The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kučera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961" (Wikpedia: Brown Corpus)

Let's download and inspect the data!

In [0]:
%%capture
!pip install nltk

In [0]:
import nltk

from nltk.corpus import brown
nltk.download('brown')

brown.words()

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [0]:
nltk.download('universal_tagset')
brown.tagged_words(tagset='universal')

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


[('The', 'DET'), ('Fulton', 'NOUN'), ...]

In [0]:
brown.sents()

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [0]:
len(brown.words())

1161192

From the brown the object provided by NLTK we will work with the tagged sentence list:

In [0]:
sents = brown.tagged_sents(tagset="universal")

sents[:2]

[[('The', 'DET'),
  ('Fulton', 'NOUN'),
  ('County', 'NOUN'),
  ('Grand', 'ADJ'),
  ('Jury', 'NOUN'),
  ('said', 'VERB'),
  ('Friday', 'NOUN'),
  ('an', 'DET'),
  ('investigation', 'NOUN'),
  ('of', 'ADP'),
  ("Atlanta's", 'NOUN'),
  ('recent', 'ADJ'),
  ('primary', 'NOUN'),
  ('election', 'NOUN'),
  ('produced', 'VERB'),
  ('``', '.'),
  ('no', 'DET'),
  ('evidence', 'NOUN'),
  ("''", '.'),
  ('that', 'ADP'),
  ('any', 'DET'),
  ('irregularities', 'NOUN'),
  ('took', 'VERB'),
  ('place', 'NOUN'),
  ('.', '.')],
 [('The', 'DET'),
  ('jury', 'NOUN'),
  ('further', 'ADV'),
  ('said', 'VERB'),
  ('in', 'ADP'),
  ('term-end', 'NOUN'),
  ('presentments', 'NOUN'),
  ('that', 'ADP'),
  ('the', 'DET'),
  ('City', 'NOUN'),
  ('Executive', 'ADJ'),
  ('Committee', 'NOUN'),
  (',', '.'),
  ('which', 'DET'),
  ('had', 'VERB'),
  ('over-all', 'ADJ'),
  ('charge', 'NOUN'),
  ('of', 'ADP'),
  ('the', 'DET'),
  ('election', 'NOUN'),
  (',', '.'),
  ('``', '.'),
  ('deserves', 'VERB'),
  ('the', 'DET'),

In [0]:
len(sents)

57340

We divide our data set into a train and a valid part:

In [0]:
valid_sents = sents[:5734]
train_sents = sents[5734:]

# Feature template

Since the plan is to build a CRF model, we need a __feature template__, which generates features for a word in a sentence (our sequence in the sequence tagging task). We use spaCy for feature extraction.

In [0]:
#Spacy install, load and such stuff

#Import
import spacy
#By model load, please deactivate unnecessary pipeline elements!

nlp = spacy.load("en_core_web_sm") 

We write a function which generates features for a token in a sentence, which is already a spaCy document. The feature vector is represented as a `dict` mapping feature names to their values.

The desired **feature set for a token is**:

- `bias`: A constant value of 1 as an input
- `token.lower`: the lowercased textual form of the token
- `token.suffix`: the textual form of the token's suffix as defined by SpaCy,
- `token.prefix`: the textual form of the token's prefix as defined by SpaCy,
- `token.is_upper`: boolean value indicating if the token is uppercase,
- `token.is_title`: boolean value indicating if the token is a title,
- `token.is_digit`: boolean value indicating if the token consists of numbers.

These are only the `Token`'s own properties, but they represent no context.

We would like to include information about  the previous and next words, as well as indicating if the `Token` is the beginning or the end of sentence.

The **contextual features** should be:
 
- `-1:token.lower`: What is the lowercase textual form of the previous token?,
- `-1:token.is_title`: Is the previous token a title?,
- `-1:token.is_upper`: Is the previous token uppercase?,
- `+1:token.lower`: What is the lowercase textual form of the next token?,
- `+1:token.is_title`: Is the next token a title?,
- `+1:token.is_upper`: Is the next token uppercase?,
- `BOS`: Boolean value indicating if the token is the beginning of a sentence,
- `EOS`: Boolean value indicating if the token is the end of a sentence

In [0]:
def token2features(sent, i):
    """Return a feature dict for a token. 
    sent is a spaCy Doc containing a sentence, i is the token's index in it.
    """

    token = sent[i]

    features = {
        'bias': 1.0,
        'token.lower()': token.lower_,
        'token.suffix': token.suffix_,
        'token.prefix': token.prefix_,
        'token.isupper()': token.is_upper,
        'token.istitle()': token.is_title,
        'token.isdigit()': token.is_digit,
    }
    if i > 0:
        token1 = sent[i-1]
        features.update({
            '-1:token.lower()': token1.lower_,
            '-1:token.istitle()': token1.is_title,
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        token1 = sent[i+1]
        features.update({
            '+1:token.lower()': token1.lower_,
            '+1:token.istitle()': token1.is_title,
            '+1:token.isupper()': token1.is_upper,
        })
    else:
        features['EOS'] = True

    return features

For training, we will also need functions to generate feature dict and label lists for sentences in our training corpus:

In [0]:
  def sent2features(sent):
      "Return a list of feature dicts for a sentence in the data set."
      # Create a doc by instantiating a Doc class and iterating through the sentence token by token.
      # Please bear in mind, that Brown has token-POS pairs, latter one we don't need here...
      # Plese use the above defined token2features function on each token to generate the features
      # For the whole sentence!
      doc1 = []
      res = [lis[0] for lis in sent] 
      doc1 = spacy.tokens.Doc(nlp.vocab, words=res)
      return [token2features(doc1, i) for i in range(len(doc1))]

  def sent2labels(sent):
      
      #Please create / filter only the labels for given sentence!
      label = []
      for token in sent:
        label.append(token[1])
      return label 

Sanity check: let's see the values for the first 2 tokens in the corpus:

In [0]:
print(sent2features(sents[0])[:2])
print(sent2labels(sents[0])[:2])

[{'bias': 1.0, 'token.lower()': 'the', 'token.suffix': 'The', 'token.prefix': 'T', 'token.isupper()': False, 'token.istitle()': True, 'token.isdigit()': False, 'BOS': True, '+1:token.lower()': 'fulton', '+1:token.istitle()': True, '+1:token.isupper()': False}, {'bias': 1.0, 'token.lower()': 'fulton', 'token.suffix': 'ton', 'token.prefix': 'F', 'token.isupper()': False, 'token.istitle()': True, 'token.isdigit()': False, '-1:token.lower()': 'the', '-1:token.istitle()': True, '+1:token.lower()': 'county', '+1:token.istitle()': True, '+1:token.isupper()': False}]
['DET', 'NOUN']


# Putting the data into final form

Everything is ready to generate the training data in the form which is usable for the CRFsuite. Note that our inputs and labels will be  2-level representations, lists of lists, because we deal with token sequences (sentences).

In [0]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_valid = [sent2features(s) for s in valid_sents]
y_valid = [sent2labels(s) for s in valid_sents]

CPU times: user 36.3 s, sys: 819 ms, total: 37.1 s
Wall time: 37.2 s


In [0]:
print("Feature dict for the first token in the first validation sentence:")
print(X_valid[0][0])
print("Its label:")
print(y_valid[0][0])

Feature dict for the first token in the first validation sentence:
{'bias': 1.0, 'token.lower()': 'the', 'token.suffix': 'The', 'token.prefix': 'T', 'token.isupper()': False, 'token.istitle()': True, 'token.isdigit()': False, 'BOS': True, '+1:token.lower()': 'fulton', '+1:token.istitle()': True, '+1:token.isupper()': False}
Its label:
DET


# Training and evaluation

We use the super-optimized [CRFsuite](http://www.chokkan.org/software/crfsuite/) via the scikit-learn compatible [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io) wrapper to train a CRF model on the data.

In [0]:
%%capture 
# only to avoid ugly printouts during install
!pip install sklearn-crfsuite

In [0]:
# Please import and train an averaged perceptron model from CRFsuite and use it's custom metrics, 
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
# especially the multiple forms of accuracy score to evaluate the model!
crf = sklearn_crfsuite.CRF(
    algorithm='ap',
    max_iterations=500,
    all_possible_transitions=False,
    epsilon=1e-5
)

In [0]:
# Please draw some conclusion if this model is "good enough" 
# in your view if you take token level and sentence level metrics into account!
#%%time
crf.fit(X_train, y_train)

labels = list(crf.classes_)

y_pred = crf.predict(X_valid)
metrics.flat_f1_score(y_valid, y_pred,average='weighted', labels=labels)

0.9724865413483401

In [0]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_valid, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

           .      1.000     1.000     1.000     14377
           X      0.761     0.540     0.632       100
         ADJ      0.904     0.915     0.909      8525
         ADP      0.977     0.981     0.979     15138
         ADV      0.917     0.934     0.926      4438
        VERB      0.969     0.975     0.972     18229
         DET      0.995     0.995     0.995     14128
        CONJ      0.994     0.996     0.995      3435
        NOUN      0.976     0.968     0.972     36341
        PRON      0.985     0.989     0.987      3427
         PRT      0.938     0.924     0.931      2877
         NUM      0.975     0.973     0.974      2386

    accuracy                          0.972    123401
   macro avg      0.949     0.932     0.939    123401
weighted avg      0.973     0.972     0.972    123401



Let's instantiate and fit our model. CRFsuite implements several learning methods, here we use "ap", i.e., averaged perceptron.

# Demonstration

Just for the fun, we can try out the model.

In [0]:
def predict_tags(sent):
    """Predict tags for a sentence.
    sent is a string.
    """
    doc = nlp(sent)
    return crf.predict([[token2features(doc, i) for i in range(len(doc))]])
    

In [0]:
 while True:
        sent = input("\nEnter a sentence to tag or press return to quit:\n")
        if sent:
            print(predict_tags(sent))
        else:
            print("\nEmpty input received -- bye!")
            break


Enter a sentence to tag or press return to quit:
This is an English sentence!
[['DET', 'VERB', 'DET', 'ADJ', 'NOUN', 'NOUN']]

Enter a sentence to tag or press return to quit:
I am a Data Science enthusiast..
[['PRON', 'VERB', 'DET', 'NOUN', 'NOUN', 'NOUN', 'NUM']]

Enter a sentence to tag or press return to quit:


Empty input received -- bye!
