<a href="https://colab.research.google.com/github/ELehmann91/FS1/blob/master/Copy_of_POS_tagging_with_classical_models_handout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task: Part of speech tagging

In this task we try to recreate a very rudimentary POS tagger "from scratch" using SpaCy and CRF models. 

(We disregard the fact, that SpaCy has a built in POS tagger for the moment for demonstration purposes.)

The input is a tokenized English sentence. The task is to label each word with a part of speech (POS) tag. The tag set, which is identical the [Universal Dependencies project's](https://universaldependencies.org/) basic tag set is the following:

- NOUN: noun
- VERB: verb
- DET: determiner
- ADJ: adjective
- ADP: adposition (e.g., prepositions)
- ADV: adverb
- CONJ: conjunction
- NUM: numeral
- PART: particle (function word that cannot be inflected, has no meaning in
  itself and doesn't fit elsewhere, e.g., "to")
- PRON: pronoun
- .: punctuation
- X: other

The code in this task is an adaptation of the NER code in the sklearn-crfsuite documentation.

# The data set

__Brown__ corpus: "The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kučera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961" (Wikpedia: Brown Corpus)

Let's download and inspect the data!

In [0]:
%%capture
!pip install nltk

In [2]:
import nltk

from nltk.corpus import brown
nltk.download('brown')

brown.words()

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [3]:
nltk.download('universal_tagset')
brown.tagged_words(tagset='universal')

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


[('The', 'DET'), ('Fulton', 'NOUN'), ...]

In [4]:
brown.sents()

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [5]:
len(brown.words())

1161192

From the brown the object provided by NLTK we will work with the tagged sentence list:

In [6]:
sents = brown.tagged_sents(tagset="universal")

sents[:1]

[[('The', 'DET'),
  ('Fulton', 'NOUN'),
  ('County', 'NOUN'),
  ('Grand', 'ADJ'),
  ('Jury', 'NOUN'),
  ('said', 'VERB'),
  ('Friday', 'NOUN'),
  ('an', 'DET'),
  ('investigation', 'NOUN'),
  ('of', 'ADP'),
  ("Atlanta's", 'NOUN'),
  ('recent', 'ADJ'),
  ('primary', 'NOUN'),
  ('election', 'NOUN'),
  ('produced', 'VERB'),
  ('``', '.'),
  ('no', 'DET'),
  ('evidence', 'NOUN'),
  ("''", '.'),
  ('that', 'ADP'),
  ('any', 'DET'),
  ('irregularities', 'NOUN'),
  ('took', 'VERB'),
  ('place', 'NOUN'),
  ('.', '.')]]

In [7]:
len(sents)

57340

We divide our data set into a train and a valid part:

In [0]:
valid_sents = sents[:5734]
train_sents = sents[5734:]

# Feature template

Since the plan is to build a CRF model, we need a __feature template__, which generates features for a word in a sentence (our sequence in the sequence tagging task). We use spaCy for feature extraction.

In [9]:
! pip install spacy
#!python -m spacy download en_core_web_lg
#Spacy install, load and such stuff

#Import
import spacy
#By model load, please deactivate unnecessary pipeline elements!

en = spacy.load('en_core_web_lg') 



We write a function which generates features for a token in a sentence, which is already a spaCy document. The feature vector is represented as a `dict` mapping feature names to their values.

The desired **feature set for a token is**:

- `bias`: A constant value of 1 as an input
- `token.lower`: the lowercased textual form of the token
- `token.suffix`: the textual form of the token's suffix as defined by SpaCy,
- `token.prefix`: the textual form of the token's prefix as defined by SpaCy,
- `token.is_upper`: boolean value indicating if the token is uppercase,
- `token.is_title`: boolean value indicating if the token is a title,
- `token.is_digit`: boolean value indicating if the token consists of numbers.

These are only the `Token`'s own properties, but they represent no context.

We would like to include information about  the previous and next words, as well as indicating if the `Token` is the beginning or the end of sentence.

The **contextual features** should be:
 
- `-1:token.lower`: What is the lowercase textual form of the previous token?,
- `-1:token.is_title`: Is the previous token a title?,
- `-1:token.is_upper`: Is the previous token uppercase?,
- `+1:token.lower`: What is the lowercase textual form of the next token?,
- `+1:token.is_title`: Is the next token a title?,
- `+1:token.is_upper`: Is the next token uppercase?,
- `BOS`: Boolean value indicating if the token is the beginning of a sentence,
- `EOS`: Boolean value indicating if the token is the end of a sentence

In [0]:
def token2features(sent, i):
    """Return a feature dict for a token. 
    sent is a spaCy Doc containing a sentence, i is the token's index in it.
    """

    features = {'bias': 1,
                'word' : sent[i].lower_,
                'suffix' : sent[i].suffix_,
                'prefix' : sent[i].prefix_,
                'upper' : sent[i].is_upper,
                'title' : sent[i].is_title,
                'digit' : sent[i].is_digit
                }
    if i > 0:
        features.update({
            'prev_word': sent[i-1].text.lower(),
            'prev_title': sent[i-1].is_title,
            'prev_upper': sent[i-1].is_upper,

        })
    else:
        features['BOS'] = True
        #features['prev_word'] = None
        #features['prev_title'] = None
        #features['prev_upper'] = None

    if i < len(sent)-1:
        features.update({
            'next_word': sent[i+1].text.lower(),
            'next_title': sent[i+1].is_title,
            'next_upper': sent[i+1].is_upper,

        })
    else:
        features['EOS'] = True
        #features['next_word'] = None
        #features['next_title'] = None
        #features['next_upper'] = None

    return features


In [11]:
sentence = en('I have a immutable large car')
sentence

I have a immutable large car

In [12]:
token2features(sentence,0)

{'BOS': True,
 'bias': 1,
 'digit': False,
 'next_title': False,
 'next_upper': False,
 'next_word': 'have',
 'prefix': 'I',
 'suffix': 'I',
 'title': True,
 'upper': True,
 'word': 'i'}

For training, we will also need functions to generate feature dict and label lists for sentences in our training corpus:

In [0]:
from spacy.tokens import Doc

def sent2features(sent):
    "Return a list of feature dicts for a sentence in the data set."
    # Create a doc by instantiating a Doc class and iterating through the sentence token by token.
    # Please bear in mind, that Brown has token-POS pairs, latter one we don't need here...

    words  = [a for a,b in sent]

    doc = Doc(en.vocab, words=words)

    # Plese use the above defined token2features function on each token to generate the features
    # For the whole sentence!
    sent_features= [token2features(doc,i) for i,word in enumerate(doc)]
    
    return sent_features

def sent2labels(sent):
    
    #Please create / filter only the labels for given sentence!
    labels= [b for a,b in sent]
    
    return labels

Sanity check: let's see the values for the first 2 tokens in the corpus:

In [19]:
for i in range(80,82):
  print(sents[i])
  print(sent2features(sents[i])[:2])
  print(sent2labels(sents[i])[:2])

[('It', 'PRON'), ('says', 'VERB'), ('that', 'ADP'), ('``', '.'), ('in', 'ADP'), ('the', 'DET'), ('event', 'NOUN'), ('Congress', 'NOUN'), ('does', 'VERB'), ('provide', 'VERB'), ('this', 'DET'), ('increase', 'NOUN'), ('in', 'ADP'), ('federal', 'ADJ'), ('funds', 'NOUN'), ("''", '.'), (',', '.'), ('the', 'DET'), ('State', 'NOUN'), ('Board', 'NOUN'), ('of', 'ADP'), ('Education', 'NOUN'), ('should', 'VERB'), ('be', 'VERB'), ('directed', 'VERB'), ('to', 'PRT'), ('``', '.'), ('give', 'VERB'), ('priority', 'NOUN'), ("''", '.'), ('to', 'ADP'), ('teacher', 'NOUN'), ('pay', 'NOUN'), ('raises', 'NOUN'), ('.', '.')]
[{'bias': 1, 'word': 'it', 'suffix': 'It', 'prefix': 'I', 'upper': False, 'title': True, 'digit': False, 'BOS': True, 'next_word': 'says', 'next_title': False, 'next_upper': False}, {'bias': 1, 'word': 'says', 'suffix': 'ays', 'prefix': 's', 'upper': False, 'title': False, 'digit': False, 'prev_word': 'it', 'prev_title': True, 'prev_upper': False, 'next_word': 'that', 'next_title': False

# Putting the data into final form

Everything is ready to generate the training data in the form which is usable for the CRFsuite. Note that our inputs and labels will be  2-level representations, lists of lists, because we deal with token sequences (sentences).

In [20]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_valid = [sent2features(s) for s in valid_sents]
y_valid = [sent2labels(s) for s in valid_sents]

CPU times: user 37.2 s, sys: 777 ms, total: 37.9 s
Wall time: 38 s


In [21]:
print("Feature dict for the first token in the first validation sentence:")
print(X_valid[0][0])
print("Its label:")
print(y_valid[0][0])

Feature dict for the first token in the first validation sentence:
{'bias': 1, 'word': 'the', 'suffix': 'The', 'prefix': 'T', 'upper': False, 'title': True, 'digit': False, 'BOS': True, 'next_word': 'fulton', 'next_title': True, 'next_upper': False}
Its label:
DET


# Training and evaluation

We use the super-optimized [CRFsuite](http://www.chokkan.org/software/crfsuite/) via the scikit-learn compatible [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io) wrapper to train a CRF model on the data.

In [0]:
%%capture 
# only to avoid ugly printouts during install
!pip install sklearn_crfsuite

In [0]:
# Please import and train an averaged perceptron model from CRFsuite and use it's custom metrics, 
# especially the multiple forms of accuracy score to evaluate the model!
import sklearn_crfsuite 
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

In [24]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CPU times: user 3min 5s, sys: 372 ms, total: 3min 5s
Wall time: 3min 5s


In [25]:
# Please draw some conclusion if this model is "good enough" 
# in your view if you take token level and sentence level metrics into account!

labels = list(crf.classes_)
#labels.remove('O')
labels

['VERB',
 'ADP',
 'ADV',
 '.',
 'DET',
 'NOUN',
 'ADJ',
 'PRT',
 'CONJ',
 'NUM',
 'PRON',
 'X']

In [26]:
y_pred = crf.predict(X_valid)
metrics.flat_f1_score(y_valid, y_pred,
                      average='weighted', labels=labels)

0.9767205077886142

Yes ist good enaugth!

In [29]:
metrics.sequence_accuracy_score(y_valid, y_pred)

0.6510289501220788

Oops on sentence level only 65% Accuracy, thats not that good.

In [0]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_valid, y_pred, labels=sorted_labels, digits=3
))

Let's instantiate and fit our model. CRFsuite implements several learning methods, here we use "ap", i.e., averaged perceptron.

# Demonstration

Just for the fun, we can try out the model.

In [0]:
def predict_tags(sent):
    """Predict tags for a sentence.
    sent is a string.
    """
    doc = en(sent)
    return crf.predict([[token2features(doc, i) for i in range(len(doc))]])
    

In [31]:
 while True:
        sent = input("\nEnter a sentence to tag or press return to quit:\n")
        if sent:
            print(predict_tags(sent))
        else:
            print("\nEmpty input received -- bye!")
            break


Enter a sentence to tag or press return to quit:
Hello is it me you are looking for?
[['PRT', 'VERB', 'PRON', 'PRON', 'PRON', 'VERB', 'VERB', 'ADP', '.']]

Enter a sentence to tag or press return to quit:


Empty input received -- bye!
