# CRF for Entity Extraction on CoNLL2003

In this notebook we build a CRF model for Name Entity Recognition over the CONLL2003 english dataset. 
We will use the `sklearn-crfsuite` package for implementing our model and `seqeval` for f1-score evaluation.

---

In [1]:
import os
from utils import dataio, modelutils
from pprint import pprint
from seqeval.metrics import classification_report
from sklearn_crfsuite import CRF

## Load Dataset
We load CONLL2003 dataset from [this GitHub repo](https://github.com/davidsbatista/NER-datasets/tree/master/CONLL2003). 
For each token it reports Part-of-Speech tag, Dependency tag and Entity (with BIO notation). One token per line, features separated with a whitespace, sentences are separated with an empty line.

In [2]:
data_dir = os.path.join('data', 'conll03')
raw_train, Y_train, output_labels = dataio.load_conll_data('train.txt', dir_path=data_dir)
raw_valid, Y_valid, _ = dataio.load_conll_data('valid.txt', dir_path=data_dir)
raw_test, Y_test, _ = dataio.load_conll_data('test.txt', dir_path=data_dir)

Reading file data\conll03\train.txt
Read 14027 sentences
Reading file data\conll03\valid.txt
Read 3249 sentences
Reading file data\conll03\test.txt
Read 3452 sentences


In [3]:
print("Labels:", output_labels)

Labels: {'I-ORG', 'I-PER', 'B-MISC', 'B-PER', 'I-MISC', 'B-ORG', 'B-LOC', 'O', 'I-LOC'}


In [4]:
print("Sentence Example:")
pprint(raw_train[0])
print("="*30)
print(Y_train[0])

Sentence Example:
[('German', 'JJ', 'B-NP'),
 ('call', 'NN', 'I-NP'),
 ('to', 'TO', 'B-VP'),
 ('boycott', 'VB', 'I-VP'),
 ('British', 'JJ', 'B-NP'),
 ('lamb', 'NN', 'I-NP'),
 ('.', '.', 'O')]
['B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


---

## Feature Functions

In this section we define the features to extract from each token. Each token will be represent with a vector that contain:
* The lowercase token string*;
* The token suffix;
* If the token is capitalized*;
* If the token is uppercase*;
* If the token is a number;
* Complete Part-of-Speech tag of the token*;
* More general Part-of-Speech tag of the token*;
* Complete Dependency tag of the token*;
* More general Dependency tag of the token*;
* If the token is the first of the sentence;
* If the token is the last of the sentence.

\* also for previous and next tokens, if there are.  

> Note: categorical features are one-hot encoded.

In [5]:
def word_features(sentence, idx):
    """Extract features related to a word and its neighbours"""
    word, pos, dep = sentence[idx]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': pos,
        'postag[:2]': pos[:2],
        'deptag': dep,
        'deptag[-2:]': dep[-2:]
    }
    if idx > 0:
        word1, pos1, dep1 = sentence[idx-1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': pos1,
            '-1:postag[:2]': pos1[:2],
            '-1:deptag': dep1,
            '-1:deptag[-2:]': dep1[-2:],
        })
    else:
        features['BOS'] = True
        
    if idx < len(sentence)-1:
        word1, pos1, dep1 = sentence[idx+1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': pos1,
            '+1:postag[:2]': pos1[:2],
            '+1:deptag': dep1,
            '+1:deptag[-2:]': dep1[-2:],
        })
    else:
        features['EOS'] = True
                
    return features


def sentence_features(sentence):
    return tuple(word_features(sentence, index) for index in range(len(sentence)))

X_train = [sentence_features(sentence) for sentence in raw_train]
X_valid = [sentence_features(sentence) for sentence in raw_valid]
X_test = [sentence_features(sentence) for sentence in raw_test]

In [6]:
print("Token features example:")
pprint(X_train[1][1])
print("="*30)
print(Y_train[1][1])

Token features example:
{'-1:deptag': 'B-NP',
 '-1:deptag[-2:]': 'NP',
 '-1:postag': 'NNP',
 '-1:postag[:2]': 'NN',
 '-1:word.istitle()': True,
 '-1:word.isupper()': False,
 '-1:word.lower()': 'peter',
 'EOS': True,
 'bias': 1.0,
 'deptag': 'I-NP',
 'deptag[-2:]': 'NP',
 'postag': 'NNP',
 'postag[:2]': 'NN',
 'word.isdigit()': False,
 'word.istitle()': True,
 'word.isupper()': False,
 'word.lower()': 'blackburn',
 'word[-2:]': 'rn',
 'word[-3:]': 'urn'}
I-PER


---

## Training

In [7]:
%%time

crf = CRF(
    algorithm = 'lbfgs',
    c1 = 0.1,
    c2 = 0.5,
    max_iterations = 800,
    all_possible_transitions = True,
    verbose = False
)

crf.fit(X_train, Y_train, X_dev=X_valid, y_dev=Y_valid)

Wall time: 4min 23s




CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.5,
    keep_tempfiles=None, max_iterations=800)

---

## Evaluation

We evaluate:
* **Memory consumption** using the attribute `crf.size_`;
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [8]:
print(f'Model size: {crf.size_ / 1000000:0.2f}M')

Model size: 1.91M


In [9]:
print(f'Model latency in prediction: {modelutils.compute_prediction_latency(X_test, crf):.3} s')

Model latency in prediction: 0.00019 s


In [10]:
datasets = [('Training Set', X_train, Y_train), 
            ('Test Set', X_test, Y_test), 
            ('Validation Set', X_valid, Y_valid)]

for title, X, Y in datasets:
    Y_pred = crf.predict(X)
    print(title)
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      ORG      0.961     0.943     0.952      6318
      PER      0.977     0.974     0.975      6600
      LOC      0.971     0.971     0.971      7140
     MISC      0.964     0.916     0.940      3438

micro avg      0.969     0.956     0.963     23496
macro avg      0.969     0.956     0.963     23496



Test Set
           precision    recall  f1-score   support

      PER      0.809     0.850     0.829      1616
      ORG      0.737     0.697     0.717      1660
      LOC      0.848     0.809     0.828      1667
     MISC      0.801     0.728     0.762       701

micro avg      0.799     0.778     0.788      5644
macro avg      0.798     0.778     0.787      5644



Validation Set
           precision    recall  f1-score   support

      ORG      0.818     0.772     0.794      1340
      LOC      0.903     0.868     0.885      1837
     MISC      0.907     0.808     0.855       922
      PER      0.888     0.894    

---