# CRF for Entity Extraction on WikiNER (Italian)
WiNER is a dataset of annotated sentences for Entity Extraction taken from Wikipedia. In this notebook we train and evaluate a CRF model on the italian data to recognize entities such as Persons, Locations and Orgnizations from text.

We use the `sklearn-crfsuite` package for implementing our model and `seqeval` for f1-score evaluation.

---

In [11]:
import os
from utils import dataio, modelutils
from pprint import pprint
from sklearn_crfsuite import CRF
from sklearn.model_selection import train_test_split
from seqeval.metrics import classification_report

## Data Preparation

We load the dataset from the `data/` directory. For each token, the datatset reports word, Part of Speech tag and entity tag.

In [2]:
file_path = os.path.join('data', 'wikiner-it-wp3-raw.txt')
sentences, tags, output_labels = dataio.load_wikiner(file_path)

Read 127940 sentences.


In [3]:
print("Labels:", output_labels)

Labels: {'I-LOC', 'I-ORG', 'O', 'B-PER', 'I-MISC', 'B-LOC', 'B-MISC', 'I-PER', 'B-ORG'}


In [4]:
print("Sentence Example:")
pprint(sentences[1])
print("="*30)
print(tags[1])

Sentence Example:
[('Seguirono', 'VER:remo'),
 ('Lamarck', 'NOM'),
 ('(', 'PON'),
 ('1744', 'NUM'),
 ('--', 'NOM'),
 ('1829', 'NUM'),
 (')', 'PON'),
 (',', 'PON'),
 ('Blumenbach', 'NOM'),
 ('(', 'PON'),
 ('1752', 'NUM'),
 ('--', 'NOM'),
 ('1840', 'NUM'),
 (')', 'PON'),
 (',', 'PON'),
 ('con', 'PRE'),
 ('le', 'DET:def'),
 ('sue', 'PRO:poss'),
 ('norme', 'NOM'),
 ('descrittive', 'ADJ'),
 ('del', 'PRE:det'),
 ('cranio', 'NOM'),
 (',', 'PON'),
 ('Paul', 'NPR'),
 ('Broca', 'NOM'),
 ('con', 'PRE'),
 ('la', 'DET:def'),
 ('focalizzazione', 'NOM'),
 ('dei', 'PRE:det'),
 ('rapporti', 'NOM'),
 ('tra', 'PRE'),
 ('morfologia', 'NOM'),
 ('e', 'CON'),
 ('funzionalità', 'NOM'),
 ('.', 'SENT')]
['O', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


---

## Features Engineering

In this section, we build our feature vector for each token. It is composed by:
* The lowercase token string*;
* The token suffix;
* If the token is capitalized*;
* If the token is uppercase*;
* If the token is a number;
* Complete Part-of-Speech tag of the token*;
* More general Part-of-Speech tag of the token*;
* If the token is the first of the sentence;
* If the token is the last of the sentence.

\* also for previous and next tokens, if there are.  

> Note: categorical features are one-hot encoded.

In [5]:
def word_features(sentence, idx):
    """Extract features related to a word and its neighbours"""
    word, pos = sentence[idx]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': pos,
        'postag[:2]': pos[:2],
    }
    if idx > 0:
        word1, pos1 = sentence[idx-1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': pos1,
            '-1:postag[:2]': pos1[:2],
        })
    else:
        features['BOS'] = True
        
    if idx < len(sentence)-1:
        word1, pos1 = sentence[idx+1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': pos1,
            '+1:postag[:2]': pos1[:2],
        })
    else:
        features['EOS'] = True
                
    return features


def sentence_features(sentence):
    return tuple(word_features(sentence, index) for index in range(len(sentence)))

In [6]:
X = [sentence_features(sentence) for sentence in sentences]

In [7]:
print("Token features example:")
pprint(X[1][1])
print("="*30)
print(tags[1][1])

Token features example:
{'+1:postag': 'PON',
 '+1:postag[:2]': 'PO',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False,
 '+1:word.lower()': '(',
 '-1:postag': 'VER:remo',
 '-1:postag[:2]': 'VE',
 '-1:word.istitle()': True,
 '-1:word.isupper()': False,
 '-1:word.lower()': 'seguirono',
 'bias': 1.0,
 'postag': 'NOM',
 'postag[:2]': 'NO',
 'word.isdigit()': False,
 'word.istitle()': True,
 'word.isupper()': False,
 'word.lower()': 'lamarck',
 'word[-2:]': 'ck',
 'word[-3:]': 'rck'}
I-PER


## Training

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, tags, test_size=0.2, 
                                                    random_state=3791)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, 
                                                      test_size=0.2, 
                                                      random_state=3791)

In [12]:
%%time

crf = CRF(
    algorithm = 'lbfgs',
    c1 = 0.1,
    c2 = 0.5,
    max_iterations = 800,
    all_possible_transitions = True,
    verbose = False
)

crf.fit(X_train, y_train, X_dev=X_valid, y_dev=y_valid)

Wall time: 32min 14s




CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.5,
    keep_tempfiles=None, max_iterations=800)

---

## Evaluation

We evaluate:
* **Memory consumption** using the attribute `crf.size_`;
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [13]:
print('Model size: {:0.2f}M'.format(crf.size_ / 1000000))

Model size: 7.54M


In [14]:
print(f'Model latency in prediction: {modelutils.compute_prediction_latency(X_test, crf):.3} s')

Model latency in prediction: 0.000309 s


In [15]:
datasets = [('Training Set', X_train, y_train), ('Test Set', X_test, y_test)]

for title, X, Y in datasets:
    Y_pred = crf.predict(X)
    print(title)
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      LOC      0.898     0.933     0.915     82830
      ORG      0.923     0.804     0.859     13708
     MISC      0.893     0.762     0.822     24386
      PER      0.940     0.931     0.935     46049

micro avg      0.911     0.897     0.904    166973
macro avg      0.911     0.897     0.902    166973



Test Set
           precision    recall  f1-score   support

      LOC      0.865     0.899     0.882     25889
      ORG      0.861     0.729     0.789      4224
      PER      0.903     0.896     0.900     14193
     MISC      0.795     0.656     0.719      7402

micro avg      0.867     0.850     0.858     51708
macro avg      0.865     0.850     0.856     51708





---