# CRF for Entity Extraction on Annotated Corpus for Named Entity Recognition

In this notebook we build a CRF model for Named Entity Recognition over the [ACNER](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus) dataset from Kaggle. 

We use the `sklearn-crfsuite` package for implementing our model and `seqeval` for f1-score evaluation.

---

In [1]:
import os
from utils import dataio, modelutils
from pprint import pprint
from sklearn_crfsuite import CRF
from sklearn.model_selection import train_test_split
from seqeval.metrics import classification_report

## Data Preparation

Load dataset from the `data/` directory and extract only chosen features.
For each token, the feature vector is composed by:
* The string of the token*;
* Token lemma*;
* Token Part-of-Speech tag*;
* Token shape (uppercase, lowercase, capitalized, punctuation, ...)*;
* Sentence Index.

\* also for previous and next tokens. If the token is the first or the last of a sentence, value for previous/next token are replaced with a special value (`__start__` and `__end__`)

> Note: categorical features are one-hot encoded.

In [2]:
X, y, tags = dataio.load_anerd_data(os.path.join('data', 'annotated-ner-dataset', 'ner.csv'))

b'Skipping line 281837: expected 25 fields, saw 34\n'


Filter level: default
Features: Index(['lemma', 'next-lemma', 'next-pos', 'next-shape', 'next-word', 'pos',
       'prev-lemma', 'prev-pos', 'prev-shape', 'prev-word', 'sentence_idx',
       'shape', 'word', 'tag'],
      dtype='object')
Dataset dimension: 35177 sentences
Data read successfully!


In [3]:
print("Labels:")
pprint(tags)

Labels:
{'B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O',
 'unk'}


## Train-Test Split

Split data into training set and test set. We set a fixed random state in order to easily reproduce results. 

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [5]:
print("Token features example:")
pprint(X_train[0][1])
print("="*30)
print(y_train[0][1])

Token features example:
{'lemma': 'offici',
 'next-lemma': 'announc',
 'next-pos': 'VBD',
 'next-shape': 'lowercase',
 'next-word': 'announced',
 'pos': 'NNS',
 'prev-lemma': 'turkish',
 'prev-pos': 'JJ',
 'prev-shape': 'capitalized',
 'prev-word': 'Turkish',
 'sentence_idx': 6742.0,
 'shape': 'lowercase',
 'word': 'officials'}
O


---


## Training

In [6]:
%%time

crf = CRF(
    algorithm = 'lbfgs',
    c1 = 0.1,
    c2 = 0.5,
    max_iterations = 800,
    all_possible_transitions = True,
    verbose = False
)

crf.fit(X_train, y_train, X_dev=X_valid, y_dev=y_valid)

Wall time: 15min 46s




CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.5,
    keep_tempfiles=None, max_iterations=800)

---

## Evaluation

We evaluate:
* **Memory consumption** using the attribute `crf.size_`;
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [7]:
print('Model size: {:0.2f}M'.format(crf.size_ / 1000000))

Model size: 4.04M


In [8]:
print(f'Model latency in prediction: {modelutils.compute_prediction_latency(X_test, crf):.3} s')

Model latency in prediction: 0.00026 s


In [9]:
datasets = [('Training Set', X_train, y_train), ('Test Set', X_test, y_test)]

for title, X, Y in datasets:
    Y_pred = crf.predict(X)
    print(title)
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      org      0.905     0.843     0.873     12900
      geo      0.902     0.952     0.926     23737
      gpe      0.982     0.940     0.960     10450
      tim      0.958     0.910     0.933     12878
      per      0.911     0.886     0.898     10949
      art      0.973     0.602     0.744       304
      nat      0.924     0.660     0.770       147
      eve      0.913     0.796     0.851       211

micro avg      0.925     0.910     0.918     71576
macro avg      0.926     0.910     0.917     71576



Test Set
           precision    recall  f1-score   support

      geo      0.842     0.901     0.871      7715
      gpe      0.966     0.925     0.945      3305
      per      0.782     0.761     0.771      3289
      org      0.762     0.698     0.729      3983
      tim      0.901     0.843     0.871      4053
      art      0.000     0.000     0.000        75
      eve      0.610     0.357     0.450        70
   

---