<a id=contents></a>

# Baseline Model building - Conditional Random Field model



[1. ETL and Train Test Split](#ETL)

[2. Modelling with Conditional Random Field](#CRF)

[3. Choice of model architectures](#selection)


[7. Conclusions and model comparison table](#conc)

In [408]:

import pandas as pd
import numpy as np

import functions as fn
import pickle

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style("darkgrid")

from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from matplotlib import cm
import numpy as np

from sklearn_crfsuite import CRF, scorers, metrics
from sklearn.model_selection import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report, flat_accuracy_score, flat_f1_score

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import re
import string
tokenizer = RegexpTokenizer(r'\b\w{3,}\b')
stop_words = list(set(stopwords.words("english")))
stop_words += list(string.punctuation)

import warnings
warnings.filterwarnings('ignore')

from scipy import stats as ss
import eli5
#baseline sequential evaluation metrics
from seqeval.metrics import accuracy_score as seq_acc
from seqeval.metrics import classification_report as seq_cr
from seqeval.metrics import f1_score as seq_f1_score

#python package for evaluation in line with 
import nereval


%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<a id=ETL ><a/> 

## 1. ETL of data and Train-Test Split
    
[LINK to table of contents](#contents)

In [116]:
with open('clean_data/crf_train_data.pkl', 'rb') as f:
    crf_features_train = pickle.load(f)
    
with open('clean_data/crf_test_data.pkl', 'rb') as f:
    crf_features_test = pickle.load(f)
    
with open('clean_data/crf_valid_data.pkl', 'rb') as f:
    crf_features_valid = pickle.load(f)
    
with open('clean_data/crf_valid_targets.pkl', 'rb') as f:
    crf_targets_valid = pickle.load(f)
    
with open('clean_data/crf_train_targets.pkl', 'rb') as f:
    crf_targets_train = pickle.load(f)
    
with open('clean_data/crf_test_targets.pkl', 'rb') as f:
    crf_targets_test = pickle.load(f)
    

In [123]:
# a reminder of how our feature data is structured - 7th word of 1st sentence
crf_features_train[0][6]

{'word.lower()': 'london',
 'word.istitle()': 1,
 'len(word)': 6,
 'word.isupper()': 0,
 'word.isdigit()': 0,
 'word.prefix_2': 'Lo',
 'word.suffix_2': 'on',
 'word.prefix_3': 'Lon',
 'word.suffix_3': 'don',
 'word.frequency': 0,
 'word.+1_POS': 'TO',
 'word.-1_POS': 'IN',
 'word.-2_POS': 'VBN',
 'word.BOS': 0,
 'word.same_POS_-1': 0}

In [163]:
# and our target data
crf_targets_train[0][6]

'B-geo'

Before modelling, here's a quick reminder of the meaning of the target variable Tags:
* B - beginning of NE chunk
* I - inside NE chunk
* O - not an NE


* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon



In [195]:
print(f'We have {len(crf_features_train)} sentences in our training data.')
print(f'We have {len(crf_features_valid)} sentences in our validation data.')
print(f'We have {len(crf_features_test)} sentences in our test data.')

We have 1799 sentences in our training data.
We have 600 sentences in our validation data.
We have 600 sentences in our test data.


<a id = 'CRF'></a>

## 2. Baseline modelling with a Conditional Random Field

[LINK to table of contents](#contents)

We will be using a Conditional Random Field (CRF) model for our baseline and MVP models. A [CRF](#Charles Sutton and Andrew McCallum (2012), "An Introduction to Conditional Random Fields", Foundations and Trends® in Machine Learning: Vol. 4: No. 4, pp 267-373.http://dx.doi.org/10.1561/2200000013) ---

Sklearn's CRF requires the input data to be a list of lists of dicts. I stored these as pickle files in notebook and loaded them above

In [24]:
crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=True)

In [128]:
#baseline
%time
crf.fit(crf_features_train, crf_targets_train)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 16.9 µs


CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

In [90]:
# crf object stores our Tag labels
labels = list(crf.classes_)
labels

['O',
 'B-geo',
 'B-gpe',
 'B-per',
 'I-geo',
 'B-org',
 'I-org',
 'B-tim',
 'B-art',
 'I-art',
 'I-per',
 'I-gpe',
 'I-tim',
 'B-nat',
 'B-eve',
 'I-eve',
 'I-nat']

In [164]:
crf_y_pred_train = crf.predict(crf_features_train)
metrics.flat_f1_score(crf_targets_train, crf_y_pred_train,
                      average='weighted')

0.9863591333898722

### Classification Report Interpretation for train and validation data:

The main reported figure is the weighted average F1 Score:

$$    F_{1} = 2* \frac{Precision * Recall}{Precision + Recall}         $$

The support column refers to how many instances there are of each class. As we've seen before, this distribution is dominated by 'O' (non-NE) and there are some, such as nat ('national phenomena') that are almost zero (there were only 17 instances in the training data). 

The table below gives us the relevant metrics for our baseline model's performance. I would draw your attention to the near bottom right, were we see the macro-average F1 score across all the NE categories. The support column indicates how many instances of each NE there are across the data. 

In [140]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0]))

print(metrics.flat_classification_report(
    crf_targets_train, crf_y_pred_train, labels=sorted_labels, digits=3))

             precision    recall  f1-score   support

        geo      0.849     0.936     0.890      1174
        gpe      0.906     0.858     0.881       808
        org      0.914     0.810     0.859       784
        tim      0.973     0.907     0.939       680
        per      0.972     0.970     0.971       633
        nat      1.000     0.941     0.970        17
        eve      0.958     0.958     0.958        24
        art      0.953     0.891     0.921        46

avg / total      0.913     0.897     0.904      4166



We have plenty of evidence of overfitting, our average F1 score dropping down from 0.9 to 0.7, so we will be fitting a new model using crossvalidation and GridSearchCV. 

In [145]:
crf_y_pred_valid = crf.predict(crf_features_valid)
metrics.flat_f1_score(crf_targets_valid, crf_y_pred_valid,
                      average='weighted', labels=labels)

0.7003058103975536

In [143]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0]))

print(metrics.flat_classification_report(
    crf_targets_valid, crf_y_pred_valid, labels=sorted_labels, digits=3))

             precision    recall  f1-score   support

        geo      0.704     0.744     0.724       422
        gpe      0.762     0.794     0.778       218
        per      0.758     0.684     0.719       206
        tim      0.862     0.720     0.785       243
        art      0.000     0.000     0.000         5
        org      0.532     0.482     0.506       226
        eve      0.500     0.267     0.348        15
        nat      0.000     0.000     0.000         1

avg / total      0.716     0.686     0.699      1336



## 3. Optimisation with GridSearchCV and fine-tuning

In [222]:
crf_optim = CRF(algorithm='lbfgs',
          max_iterations=100,
          all_possible_transitions=True)

f1_scorer = make_scorer(flat_f1_score,
                        average='weighted', labels=labels)


crf_params = {'c1': [ 0.1, 1.0, 10, 50, 100],
              'c2': [0.05, 0.1,  0.5, 1.0, 1.5]}

grid =  GridSearchCV(crf_optim, 
                    crf_params, 
                    seqeval_scorer, 
                    -1, cv=5, 
                    return_train_score=True, 
                    verbose=True)

grid.fit(crf_features_train, crf_targets_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed: 10.9min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=CRF(algorithm='lbfgs', all_possible_states=None,
                           all_possible_transitions=True, averaging=None,
                           c=None, c1=None, c2=None,
                           calibration_candidates=None, calibration_eta=None,
                           calibration_max_trials=None, calibration_rate=None,
                           calibration_samples=None, delta=None, epsilon=None,
                           error_sensitive=None, gamma=None,
                           keep_tempfi...
                           max_iterations=100, max_linesearch=None,
                           min_freq=None, model_filename=None,
                           num_memories=None, pa_type=None, period=None,
                           trainer_cls=None, variance=None, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'c1': [0.1, 1.0, 10, 50, 100],
                         'c2': [0.05, 0.1, 0.5, 

In [225]:
best_crf = grid.best_estimator_

In [226]:
grid.best_params_

{'c1': 0.1, 'c2': 0.1}

In [186]:
# further fine tuning around the c1 penalty term
crf_optim = CRF(algorithm='lbfgs',
          max_iterations=100,
          all_possible_transitions=True)

f1_scorer = make_scorer(flat_f1_score,
                        average='weighted', labels=labels)

crf_params = {'c1': [0.095, 0.0975, 0.1, 0.1025, 0.105],
              'c2': [0.095, 0.0975, 0.1, 0.1025, 0.105]}

grid_2 =  GridSearchCV(crf_optim, 
                    crf_params, 
                    f1_scorer, 
                    -1, cv=5, 
                    return_train_score=True, 
                    verbose=True)

grid_2.fit(crf_features_train, crf_targets_train)

best_crf_2 = grid.best_estimator_

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed: 12.5min finished


I am going to use the python library seqeval, which has been specifically designed to work with BIO labels and to help with measuring performance on tasks "[such as named-entity recognition, part-of-speech tagging, semantic role labeling](#https://pypi.org/project/seqeval/)".
The `seqeval` package collapses the B- and I- type of tags into fewer NE tags, as seen in the classification report below. 

In [187]:
grid_2.best_params_


{'c1': 0.0975, 'c2': 0.1025}

In [189]:
flat_f1_score(crf_targets_train, best_crf_2.predict(crf_features_train), average='weighted')


0.9811511208222455

Now let's see how this performs on validation data -- we can see that we have overfitted significantly. 


In [190]:
flat_f1_score(crf_targets_valid, best_crf_2.predict(crf_features_valid), average='weighted')


0.9496762857123459

In [191]:
# further fine tuning around the c1 penalty term
crf_optim = CRF(algorithm='lbfgs',
          max_iterations=100,
          all_possible_transitions=True)

crf_params = {'c1': [0.097, 0.0975, 0.098,],
              'c2': [ 0.103, 0.1025, 0.102]}

grid_3 =  GridSearchCV(crf_optim, 
                    crf_params, 
                    f1_scorer, 
                    -1, cv=10, 
                    return_train_score=True, 
                    verbose=True)

grid_3.fit(crf_features_train, crf_targets_train)

best_crf_3 = grid_3.best_estimator_


Fitting 10 folds for each of 9 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  8.2min finished


In [198]:
crf_y_train_pred = best_crf_3.predict(crf_features_train)

print(flat_f1_score(crf_targets_train, crf_y_train_pred, average='macro'))


0.937762045091888


In [201]:
crf_y_valid_pred = best_crf_3.predict(crf_features_valid)

sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0]))


print(flat_classification_report(crf_targets_valid, crf_y_valid_pred, digits=3, labels=sorted_labels))


              precision    recall  f1-score   support

           O      0.984     0.990     0.987     11034
       B-art      0.000     0.000     0.000         5
       I-art      0.000     0.000     0.000         7
       B-eve      0.500     0.267     0.348        15
       I-eve      0.333     0.231     0.273        13
       B-geo      0.710     0.744     0.727       422
       I-geo      0.693     0.604     0.646       101
       B-gpe      0.760     0.798     0.779       218
       I-gpe      0.111     0.250     0.154         4
       B-nat      0.000     0.000     0.000         1
       I-nat      0.000     0.000     0.000         0
       B-org      0.576     0.518     0.545       226
       I-org      0.633     0.654     0.643       153
       B-per      0.749     0.694     0.720       206
       I-per      0.845     0.914     0.879       257
       B-tim      0.912     0.765     0.832       243
       I-tim      0.771     0.561     0.649        66

   micro avg      0.948   

So we have been able to achieve a weighted average F1 score of **0.947 on the validation data** (0.937 on train data respectively).



### Evaluation with seqeval

Now it's important to note that our previous evaluation using the crf-suite package treated the instances of each NE tag separately. While it's useful to see how the model performs on those individual, single word tags, this paints an artificially positive picture of our model's performance, if we're interested in whole-NE chunk identification. Therefore, going forward, I'll be using the `seqeval` library which collapses these categories. The $F_1$ will drop but it'll be more true to actual performance. 

In [208]:
print("Our overall macro-average F1 score on validation is", round(seq_f1_score(crf_targets_valid, crf_y_valid_pred, average='macro'),3))

Our overall macro-average F1 score on validation is 0.7


In [207]:
print("Our overall accuracy on validation data is", round(seq_acc(crf_targets_valid, crf_y_valid_pred),4),)

Our overall accuracy on validation data is 0.9483


As you'd expect, the results are very different compared to sklearn's estimation, however this is a much more realistic picture of how well our model is performing. The model is pulled down considerably by the low-frequency classes of 'artefacts' and 'natural phenomena'. 


In [209]:
print(seq_cr(crf_targets_valid, crf_y_valid_pred))

             precision    recall  f1-score   support

        geo       0.71      0.74      0.72       422
        gpe       0.76      0.79      0.77       218
        per       0.74      0.69      0.72       206
        tim       0.86      0.72      0.79       243
        art       0.00      0.00      0.00         5
        org       0.53      0.48      0.51       226
        eve       0.50      0.27      0.35        15
        nat       0.00      0.00      0.00         1

avg / total       0.71      0.69      0.70      1336



In [210]:
# And if we check our training classification report:
print(seq_cr(crf_targets_train, crf_y_train_pred))

             precision    recall  f1-score   support

        geo       0.85      0.94      0.89      1174
        gpe       0.91      0.86      0.88       808
        org       0.92      0.81      0.86       784
        tim       0.98      0.91      0.94       680
        per       0.97      0.97      0.97       633
        nat       1.00      0.94      0.97        17
        eve       0.96      0.96      0.96        24
        art       0.95      0.91      0.93        46

avg / total       0.91      0.90      0.91      4166



In [212]:
# further fine tuning around the c1 penalty term
crf_optim = CRF(algorithm='lbfgs',
          max_iterations=100,
          all_possible_transitions=True)

seqeval_scorer = make_scorer(seq_f1_score, average='macro')

crf_params = {'c1': [0.01, 0.0975, 0.5, 1.5, 2.0],
              'c2': [0.01, 0.1025, 0.5, 1.5, 2.0]}

grid_seqeval =  GridSearchCV(crf_optim, 
                    crf_params, 
                    seqeval_scorer, 
                    -1, cv=5, 
                    return_train_score=True, 
                    verbose=True)

grid_seqeval.fit(crf_features_train, crf_targets_train)

best_crf_seqeval = grid_seqeval.best_estimator_


Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed: 11.9min finished


In [302]:
# further fine tuning around the c1 penalty term
crf_optim = CRF(algorithm='lbfgs',
          max_iterations=100,
          all_possible_transitions=False)

seqeval_scorer = make_scorer(seq_f1_score, average='weighted')

crf_params_2 = {'c1': [0.01, 0.1, 50, 100],
              'c2': [0.1025]}

grid_seqeval_2 =  GridSearchCV(crf_optim, 
                    crf_params_2, 
                    seqeval_scorer, 
                    -1, cv=5, 
                    return_train_score=True, 
                    verbose=True)

grid_seqeval_2.fit(crf_features_train, crf_targets_train)

best_crf_seqeval_2 = grid_seqeval.best_estimator_


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.5min finished


In [303]:
grid_seqeval_2.best_params_

{'c1': 0.1, 'c2': 0.1025}

In [304]:
crf_y_train_seqeval_preds = best_crf_seqeval_2.predict(crf_features_train)

print(seq_cr(crf_targets_train, crf_y_train_seqeval_preds))

             precision    recall  f1-score   support

        geo       0.85      0.94      0.89      1174
        gpe       0.90      0.86      0.88       808
        org       0.91      0.81      0.86       784
        tim       0.98      0.92      0.95       680
        per       0.97      0.97      0.97       633
        nat       1.00      0.94      0.97        17
        eve       0.96      0.96      0.96        24
        art       0.95      0.91      0.93        46

avg / total       0.91      0.90      0.91      4166



In [305]:
crf_y_valid_seqeval_preds = best_crf_seqeval_2.predict(crf_features_valid)

print(seq_cr(crf_targets_valid, crf_y_valid_seqeval_preds))

             precision    recall  f1-score   support

        geo       0.69      0.73      0.71       422
        gpe       0.77      0.79      0.78       218
        per       0.75      0.69      0.72       206
        tim       0.86      0.73      0.79       243
        art       0.00      0.00      0.00         5
        org       0.52      0.48      0.50       226
        eve       0.50      0.27      0.35        15
        nat       0.00      0.00      0.00         1

avg / total       0.71      0.68      0.69      1336



In [306]:
print("Model has a macro-average F1 score of", round(seq_f1_score(crf_targets_valid, crf_y_valid_seqeval_preds, average='macro'),3))

Model has a macro-average F1 score of 0.696


In [307]:
print("Our overall accuracy is", round(seq_acc(crf_targets_valid, crf_y_valid_seqeval_preds),3),)


Our overall accuracy is 0.947


<a id=ttsplit ><a/> 

## 3. Investigating our best model's weights
   
[LINK to table of contents](#contents)

The transition features weights shown in Fig 1 correspond to the probability of a Tag changing from one category (shown in rows) to another (shown in the columns). So the top left number (3.522) indicates that there is a really strong chance that a 'O' (a non-NE word) will be followed by another 'O' tagged word. This is perfectly sensible given that most of the corpus consists of 'O's. 

You'll notice that there are certain banded patterns going from left to right across the weights table, e.g. we have two very positive weights from B-eve (event, beginning of NE chunk) to I-eve (inside event NE chunk), but negative or zero weights for any other entity that is I- (inside a chunk). This is completely understandable since it would run counter to the actual logic behind the BIO tagging system. 

Now the weights that should really interest us, however, are the ones of relatively high magnitude away from the diagonal. Some notable ones include:

* B-geo and I-geo to B-tim are highly positive. So we'd have a significant number of occasions where a geographical entity (e.g. 'Shanghai') might be followed by a date, month or temporal entity (\[in\] November)
* B-gpe to B-org are also quite positive, indicating that geopolitical entities are often succeeded by organisations (e.g. 'France's Navy'

In [472]:
fig = eli5.show_weights(best_crf_seqeval_2, top=30, show=['transition_features'])
print('Fig 1 - Transition features \n')
fig

Fig 1 - Transition features 



From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,3.522,0.873,-1.95,1.863,-1.569,1.336,-3.459,0.698,-1.182,0.561,-1.628,1.247,-4.559,1.604,-3.344,1.632,-4.337
B-art,-0.136,0.0,4.295,0.0,0.0,-0.383,-0.567,-1.01,0.0,0.0,0.0,-0.499,-0.705,-0.771,-0.934,0.085,-0.568
I-art,-0.728,-0.014,4.433,0.0,0.0,-0.118,-0.45,-0.281,0.0,0.0,0.0,-0.258,-0.314,-0.425,-0.868,-0.042,0.0
B-eve,-0.847,0.0,0.0,0.0,4.917,-0.332,-0.314,-0.591,0.0,0.0,0.0,-0.383,-0.968,-0.432,-0.667,-0.311,-0.295
I-eve,-0.079,0.0,0.0,-0.623,3.366,0.0,-0.036,0.0,0.0,0.0,0.0,0.0,-0.098,0.0,-0.568,0.0,0.0
B-geo,0.512,-0.449,-1.303,-0.066,-0.729,-2.124,3.807,-0.335,-1.09,0.0,-0.399,-2.304,-2.547,-2.722,-2.993,0.985,-1.975
I-geo,0.197,0.0,0.0,0.0,0.0,-1.192,3.563,-1.332,-0.172,0.0,0.0,-0.654,-1.499,-1.47,-1.113,0.666,-0.888
B-gpe,0.759,-0.443,-0.791,-0.06,-0.918,-1.92,-2.166,-2.7,3.659,0.0,-0.441,0.996,-3.055,0.204,-2.274,-1.997,-1.786
I-gpe,-0.133,0.0,0.0,0.0,0.0,0.595,-0.001,-0.337,4.283,0.0,0.0,-0.217,-0.37,-0.447,-0.472,0.0,0.0
B-nat,-0.183,0.0,0.0,0.0,0.0,-0.093,0.0,-0.277,0.0,-0.001,3.324,-0.07,-0.255,-0.296,-0.457,0.0,0.0


In [470]:
print('Fig 2 - Top 30 feature-to-target weights \n')
eli5.show_weights(best_crf_seqeval, top=30, show=['targets'])

Fig 2 - Top 30 feature-to-target weights 



Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,Unnamed: 10_level_5,Unnamed: 11_level_5,Unnamed: 12_level_5,Unnamed: 13_level_5,Unnamed: 14_level_5,Unnamed: 15_level_5,Unnamed: 16_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,Unnamed: 10_level_6,Unnamed: 11_level_6,Unnamed: 12_level_6,Unnamed: 13_level_6,Unnamed: 14_level_6,Unnamed: 15_level_6,Unnamed: 16_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7,Unnamed: 10_level_7,Unnamed: 11_level_7,Unnamed: 12_level_7,Unnamed: 13_level_7,Unnamed: 14_level_7,Unnamed: 15_level_7,Unnamed: 16_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8,Unnamed: 10_level_8,Unnamed: 11_level_8,Unnamed: 12_level_8,Unnamed: 13_level_8,Unnamed: 14_level_8,Unnamed: 15_level_8,Unnamed: 16_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9,Unnamed: 10_level_9,Unnamed: 11_level_9,Unnamed: 12_level_9,Unnamed: 13_level_9,Unnamed: 14_level_9,Unnamed: 15_level_9,Unnamed: 16_level_9
Weight?,Feature,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10,Unnamed: 9_level_10,Unnamed: 10_level_10,Unnamed: 11_level_10,Unnamed: 12_level_10,Unnamed: 13_level_10,Unnamed: 14_level_10,Unnamed: 15_level_10,Unnamed: 16_level_10
Weight?,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11,Unnamed: 9_level_11,Unnamed: 10_level_11,Unnamed: 11_level_11,Unnamed: 12_level_11,Unnamed: 13_level_11,Unnamed: 14_level_11,Unnamed: 15_level_11,Unnamed: 16_level_11
Weight?,Feature,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12,Unnamed: 9_level_12,Unnamed: 10_level_12,Unnamed: 11_level_12,Unnamed: 12_level_12,Unnamed: 13_level_12,Unnamed: 14_level_12,Unnamed: 15_level_12,Unnamed: 16_level_12
Weight?,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13,Unnamed: 9_level_13,Unnamed: 10_level_13,Unnamed: 11_level_13,Unnamed: 12_level_13,Unnamed: 13_level_13,Unnamed: 14_level_13,Unnamed: 15_level_13,Unnamed: 16_level_13
Weight?,Feature,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14,Unnamed: 9_level_14,Unnamed: 10_level_14,Unnamed: 11_level_14,Unnamed: 12_level_14,Unnamed: 13_level_14,Unnamed: 14_level_14,Unnamed: 15_level_14,Unnamed: 16_level_14
Weight?,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15,Unnamed: 9_level_15,Unnamed: 10_level_15,Unnamed: 11_level_15,Unnamed: 12_level_15,Unnamed: 13_level_15,Unnamed: 14_level_15,Unnamed: 15_level_15,Unnamed: 16_level_15
Weight?,Feature,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16,Unnamed: 9_level_16,Unnamed: 10_level_16,Unnamed: 11_level_16,Unnamed: 12_level_16,Unnamed: 13_level_16,Unnamed: 14_level_16,Unnamed: 15_level_16,Unnamed: 16_level_16
+3.341,word.lower():israeli-palestinian,,,,,,,,,,,,,,,
+3.150,word.-1_POS:JJS,,,,,,,,,,,,,,,
+2.885,word.lower():a,,,,,,,,,,,,,,,
+2.643,word.+1_POS:VB,,,,,,,,,,,,,,,
+2.540,word.prefix_3:Pri,,,,,,,,,,,,,,,
+2.508,word.lower():chairman,,,,,,,,,,,,,,,
+2.366,word.+1_POS:JJ,,,,,,,,,,,,,,,
+2.364,word.lower():minister,,,,,,,,,,,,,,,
+2.286,word.prefix_3:06-,,,,,,,,,,,,,,,
+2.286,word.prefix_2:06,,,,,,,,,,,,,,,

Weight?,Feature
+3.341,word.lower():israeli-palestinian
+3.150,word.-1_POS:JJS
+2.885,word.lower():a
+2.643,word.+1_POS:VB
+2.540,word.prefix_3:Pri
+2.508,word.lower():chairman
+2.366,word.+1_POS:JJ
+2.364,word.lower():minister
+2.286,word.prefix_3:06-
+2.286,word.prefix_2:06

Weight?,Feature
+2.105,word.prefix_3:Top
+1.831,word.+1_POS:VB
+1.812,word.prefix_3:Nob
+1.740,word.prefix_3:Soy
+1.740,word.lower():soyuzcapsule
+1.721,word.prefix_3:eng
+1.683,word.suffix_3:ule
+1.628,word.lower():alhurra
+1.628,word.prefix_3:alH
+1.627,word.prefix_2:Do

Weight?,Feature
+1.018,word.prefix_2:No
+0.952,word.suffix_3:ror
+0.938,word.prefix_3:Non
+0.938,word.lower():non-proliferation
+0.916,word.+1_POS:NN
+0.910,word.lower():3
+0.910,word.suffix_3:3
+0.910,word.suffix_2:3
+0.910,word.prefix_3:3
+0.910,word.prefix_2:3

Weight?,Feature
+1.492,word.prefix_2:Ko
+1.362,word.suffix_3:pic
+1.362,word.lower():olympic
+1.360,word.prefix_3:Oly
+1.318,word.prefix_3:Ash
+1.300,word.lower():ashura
+1.244,word.prefix_3:Gam
+1.243,word.lower():games
+1.168,word.suffix_3:II
+1.168,word.prefix_3:II

Weight?,Feature
+1.168,word.prefix_3:War
+1.031,word.prefix_2:Wa
+0.936,word.-1_POS:NNP
+0.920,word.suffix_3:nal
+0.915,word.prefix_3:Oly
+0.913,word.lower():international
+0.895,word.prefix_3:Int
+0.893,word.prefix_2:Ol
+0.887,word.suffix_2:I
+0.887,word.prefix_3:I

Weight?,Feature
+2.163,word.suffix_2:ta
+2.046,word.lower():paris
+2.020,word.istitle()
+1.949,word.suffix_3:kan
+1.949,word.lower():lankan
+1.940,word.suffix_3:ris
+1.938,word.suffix_3:and
+1.922,word.suffix_2:ai
+1.907,word.lower():second-in-command
+1.869,word.suffix_3:the

Weight?,Feature
+2.271,word.suffix_3:tan
+2.013,word.lower():homeland
+1.558,word.prefix_3:Mus
+1.466,word.suffix_3:ica
+1.460,word.suffix_2:ca
+1.430,word.prefix_3:Riv
+1.289,word.-1_POS:JJ
+1.264,word.prefix_3:hom
+1.247,word.prefix_3:Ara
+1.236,word.lower():river

Weight?,Feature
+3.837,word.suffix_3:ese
+2.623,word.prefix_2:Sw
+2.579,word.suffix_3:ans
+2.348,word.suffix_3:ish
+2.304,word.lower():afghan
+2.201,word.prefix_3:Chi
+2.161,word.suffix_2:li
+2.126,word.prefix_3:Kor
+1.972,word.suffix_2:an
+1.971,word.suffix_3:ian

Weight?,Feature
+2.453,word.+1_POS:POS
+2.098,word.suffix_3:can
+1.388,word.-2_POS:CC
+1.230,word.lower():republic
+1.223,word.prefix_3:Rep
+1.150,word.lower():american
+1.082,word.prefix_3:Sta
+1.004,word.lower():arab
+0.999,word.-2_POS:WDT
+0.994,word.suffix_3:rab

Weight?,Feature
+1.605,word.isupper()
+1.511,word.prefix_2:H5
+1.511,word.prefix_3:H5N
+1.438,word.lower():katrina
+1.400,word.prefix_3:Hur
+1.393,word.suffix_3:ane
+1.377,word.prefix_2:Hu
+1.358,word.lower():hurricane
+1.239,word.-2_POS:JJS
+1.226,word.prefix_3:Kat

Weight?,Feature
+1.205,word.lower():katrina
+1.145,word.prefix_3:Kat
+1.116,word.prefix_2:Ka
+1.089,word.suffix_3:ina
+1.058,word.lower():jing
+1.058,word.prefix_3:Jin
+1.042,word.prefix_2:Ji
+0.989,word.lower():respiratory
+0.987,word.prefix_3:Res
+0.963,word.lower():syndrome

Weight?,Feature
+2.683,word.lower():hamas
+2.680,word.isupper()
+2.562,word.lower():singapore
+2.555,word.lower():kindhearts
+2.413,word.lower():al-qaida
+2.406,word.suffix_3:uay
+2.349,word.lower():guardian
+2.221,word.prefix_3:Ham
+2.201,word.lower():latgalians
+2.178,word.lower():government-funded

Weight?,Feature
+3.366,word.lower():committee-chairman
+2.281,word.lower():ministry
+1.927,word.lower():pakistan
+1.817,word.suffix_3:try
+1.626,word.suffix_3:ons
+1.616,word.lower():union
+1.610,word.lower():nations
+1.608,word.lower():charlotte
+1.606,word.suffix_3:tte
+1.598,word.prefix_3:Pak

Weight?,Feature
+2.582,word.lower():prime
+2.432,word.lower():sperling
+2.344,word.prefix_3:Jac
+2.225,word.lower():secretary
+2.145,word.prefix_3:pri
+2.088,word.prefix_2:Ob
+2.000,word.prefix_3:al-
+1.995,word.prefix_2:pr
+1.956,word.lower():senator
+1.918,word.lower():bush

Weight?,Feature
+2.154,word.-1_POS:NNP
+1.731,word.prefix_2:Mu
+1.438,word.lower():condoleezza
+1.261,word.suffix_2:ei
+1.155,word.-2_POS:POS
+1.087,word.prefix_2:Ha
+1.073,word.prefix_2:Al
+1.068,word.suffix_3:ron
+1.067,word.suffix_2:ie
+1.063,word.suffix_2:ik

Weight?,Feature
+3.763,word.suffix_3:day
+3.413,word.lower():day-long
+3.215,word.suffix_2:ay
+3.160,word.suffix_3:ber
+2.810,word.prefix_2:19
+2.592,word.lower():later
+2.582,word.-1_POS:RBR
+2.516,word.lower():two-year
+2.363,word.lower():recent
+2.325,word.lower():by-election

Weight?,Feature
+3.003,word.suffix_3:day
+2.960,word.isdigit()
+2.712,word.suffix_2:ay
+1.962,word.lower():infected
+1.926,word.lower():quarter
+1.839,word.+1_POS:TO
+1.747,word.prefix_3:inf
+1.732,word.prefix_2:de
+1.679,word.prefix_2:Ju
+1.608,word.lower():decades


## Quick inspection by example

To sanity check our model, we can use the example_output (included in `functions.py`) to check one sentence at a time. 
In the first example we can see that the model successfully distinguished between 'Afghanistan' the geographical entity and 'afghan' (forces), the *geo-political* entity. Given the news article focus of the dataset, the NATO example probably has quite a positive weight attached to it. 

In [412]:
fn.example_output(23, crf_features_valid, best_crf_seqeval_2, crf_targets_valid)

Unnamed: 0,True,Pred,Word
0,O,O,police
1,O,O,in
2,B-geo,B-geo,afghanistan
3,O,O,say
4,B-gpe,B-gpe,afghan
5,O,O,and
6,B-org,B-org,nato
7,O,O,forces
8,O,O,have
9,O,O,killed


In the next example (which refers to Polish history), our best model has successfully identified a reference to a time period as well as the limits of the named entity. 

In [415]:
fn.example_output(42, crf_features_valid, best_crf_seqeval_2, crf_targets_valid)

Unnamed: 0,True,Pred,Word
0,O,O,its
1,O,O,golden
2,O,O,age
3,O,O,occurred
4,O,O,in
5,O,O,the
6,B-tim,B-tim,16th
7,I-tim,I-tim,century
8,O,O,.


Below we see an example where the model overestimated the boundaries of one NE at the expense of another, labelling 'U.S. Vice' as an organisation, when a. the 'U.S.' should've been a geographical entity and 'Vice President Dick Cheney' should have been entirely labelled a person. 

In [421]:
fn.example_output(527, crf_features_valid, best_crf_seqeval_2, crf_targets_valid)

Unnamed: 0,True,Pred,Word
0,O,O,the
1,O,O,event
2,O,O,will
3,O,O,be
4,O,O,attended
5,O,O,by
6,O,O,many
7,O,O,foreign
8,O,O,dignitaries
9,O,O,","


Now below is an example of a *somewhat* understandable misclassification of "Tennis Masters Cup" as an organisation rather than an event. 

In [423]:
fn.example_output(255, crf_features_valid, best_crf_seqeval_2, crf_targets_valid)

Unnamed: 0,True,Pred,Word
0,O,O,the
1,O,O,injury
2,O,O,initially
3,O,O,forced
4,O,O,his
5,O,O,withdrawal
6,O,O,from
7,O,O,the
8,B-eve,B-org,tennis
9,I-eve,I-org,masters


This final example is a partial misclassification of the named entity of the "Interim Prime Minister Ali Mohamed Gedi". Notice that the model managed to determine almost all of the NE correctly, except for "interim", mistaking that for an artefact. This would have naturally been counted as a misclassification, but, all things considered, if you were reading a text highlighted automatically and it was only "interim" absent from the highlighted `B-per` text, you'd probably gloss over that mistake. 

In [430]:
fn.example_output(301, crf_features_valid, best_crf_seqeval_2, crf_targets_valid)

Unnamed: 0,True,Pred,Word
0,O,O,last
1,B-tim,B-tim,thursday
2,O,O,","
3,O,O,the
4,O,O,convoy
5,O,O,of
6,B-per,B-art,interim
7,I-per,B-per,prime
8,I-per,I-per,minister
9,I-per,I-per,ali


<a id=conc ><a/> 

## 7. Conclusions and saving our model
    
[LINK to table of contents](#contents)

In [476]:
saved_model = pickle.dumps(best_crf_seqeval_2)