<a id=contents></a>

# Model building
## What are we trying to predict?


[1. ETL and Train Test Split](#ETL)

[2. Modelling with Random Forest](#RF)

[3. Modelling with Conditional Random Field](#CRF)

[4. Choice of model architectures](#selection)

[4.1 Model 1](#one)

[4.2 Model 2](#two)

[4.2 Model 3](#three)

[4.2 Model 4](#four)

[4.2 Model 5](#five)

[7. Conclusions and model comparison table](#conc)

In [35]:

import pandas as pd
import numpy as np

import pickle

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style("darkgrid")

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from matplotlib import cm
import numpy as np
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
from sklearn.metrics import f1_score

from sklearn_crfsuite import CRF, scorers, metrics
from sklearn.model_selection import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.metrics import classification_report, make_scorer

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import re
import string
tokenizer = RegexpTokenizer(r'\b\w{3,}\b')
stop_words = list(set(stopwords.words("english")))
stop_words += list(string.punctuation)

import warnings
warnings.filterwarnings('ignore')

from scipy import stats as ss

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<a id=ETL ><a/> 

## 1. ETL of data and Train-Test Split
    
[LINK to table of contents](#contents)

In [11]:
feature_df = pd.read_pickle('feature_based_data/final_feature_data.pkl')
feature_y = feature_df.Tag
feature_X = feature_df.drop(columns=['Tag'])

x_train, x_test, y_train, y_test = train_test_split(feature_X, feature_y, test_size=.2, random_state=12345)

In [58]:
x_train.head(2)

Unnamed: 0,is_title,length,is_upper,is_digit,is_prev_NE,prev_-1_POS_NNP,prev_-1_POS_NN,is_prev_pos_same_as_current,POS_NNP,POS_NN,POS_IN,POS_DT,POS_.
42758,0,7,0,0,0,0,0,0,0,1,0,0,0
56828,0,4,0,0,0,0,0,0,0,0,0,0,0


<a id = 'RF'></a>

## 2. Modelling with Random Forest

[LINK to table of contents](#contents)

In [46]:
%time
RF = RandomForestClassifier(n_estimators=50, max_depth=20, min_samples_split=.01, n_jobs=-1)

preds = cross_val_predict(estimator=RF, X=x_train.values, y=y_train.values, cv=5)


CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 12.2 µs


In [50]:
preds.shape == y_train.shape

True

In [53]:
report = classification_report(y_pred=preds, y_true=y_train)
print(report)

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00        37
       B-eve       0.00      0.00      0.00        34
       B-geo       0.44      0.90      0.59      1668
       B-gpe       0.59      0.57      0.58       973
       B-nat       0.00      0.00      0.00        17
       B-org       0.59      0.21      0.31       987
       B-per       0.52      0.34      0.41       907
       B-tim       0.83      0.20      0.32       926
       I-art       0.00      0.00      0.00        26
       I-eve       0.00      0.00      0.00        33
       I-geo       0.49      0.12      0.20       331
       I-gpe       0.00      0.00      0.00        27
       I-nat       0.00      0.00      0.00         9
       I-org       0.42      0.21      0.28       744
       I-per       0.50      0.92      0.65       961
       I-tim       0.98      0.36      0.52       261
           O       0.98      0.99      0.99     44987

    accuracy              

In [83]:
y_train_bin = np.zeros(y_train.shape)

In [84]:
for i in y_train_bin:
    if y_train[i]!='O':
        y_train_bin[i]=1
y_train_bin

array([0., 0., 0., ..., 0., 0., 0.])

In [15]:
x_train.iloc[:5].to_dict()

{'is_title': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'length': {42758: 7, 56828: 4, 18522: 1, 15552: 2, 20990: 6},
 'is_upper': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'is_digit': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'is_prev_NE': {42758: 0, 56828: 0, 18522: 1, 15552: 0, 20990: 1},
 'prev_-1_POS_NNP': {42758: 0, 56828: 0, 18522: 1, 15552: 0, 20990: 1},
 'prev_-1_POS_NN': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'is_prev_pos_same_as_current': {42758: 0,
  56828: 0,
  18522: 0,
  15552: 0,
  20990: 0},
 'POS_NNP': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'POS_NN': {42758: 1, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'POS_IN': {42758: 0, 56828: 0, 18522: 0, 15552: 1, 20990: 0},
 'POS_DT': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'POS_.': {42758: 0, 56828: 0, 18522: 1, 15552: 0, 20990: 0}}

<a id = 'CRF'></a>

## 3. Modelling with a Conditional Random Field

[LINK to table of contents](#contents)

Sklearn's CRF requires the input data to be a list of lists of dicts. I stored these as pickle files in notebook 3.

In [23]:
with open('clean_data/crf_train_data.pkl', 'rb') as f:
    crf_features_train = pickle.load(f)
    
with open('clean_data/crf_test_data.pkl', 'rb') as f:
    crf_features_test = pickle.load(f)
    
with open('clean_data/crf_train_targets.pkl', 'rb') as f:
    crf_targets_train = pickle.load(f)
    
with open('clean_data/crf_test_targets.pkl', 'rb') as f:
    crf_targets_test = pickle.load(f)
    

In [24]:
crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=True)

In [25]:
%time
crf.fit(crf_features_train, crf_targets_train)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.05 µs


CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

In [26]:
labels = list(crf.classes_)
labels

['O',
 'B-geo',
 'B-gpe',
 'B-per',
 'I-geo',
 'B-org',
 'I-org',
 'B-tim',
 'B-art',
 'I-art',
 'I-per',
 'I-gpe',
 'I-tim',
 'B-nat',
 'B-eve',
 'I-eve',
 'I-nat']

In [39]:
crf_y_pred_train = crf.predict(crf_features_train)
metrics.flat_f1_score(crf_targets_train, crf_y_pred_train,
                      average='weighted', labels=labels)

0.9852280959671684

In [33]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0]))

print(metrics.flat_classification_report(
    crf_targets_train, crf_y_pred_train, labels=sorted_labels, digits=3))

              precision    recall  f1-score   support

           O      0.996     0.998     0.997     45078
       B-art      0.932     0.804     0.863        51
       I-art      0.941     1.000     0.970        32
       B-eve      0.974     0.949     0.961        39
       I-eve      1.000     0.939     0.969        33
       B-geo      0.849     0.936     0.890      1596
       I-geo      0.891     0.920     0.906       339
       B-gpe      0.915     0.853     0.883      1026
       I-gpe      0.895     0.531     0.667        32
       B-nat      1.000     0.944     0.971        18
       I-nat      1.000     1.000     1.000         9
       B-org      0.921     0.816     0.865      1010
       I-org      0.949     0.951     0.950       721
       B-per      0.964     0.967     0.965       839
       I-per      0.973     0.988     0.981       995
       B-tim      0.980     0.897     0.937       923
       I-tim      0.948     0.890     0.918       246

    accuracy              

In [37]:
crf_y_pred_test = crf.predict(crf_features_test)
metrics.flat_f1_score(crf_targets_test, crf_y_pred_test,
                      average='weighted', labels=labels)

0.9471352834210373

In [38]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0]))

print(metrics.flat_classification_report(
    crf_targets_test, crf_y_pred, labels=sorted_labels, digits=3))

              precision    recall  f1-score   support

           O      0.984     0.987     0.986     11139
       B-art      0.000     0.000     0.000         2
       I-art      0.000     0.000     0.000         2
       B-eve      0.400     0.333     0.364         6
       I-eve      0.400     0.500     0.444         4
       B-geo      0.769     0.738     0.753       474
       I-geo      0.600     0.600     0.600        75
       B-gpe      0.788     0.819     0.803       204
       I-gpe      0.000     0.000     0.000         2
       B-nat      0.000     0.000     0.000         2
       I-nat      0.000     0.000     0.000         0
       B-org      0.628     0.626     0.627       227
       I-org      0.728     0.639     0.681       205
       B-per      0.805     0.709     0.754       268
       I-per      0.689     0.908     0.783       239
       B-tim      0.887     0.793     0.837       237
       I-tim      0.754     0.591     0.662        88

   micro avg      0.948   

In [41]:
import eli5

In [42]:
eli5.show_weights(crf, top=30)
# repeat this for the hyperparam optimised version

From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,3.34,0.796,-2.118,2.335,-1.988,1.232,-4.234,0.638,-1.858,0.629,-1.87,1.313,-4.505,1.884,-3.625,1.916,-4.072
B-art,-0.321,0.0,4.978,0.0,0.0,-0.658,-0.708,-1.079,-0.048,0.0,0.0,-0.588,-0.943,-0.94,-1.402,0.181,-0.277
I-art,-1.181,-0.001,4.767,0.0,0.0,0.359,-0.537,-0.557,0.0,0.0,0.0,-0.309,-0.73,-0.857,-1.104,0.0,-0.065
B-eve,-1.053,0.0,0.0,0.0,4.506,-0.668,-0.615,-0.941,-0.303,0.0,0.0,-0.6,-1.242,-0.698,-0.861,0.213,-0.384
I-eve,-0.084,0.0,0.0,-0.609,3.942,-0.094,-0.152,-0.313,0.0,0.0,0.0,-0.252,-0.267,-0.221,-1.065,-0.136,0.0
B-geo,0.505,0.404,-0.887,-0.59,-1.238,-2.332,3.828,-0.614,-2.201,0.0,-0.442,-1.706,-2.941,-3.045,-2.945,1.298,-1.725
I-geo,0.142,-0.003,-0.187,-0.155,-0.012,-2.038,3.402,-2.009,-0.362,0.0,-0.061,-0.85,-1.96,-1.274,-1.487,0.321,-0.815
B-gpe,0.385,-0.824,-1.182,-0.084,-1.448,-2.31,-2.814,-3.249,2.958,-0.126,-0.494,0.715,-3.356,0.135,-2.429,-2.775,-1.81
I-gpe,-0.024,0.0,0.0,0.0,0.0,0.492,-0.029,-0.175,4.281,0.0,0.0,-0.155,-0.379,-0.307,-0.49,0.0,0.0
B-nat,-0.17,0.0,0.0,0.0,0.0,-0.335,0.0,-0.366,0.0,0.0,3.752,-0.145,-0.354,-0.162,-0.321,0.0,0.0

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,Unnamed: 10_level_5,Unnamed: 11_level_5,Unnamed: 12_level_5,Unnamed: 13_level_5,Unnamed: 14_level_5,Unnamed: 15_level_5,Unnamed: 16_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,Unnamed: 10_level_6,Unnamed: 11_level_6,Unnamed: 12_level_6,Unnamed: 13_level_6,Unnamed: 14_level_6,Unnamed: 15_level_6,Unnamed: 16_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7,Unnamed: 10_level_7,Unnamed: 11_level_7,Unnamed: 12_level_7,Unnamed: 13_level_7,Unnamed: 14_level_7,Unnamed: 15_level_7,Unnamed: 16_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8,Unnamed: 10_level_8,Unnamed: 11_level_8,Unnamed: 12_level_8,Unnamed: 13_level_8,Unnamed: 14_level_8,Unnamed: 15_level_8,Unnamed: 16_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9,Unnamed: 10_level_9,Unnamed: 11_level_9,Unnamed: 12_level_9,Unnamed: 13_level_9,Unnamed: 14_level_9,Unnamed: 15_level_9,Unnamed: 16_level_9
Weight?,Feature,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10,Unnamed: 9_level_10,Unnamed: 10_level_10,Unnamed: 11_level_10,Unnamed: 12_level_10,Unnamed: 13_level_10,Unnamed: 14_level_10,Unnamed: 15_level_10,Unnamed: 16_level_10
Weight?,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11,Unnamed: 9_level_11,Unnamed: 10_level_11,Unnamed: 11_level_11,Unnamed: 12_level_11,Unnamed: 13_level_11,Unnamed: 14_level_11,Unnamed: 15_level_11,Unnamed: 16_level_11
Weight?,Feature,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12,Unnamed: 9_level_12,Unnamed: 10_level_12,Unnamed: 11_level_12,Unnamed: 12_level_12,Unnamed: 13_level_12,Unnamed: 14_level_12,Unnamed: 15_level_12,Unnamed: 16_level_12
Weight?,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13,Unnamed: 9_level_13,Unnamed: 10_level_13,Unnamed: 11_level_13,Unnamed: 12_level_13,Unnamed: 13_level_13,Unnamed: 14_level_13,Unnamed: 15_level_13,Unnamed: 16_level_13
Weight?,Feature,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14,Unnamed: 9_level_14,Unnamed: 10_level_14,Unnamed: 11_level_14,Unnamed: 12_level_14,Unnamed: 13_level_14,Unnamed: 14_level_14,Unnamed: 15_level_14,Unnamed: 16_level_14
Weight?,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15,Unnamed: 9_level_15,Unnamed: 10_level_15,Unnamed: 11_level_15,Unnamed: 12_level_15,Unnamed: 13_level_15,Unnamed: 14_level_15,Unnamed: 15_level_15,Unnamed: 16_level_15
Weight?,Feature,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16,Unnamed: 9_level_16,Unnamed: 10_level_16,Unnamed: 11_level_16,Unnamed: 12_level_16,Unnamed: 13_level_16,Unnamed: 14_level_16,Unnamed: 15_level_16,Unnamed: 16_level_16
+3.660,word.lower():israeli-palestinian,,,,,,,,,,,,,,,
+2.970,word.lower():war,,,,,,,,,,,,,,,
+2.862,word.prefix_3:Pri,,,,,,,,,,,,,,,
+2.783,word.lower():a,,,,,,,,,,,,,,,
+2.668,word.lower():chairman,,,,,,,,,,,,,,,
+2.565,word.+1_POS:VB,,,,,,,,,,,,,,,
+2.557,word.-1_POS:JJS,,,,,,,,,,,,,,,
+2.535,word.lower():unification,,,,,,,,,,,,,,,
+2.521,word.lower():after,,,,,,,,,,,,,,,
+2.516,word.lower():and,,,,,,,,,,,,,,,

Weight?,Feature
+3.660,word.lower():israeli-palestinian
+2.970,word.lower():war
+2.862,word.prefix_3:Pri
+2.783,word.lower():a
+2.668,word.lower():chairman
+2.565,word.+1_POS:VB
+2.557,word.-1_POS:JJS
+2.535,word.lower():unification
+2.521,word.lower():after
+2.516,word.lower():and

Weight?,Feature
+2.192,word.+1_POS:VB
+2.185,word.prefix_3:Nob
+2.120,word.prefix_3:Top
+1.925,word.prefix_2:Do
+1.876,word.lower():alhurra
+1.876,word.prefix_3:alH
+1.855,word.lower():soyuzcapsule
+1.855,word.prefix_3:Soy
+1.825,word.prefix_3:eng
+1.705,word.suffix_3:ule

Weight?,Feature
+1.541,word.lower():constitution
+1.315,word.lower():declaration
+1.308,word.prefix_3:Dec
+1.113,word.prefix_2:Ga
+1.003,word.prefix_2:3
+1.003,word.suffix_3:3
+1.003,word.prefix_3:3
+1.003,word.lower():3
+1.003,word.suffix_2:3
+0.994,word.+1_POS:MD

Weight?,Feature
+1.677,word.lower():olympic
+1.677,word.suffix_3:pic
+1.548,word.prefix_2:Ko
+1.538,word.prefix_3:Gam
+1.533,word.lower():games
+1.521,word.prefix_3:Oly
+1.452,word.lower():christmas
+1.428,word.lower():ashura
+1.427,word.prefix_3:Ash
+1.281,word.suffix_3:ura

Weight?,Feature
+1.384,word.suffix_2:ic
+1.312,word.prefix_3:War
+1.126,word.+1_POS:TO
+1.066,word.suffix_2:om
+1.047,word.-1_POS:NNP
+1.029,word.prefix_3:Oly
+1.023,word.suffix_2:rs
+1.020,word.prefix_2:Wa
+1.013,word.lower():international
+0.959,word.lower():open

Weight?,Feature
+2.588,word.lower():lankan
+2.559,word.istitle()
+2.369,word.lower():second-in-command
+2.123,word.lower():paris
+2.122,word.suffix_2:ta
+2.117,word.lower():gaza
+2.063,word.lower():khost
+1.965,word.suffix_3:ris
+1.957,word.lower():thailand
+1.909,word.suffix_3:est

Weight?,Feature
+2.781,word.lower():shaikan
+1.969,word.suffix_3:nds
+1.965,word.lower():homeland
+1.847,word.suffix_3:tan
+1.603,word.suffix_3:ica
+1.596,word.suffix_2:ca
+1.562,word.-1_POS:DT
+1.480,word.prefix_2:Ku
+1.461,word.lower():kurdish
+1.454,word.lower():netherlands

Weight?,Feature
+4.195,word.suffix_3:ese
+2.918,word.lower():turkish
+2.869,word.prefix_3:Kor
+2.709,word.lower():afghan
+2.708,word.suffix_3:ans
+2.329,word.-1_POS:WRB
+2.321,word.prefix_2:Sw
+2.286,word.suffix_3:ian
+2.093,word.lower():thailand
+2.024,word.suffix_2:bs

Weight?,Feature
+2.595,word.suffix_3:can
+1.744,word.+1_POS:POS
+1.387,word.prefix_3:Sta
+1.356,word.prefix_2:St
+1.317,word.-2_POS:CC
+1.213,word.lower():african
+1.201,word.prefix_3:Cit
+1.187,word.+1_POS:VBP
+1.131,word.+1_POS:IN
+1.065,word.prefix_2:Ci

Weight?,Feature
+1.658,word.isupper()
+1.499,word.lower():katrina
+1.484,word.prefix_3:H5N
+1.484,word.prefix_2:H5
+1.453,word.prefix_3:Kat
+1.394,word.lower():hurricane
+1.393,word.prefix_3:Hur
+1.382,word.-2_POS:VBN
+1.329,word.suffix_3:ane
+1.281,word.prefix_2:Hu

Weight?,Feature
+1.324,word.lower():katrina
+1.180,word.prefix_2:Ka
+1.144,word.prefix_3:Kat
+1.065,word.prefix_3:Syn
+1.065,word.lower():syndrome
+1.014,word.lower():respiratory
+1.014,word.prefix_3:Res
+0.946,word.lower():jing
+0.934,word.prefix_2:Ji
+0.922,word.prefix_3:Jin

Weight?,Feature
+2.917,word.lower():kindhearts
+2.740,word.lower():hamas
+2.660,word.isupper()
+2.583,word.lower():singapore
+2.503,word.lower():secretary-general
+2.388,word.prefix_3:Ham
+2.301,word.lower():guardian
+2.283,word.lower():latgalians
+2.279,word.lower():government-funded
+2.268,word.lower():parliament

Weight?,Feature
+3.005,word.lower():committee-chairman
+2.256,word.lower():ministry
+1.874,word.-1_POS:CC
+1.860,word.suffix_3:try
+1.770,word.prefix_3:Com
+1.563,word.prefix_3:Cou
+1.560,word.lower():anatolia
+1.528,word.suffix_3:tte
+1.511,word.lower():resistance
+1.499,word.prefix_3:Mot

Weight?,Feature
+3.135,word.lower():prime
+2.422,word.lower():sperling
+2.209,word.lower():senator
+2.105,word.prefix_2:pr
+2.065,word.prefix_3:Rog
+2.060,word.lower():somalians
+2.035,word.prefix_2:Ob
+2.035,word.lower():lion
+2.035,word.prefix_3:Lio
+2.032,word.lower():secretary

Weight?,Feature
+1.779,word.-1_POS:NNP
+1.563,word.prefix_2:Mu
+1.535,word.lower():condoleezza
+1.524,word.suffix_3:son
+1.491,word.prefix_3:al-
+1.447,word.prefix_2:Al
+1.176,word.suffix_3:med
+1.092,word.prefix_3:McC
+1.092,word.prefix_2:Mc
+1.081,word.suffix_3:les

Weight?,Feature
+4.391,word.suffix_3:day
+3.288,word.suffix_3:ber
+3.128,word.lower():day-long
+3.049,word.lower():later
+2.782,word.lower():two-year
+2.701,word.prefix_2:19
+2.502,word.suffix_2:ay
+2.421,word.lower():four-year
+2.420,word.lower():january
+2.389,word.lower():recent

Weight?,Feature
+4.187,word.suffix_3:day
+2.897,word.isdigit()
+2.157,word.suffix_2:ry
+1.882,word.suffix_2:ay
+1.873,word.lower():quarter
+1.839,word.lower():decades
+1.795,word.prefix_2:de
+1.780,word.prefix_3:mor
+1.757,word.prefix_2:Ju
+1.746,word.lower():infected


### Classification Report Interpretation for test data:

The main reported figure is the weighted average F1 Score:

$$        \frac{2 * Precision * Recall}{Precision + Recall}         $$

The support column refers to how many instances there are of each class. As we've seen before, this distribution is dominated by 'O' (non-NE) and there are some, such as B-nat ('national entity') that are almost zero (there were only 18 instances in the training data). 



In [47]:
crf_optim = CRF(algorithm='lbfgs',
          max_iterations=100,
          all_possible_transitions=True)

f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

crf_params = {'c1': [0.05, 0.1, 0.5, 1.0, 1.5,  2.0],
              'c2': [0.05, 0.1,  0.5, 1.0, 1.5,  2.0]}

grid =  GridSearchCV(crf_optim, 
                    crf_params, 
                    f1_scorer, 
                    -1, cv=5, 
                    return_train_score=True, 
                    verbose=True)

grid.fit(crf_features_train, crf_targets_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 58.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=CRF(algorithm='lbfgs', all_possible_states=None,
                           all_possible_transitions=True, averaging=None,
                           c=None, c1=None, c2=None,
                           calibration_candidates=None, calibration_eta=None,
                           calibration_max_trials=None, calibration_rate=None,
                           calibration_samples=None, delta=None, epsilon=None,
                           error_sensitive=None, gamma=None,
                           keep_tempfi...
             iid='deprecated', n_jobs=-1,
             param_grid={'c1': [0.05, 0.1, 0.5, 1.0, 1.5, 2.0],
                         'c2': [0.05, 0.1, 0.5, 1.0, 1.5, 2.0]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring=make_scorer(flat_f1_score, average=weighted, labels=['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim', 'B-art', 'I-art', 'I-per', 'I-gpe', 'I

In [48]:
best_crf = grid.best_estimator_

In [49]:
best_crf.get_params()

{'algorithm': 'lbfgs',
 'all_possible_states': None,
 'all_possible_transitions': True,
 'averaging': None,
 'c': None,
 'c1': 0.05,
 'c2': 0.5,
 'calibration_candidates': None,
 'calibration_eta': None,
 'calibration_max_trials': None,
 'calibration_rate': None,
 'calibration_samples': None,
 'delta': None,
 'epsilon': None,
 'error_sensitive': None,
 'gamma': None,
 'keep_tempfiles': None,
 'linesearch': None,
 'max_iterations': 100,
 'max_linesearch': None,
 'min_freq': None,
 'model_filename': None,
 'num_memories': None,
 'pa_type': None,
 'period': None,
 'trainer_cls': None,
 'variance': None,
 'verbose': False}

In [51]:
# further fine tuning around the c1 penalty term
crf_optim = CRF(algorithm='lbfgs',
          max_iterations=100,
          all_possible_transitions=True)

f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

crf_params = {'c1': [0.01, 0.02, 0.03, 0.04, 0.05,],
              'c2': [ 0.45, 0.5, 0.55]}

grid =  GridSearchCV(crf_optim, 
                    crf_params, 
                    f1_scorer, 
                    -1, cv=5, 
                    return_train_score=True, 
                    verbose=True)

grid.fit(crf_features_train, crf_targets_train)

best_crf = grid.best_estimator_

Fitting 5 folds for each of 15 candidates, totalling 75 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:  7.8min finished


In [53]:
best_crf.get_params()

{'algorithm': 'lbfgs',
 'all_possible_states': None,
 'all_possible_transitions': True,
 'averaging': None,
 'c': None,
 'c1': 0.04,
 'c2': 0.45,
 'calibration_candidates': None,
 'calibration_eta': None,
 'calibration_max_trials': None,
 'calibration_rate': None,
 'calibration_samples': None,
 'delta': None,
 'epsilon': None,
 'error_sensitive': None,
 'gamma': None,
 'keep_tempfiles': None,
 'linesearch': None,
 'max_iterations': 100,
 'max_linesearch': None,
 'min_freq': None,
 'model_filename': None,
 'num_memories': None,
 'pa_type': None,
 'period': None,
 'trainer_cls': None,
 'variance': None,
 'verbose': False}

In [54]:
# further fine tuning around the c1 penalty term
crf_optim = CRF(algorithm='lbfgs',
          max_iterations=100,
          all_possible_transitions=True)

f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

crf_params = {'c1': [0.0375, 0.04, 0.0425,],
              'c2': [ 0.425, 0.45, 0.475]}

grid =  GridSearchCV(crf_optim, 
                    crf_params, 
                    f1_scorer, 
                    -1, cv=5, 
                    return_train_score=True, 
                    verbose=True)

grid.fit(crf_features_train, crf_targets_train)

best_crf = grid.best_estimator_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  4.5min finished


In [60]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0]))

crf_y_test_pred = best_crf.predict(crf_features_test)

print(metrics.flat_classification_report(
    crf_targets_test, crf_y_test_pred, labels=sorted_labels, digits=3))

              precision    recall  f1-score   support

           O      0.983     0.989     0.986     11139
       B-art      0.000     0.000     0.000         2
       I-art      0.000     0.000     0.000         2
       B-eve      0.500     0.333     0.400         6
       I-eve      0.333     0.250     0.286         4
       B-geo      0.771     0.768     0.770       474
       I-geo      0.568     0.613     0.590        75
       B-gpe      0.822     0.814     0.818       204
       I-gpe      0.000     0.000     0.000         2
       B-nat      0.000     0.000     0.000         2
       I-nat      0.000     0.000     0.000         0
       B-org      0.668     0.612     0.639       227
       I-org      0.747     0.634     0.686       205
       B-per      0.809     0.709     0.755       268
       I-per      0.687     0.908     0.782       239
       B-tim      0.917     0.797     0.853       237
       I-tim      0.828     0.545     0.658        88

   micro avg      0.950   

In [57]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0]))

crf_y_train_pred = best_crf.predict(crf_features_train)

print(metrics.flat_classification_report(
    crf_targets_train, crf_y_train_pred, labels=sorted_labels, digits=3))

              precision    recall  f1-score   support

           O      0.993     0.997     0.995     45078
       B-art      0.970     0.627     0.762        51
       I-art      1.000     0.844     0.915        32
       B-eve      0.941     0.821     0.877        39
       I-eve      0.939     0.939     0.939        33
       B-geo      0.820     0.934     0.873      1596
       I-geo      0.858     0.906     0.881       339
       B-gpe      0.911     0.822     0.864      1026
       I-gpe      0.909     0.312     0.465        32
       B-nat      1.000     0.500     0.667        18
       I-nat      1.000     0.667     0.800         9
       B-org      0.909     0.768     0.833      1010
       I-org      0.922     0.936     0.929       721
       B-per      0.946     0.933     0.939       839
       I-per      0.942     0.987     0.964       995
       B-tim      0.971     0.862     0.913       923
       I-tim      0.912     0.760     0.829       246

    accuracy              

So we have been able to achieve a weighted average F1 score of **0.949 on the test data** (0.980 on train data respectively). Therefore there is a small train to test drop in performance, which looks greater when you compare the macro average F1 (test : 0.484; train : 0.850). 

### Assessing results with seqeval

So far I've been relying on Sklearn's standard metrics library, however this evaluation leaves a great deal of room for error. For one thing, it considers the B- and I- tags wholly distinct and does not help with pairing them the way they should be pared into single, whole entities. So I am going to use the python library seqeval, which has been specifically designed to work with BIO labels and to help with measuring performance on tasks "[such as named-entity recognition, part-of-speech tagging, semantic role labeling](#https://pypi.org/project/seqeval/)".

In [63]:
from seqeval.metrics import accuracy_score as seq_acc
from seqeval.metrics import classification_report as seq_cr
from seqeval.metrics import f1_score as seq_f1_score


In [73]:
print("Our overall macro-average F1 score is", round(seq_f1_score(crf_targets_test, crf_y_test_pred, average='micro'),3))

Our overall macro-average F1 score is 0.729


In [76]:
print("Our overall accuracy is", round(seq_acc(crf_targets_test, crf_y_test_pred),4),)

Our overall accuracy is 0.9497


As you'd expect, the results are very different compared to sklearn's estimation, however this is a much more realistic picture of how well our model is performing. The model is pulled down considerably by the low-frequency classes of 

In [69]:
print(seq_cr(crf_targets_test, crf_y_test_pred))

             precision    recall  f1-score   support

        geo       0.76      0.76      0.76       474
        gpe       0.82      0.81      0.82       204
        tim       0.85      0.74      0.79       237
        org       0.65      0.59      0.62       227
        per       0.69      0.61      0.65       268
        eve       0.50      0.33      0.40         6
        art       0.00      0.00      0.00         2
        nat       0.00      0.00      0.00         2

avg / total       0.75      0.71      0.73      1420



In [56]:
# And our accuracy score
best_crf.score(crf_features_train, crf_targets_train)

0.9802593088870855

<a id=ttsplit ><a/> 

## 3. Investigating our best model's weights
   
[LINK to table of contents](#contents)

In [59]:
eli5.show_weights(best_crf, top=30)


From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,3.374,1.116,-1.561,2.038,-1.253,1.351,-2.744,1.076,-1.165,0.965,-1.092,1.475,-2.792,2.373,-2.162,2.223,-2.721
B-art,-0.373,-0.043,3.926,-0.023,-0.112,-0.524,-0.384,-0.681,-0.12,-0.032,-0.059,-0.47,-0.58,-0.56,-0.536,0.164,-0.329
I-art,-0.986,-0.039,3.672,-0.006,-0.075,0.122,-0.277,-0.42,-0.059,0.0,-0.019,-0.319,-0.392,-0.409,-0.659,0.0,-0.207
B-eve,-0.929,-0.027,-0.109,-0.062,3.892,-0.523,-0.401,-0.553,-0.119,0.0,-0.065,-0.436,-0.62,-0.467,-0.484,0.162,-0.238
I-eve,-0.153,0.0,-0.041,-0.356,2.923,-0.181,-0.188,-0.187,-0.037,0.0,0.0,-0.21,-0.234,-0.132,-0.41,-0.136,-0.112
B-geo,0.517,0.175,-0.742,-0.353,-0.795,-1.523,3.769,-0.482,-0.874,-0.12,-0.466,-1.171,-1.649,-1.747,-1.545,1.3,-1.025
I-geo,0.088,-0.162,-0.362,-0.126,-0.211,-1.168,3.103,-0.916,-0.315,-0.019,-0.123,-0.645,-0.909,-0.849,-1.084,0.242,-0.526
B-gpe,0.477,-0.439,-0.713,-0.191,-0.795,-1.39,-1.581,-1.923,2.989,-0.153,-0.422,0.9,-1.856,0.891,-1.203,-1.453,-1.01
I-gpe,-0.327,-0.003,-0.022,0.0,-0.038,0.205,-0.129,-0.206,2.617,0.0,0.0,-0.232,-0.287,-0.29,-0.302,-0.105,-0.119
B-nat,-0.474,0.0,-0.022,0.0,-0.068,-0.269,-0.157,-0.368,-0.015,-0.023,2.562,-0.265,-0.24,-0.265,-0.314,-0.109,-0.101

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,Unnamed: 10_level_5,Unnamed: 11_level_5,Unnamed: 12_level_5,Unnamed: 13_level_5,Unnamed: 14_level_5,Unnamed: 15_level_5,Unnamed: 16_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,Unnamed: 10_level_6,Unnamed: 11_level_6,Unnamed: 12_level_6,Unnamed: 13_level_6,Unnamed: 14_level_6,Unnamed: 15_level_6,Unnamed: 16_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7,Unnamed: 10_level_7,Unnamed: 11_level_7,Unnamed: 12_level_7,Unnamed: 13_level_7,Unnamed: 14_level_7,Unnamed: 15_level_7,Unnamed: 16_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8,Unnamed: 10_level_8,Unnamed: 11_level_8,Unnamed: 12_level_8,Unnamed: 13_level_8,Unnamed: 14_level_8,Unnamed: 15_level_8,Unnamed: 16_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9,Unnamed: 10_level_9,Unnamed: 11_level_9,Unnamed: 12_level_9,Unnamed: 13_level_9,Unnamed: 14_level_9,Unnamed: 15_level_9,Unnamed: 16_level_9
Weight?,Feature,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10,Unnamed: 9_level_10,Unnamed: 10_level_10,Unnamed: 11_level_10,Unnamed: 12_level_10,Unnamed: 13_level_10,Unnamed: 14_level_10,Unnamed: 15_level_10,Unnamed: 16_level_10
Weight?,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11,Unnamed: 9_level_11,Unnamed: 10_level_11,Unnamed: 11_level_11,Unnamed: 12_level_11,Unnamed: 13_level_11,Unnamed: 14_level_11,Unnamed: 15_level_11,Unnamed: 16_level_11
Weight?,Feature,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12,Unnamed: 9_level_12,Unnamed: 10_level_12,Unnamed: 11_level_12,Unnamed: 12_level_12,Unnamed: 13_level_12,Unnamed: 14_level_12,Unnamed: 15_level_12,Unnamed: 16_level_12
Weight?,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13,Unnamed: 9_level_13,Unnamed: 10_level_13,Unnamed: 11_level_13,Unnamed: 12_level_13,Unnamed: 13_level_13,Unnamed: 14_level_13,Unnamed: 15_level_13,Unnamed: 16_level_13
Weight?,Feature,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14,Unnamed: 9_level_14,Unnamed: 10_level_14,Unnamed: 11_level_14,Unnamed: 12_level_14,Unnamed: 13_level_14,Unnamed: 14_level_14,Unnamed: 15_level_14,Unnamed: 16_level_14
Weight?,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15,Unnamed: 9_level_15,Unnamed: 10_level_15,Unnamed: 11_level_15,Unnamed: 12_level_15,Unnamed: 13_level_15,Unnamed: 14_level_15,Unnamed: 15_level_15,Unnamed: 16_level_15
Weight?,Feature,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16,Unnamed: 9_level_16,Unnamed: 10_level_16,Unnamed: 11_level_16,Unnamed: 12_level_16,Unnamed: 13_level_16,Unnamed: 14_level_16,Unnamed: 15_level_16,Unnamed: 16_level_16
+2.161,word.lower():a,,,,,,,,,,,,,,,
+2.050,word.lower():israeli-palestinian,,,,,,,,,,,,,,,
+2.027,word.+1_POS:JJ,,,,,,,,,,,,,,,
+2.025,word.lower():and,,,,,,,,,,,,,,,
+1.891,word.BOS,,,,,,,,,,,,,,,
+1.891,word.-1_POS:,,,,,,,,,,,,,,,
+1.891,word.-2_POS:,,,,,,,,,,,,,,,
+1.768,word.+1_POS:,,,,,,,,,,,,,,,
+1.752,word.suffix_2:ty,,,,,,,,,,,,,,,
+1.727,word.+1_POS:VB,,,,,,,,,,,,,,,

Weight?,Feature
+2.161,word.lower():a
+2.050,word.lower():israeli-palestinian
+2.027,word.+1_POS:JJ
+2.025,word.lower():and
+1.891,word.BOS
+1.891,word.-1_POS:
+1.891,word.-2_POS:
+1.768,word.+1_POS:
+1.752,word.suffix_2:ty
+1.727,word.+1_POS:VB

Weight?,Feature
+1.195,word.prefix_3:Nob
+1.186,word.prefix_3:Top
+1.072,word.prefix_2:Do
+1.042,word.suffix_3:oxx
+1.042,word.suffix_2:xx
+1.042,word.lower():vioxx
+1.008,word.prefix_3:Vio
+0.954,word.lower():alhurra
+0.954,word.prefix_3:alH
+0.921,word.prefix_3:eng

Weight?,Feature
+0.678,word.prefix_2:Ga
+0.668,word.suffix_3:ion
+0.663,word.lower():constitution
+0.662,word.suffix_2:on
+0.640,word.-2_POS:DT
+0.633,word.suffix_2:le
+0.633,word.prefix_3:3
+0.633,word.prefix_2:3
+0.633,word.suffix_2:3
+0.633,word.lower():3

Weight?,Feature
+1.166,word.lower():olympic
+1.166,word.suffix_3:pic
+1.147,word.prefix_3:Oly
+0.953,word.suffix_3:II
+0.953,word.prefix_3:II
+0.953,word.lower():ii
+0.941,word.prefix_2:Ol
+0.929,word.prefix_2:II
+0.929,word.suffix_2:II
+0.916,word.prefix_3:Gam

Weight?,Feature
+1.072,word.istitle()
+0.882,word.-1_POS:NNP
+0.839,word.suffix_2:ic
+0.821,word.prefix_3:War
+0.790,word.prefix_2:Wa
+0.728,word.lower():open
+0.728,word.suffix_3:pen
+0.698,word.prefix_2:Op
+0.698,word.prefix_3:Ope
+0.666,word.prefix_3:Oly

Weight?,Feature
+2.372,word.istitle()
+1.574,word.prefix_2:Ba
+1.383,word.suffix_2:ta
+1.371,word.suffix_3:the
+1.345,word.lower():paris
+1.327,word.suffix_3:ris
+1.278,word.suffix_3:est
+1.244,word.lower():gaza
+1.236,word.prefix_3:wes
+1.222,word.suffix_3:and

Weight?,Feature
+1.777,word.istitle()
+1.422,word.suffix_3:tan
+1.290,word.suffix_3:nds
+1.117,word.suffix_3:ica
+1.111,word.suffix_2:ca
+1.086,word.prefix_2:Ku
+1.064,word.-1_POS:DT
+1.048,word.lower():kurdish
+1.016,word.lower():netherlands
+0.996,word.lower():shaikan

Weight?,Feature
+2.831,word.suffix_3:ese
+2.000,word.suffix_3:ans
+1.911,word.istitle()
+1.837,word.suffix_3:ian
+1.714,word.suffix_2:li
+1.695,word.suffix_2:an
+1.633,word.suffix_3:ish
+1.507,word.prefix_3:Kor
+1.476,word.lower():turkish
+1.453,word.suffix_3:eli

Weight?,Feature
+1.508,word.suffix_3:can
+1.030,word.+1_POS:POS
+1.029,word.-2_POS:CC
+0.943,word.lower():african
+0.922,word.prefix_3:Sta
+0.871,word.prefix_3:Afr
+0.800,word.prefix_2:Af
+0.767,word.-1_POS:NNP
+0.763,word.+1_POS:IN
+0.737,word.prefix_2:St

Weight?,Feature
+1.752,word.isupper()
+1.083,word.prefix_2:H5
+1.083,word.prefix_3:H5N
+1.014,word.lower():hurricane
+1.013,word.prefix_3:Hur
+1.005,word.suffix_3:ane
+0.981,word.prefix_2:Hu
+0.841,word.prefix_3:Kat
+0.830,word.lower():katrina
+0.799,word.-2_POS:VBN

Weight?,Feature
+0.926,word.lower():katrina
+0.900,word.prefix_3:Kat
+0.872,word.prefix_2:Ka
+0.844,word.suffix_3:ina
+0.798,word.suffix_2:na
+0.758,word.-1_POS:NNP
+0.724,word.lower():jing
+0.722,word.lower():syndrome
+0.722,word.prefix_3:Syn
+0.707,word.prefix_3:Jin

Weight?,Feature
+2.626,word.isupper()
+1.769,word.lower():hamas
+1.563,word.lower():taleban
+1.531,word.prefix_3:Ham
+1.493,word.suffix_3:ban
+1.474,word.lower():kindhearts
+1.409,word.lower():al-qaida
+1.330,word.lower():singapore
+1.318,word.prefix_3:al-
+1.302,word.prefix_3:Kin

Weight?,Feature
+1.346,word.lower():ministry
+1.259,word.-1_POS:CC
+1.186,word.suffix_3:try
+1.174,word.prefix_3:Com
+1.092,word.prefix_3:Chi
+1.051,word.prefix_3:Cou
+1.024,word.-1_POS:IN
+1.014,word.lower():union
+1.001,word.lower():committee-chairman
+0.996,word.suffix_3:rce

Weight?,Feature
+1.716,word.lower():prime
+1.634,word.prefix_2:pr
+1.598,word.prefix_3:al-
+1.536,word.lower():president
+1.463,word.istitle()
+1.444,word.prefix_3:pri
+1.403,word.prefix_2:Ob
+1.373,word.lower():sperling
+1.365,word.lower():bush
+1.299,word.prefix_2:Ra

Weight?,Feature
+1.784,word.-1_POS:NNP
+1.222,word.prefix_2:Mu
+1.038,word.prefix_3:al-
+0.887,word.prefix_2:Al
+0.845,word.lower():condoleezza
+0.813,word.istitle()
+0.740,word.suffix_3:son
+0.723,word.-2_POS:NN
+0.711,word.prefix_2:Mc
+0.711,word.prefix_3:McC

Weight?,Feature
+3.567,word.suffix_3:day
+2.424,word.suffix_3:ber
+2.375,word.suffix_2:ay
+2.107,word.isdigit()
+2.024,word.prefix_2:19
+1.836,word.lower():later
+1.619,word.lower():recent
+1.535,word.suffix_2:0s
+1.484,word.lower():january
+1.418,word.suffix_3:0th

Weight?,Feature
+2.388,word.suffix_3:day
+2.123,word.isdigit()
+2.019,word.suffix_2:ay
+1.180,word.lower():decades
+1.138,word.suffix_2:ne
+1.110,word.prefix_3:dec
+1.069,word.prefix_2:de
+1.053,word.lower():quarter
+1.044,word.prefix_2:Ju
+1.042,word.+1_POS:.


<a id=selection></a>

## 4. Choice of model architectures

[LINK to table of contents](#contents)

<a id=one ><a/> 

## 4.1 Model 1
    
[LINK to table of contents](#contents)

<a id=two ><a/> 

## 4.2 Model 2
    
[LINK to table of contents](#contents)

<a id=three ><a/> 

## 4.3 Model 3
    
[LINK to table of contents](#contents)

<a id=four ><a/> 

## 4.4 Model 4
    
[LINK to table of contents](#contents)

<a id=five ><a/> 

## 4.5 Model 5
    
[LINK to table of contents](#contents)

<a id=conc ><a/> 

## 7. Conclusions and model comparison table
    
[LINK to table of contents](#contents)