<a id=contents></a>

# Model building
## What are we trying to predict?


[1. ETL and Train Test Split](#ETL)

[2. Modelling with Random Forest](#RF)

[3. Modelling with Conditional Random Field](#CRF)

[4. Choice of model architectures](#selection)

[4.1 Model 1](#one)

[4.2 Model 2](#two)

[4.2 Model 3](#three)

[4.2 Model 4](#four)

[4.2 Model 5](#five)

[7. Conclusions and model comparison table](#conc)

In [17]:

import pandas as pd
import numpy as np

import pickle

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style("darkgrid")

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from matplotlib import cm
import numpy as np
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
from sklearn.metrics import f1_score

from sklearn_crfsuite import CRF, scorers, metrics
from sklearn.model_selection import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.metrics import classification_report, make_scorer

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import re
import string
tokenizer = RegexpTokenizer(r'\b\w{3,}\b')
stop_words = list(set(stopwords.words("english")))
stop_words += list(string.punctuation)

import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<a id=ETL ><a/> 

## 1. ETL of data and Train-Test Split
    
[LINK to table of contents](#contents)

In [11]:
feature_df = pd.read_pickle('feature_based_data/final_feature_data.pkl')
feature_y = feature_df.Tag
feature_X = feature_df.drop(columns=['Tag'])

x_train, x_test, y_train, y_test = train_test_split(feature_X, feature_y, test_size=.2, random_state=12345)

In [12]:
x_train.head(2)

Unnamed: 0,is_title,length,is_upper,is_digit,is_prev_NE,prev_-1_POS_NNP,prev_-1_POS_NN,is_prev_pos_same_as_current,POS_NNP,POS_NN,POS_IN,POS_DT,POS_.
42758,0,7,0,0,0,0,0,0,0,1,0,0,0
56828,0,4,0,0,0,0,0,0,0,0,0,0,0


<a id = 'RF'></a>

## 2. Modelling with Random Forest

[LINK to table of contents](#contents)

In [46]:
%time
RF = RandomForestClassifier(n_estimators=50, max_depth=20, min_samples_split=.01, n_jobs=-1)

preds = cross_val_predict(estimator=RF, X=x_train.values, y=y_train.values, cv=5)


CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 12.2 µs


In [50]:
preds.shape == y_train.shape

True

In [53]:
report = classification_report(y_pred=preds, y_true=y_train)
print(report)

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00        37
       B-eve       0.00      0.00      0.00        34
       B-geo       0.44      0.90      0.59      1668
       B-gpe       0.59      0.57      0.58       973
       B-nat       0.00      0.00      0.00        17
       B-org       0.59      0.21      0.31       987
       B-per       0.52      0.34      0.41       907
       B-tim       0.83      0.20      0.32       926
       I-art       0.00      0.00      0.00        26
       I-eve       0.00      0.00      0.00        33
       I-geo       0.49      0.12      0.20       331
       I-gpe       0.00      0.00      0.00        27
       I-nat       0.00      0.00      0.00         9
       I-org       0.42      0.21      0.28       744
       I-per       0.50      0.92      0.65       961
       I-tim       0.98      0.36      0.52       261
           O       0.98      0.99      0.99     44987

    accuracy              

In [83]:
y_train_bin = np.zeros(y_train.shape)

In [84]:
for i in y_train_bin:
    if y_train[i]!='O':
        y_train_bin[i]=1
y_train_bin

array([0., 0., 0., ..., 0., 0., 0.])

In [15]:
x_train.iloc[:5].to_dict()

{'is_title': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'length': {42758: 7, 56828: 4, 18522: 1, 15552: 2, 20990: 6},
 'is_upper': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'is_digit': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'is_prev_NE': {42758: 0, 56828: 0, 18522: 1, 15552: 0, 20990: 1},
 'prev_-1_POS_NNP': {42758: 0, 56828: 0, 18522: 1, 15552: 0, 20990: 1},
 'prev_-1_POS_NN': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'is_prev_pos_same_as_current': {42758: 0,
  56828: 0,
  18522: 0,
  15552: 0,
  20990: 0},
 'POS_NNP': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'POS_NN': {42758: 1, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'POS_IN': {42758: 0, 56828: 0, 18522: 0, 15552: 1, 20990: 0},
 'POS_DT': {42758: 0, 56828: 0, 18522: 0, 15552: 0, 20990: 0},
 'POS_.': {42758: 0, 56828: 0, 18522: 1, 15552: 0, 20990: 0}}

<a id = 'CRF'></a>

## 3. Modelling with a Conditional Random Field

[LINK to table of contents](#contents)

Sklearn's CRF requires the input data to be a list of lists of dicts. I stored these as pickle files in notebook 3.

In [23]:
with open('clean_data/crf_train_data.pkl', 'rb') as f:
    crf_features_train = pickle.load(f)
    
with open('clean_data/crf_test_data.pkl', 'rb') as f:
    crf_features_test = pickle.load(f)
    
with open('clean_data/crf_train_targets.pkl', 'rb') as f:
    crf_targets_train = pickle.load(f)
    
with open('clean_data/crf_test_targets.pkl', 'rb') as f:
    crf_targets_test = pickle.load(f)
    

In [24]:
crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=True)

In [25]:
%time
crf.fit(crf_features_train, crf_targets_train)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.05 µs


CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

In [26]:
labels = list(crf.classes_)
labels

['O',
 'B-geo',
 'B-gpe',
 'B-per',
 'I-geo',
 'B-org',
 'I-org',
 'B-tim',
 'B-art',
 'I-art',
 'I-per',
 'I-gpe',
 'I-tim',
 'B-nat',
 'B-eve',
 'I-eve',
 'I-nat']

In [27]:
crf_y_pred = crf.predict(crf_features_test)
metrics.flat_f1_score(crf_targets_test, crf_y_pred,
                      average='weighted', labels=labels)

0.9471352834210373

In [28]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0]))

print(metrics.flat_classification_report(
    crf_targets_test, crf_y_pred, labels=sorted_labels, digits=3))

              precision    recall  f1-score   support

           O      0.984     0.987     0.986     11139
       B-art      0.000     0.000     0.000         2
       I-art      0.000     0.000     0.000         2
       B-eve      0.400     0.333     0.364         6
       I-eve      0.400     0.500     0.444         4
       B-geo      0.769     0.738     0.753       474
       I-geo      0.600     0.600     0.600        75
       B-gpe      0.788     0.819     0.803       204
       I-gpe      0.000     0.000     0.000         2
       B-nat      0.000     0.000     0.000         2
       I-nat      0.000     0.000     0.000         0
       B-org      0.628     0.626     0.627       227
       I-org      0.728     0.639     0.681       205
       B-per      0.805     0.709     0.754       268
       I-per      0.689     0.908     0.783       239
       B-tim      0.887     0.793     0.837       237
       I-tim      0.754     0.591     0.662        88

   micro avg      0.948   

<a id=ttsplit ><a/> 

## 3. Train-test split and model transformation
   
[LINK to table of contents](#contents)

<a id=selection></a>

## 4. Choice of model architectures

[LINK to table of contents](#contents)

<a id=one ><a/> 

## 4.1 Model 1
    
[LINK to table of contents](#contents)

<a id=two ><a/> 

## 4.2 Model 2
    
[LINK to table of contents](#contents)

<a id=three ><a/> 

## 4.3 Model 3
    
[LINK to table of contents](#contents)

<a id=four ><a/> 

## 4.4 Model 4
    
[LINK to table of contents](#contents)

<a id=five ><a/> 

## 4.5 Model 5
    
[LINK to table of contents](#contents)

<a id=conc ><a/> 

## 7. Conclusions and model comparison table
    
[LINK to table of contents](#contents)