<a id=contents></a>

# Model building
## What are we trying to predict?


[1. Target variable and choice of evaluation metric](#target)

[2. Class imbalance / sampling solution / distribution](#samp)

[3. Train-test split and model transformation](#ttsplit)

[4. Choice of model architectures](#selection)

[4.1 Model 1](#one)

[4.2 Model 2](#two)

[4.2 Model 3](#three)

[4.2 Model 4](#four)

[4.2 Model 5](#five)

[7. Conclusions and model comparison table](#conc)

In [25]:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style("darkgrid")

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from matplotlib import cm
import numpy as np
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
from sklearn.metrics import f1_score

from sklearn_crfsuite import CRF, scorers, metrics
from sklearn.model_selection import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.metrics import classification_report, make_scorer

tokenizer = RegexpTokenizer(r'\b\w{3,}\b')
stop_words = list(set(stopwords.words("english")))
stop_words += list(string.punctuation)

import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [82]:

# an attempt at using Keras contrib's CRF, given that sklearn's is unreconcilable to my data
from keras.models import *
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input
import importlib  
keras_contrib = importlib.import_module("keras-contrib")


<a id=target ><a/> 

## 1. ETL of data and Train-Test Split
    
[LINK to table of contents](#contents)

In [3]:
feature_df = pd.read_pickle('feature_based_data/final_feature_data.pkl')
feature_y = feature_df.Tag
feature_X = feature_df.drop(columns=['Tag'])

x_train, x_test, y_train, y_test = train_test_split(feature_X, feature_y, test_size=.2, random_state=12345)

In [39]:
x_train.head(2)

Unnamed: 0,is_title,length,is_upper,is_digit,is_prev_NE,prev_-1_POS_NNP,prev_-1_POS_NN,is_prev_pos_same_as_current,POS_NNP,POS_NN,POS_IN,POS_DT,POS_.
42758,0,7,0,0,0,0,0,0,0,1,0,0,0
56828,0,4,0,0,0,0,0,0,0,0,0,0,0


In [46]:
%time
RF = RandomForestClassifier(n_estimators=50, max_depth=20, min_samples_split=.01, n_jobs=-1)

preds = cross_val_predict(estimator=RF, X=x_train.values, y=y_train.values, cv=5)


CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 12.2 µs


In [50]:
preds.shape == y_train.shape

True

In [53]:
report = classification_report(y_pred=preds, y_true=y_train)
print(report)

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00        37
       B-eve       0.00      0.00      0.00        34
       B-geo       0.44      0.90      0.59      1668
       B-gpe       0.59      0.57      0.58       973
       B-nat       0.00      0.00      0.00        17
       B-org       0.59      0.21      0.31       987
       B-per       0.52      0.34      0.41       907
       B-tim       0.83      0.20      0.32       926
       I-art       0.00      0.00      0.00        26
       I-eve       0.00      0.00      0.00        33
       I-geo       0.49      0.12      0.20       331
       I-gpe       0.00      0.00      0.00        27
       I-nat       0.00      0.00      0.00         9
       I-org       0.42      0.21      0.28       744
       I-per       0.50      0.92      0.65       961
       I-tim       0.98      0.36      0.52       261
           O       0.98      0.99      0.99     44987

    accuracy              

In [83]:
y_train_bin = np.zeros(y_train.shape)

In [84]:
for i in y_train_bin:
    if y_train[i]!='O':
        y_train_bin[i]=1
y_train_bin

array([0., 0., 0., ..., 0., 0., 0.])

In [93]:
y_train_bin.shape

(52928,)

In [85]:
crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=True)

## What the problem was

CRF in sklearn expects "list of lists of dicts"...  

In [91]:
%time
crf.fit(x_train, y_train_bin)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 20 µs


TypeError: 'numpy.float64' object is not iterable

In [23]:
f1_scorer = make_scorer(metrics.flat_f1_score, average='macro') 

In [30]:
pred = cross_val_predict(estimator=crf, X=x_train.values, y=y_train.values, cv=5)


ValueError: The numbers of items and labels differ: |x| = 13, |y| = 1

In [28]:
pred

{'fit_time': array([0.01170826, 0.00236893, 0.00223017, 0.00199819, 0.00233698]),
 'score_time': array([0., 0., 0., 0., 0.]),
 'test_score': array([nan, nan, nan, nan, nan])}

<a id=samp ><a/> 

## 2.  Class imbalance / sampling solution / distribution
    
[LINK to table of contents](#contents)

<a id=ttsplit ><a/> 

## 3. Train-test split and model transformation
   
[LINK to table of contents](#contents)

<a id=selection></a>

## 4. Choice of model architectures

[LINK to table of contents](#contents)

<a id=one ><a/> 

## 4.1 Model 1
    
[LINK to table of contents](#contents)

<a id=two ><a/> 

## 4.2 Model 2
    
[LINK to table of contents](#contents)

<a id=three ><a/> 

## 4.3 Model 3
    
[LINK to table of contents](#contents)

<a id=four ><a/> 

## 4.4 Model 4
    
[LINK to table of contents](#contents)

<a id=five ><a/> 

## 4.5 Model 5
    
[LINK to table of contents](#contents)

<a id=conc ><a/> 

## 7. Conclusions and model comparison table
    
[LINK to table of contents](#contents)