# Week 8 NER
This notebook has the following goals:
- To test the accuracy of spaCy's entity predictions
- To test the impacts of doing lemmatization before vectorization, as a hyperparameter optimization
- To test autoML, and compare it to previous models

## Basic imports and setup
### Imports

In [8]:
import pandas as pd
import spacy
from spacy.tokens import Doc
from nltk import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from tpot import TPOTRegressor

from sklearn.metrics import (
    confusion_matrix,
    accuracy_score as accuracy,
    recall_score as recall,
    precision_score as precision,
    f1_score
)


### Read in dataframe built based on previous EDA

In [9]:
ner_df = pd.read_csv('../datasets/extended_df.csv')
ner_df.drop(columns=['Unnamed: 0'], inplace=True)
ner_df.head()

Unnamed: 0,Sentence #,Word,POS,Tag,WordLength,Capital,Non-Punctuation,StopWord,IsNER
0,Sentence: 1,Thousands,NNS,O,9,True,True,False,0
1,,of,IN,O,2,False,True,True,0
2,,demonstrators,NNS,O,13,False,True,False,0
3,,have,VBP,O,4,False,True,True,0
4,,marched,VBN,O,7,False,True,False,0


### Make some updates to the "Sentence #" column
Ensuring every row has a sentence number, and changing that column to an int for use as a numerical feature later.

In [10]:
ner_df['Sentence #'] = ner_df['Sentence #'].str.replace('Sentence: ','')
ner_df['Sentence #'].fillna(method='ffill', inplace=True)
ner_df['Sentence #'] = ner_df['Sentence #'].astype('int64')
ner_df.head()

Unnamed: 0,Sentence #,Word,POS,Tag,WordLength,Capital,Non-Punctuation,StopWord,IsNER
0,1,Thousands,NNS,O,9,True,True,False,0
1,1,of,IN,O,2,False,True,True,0
2,1,demonstrators,NNS,O,13,False,True,False,0
3,1,have,VBP,O,4,False,True,True,0
4,1,marched,VBN,O,7,False,True,False,0


### Initial splits
Establishing X and y DataFrames, and splitting prior to any engineering, so as to avoid leakage.

In [11]:
X = ner_df.drop(columns=['Tag', 'IsNER'])
y = ner_df['IsNER']
X_train = X[:839270]
X_test = X[839270:]
y_train = y[:839270]
y_test = y[839270:]

### Establishing metrics 
Based on Electronics Purchase Prediction notebook from class.

In [12]:
def display_metrics(y_true, y_pred):
    print(f'Confusion Matrix: \n{confusion_matrix(y_true, y_pred)}')
    print('Accuracy: {:.3f}'.format(accuracy(y_true, y_pred)))
    print('Recall: {:.3f}'.format(recall(y_true, y_pred)))
    print('Precision: {:.3f}'.format(precision(y_true, y_pred)))
    print('F1 Score: {:.3f}'.format(f1_score(y_true, y_pred)))

## Test One
Testing the accuracy of spaCy predictions

In [13]:
def spacy_model(df):
    nlp = spacy.load('en_core_web_sm')
    return_list = []
    for sentence in range(df['Sentence #'].max()):
        words = nlp(Doc(nlp.vocab, df[df['Sentence #'] == sentence + 1].Word.values))
        for word in words:
            is_ner = str(word) in set(ent.text for ent in words.ents)
            return_list.append((is_ner))
    return pd.Series(return_list)

In [14]:
preds = spacy_model(X_test)
display_metrics(y_test, preds)

Confusion Matrix: 
[[182272   4559]
 [  8553  13921]]
Accuracy: 0.937
Recall: 0.619
Precision: 0.753
F1 Score: 0.680


## Test 2
Testing impacts of lemmatization before vectorization.

Model from baseline, for comparison:

In [15]:
xgb_model = XGBClassifier(random_state=42)
lr_model = LogisticRegression(random_state=42)
models = [xgb_model, lr_model]
model_names = ['XGB', 'Logistic Regression']

categorical_cols = ['Word', 'POS']

numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64', 'bool']]

numerical_transformer = SimpleImputer(strategy='constant')

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

for model, model_name in zip(models, model_names):


    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('model', model)
                                ])

    pipeline.fit(X_train, y_train)  

    preds = pipeline.predict(X_test)

    print('Display metrics for {} with one-hot encoding:'.format(model_name))
    display_metrics(y_test, preds)



Display metrics for XGB with one-hot encoding:
Confusion Matrix: 
[[180446   6385]
 [  2111  20363]]
Accuracy: 0.959
Recall: 0.906
Precision: 0.761
F1 Score: 0.827


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Display metrics for Logistic Regression with one-hot encoding:
Confusion Matrix: 
[[178538   8293]
 [  3668  18806]]
Accuracy: 0.943
Recall: 0.837
Precision: 0.694
F1 Score: 0.759


Retesting with balanced class weights

In [16]:
lr_model = LogisticRegression(random_state=42, class_weight='balanced')
models = [lr_model]
model_names = ['Logistic Regression']

categorical_cols = ['Word', 'POS']

numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64', 'bool']]

numerical_transformer = SimpleImputer(strategy='constant')

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

for model, model_name in zip(models, model_names):


    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('model', model)
                                ])

    pipeline.fit(X_train, y_train)  

    preds = pipeline.predict(X_test)

    print('Display metrics for {} with one-hot encoding:'.format(model_name))
    display_metrics(y_test, preds)

Display metrics for Logistic Regression with one-hot encoding:
Confusion Matrix: 
[[175263  11568]
 [  2001  20473]]
Accuracy: 0.935
Recall: 0.911
Precision: 0.639
F1 Score: 0.751


Function for lemmatizing

In [17]:
def lemmatizer(df):
    nlp = spacy.load('en_core_web_sm')
    return_list = []
    for sentence in range(df['Sentence #'].max()):
        words = nlp(Doc(nlp.vocab, df[df['Sentence #'] == sentence + 1].Word.values))
        for word in words:
            lemma = word.lemma_
            return_list.append((lemma))
    return pd.Series(return_list)

Model from baseling with lemmatization before vectorization:

In [18]:
xgb_model = XGBClassifier(random_state=42)
lr_model = LogisticRegression(random_state=42, class_weight='balanced')
models = [xgb_model, lr_model]
model_names = ['XGB', 'Logistic Regression']
X_train_lemma = X_train.copy()
X_test_lemma = X_test.copy()

X_train_lemma['Word'] = lemmatizer(X_train_lemma)
X_test_lemma['Word'] = lemmatizer(X_test_lemma)
categorical_cols = ['Word', 'POS']

numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64', 'bool']]

numerical_transformer = SimpleImputer(strategy='constant')

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

for model, model_name in zip(models, model_names):


    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('model', model)
                                ])

    pipeline.fit(X_train_lemma, y_train)  

    preds = pipeline.predict(X_test_lemma)

    print('Display metrics for {} with one-hot encoding:'.format(model_name))
    display_metrics(y_test, preds)



Display metrics for XGB with one-hot encoding:
Confusion Matrix: 
[[177469   9362]
 [  2112  20362]]
Accuracy: 0.945
Recall: 0.906
Precision: 0.685
F1 Score: 0.780
Display metrics for Logistic Regression with one-hot encoding:
Confusion Matrix: 
[[175626  11205]
 [  2232  20242]]
Accuracy: 0.936
Recall: 0.901
Precision: 0.644
F1 Score: 0.751


## Test 3
Testing TPOT for autoML.

In [19]:
# %%time
# X_train_auto = X_train.copy()
# ohe = OneHotEncoder(handle_unknown='ignore')
# X_train_auto = ohe.fit_transform(X_train_auto)
# tpot = TPOTRegressor(generations=10, 
#                      population_size=40,
#                      scoring='accuracy', 
#                      verbosity=2,
#                      random_state=42,
#                      config_dict='TPOT sparse')
# tpot.fit(X_train_auto, y_train)
# print(f"Tpop score on test data: {tpot.score(test_features, test_labels):.2f}")
# tpot.export('tpot_mpg_pipeline.py')