# Model Evaluation 1

---

__This Notebook__

The original goal was to build a classifier that detected `spam` well (high sensitivity) but also detected `ham` well, possibly even better (high specificity), since it would be bad to send a legitimate message to the spam folder.

Since specificity seemed not to be a problem, the entire modeling phase focused on sensitivity - this was also because of a confusion with the original tutorial which classified `ham` as the positive case (it was a ham detector not a spam detector) and so increasing sensitivity made sense in that scenario. 


__Results__

A quick evaluation of a single confusion matrix per classifier shows that, at first glance, they generalize more or less well - as expected. The worst is perhaps the AdaBoost classifier, also as expected.

__Next__

Plotting some learning curves with the entire data (including the test set) and comparing them to the training learning curves might help us determine better how well the models generalize.

Also verifying what kinds of mistakes (as in, what are the texts it misclassifies) trip up the classifiers in the training and test sets and comparing them might help understand whay needed to be done - this should've been done during training and modeling but I just thought about it.


## Setup

In [1]:
import re
import os
import sys
import time
import joblib 

import numpy as np
import pandas as pd
import scipy.sparse as sp
import matplotlib.pyplot as plt

from datetime import datetime
from sklearn.model_selection import train_test_split, \
    ShuffleSplit, StratifiedKFold, learning_curve
from sklearn.metrics import make_scorer, accuracy_score, \
    recall_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, \
    RandomForestClassifier, GradientBoostingClassifier, \
    VotingClassifier

import custom.evaluate_models as E

np.set_printoptions(threshold=sys.maxsize)
dt_object = datetime.fromtimestamp(time.time())
day, T = str(dt_object).split('.')[0].split(' ')
print('Revised on: ' + day)

Revised on: 2021-02-21


## Load Target Values

In [2]:
def load_target(data):
    raw_path = os.path.join("data","1_raw")
    filename = ''.join([data, ".csv"])
    out_dfm = pd.read_csv(os.path.join(raw_path, filename))
    out_arr = np.array(out_dfm.iloc[:,0].ravel())
    return out_arr

y_train_array = load_target("y_train")
y_test_array = load_target("y_test") 

def make_int(y_array):
    y = y_array.copy()
    y[y=='ham'] = 0
    y[y=='spam'] = 1
    y = y.astype('int')
    return y

y_train = make_int(y_train_array)
y_test = make_int(y_test_array)

## Load Preprocessed Data

In [3]:
def load_X(filename):
    proc_dir = os.path.join("data", "2_processed")
    filename = ''.join([filename, '.npz'])
    X = sp.load_npz(os.path.join(proc_dir, filename))
    return X

X_train = load_X('X_train_processed')
X_test = load_X('X_test_processed')

In [4]:
# sanity checks
X_train, X_test, y_train.shape, y_test.shape

(<3900x801 sparse matrix of type '<class 'numpy.float64'>'
 	with 3123099 stored elements in COOrdinate format>,
 <1672x801 sparse matrix of type '<class 'numpy.float64'>'
 	with 1338471 stored elements in COOrdinate format>,
 (3900,),
 (1672,))

## Instantiate Candidate Models

In [5]:
# remember to use warm_start=True for learning curves

rnd_clf1 = RandomForestClassifier(
    random_state=42, n_estimators=100, max_features=150, 
    max_depth=8, min_samples_split=3, n_jobs=1) 

rnd_clf2 = RandomForestClassifier(
    random_state=42, n_estimators=100, max_features=300, 
    max_depth=8, min_samples_split=3, n_jobs=1)
    
ada_clf =  AdaBoostClassifier(
    random_state=42 , n_estimators=10, 
    learning_rate=0.001)

gbc1a = GradientBoostingClassifier(
    random_state=42, n_estimators=50, max_features=None, 
    max_depth=1, min_samples_split=2)

gbc2a = GradientBoostingClassifier(
    random_state=42, n_estimators=100, max_features=300, 
    max_depth=8, min_samples_split=5)

gbc2c = GradientBoostingClassifier(
    random_state=42, n_estimators=50, max_features=300, 
    max_depth=3, min_samples_split=5)

In [6]:
def eval_classifier(clf, sets):
    X_train, y_train, X_test, y_test = sets
    E.fit_clf(clf, X_train, y_train)
    y_pred = clf.predict(X_test)
    E.eval_clf(y_test, y_pred)

In [7]:
sets = X_train, y_train, X_test, y_test

In [8]:
eval_classifier(rnd_clf1, sets)

Elapsed: 1m 1s
          pred_neg  pred_pos
cond_neg      1433         9
cond_pos         7       223
acc: 0.9904
tpr: 0.9696
tnr: 0.9938


In [9]:
eval_classifier(rnd_clf2, sets)

Elapsed: 0m 56s
          pred_neg  pred_pos
cond_neg      1432        10
cond_pos         7       223
acc: 0.9898
tpr: 0.9696
tnr: 0.9931


In [10]:
eval_classifier(ada_clf, sets)

Elapsed: 0m 3s
          pred_neg  pred_pos
cond_neg      1432        10
cond_pos         8       222
acc: 0.9892
tpr: 0.9652
tnr: 0.9931


In [11]:
eval_classifier(gbc1a, sets)

Elapsed: 0m 14s
          pred_neg  pred_pos
cond_neg      1431        11
cond_pos         7       223
acc: 0.9892
tpr: 0.9696
tnr: 0.9924


In [12]:
eval_classifier(gbc2a, sets)

Elapsed: 1m 18s
          pred_neg  pred_pos
cond_neg      1431        11
cond_pos         6       224
acc: 0.9898
tpr: 0.9739
tnr: 0.9924


In [13]:
eval_classifier(gbc2c, sets)

Elapsed: 0m 15s
          pred_neg  pred_pos
cond_neg      1433         9
cond_pos         7       223
acc: 0.9904
tpr: 0.9696
tnr: 0.9938


---