# Random Forests 4

---

__This Notebook__


- *original goal:*
    - evaluate the results of the last batch of random forest grid searches
    - narrow down search space and conduct final and more robust search

- *results:*
    - unexpectedly low sensitivity; my early attempts achieved 93% sensitivity while the latter, more comprehensive attempts maxed at 89% 

Since "something went wrong" - this notebook turned into troubleshooting and debugging

__The Issue__

- *MinMaxScaler()*:
    - in __Notebook 6: Dimensionality Reduction__ I created a custom `performSVD` function that scaled results to remove negative values from the $V$ matrix so I could run a logistic classifier and "test the waters" with some quick modeling
    - I had no idea that scaling the $V$ matrix would have such great impact on the performance of random forests, with respect to runtime and sensitivity

Below I test the differences in hyperparameters and representations and notice how scaling is the culprit after all

__Results__ 

- the newer grid search yields a model that achieves 96% sensitivity; I can now narrow down and push the envelope a bit


---

## Setup & Load

The are the results of the grid searches in the previous 3 notebooks (__12_RandomForests3.1__, __3.2__, and __3.3__).

In [1]:
import re
import os
import time
import json
import joblib 

import numpy as np
import pandas as pd

from datetime import datetime

dt_object = datetime.fromtimestamp(time.time())
day, T = str(dt_object).split('.')[0].split(' ')
print('Revised on: ' + day)

Revised on: 2021-01-17


In [2]:
%%capture

# load grid searches
mod_path = os.path.join("data","3_modeling")

gridsearch_names = ['01052021_rf_gridsearches.joblib',
                    '01062021_rf_gridsearches_1.joblib',
                    '01062021_rf_gridsearches_2.joblib',
                    '01062021_rf_gridsearches_3.joblib']

gridsearches = []
for name in gridsearch_names:
    filepath = os.path.join(mod_path, name)
    gridsearches.append(joblib.load(filepath))
        
gridsearches = [item for sublist in gridsearches for item in sublist]        

In [3]:
def extract_df(dic):
    df = pd.concat([
                    pd.DataFrame({'representation':[dic['representation']] \
                                * len(dic['gridsearch_res'].cv_results_["params"])}),
                    pd.DataFrame(dic['gridsearch_res'].cv_results_["params"]),
                    pd.DataFrame(dic['gridsearch_res'].cv_results_["mean_test_acc"], 
                                 columns=["mean_val_acc"]),
                    pd.DataFrame(dic['gridsearch_res'].cv_results_["mean_test_tpr"], 
                                 columns=["mean_val_tpr"]),
                    pd.DataFrame(dic['gridsearch_res'].cv_results_["mean_test_tnr"], 
                                 columns=["mean_val_tnr"])
                    ], axis=1)
    return df

# create list of dfs
df_list = []
for ix, dic in enumerate(gridsearches):
    df_list.append(extract_df(dic))

# flatten and reindex
dfm = pd.concat(df_list)
dfm.index = range(len(dfm))

# sort by top mean validation sensitivity
top_tpr = dfm.sort_values(by=['mean_val_tpr'], ascending=False).iloc[:6,:].copy()
top_tpr

Unnamed: 0,representation,max_depth,max_features,min_samples_split,n_estimators,mean_val_acc,mean_val_tpr,mean_val_tnr
858,X_bot,20,500,5,100,0.981544,0.894534,0.994876
853,X_bot,20,250,5,200,0.984277,0.894467,0.998027
852,X_bot,20,250,5,100,0.982911,0.891835,0.996846
860,X_bot,20,500,10,100,0.980517,0.889339,0.994479
859,X_bot,20,500,5,200,0.981543,0.889339,0.995663
862,X_bot,20,500,15,100,0.980859,0.886775,0.995269


## Rerunning old grid search

Re-run `scikitlearn_cv`, `collect_cvs`,  `build_random_forests` (a first "gridsearch_wrapper") in ill-attempt to identify why my new gridsearch mean validation true positive rates are worse than my first attempts at building random forests. 

Results below confirm that it is not the old random forest model hypeparameters that were optimal, the issue is with the representations.

In [4]:
import scipy.sparse as sp

# load target
raw_path = os.path.join("data","1_raw")
filename = "y_train.csv"
y = pd.read_csv(os.path.join(raw_path, filename))
y = np.array(y.iloc[:,0].ravel())
y[y=='ham'] = 0
y[y=='spam'] = 1
y = y.astype('int')

# load 12 matrices
proc_dir = os.path.join("data","2_processed")
Xnames = ['X_bot_svd_cos.npz', 'X_bot_tfidf_svd_cos.npz']
Xs = []
for ix, X in enumerate(Xnames):
    path_ = os.path.join(proc_dir, Xnames[ix])
    Xs.append(sp.load_npz(path_))

In [5]:
# takes 2h 45m
#import custom.old_gridsearch as og
#gridsearch = og.build_random_forests(Xs, 
#                                     Xnames,
#                                     y,
#                                     cv_seed=423, 
#                                     rf_seed=514,
#                                     mtry_=[50, 100, 250],
#                                     trees=500, 
#                                     max_leaf_nodes=99,
#                                     cv=10)

| representation | mean accuracy | mean sensitivity | mean specificity | mtry | elapsed (secs.)|
|:--------------------------|:---------:|:---------:|:---------:|:-----:|:---------:|
| X_bot_svd_cos.npz     	| 0.9746	| 0.8451	| 0.9944	| 50	| 707.2		|
| X_bot_tfidf_svd_cos.npz	| 0.9787	| 0.8643	| 0.9962	| 50	| 553.3		|
| X_bot_svd_cos.npz			| 0.9736	| 0.8490	| 0.9926	| 100	| 1374.2	|
| X_bot_tfidf_svd_cos.npz	| 0.9797	| 0.8798	| 0.9950	| 100	| 1063.6	|
| X_bot_svd_cos.npz			| 0.9713	| 0.8432	| 0.9908	| 250	| 3399.1	|
| X_bot_tfidf_svd_cos.npz	| 0.9797	| 0.8855	| 0.9941	| 250	| 2772.7	|

---

The old grid searcn run above confirms that it is not the random forest hyperparameters but something else driving the lower mean validation sensitivity.


## Rerunning original representations 

Here I step back to earlier in the project, before the mistake scaling SVD in Notebook 9.

In [6]:
import urlextract
from nltk.stem import WordNetLemmatizer

def load_data(data):
    raw_path = os.path.join("data","1_raw")
    filename = ''.join([data, ".csv"])
    out_dfm = pd.read_csv(os.path.join(raw_path, filename))
    out_arr = np.array(out_dfm.iloc[:,0].ravel())
    return out_arr

X_train = load_data("X_train")
y_train = load_data("y_train")

y = y_train.copy()

# transform y_array into int type
y[y=='ham'] = 0
y[y=='spam'] = 1
y = y.astype('int')

# load contractions map for custom cleanup
with open("contractions_map.json") as f:
    contractions_map = json.load(f)

In [7]:
import custom.clean_preprocess as cp

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

pipe = Pipeline([('counter', cp.DocumentToNgramCounterTransformer(n_grams=3)),
                 ('bot', cp.WordCounterToVectorTransformer(vocabulary_size=2000)),
                 ('tfidf', TfidfTransformer(sublinear_tf=True))
                ])

X_counter = pipe['counter'].fit_transform(X_train)
X_bot = pipe['bot'].fit_transform(X_counter)
X_tfidf = pipe.fit_transform(X_train)

In [8]:
from scipy.sparse.linalg import svds
from sklearn.utils.extmath import svd_flip

def perform_SVD(X, n_components=300):
    
    X_array = X.asfptype()
    U, Sigma, VT = svds(X_array.T, # term-document matrix
                        k=n_components)
    # reverse outputs
    Sigma = Sigma[::-1]
    U, VT = svd_flip(U[:, ::-1], VT[::-1])
    
    # return V 
    V = VT.T
    return V # do not scale

X_svd_bot = perform_SVD(X_bot)
X_svd_tfidf = perform_SVD(X_tfidf)

In [9]:
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

X_cossim_svd_bot = cosine_similarity(X_svd_bot)
X_cossim_svd_tfidf = cosine_similarity(X_svd_tfidf)

train_df = pd.DataFrame({'sms':X_train, 'target':y_train})

# get spam indexes
spam_ix = train_df.loc[train_df['target']=='spam'].index

# calculate average spam similarity on SVD
mean_spam_sims_bot, mean_spam_sims_tfidf = [], []

for ix in range(X_cossim_svd_bot.shape[0]):
    mean_spam_sims_bot.append(np.mean(X_cossim_svd_bot[ix, spam_ix]))
    mean_spam_sims_tfidf.append(np.mean(X_cossim_svd_tfidf[ix, spam_ix]))

X_bot_cossim_bot = sp.hstack((csr_matrix(mean_spam_sims_bot).T, X_svd_bot)) 
X_tfidf_cossim_tfidf = sp.hstack((csr_matrix(mean_spam_sims_tfidf).T, X_svd_tfidf)) 

In [10]:
Xs = [X_bot_cossim_bot, X_tfidf_cossim_tfidf]
Xnames = ['X_bot_cossim_bot', 'X_tfidf_cossim_tfidf']

In [11]:
# takes 42m (compared to 2h 45m, but cv=5)
#gridsearch = og.build_random_forests(Xs, 
#                                     Xnames,
#                                     y,
#                                     cv_seed=423,
#                                     rf_seed=514,
#                                     mtry_=[50, 100, 250],
#                                     trees=500, 
#                                     max_leaf_nodes=99, 
#                                     cv=5)


| representation | mean accuracy | mean sensitivity | mean specificity | mtry | elapsed (secs.)|
|:--------------------------|:---------:|:---------:|:---------:|:-----:|:---------:|
| X_bot_cossim_bot		    | 0.9874	| 0.9265	| 0.9967	| 50    | 295.3	    |
| X_tfidf_cossim_tfidf	    | 0.9872	| 0.9342	| 0.9953	| 50    | 180.7	    |
| X_bot_cossim_bot		    | 0.9872	| 0.9362	| 0.9950	| 100   | 340.3	    |
| X_tfidf_cossim_tfidf	    | 0.9874	| 0.9361	| 0.9953	| 100   | 309.7	    |
| X_bot_cossim_bot		    | 0.9846	| 0.9226	| 0.9941	| 250   | 716.7	    |
| X_tfidf_cossim_tfidf	    | 0.9872	| 0.9380	| 0.9947	| 250   | 699.0	    |

 - Mean validation sensitivity is a lot higher.

## Try new grid search wrapper with non-scaled representations

Finally, I check that the newer grid search performs at least equally well and confirm that the problem was the scaled SVD after all.

In [12]:
import custom.new_gridsearch as ng

In [16]:
old_params = {
    'max_features':[50, 100, 250],
    'n_estimators':[500], 
    'max_leaf_nodes':[99]
}

# takes around X min
gridsearch_old = ng.gridsearch_wrapper(X, Xnames, y, old_params, k=10)

In [None]:
new_params = {
    'n_estimators' : [100, 250, 500],   # trees
    'max_features': [50, 75, 100, 125], # mtry
    'max_depth': [10, 20, 30],
    'min_samples_split': [5, 10, 15]
}

# takes around Y min
gridsearch_old = ng.gridsearch_wrapper(X, Xnames, y, new_params, k=10)

96% sensitivity?

In [None]:
# check save path
save_path = os.path.join(mod_path, "".join(["01092021_RERUN", "_rf_gridsearches.joblib"]))
save_path

In [None]:
# persist gridsearches
joblib.dump(gridsearch_cv, save_path)

---