# Cross Validation and HyperParameter Tuning 

## Goal: Imrpove AUC on Precision emphasized models


- format:
    - build parameter grid
    - build model
    - search
    - show metrics of solution
    
    
- Models:
    - DT
    - RF
        - w/ and w/out bagging
    - LR
    - SVM
        - lin and poly


### Model - Parameter Breakdown
    
#### Decision Tree
    - max_depth
    - min_samples_split
    - min_samples_leaf
    - max_features

#### LogReg
    - penalty (l1, l2)
    - C
    - solver (liblinear, saga)
    
    
#### Random Forest (less runs)
    - n_estimators
    - min_samples_split
    - min_samples_leaf
    - max_depth
    - bootstrap (bool)
    

#### SVM (not tuned)
    - n_neighbors
    - weights (uniform (d), distance)


In [112]:
#import modules

#SKLearn
import sklearn.model_selection as sk
import sklearn.metrics as m
from sklearn import ensemble as e
from sklearn import svm, linear_model, cluster
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

#Else
import sys
import pandas as pd
import math
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('default')

#full dfs
%store -r df 
%store -r scaled_df
%store -r fe_df
%store -r df_upsampled

%store -r x_tr
%store -r y_tr
%store -r x_te
%store -r y_te

%store -r xs_tr
%store -r ys_tr
%store -r xs_te
%store -r ys_te

%store -r x_fetr
%store -r y_fetr
%store -r x_fete
%store -r y_fete
%store -r xs_fetr
%store -r ys_fetr
%store -r xs_fete
%store -r ys_fete

%store -r x_uptr
%store -r y_uptr
%store -r x_upte
%store -r y_upte
%store -r xs_uptr
%store -r ys_uptr
%store -r xs_upte
%store -r ys_upte

%store -r cols
%store -r fe_cols
%store -r testsizepercent
%store -r rs


#Tuning Parameters
crossval = 3 #Stratified by default when target is binary
permsz = 100 #count of permutations to test

# Decision Tree

In [113]:
######## ---------- Decision Tree

# Build Grid

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 9)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split =  [25, 40, 60, 90, 120, 200]

# Minimum number of samples required at each leaf node
min_samples_leaf = [25, 40, 60, 90, 120, 175]



dt_grid = {'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}


In [114]:
#Decision Tree
dt_s = DecisionTreeClassifier(criterion='entropy').fit(x_tr, y_tr)
dt_fe = DecisionTreeClassifier(criterion='entropy').fit(x_fetr, y_fetr)
dt_up = DecisionTreeClassifier(criterion='entropy').fit(x_uptr, y_uptr)

In [115]:
# Random Grid Search
dt_s_cv = sk.GridSearchCV(estimator = dt_s, 
                                param_grid= dt_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(x_tr, y_tr)

dt_fe_cv = sk.GridSearchCV(estimator = dt_fe, 
                                param_grid= dt_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(x_fetr, y_fetr)

dt_up_cv = sk.GridSearchCV(estimator = dt_up, 
                                param_grid = dt_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(x_uptr, y_uptr)

Fitting 3 folds for each of 720 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 171 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 577 tasks      | elapsed:   10.5s
[Parallel(n_jobs=-1)]: Done 1143 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done 1873 tasks      | elapsed:   29.9s
[Parallel(n_jobs=-1)]: Done 2145 out of 2160 | elapsed:   35.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed:   35.3s finished


Fitting 3 folds for each of 720 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed:   22.8s
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed:   32.0s
[Parallel(n_jobs=-1)]: Done 1969 tasks      | elapsed:   42.4s
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed:   47.7s finished


Fitting 3 folds for each of 720 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed:   28.5s
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed:   41.1s
[Parallel(n_jobs=-1)]: Done 1969 tasks      | elapsed:   56.2s
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed:  1.0min finished


In [116]:
print("DT - s:", dt_s_cv.best_params_)
print("DT - fe:", dt_fe_cv.best_params_)
print("DT - up:", dt_up_cv.best_params_)

DT - s: {'max_depth': 110, 'max_features': 'auto', 'min_samples_leaf': 25, 'min_samples_split': 60}
DT - fe: {'max_depth': 22, 'max_features': 'auto', 'min_samples_leaf': 25, 'min_samples_split': 40}
DT - up: {'max_depth': 60, 'max_features': 'auto', 'min_samples_leaf': 25, 'min_samples_split': 25}


# Logistic Regression Tuning

In [117]:
######## ---------- Random Forest

# Build Grid

# Number of trees in random forest
c = [.0001, .001, .005, .01, .05, .1, 1, 10, 20]

# Method of selecting samples for training each tree
pen = ['l1', 'l2']

solve=['liblinear', 'saga']


lr_grid = {'C': c,
               'penalty': pen,
               'solver': solve}


In [118]:
lr_s = linear_model.LogisticRegression(max_iter=500).fit(xs_tr, ys_tr)

lr_fe = linear_model.LogisticRegression(max_iter=500).fit(xs_fetr, ys_fetr)

lr_up = linear_model.LogisticRegression(max_iter=500).fit(xs_uptr, ys_uptr)



In [121]:
# Random Grid Search
lr_s_cv = sk.GridSearchCV(estimator = lr_s, 
                                param_grid = lr_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_tr, ys_tr)

lr_fe_cv = sk.GridSearchCV(estimator = lr_fe, 
                                param_grid = lr_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_fetr, ys_fetr)

lr_up_cv = sk.GridSearchCV(estimator = lr_up, 
                                param_grid = lr_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_uptr, ys_uptr)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:    2.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   24.3s finished


Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:    2.9s finished


In [122]:
print("LR - s:", lr_s_cv.best_params_)
print("LR - fe:", lr_fe_cv.best_params_)
print("LR - up:", lr_up_cv.best_params_)

LR - s: {'C': 0.05, 'penalty': 'l1', 'solver': 'liblinear'}
LR - fe: {'C': 20, 'penalty': 'l2', 'solver': 'liblinear'}
LR - up: {'C': 0.005, 'penalty': 'l2', 'solver': 'saga'}


# Random Forest Tuning

-- super slow, leaving out pas

In [123]:
######## ---------- Random Forest

# Build Grid

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 25, stop = 300, num = 8)]

# Method of selecting samples for training each tree
bootstrap = [True, False]


rf_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


In [124]:
# Build Model
rf_s = e.RandomForestClassifier(criterion='entropy', 
                                n_estimators=100).fit(x_tr, y_tr)

rf_fe = e.RandomForestClassifier(criterion='entropy', 
                                n_estimators=100).fit(x_fetr, y_fetr)

rf_up = e.RandomForestClassifier(criterion='entropy', 
                                n_estimators=100).fit(x_uptr, y_uptr)

In [126]:
# Random Grid Search
rf_s_cv = sk.RandomizedSearchCV(estimator = rf_s, 
                                param_distributions = rf_grid, 
                                n_iter = permsz, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(x_tr, y_tr)

print("RF - s:", rf_s_cv.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   35.7s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  8.4min finished


RF - s: {'n_estimators': 64, 'min_samples_split': 25, 'min_samples_leaf': 25, 'max_features': 'auto', 'max_depth': 110, 'bootstrap': False}


In [127]:
rf_fe_cv = sk.RandomizedSearchCV(estimator = rf_fe, 
                                param_distributions = rf_grid, 
                                n_iter = permsz, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(x_fetr, y_fetr)

print("RF - fe:", rf_fe_cv.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  6.4min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 11.3min finished


RF - fe: {'n_estimators': 221, 'min_samples_split': 60, 'min_samples_leaf': 25, 'max_features': 'sqrt', 'max_depth': 85, 'bootstrap': False}


In [128]:
rf_up_cv = sk.RandomizedSearchCV(estimator = rf_up, 
                                param_distributions = rf_grid, 
                                n_iter = permsz, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(x_uptr, y_uptr)

print("RF - up:", rf_up_cv.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 10.5min finished


RF - up: {'n_estimators': 221, 'min_samples_split': 25, 'min_samples_leaf': 25, 'max_features': 'auto', 'max_depth': 72, 'bootstrap': False}
