# Cross Validation and HyperParameter Tuning 

## Goal: Imrpove AUC on Precision emphasized models


- format:
    - build parameter grid
    - build model
    - search
    - show metrics of solution
    
    
- Models:
    - DT
    - RF
        - w/ and w/out bagging
    - LR
    - SVM
        - lin and poly


### Model - Parameter Breakdown
    
#### Decision Tree
    - max_depth
    - min_samples_split
    - min_samples_leaf
    - max_features

#### LogReg
    - penalty (l1, l2)
    - C
    - solver (liblinear, saga)
    
    
#### Random Forest (less runs)
    - n_estimators
    - min_samples_split
    - min_samples_leaf
    - max_depth
    - bootstrap (bool)
    

#### SVM (not tuned)
    - n_neighbors
    - weights (uniform (d), distance)


In [7]:
#import modules

#SKLearn
import sklearn.model_selection as sk
import sklearn.metrics as m
from sklearn import ensemble as e
from sklearn import svm, linear_model, cluster
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

#Else
import sys
import pandas as pd
import math
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('default')

#full dfs
%store -r df 
%store -r scaled_df
%store -r fe_df
%store -r df_upsampled

%store -r x_tr
%store -r y_tr
%store -r x_te
%store -r y_te

%store -r xs_tr
%store -r ys_tr
%store -r xs_te
%store -r ys_te

%store -r x_fetr
%store -r y_fetr
%store -r x_fete
%store -r y_fete
%store -r xs_fetr
%store -r ys_fetr
%store -r xs_fete
%store -r ys_fete

%store -r x_uptr
%store -r y_uptr
%store -r x_upte
%store -r y_upte
%store -r xs_uptr
%store -r ys_uptr
%store -r xs_upte
%store -r ys_upte

%store -r cols
%store -r fe_cols
%store -r testsizepercent
%store -r rs


#Tuning Parameters
crossval = 3 #Stratified by default when target is binary
permsz = 100 #count of permutations to test

# Decision Tree

In [8]:
######## ---------- Decision Tree

# Build Grid

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 9)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split =  [25, 40, 60, 90, 120, 200]

# Minimum number of samples required at each leaf node
min_samples_leaf = [25, 40, 60, 90, 120, 175]



dt_grid = {'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}


In [9]:
max_depth

[10, 22, 35, 47, 60, 72, 85, 97, 110, None]

In [10]:
#Decision Tree
dt_s = DecisionTreeClassifier(criterion='entropy').fit(xs_tr, ys_tr)
dt_fe = DecisionTreeClassifier(criterion='entropy').fit(xs_fetr, ys_fetr)
dt_up = DecisionTreeClassifier(criterion='entropy').fit(xs_uptr, ys_uptr)

In [11]:
# Grid Search
dt_s_cv = sk.GridSearchCV(estimator = dt_s, 
                                param_grid= dt_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_tr, ys_tr)

dt_fe_cv = sk.GridSearchCV(estimator = dt_fe, 
                                param_grid= dt_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_fetr, ys_fetr)

dt_up_cv = sk.GridSearchCV(estimator = dt_up, 
                                param_grid = dt_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_uptr, ys_uptr)

Fitting 3 folds for each of 720 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 240 tasks      | elapsed:    5.3s
[Parallel(n_jobs=-1)]: Done 646 tasks      | elapsed:    9.5s
[Parallel(n_jobs=-1)]: Done 1212 tasks      | elapsed:   16.5s
[Parallel(n_jobs=-1)]: Done 1942 tasks      | elapsed:   25.2s
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed:   27.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 720 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 276 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done 682 tasks      | elapsed:   10.7s
[Parallel(n_jobs=-1)]: Done 1248 tasks      | elapsed:   20.7s
[Parallel(n_jobs=-1)]: Done 1978 tasks      | elapsed:   37.2s
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed:   42.6s finished


Fitting 3 folds for each of 720 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 276 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done 682 tasks      | elapsed:   16.0s
[Parallel(n_jobs=-1)]: Done 1248 tasks      | elapsed:   29.2s
[Parallel(n_jobs=-1)]: Done 1978 tasks      | elapsed:   46.1s
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed:   49.9s finished


In [12]:
print("DT - s:", dt_s_cv.best_params_)
print("DT - fe:", dt_fe_cv.best_params_)
print("DT - up:", dt_up_cv.best_params_)

DT - s: {'max_depth': 72, 'max_features': 'auto', 'min_samples_leaf': 25, 'min_samples_split': 120}
DT - fe: {'max_depth': 47, 'max_features': 'auto', 'min_samples_leaf': 25, 'min_samples_split': 60}
DT - up: {'max_depth': 110, 'max_features': 'sqrt', 'min_samples_leaf': 25, 'min_samples_split': 25}


# Logistic Regression Tuning

In [13]:
######## ---------- Random Forest

# Build Grid


c = [.0001, .001, .005, .01, .05, .1, 1, 10, 20]
pen = ['l1', 'l2']
solve=['liblinear', 'saga']


lr_grid = {'C': c,
               'penalty': pen,
               'solver': solve}


In [14]:
lr_s = linear_model.LogisticRegression(max_iter=500).fit(xs_tr, ys_tr)

lr_fe = linear_model.LogisticRegression(max_iter=500).fit(xs_fetr, ys_fetr)

lr_up = linear_model.LogisticRegression(max_iter=500).fit(xs_uptr, ys_uptr)



In [15]:
# Grid Search
lr_s_cv = sk.GridSearchCV(estimator = lr_s, 
                                param_grid = lr_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_tr, ys_tr)

lr_fe_cv = sk.GridSearchCV(estimator = lr_fe, 
                                param_grid = lr_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_fetr, ys_fetr)

lr_up_cv = sk.GridSearchCV(estimator = lr_up, 
                                param_grid = lr_grid, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_uptr, ys_uptr)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:    1.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   36.2s finished


Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:    2.2s finished


In [16]:
print("LR - s:", lr_s_cv.best_params_)
print("LR - fe:", lr_fe_cv.best_params_)
print("LR - up:", lr_up_cv.best_params_)

LR - s: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}
LR - fe: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
LR - up: {'C': 0.05, 'penalty': 'l1', 'solver': 'liblinear'}


# Random Forest Tuning

-- super slow, leaving out pas

In [19]:
######## ---------- Random Forest

# Build Grid

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 25, stop = 300, num = 8)]

# Method of selecting samples for training each tree
bootstrap = [True, False]


rf_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


In [20]:
n_estimators

[25, 64, 103, 142, 182, 221, 260, 300]

In [21]:
# Build Model
rf_s = e.RandomForestClassifier(criterion='entropy', 
                                n_estimators=100).fit(xs_tr, ys_tr)

rf_fe = e.RandomForestClassifier(criterion='entropy', 
                                n_estimators=100).fit(xs_fetr, ys_fetr)

rf_up = e.RandomForestClassifier(criterion='entropy', 
                                n_estimators=100).fit(xs_uptr, ys_uptr)

In [22]:
# Random Grid Search
rf_s_cv = sk.RandomizedSearchCV(estimator = rf_s, 
                                param_distributions = rf_grid, 
                                n_iter = permsz, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_tr, ys_tr)

print("RF - s:", rf_s_cv.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   26.6s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  8.3min finished


RF - s: {'n_estimators': 182, 'min_samples_split': 25, 'min_samples_leaf': 25, 'max_features': 'auto', 'max_depth': 110, 'bootstrap': False}


In [23]:
rf_fe_cv = sk.RandomizedSearchCV(estimator = rf_fe, 
                                param_distributions = rf_grid, 
                                n_iter = permsz, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_fetr, ys_fetr)

print("RF - fe:", rf_fe_cv.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 13.7min finished


RF - fe: {'n_estimators': 182, 'min_samples_split': 25, 'min_samples_leaf': 25, 'max_features': 'auto', 'max_depth': 72, 'bootstrap': False}


In [24]:
rf_up_cv = sk.RandomizedSearchCV(estimator = rf_up, 
                                param_distributions = rf_grid, 
                                n_iter = permsz, 
                                cv = crossval, 
                                verbose=2, n_jobs=-1).fit(xs_uptr, ys_uptr)

print("RF - up:", rf_up_cv.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   49.6s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 10.7min finished


RF - up: {'n_estimators': 300, 'min_samples_split': 25, 'min_samples_leaf': 25, 'max_features': 'auto', 'max_depth': 60, 'bootstrap': False}
