# Random Search and Grid Search

**for project: "miRNA Biomarker for Lung Cancer Diagnostics - Selecting a test panel for patient classification -"**

## Import

### Packages & Modules

In [1]:
# Import all Packages & Modules

# IPython
from IPython.display import Image

# mlxtend
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.classifier import StackingCVClassifier

# SciPy
from scipy.stats import normaltest

# sklearn
import sklearn

from sklearn.dummy import DummyClassifier

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier 

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression 
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split 

from sklearn.naive_bayes import GaussianNB

from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz  

# subprocess
from subprocess import call

# xgboost
from xgboost import XGBClassifier

# yellowbrick
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import PrecisionRecallCurve
from yellowbrick.classifier import ROCAUC

# matplotlib
import matplotlib
import matplotlib.pyplot as plt

# missingno
import missingno

# numpy
import numpy as np

# pandas
import pandas as pd

# pickle (for saving and loading data)
import pickle # 

# seaborn
import seaborn as sns

# sys (enables to exit execution of code)
import sys



### Functions

In [2]:
# Import all written Functions

from functions import feature_selection
from functions import rfe_selection
from functions import dataframe_selection
from functions import model_evaluation
from functions import top_model
from functions import multibar_plot
from functions import random_searching
from functions import grid_searching
from functions import viz_summary
from functions import feature_reduce
from functions import score_eval
from functions import heatmap
from functions import TOP_n_from

### Prerequesites

In [3]:
# Set random_state seed to exclude for randomness effects in the notebook
seed=1

In [4]:
# Load all necessary dataframes from main notebook
# for Random Search and Grid Search to be used in this notebook
X_tbc_top20 = pd.read_pickle("data/X_tbc_top20")
y_train = pd.read_pickle("data/y_train")
X_tbc_top20_test = pd.read_pickle("data/X_tbc_top20_test")
y_test = pd.read_pickle("data/y_test")

## Model Optimization

Now that I know which Feature Selection in combination with which classification model performs best, I try to **optimize on the chosen metric precision** (described in the main notebook). As the ranking might change I use the TOP3 models (RFC, ETC, XGBC) in combination with the Feature Selection of TOP20 Combination of tree-based Classifier (TOP20 TBC) for further optimization.

### Optimization of parameters with Random Search & Grid Search

#### Random Search Grid Parameters

As the the whole Random Search and Grid Search process takes approxemately 8 hours to compute, I decided to outsource this part of code to this seperate notebook.

I always include the default values of the models in the hyperparameter grid in case this would be already the best conditions for my dataset. In this way it is also easier to compare with the unoptimized model.

First I perform a Randomized Search to check a broad range of model parameters. This search method is quick by using only random combinations and no cross validation limited to 500 fits. In this way I can narrow down the parameters for a more detailed optimization.

In [5]:
# Initialize Classifier Models
RFC2 = RandomForestClassifier(random_state=seed, n_jobs=-1)
ETC2 = ExtraTreesClassifier(random_state=seed, n_jobs=-1)
XGBC2 = XGBClassifier(random_state=seed, n_jobs=-1)

##### Random Search Random Forrest Classifier (RFC)

In [6]:
# Set Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_RFC = {
    'bootstrap': [True, False], # default=True
    'criterion': ['gini', 'entropy'], # default=gini
    'max_depth': [None, 5, 10, 15, 20, 25, 30], # default=None
    'max_features': ['sqrt', 'log2'], # default=sqrt
    'min_samples_leaf': [1, 2, 3], # default=1
    'min_samples_split': [2, 3, 5, 10], # default=2
    'n_estimators': [100, 400, 700, 1000, 1300, 1600], # default=100
}

# Use random_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test, y_test)
random_searching(RFC2, parameters_RFC, X_tbc_top20, y_train, X_tbc_top20_test, y_test, seed=seed)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   46.4s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  6.7min finished


Unoptimized model
------
Accuracy score Unoptimized: 0.8333333333333334
Roc_AUC score Unoptimized: 0.9801128428579409
Precision score Unoptimized: 0.8333333333333334
F1 score Unoptimized: 0.8333333333333334

Optimized Model
------
Accuracy score Optimized: 0.8333333333333334
Roc_AUC score Optimized: 0.9816909669850847
Precision score Optimized: 0.8333333333333334
F1 score Optimized: 0.8333333333333334
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, n_estimators=700,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)


In [7]:
# Second use random search to further narrow down the range of parameters

# Set a new Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_RFC_new = {
    'bootstrap': [True, False], # default=True
    'criterion': ['gini', 'entropy'], # default=gini
    'max_depth': [None, 6, 8, 10, 12, 14], # default=None
    'max_features': ['sqrt', 'log2'], # default=sqrt
    'min_samples_leaf': [1, 2, 3], # default=1
    'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9], # default=2
    'n_estimators': [100, 1100, 1200, 1300, 1400, 1500], # default=100
}

# Use random_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test, y_test)
random_searching(RFC2, parameters_RFC_new, X_tbc_top20, y_train, X_tbc_top20_test, y_test, seed=seed)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   12.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  8.2min finished


Unoptimized model
------
Accuracy score Unoptimized: 0.8333333333333334
Roc_AUC score Unoptimized: 0.9801128428579409
Precision score Unoptimized: 0.8333333333333334
F1 score Unoptimized: 0.8333333333333334

Optimized Model
------
Accuracy score Optimized: 0.8333333333333334
Roc_AUC score Optimized: 0.9801477571085414
Precision score Optimized: 0.8333333333333334
F1 score Optimized: 0.8333333333333334
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='log2',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=1400,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)


##### Random Search Extra Tree Classifier (ETC)

In [8]:
# Set Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_ETC = {
    'bootstrap': [False, True], # default=False
    'criterion': ['gini', 'entropy'], # default=gini
    'max_depth': [None, 5, 10, 15, 20, 25, 30], # default=None
    'max_features': ['sqrt', 'log2'], # default=sqrt
    'min_samples_leaf': [1, 2, 3], # default=1
    'min_samples_split': [2, 5, 10], # default=2
    'n_estimators': [100, 400, 800, 1200, 1600, 2000], # default=100
}

# Use random_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test, y_test)
random_searching(ETC2, parameters_ETC, X_tbc_top20, y_train, X_tbc_top20_test, y_test, seed=seed)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   51.6s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  6.0min finished


Unoptimized model
------
Accuracy score Unoptimized: 0.8666666666666667
Roc_AUC score Unoptimized: 0.9847424724875705
Precision score Unoptimized: 0.8666666666666667
F1 score Unoptimized: 0.8666666666666667

Optimized Model
------
Accuracy score Optimized: 0.8666666666666667
Roc_AUC score Optimized: 0.984777386738171
Precision score Optimized: 0.8666666666666667
F1 score Optimized: 0.8666666666666667
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=10, max_features='sqrt',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=1600, n_jobs=-1,
                     oob_score=False, random_state=1, verbose=0,
                     warm_start=False)


In [9]:
# Second use random search to further narrow down the range of parameters

# Set a new Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_ETC_new = {
    'bootstrap': [False, True], # default=False
    'criterion': ['gini', 'entropy'], # default=gini
    'max_depth': [None, 26, 28, 30, 32, 34], # default=None
    'max_features': ['sqrt', 'log2'], # default=sqrt
    'min_samples_leaf': [1, 2, 3], # default=1
    'min_samples_split': [2, 3, 4], # default=2
    'n_estimators': [100, 200, 300, 400, 500, 600], # default=100
}

# Use random_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test, y_test)
random_searching(ETC2, parameters_ETC_new, X_tbc_top20, y_train, X_tbc_top20_test, y_test, seed=seed)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   16.7s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   46.1s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  2.5min finished


Unoptimized model
------
Accuracy score Unoptimized: 0.8666666666666667
Roc_AUC score Unoptimized: 0.9847424724875705
Precision score Unoptimized: 0.8666666666666667
F1 score Unoptimized: 0.8666666666666667

Optimized Model
------
Accuracy score Optimized: 0.8666666666666667
Roc_AUC score Optimized: 0.9832341768616278
Precision score Optimized: 0.8666666666666667
F1 score Optimized: 0.8666666666666667
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=28, max_features='log2',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=4,
                     min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
                     oob_score=False, random_state=1, verbose=0,
                     warm_start=False)


##### Random Search XG Boost Classifier (XGBC)

In [10]:
# Set Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_XGBC = {
    'max_depth': [3, 9, 15, 21, 27, 33, None], # default=3
    'learning_rate': [0.1, 0.01, 0.001, 0.0001], # default=0.1
    'min_child_weight': [1, 2, 3], # default=1
    'subsample': [1.0, 0.9, 0.8, 0.7, 0.6], # default=1.0
    'colsample_bytree': [1.0, 0.75, 0.5], # default=1.0
    'n_estimators': [100, 200, 300, 400, 500], #default=100
}

# Use random_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test, y_test)
random_searching(XGBC2, parameters_XGBC, X_tbc_top20, y_train, X_tbc_top20_test, y_test, seed=seed)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done 178 tasks      | elapsed:   27.7s
[Parallel(n_jobs=-1)]: Done 304 tasks      | elapsed:   53.1s
[Parallel(n_jobs=-1)]: Done 474 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  1.5min finished


Unoptimized model
------
Accuracy score Unoptimized: 0.9
Roc_AUC score Unoptimized: 0.9878288922406568
Precision score Unoptimized: 0.9
F1 score Unoptimized: 0.9

Optimized Model
------
Accuracy score Optimized: 0.8666666666666667
Roc_AUC score Optimized: 0.9907756549913413
Precision score Optimized: 0.8666666666666667
F1 score Optimized: 0.8666666666666667
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.75, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=21,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='multi:softprob', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.7, verbosity=1)


In [11]:
# Second use random search to further narrow down the range of parameters

# Set a new Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_XGBC_new = {
    'max_depth': [3, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], # default=3
    'learning_rate': [0.1, 0.005, 0.0075, 0.01, 0.025, 0.05], # default=0.1
    'min_child_weight': [1, 2, 3], # default=1
    'subsample': [1.0, 0.85, 0.875, 0.9, 0.925, 0.95], # default=1.0
    'colsample_bytree': [1.0, 0.45, 0.475, 0.5, 0.525, 0.55], # default=1.0
    'n_estimators': [100, 225, 250, 300, 325, 350], #default=100
}

# Use random_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test, y_test)
random_searching(XGBC2, parameters_XGBC_new, X_tbc_top20, y_train, X_tbc_top20_test, y_test, seed=seed)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   25.3s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:   50.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  1.5min finished


Unoptimized model
------
Accuracy score Unoptimized: 0.9
Roc_AUC score Unoptimized: 0.9878288922406568
Precision score Unoptimized: 0.9
F1 score Unoptimized: 0.9

Optimized Model
------
Accuracy score Optimized: 0.8
Roc_AUC score Optimized: 0.9876892352382548
Precision score Optimized: 0.8
F1 score Optimized: 0.8000000000000002
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.45, gamma=0,
              learning_rate=0.005, max_delta_step=0, max_depth=36,
              min_child_weight=1, missing=None, n_estimators=250, n_jobs=-1,
              nthread=None, objective='multi:softprob', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.95, verbosity=1)


In [12]:
# Stop Switch (comment out to run the steps below)
# sys.exit("Stop without Grid Search!")

#### Optimization of parameters with Grid Search

After narrowing down the model parameters with Randomized Search, I can use the reduced set for the more exact method GridSearchCV. This algorithm uses all possible parameters and evaluates them in detail with cross-validation not limited to a number of fits. In that way the best parameters are optimized on precision.

In [13]:
# Initialize Classifier Models
RFC3 = RandomForestClassifier(random_state=seed, n_jobs=-1)
ETC3 = ExtraTreesClassifier(random_state=seed, n_jobs=-1)
XGBC3 = XGBClassifier(random_state=seed, n_jobs=-1)

# Create Dictionary for Optimization results
results_opt = {}

# Lists with keys for optimized and unoptized models
name_model_RFC = ["RFC_UnOpt", "RFC_Opt"]
name_model_ETC = ["ETC_UnOpt", "ETC_Opt"]
name_model_XGBC = ["XGBC_UnOpt", "XGBC_Opt"] 

##### Grid Search Random Forrest Classifier (RFC)

In [14]:
# Set Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_RFC = {
    'bootstrap': [True, False], # default=True
    'criterion': ['gini', 'entropy'], # default=gini
    'max_depth': [None, 9, 10, 11], # default=None
    'max_features': ['sqrt', 'log2'], # default=sqrt
    'min_samples_leaf': [1, 2, 3], # default=1
    'min_samples_split': [2, 3, 4], # default=2
    'n_estimators': [100, 1350, 1400, 1425, 1450], # default=100
}

# Use grid_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test)
grid_searching(results_opt, RFC3, name_model_RFC, parameters_RFC, X_tbc_top20, y_train, X_tbc_top20_test, y_test)

Fitting 5 folds for each of 1440 candidates, totalling 7200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   10.8s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed: 10.3min
[Parallel(n_jobs=-1)]: Done 874 tasks      | elapsed: 13.7min
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed: 17.8min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed: 22.2min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 27.1min
[Parallel(n_jobs=-1)]: Done 2170 tasks      | elapsed: 32.9min
[Parallel(n_jobs=-1)]: Done 2584 tasks      | elapsed: 39.2min
[Parallel(n_jobs=-1)]: Done 3034 tasks      | elapsed: 46.0min
[Parallel(n_jobs=-1)]: Done 3520 tasks      | elapsed: 53.5min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | ela

Unoptimized model
------
Accuracy score Unoptimized: 0.8333333333333334
Roc_AUC score Unoptimized: 0.9801128428579409
Precision score Unoptimized: 0.8333333333333334
F1 score Unoptimized: 0.8333333333333334

Optimized Model
------
Accuracy score Optimized: 0.8333333333333334
Roc_AUC score Optimized: 0.9862158538629128
Precision score Optimized: 0.8333333333333334
F1 score Optimized: 0.8333333333333334
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=9, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, n_estimators=1350,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)


##### Grid Search Extra Tree Classifier (ETC)

In [15]:
# Set Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_ETC = {
    'bootstrap': [False, True], # default=False
    'criterion': ['gini', 'entropy'], # default=gini
    'max_depth': [None, 25, 26, 27], # default=None
    'max_features': ['sqrt', 'log2'], # default=sqrt
    'min_samples_leaf': [1, 2, 3], # default=1
    'min_samples_split': [2, 3, 4], # default=2
    'n_estimators': [100, 325, 350, 400, 425, 450], # default=100
}

# Use grid_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test)
grid_searching(results_opt, ETC3, name_model_ETC, parameters_ETC, X_tbc_top20, y_train, X_tbc_top20_test, y_test)

Fitting 5 folds for each of 1728 candidates, totalling 8640 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   36.4s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 874 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed: 40.4min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 42.7min
[Parallel(n_jobs=-1)]: Done 2170 tasks      | elapsed: 45.2min
[Parallel(n_jobs=-1)]: Done 2584 tasks      | elapsed: 48.0min
[Parallel(n_jobs=-1)]: Done 3034 tasks      | elapsed: 50.9min
[Parallel(n_jobs=-1)]: Done 3520 tasks      | elapsed: 54.0min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | ela

Unoptimized model
------
Accuracy score Unoptimized: 0.8666666666666667
Roc_AUC score Unoptimized: 0.9847424724875705
Precision score Unoptimized: 0.8666666666666667
F1 score Unoptimized: 0.8666666666666667

Optimized Model
------
Accuracy score Optimized: 0.8666666666666667
Roc_AUC score Optimized: 0.9817607954862857
Precision score Optimized: 0.8666666666666667
F1 score Optimized: 0.8666666666666667
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='sqrt',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=4,
                     min_weight_fraction_leaf=0.0, n_estimators=350, n_jobs=-1,
                     oob_score=False, random_state=1, verbose=0,
                     warm_start=False)


##### Grid Search XG Boost Classifier (XGBC)

In [16]:
# Set Hyperparameter grid for Model with alternatives to default values (included in grid)
parameters_XGBC = {
    'max_depth': [3, 31, 32, 33], # default=3
    'learning_rate': [0.1, 0.0065, 0.007, 0.0075, 0.008, 0.0085], # default=0.1
    'min_child_weight': [1, 2, 3], # default=1
    'subsample': [1.0, 0.88, 0.89, 0.9, 0.91, 0.92], # default=1.0
    'colsample_bytree': [1.0, 0.43, 0.44, 0.45, 0.46, 0.47], # default=1.0
    'n_estimators': [100, 215, 220, 225, 230, 335], #default=100
}

# Use grid_searching function with with TOP20 TBC Features Dataset (X_tbc_top20, y_train, X_tbc_top20_test)
grid_searching(results_opt, XGBC3, name_model_XGBC, parameters_XGBC, X_tbc_top20, y_train, X_tbc_top20_test, y_test)

Fitting 5 folds for each of 15552 candidates, totalling 77760 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   14.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   41.8s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 874 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed:  7.0min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 2170 tasks      | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done 2584 tasks      | elapsed: 12.9min
[Parallel(n_jobs=-1)]: Done 3034 tasks      | elapsed: 16.7min
[Parallel(n_jobs=-1)]: Done 3520 tasks      | elapsed: 21.0min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | ela

Unoptimized model
------
Accuracy score Unoptimized: 0.9
Roc_AUC score Unoptimized: 0.9878288922406568
Precision score Unoptimized: 0.9
F1 score Unoptimized: 0.9

Optimized Model
------
Accuracy score Optimized: 0.8333333333333334
Roc_AUC score Optimized: 0.9876543209876543
Precision score Optimized: 0.8333333333333334
F1 score Optimized: 0.8333333333333334
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.45, gamma=0,
              learning_rate=0.0075, max_delta_step=0, max_depth=31,
              min_child_weight=1, missing=None, n_estimators=225, n_jobs=-1,
              nthread=None, objective='multi:softprob', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.88, verbosity=1)


In [17]:
#create dataframe with results_optimization
results_optimization = pd.DataFrame(results_opt)
results_optimization

Unnamed: 0,RFC_UnOpt,RFC_Opt,ETC_UnOpt,ETC_Opt,XGBC_UnOpt,XGBC_Opt
Accuracy,0.833333,0.833333,0.866667,0.866667,0.9,0.833333
Roc_AUC,0.980113,0.986216,0.984742,0.981761,0.987829,0.987654
Precision,0.833333,0.833333,0.866667,0.866667,0.9,0.833333
F1,0.833333,0.833333,0.866667,0.866667,0.9,0.833333


In [18]:
# save results_opt dictionary to results_opt.p file to be used in main notebook
with open('data/results_opt.p', 'wb') as fp:
    pickle.dump(results_opt, fp, protocol=pickle.HIGHEST_PROTOCOL)

#### Summary

After I optimized all models on precision the best way to compare their performance is by their Roc AUC score.

As the Roc (receiver operating characteristics) curve of a model is the plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings, this is a measure for  the overall performance of a model independent from threshold settings. For better comparision the area under the curve (AUC) is used. The bigger the better the performance.

The best Roc_AUC is demonstrated by the unoptimzed XG Boost Classifier (XGBC_UnOpt; Roc_AUC: 0.987829, Rank = 6.0), followed by the optimized XG Boost Classifier (XGBC_Opt; Roc_AUC: 0.987654, Rank = 5.0).

Even if the unoptimized model XGBC_UnOpt shows better performance than the optimized model XGBC_Opt, the "true top model" is XGBC_Opt. The reason is that during the optimization process with GridSearchCV the model is cross-validated which should prevent overfitting. The data of XGBC_UnOpt wasn´t evaluated by cross-validation and might perform only better because of overfitting the data. Also I included the default parameters used by the unoptimized model in the GridSearchCV and theses resulted a worse score with cross-validation as not choosen from the algorithm as the best model parameters.