# Grid Search & Cross Validation
___

#### Introduction
* [1.1 K-Fold Cross Validation ](#1.1)
* 1.2 Early Stopping
* 1.3 Grid Search by Person Split
* 1.4 Grid search by Time Series Split.

#### Results
* 2.1. Person Split, basic features
* 2.2. Person Split, All Features
* 2.3. Time Series Split, basic features
* 2.4. Time Series Split, All Features

## Introduction
___

After default model training it was difficult to ascertain which features need to be kept in the model and which may not have predictive power.
From my initial analysis, the most accurate model was still the one with all features included vs when experimentally removing various groups
of features. 

In addition we need to tune paramters: learning rate and depth so will perform a grid search across 5 cross validation folds.

Aims:
* Find optimal learning rate and depth parameters
* Across 5 folds with cross validation, determine which features do not contribute to the model and can be dropped.

We will then examine removing these features to see if we see any improvement in accuracy or redution in log loss from their removal.
* Grid search will be run using the two train test split strategies, by person and by last offer.

In [8]:
# mount google drive if running in colab
import os
import sys

if os.path.exists('/usr/lib/python3.6/'):
    from google.colab import drive
    drive.mount('/content/drive/')
    sys.path.append('/content/drive/My Drive/Colab Notebooks/Starbucks_Udacity')
    %cd /content/drive/My Drive/Colab Notebooks/Starbucks_Udacity/notebooks/exploratory
else:
    sys.path.append('../../')

In [None]:
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import progressbar
import catboost
import joblib
from catboost import CatBoostClassifier
from catboost import Pool
from sklearn.preprocessing import LabelEncoder
from catboost import MetricVisualizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, recall_score, f1_score
from sklearn.model_selection import train_test_split

import shap
shap.initjs()
import timeit

from sklearn.model_selection import train_test_split, GroupShuffleSplit, GridSearchCV, GroupKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, recall_score, f1_score
from sklearn.model_selection import ParameterGrid
from xgboost.sklearn import XGBRegressor, XGBClassifier
from imblearn.over_sampling import SMOTE

from catboost.utils import select_threshold
from catboost.utils import get_roc_curve
from catboost.utils import get_fpr_curve
from catboost.utils import get_fnr_curve

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

from sklearn.model_selection import GridSearchCV, TimeSeriesSplit, GroupShuffleSplit
import seaborn as sns

%load_ext autoreload
%autoreload 2
%aimport src.models.train_model
%aimport src.data.make_dataset

from src.data import make_dataset
from src.models.train_model import gridsearch_early_stopping, generate_folds

In [171]:
df = joblib.load('../../data/interim/transcript_final.joblib')
df = src.models.train_model.drop_completion_features(df)

In [8]:
df.head(10)

Unnamed: 0,person,age,income,signed_up,gender,id,rewarded,difficulty,reward,duration,mobile,web,social,bogo,discount,informational,time_days,day,weekday,month,year,t_1,t_3,t_7,t_14,t_21,t_30,t_1c,t_3c,t_7c,t_14c,t_21c,t_30c,last_amount,last_transaction_days,hist_reward_completed,hist_reward_possible,hist_difficulty_completed,hist_difficulty_possible,hist_previous_completed,hist_previous_offers,hist_viewed_and_completed,hist_complete_not_viewed,hist_failed_complete,hist_viewed,hist_received_spend,hist_viewed_spend,completed
0,78afa995795e4d85b5d9ceeca43f5fef,75.0,100000.0,-443,0,6,0.0,5.0,5.0,7.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,9,1,5,2017,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0.0,,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0.0,0.0,1
1,78afa995795e4d85b5d9ceeca43f5fef,75.0,100000.0,-443,0,9,0.0,0.0,0.0,3.0,1.0,0.0,1.0,0.0,0.0,1.0,7.0,16,1,5,2017,17.78,37.67,37.67,37.67,37.67,37.67,1,2,2,2,2,2,17.78,1.0,5.0,5.0,5.0,5.0,1,1,1,0,0,1,0.0,37.67,0
2,78afa995795e4d85b5d9ceeca43f5fef,75.0,100000.0,-443,0,1,0.0,10.0,10.0,7.0,1.0,0.0,1.0,1.0,0.0,0.0,17.0,26,4,5,2017,0.0,23.93,53.65,110.99,110.99,110.99,0,1,2,5,5,5,23.93,2.0,5.0,5.0,5.0,5.0,1,2,1,0,1,2,0.0,87.06,1
3,78afa995795e4d85b5d9ceeca43f5fef,75.0,100000.0,-443,0,7,0.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,0.0,0.0,21.0,30,1,5,2017,0.0,0.0,23.93,73.32,110.99,110.99,0,0,1,3,5,5,23.93,6.0,15.0,15.0,15.0,15.0,2,3,2,0,1,3,0.0,135.34,1
4,a03223e636434f42ac4c3df47e8bac43,,,-356,0,0,0.0,20.0,5.0,10.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,4,4,8,2017,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0.0,,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0.0,0.0,0
5,a03223e636434f42ac4c3df47e8bac43,,,-356,0,8,0.0,0.0,0.0,4.0,1.0,1.0,0.0,0.0,0.0,1.0,14.0,18,4,8,2017,0.0,3.5,4.59,4.59,4.59,4.59,0,1,2,2,2,2,3.5,3.0,0.0,5.0,0.0,20.0,0,1,0,0,1,1,0.0,1.09,0
6,a03223e636434f42ac4c3df47e8bac43,,,-356,0,9,0.0,0.0,0.0,3.0,1.0,0.0,1.0,0.0,0.0,1.0,17.0,21,0,8,2017,0.0,0.0,3.5,4.59,4.59,4.59,0,0,1,2,2,2,3.5,6.0,0.0,5.0,0.0,20.0,0,2,0,0,2,2,0.0,1.09,0
7,a03223e636434f42ac4c3df47e8bac43,,,-356,0,0,0.0,20.0,5.0,10.0,0.0,1.0,0.0,0.0,1.0,0.0,21.0,25,4,8,2017,0.0,0.0,0.0,4.59,4.59,4.59,0,0,0,2,2,2,3.5,10.0,0.0,5.0,0.0,20.0,0,3,0,0,3,2,0.0,1.09,0
8,a03223e636434f42ac4c3df47e8bac43,,,-356,0,0,0.0,20.0,5.0,10.0,0.0,1.0,0.0,0.0,1.0,0.0,24.0,28,0,8,2017,0.0,0.0,0.0,3.5,4.59,4.59,0,0,0,1,2,2,3.5,13.0,0.0,10.0,0.0,40.0,0,4,0,0,4,3,0.06,1.09,0
9,e2127556f4f64592b11af22de27a7932,68.0,70000.0,-91,1,4,0.0,10.0,2.0,7.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,26,3,4,2018,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0.0,,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0.0,0.0,0


### 1.1. K-fold cross validation

**By person**
* After investigation ascertained that splitting person by sign_up date didn't provide a good splitting strategy. Since the latest sign up dates had a lower completion rate, closer to 29% for the 28% latest sign up dates.

* Will change to utilising SKlearn's splitting funcitons which ensure stratification

In [129]:
X = df.drop('completed', axis=1)
y = df.completed

In [96]:
# splitting train and test by person
cv = GroupShuffleSplit(n_splits=1, test_size=.2, train_size=.8, random_state=42).split(X , y, groups=X.person)
X_train, y_train, X_test, y_test = generate_folds(cv, X_train=X, y_train=y)
X_train=X_train[0]
y_train=y_train[0]
X_test=X_test[0]
y_test=y_test[0]

positive classification % per fold and length
train[0] 0.43904556025023744 (61062,)
test[0] 0.4449556358856392 (15215,)


When performing K folds cross validation, we want to ensure that train customers will be different to test customers in each fold. Therefore we use GroupKFold to split by 'person'.

In [98]:
cv = GroupKFold(n_splits=5).split(X_train, y_train, groups=X_train.person)
train_X, train_y, test_X, test_y = generate_folds(cv, X_train=X_train, y_train=y_train)

positive classification % per fold and length
train[0] 0.44015230608610206 (48849,)
test[0] 0.43461884876770657 (12213,)
train[1] 0.4400090073491781 (48849,)
test[1] 0.43519200851551626 (12213,)
train[2] 0.43930399181166835 (48850,)
test[2] 0.43801179168031446 (12212,)
train[3] 0.4385670419651996 (48850,)
test[3] 0.4409597117589256 (12212,)
train[4] 0.4371954964176049 (48850,)
test[4] 0.4464461185718965 (12212,)


Testing that we do not have any intersection of customers between train and test in each fold and that we have approximately the same percentage of the positive class across folds.

In [105]:
def cv_verification(X, y):
    '''
    Check group cross validation fold function 
    Confirms whether folds are independent and ratio of positives per fold (stratification)
    
    Parameters
    -----------
    cv: X and y train data, numpy array of DataFrame
    '''
    train_fold=[]
    test_fold=[]
    total=[]
    intersect=[]
    positive_ratio=[]
    train_X=[]
    test_X=[]
    train_y, test_y = [], []
    test_lists = []
    test_overlap = []
    
    for i,j in enumerate(GroupKFold(n_splits=5).split(X_train, y_train, groups=X_train.person)):
        
        # iterating through folds
        train_X.append(X.iloc[j[0]])
        train_y.append(y.iloc[j[0]])
        test_X.append(X.iloc[j[1]])
        test_y.append(y.iloc[j[1]])
        
        # get list of persons per fold across train and test
        train_fold.append(X.iloc[j[0]].person)
        test_fold.append(X.iloc[j[1]].person)
        
        # add total persons in train and test for each fold
        total.append(X.iloc[j[0]].person.nunique() + X.iloc[j[1]].person.nunique())
        
        # check intersecion of person between train and test of each fold
        intersect.append(np.intersect1d(X.iloc[j[0]].person, X.iloc[j[1]].person))
        
        # checking ratio of completes for each test set
        positive_ratio.append(round(y.iloc[j[1]].sum() / y.iloc[j[1]].count(),3))
                
        test_lists.append(X.iloc[j[1]].person)
        
    # check overlap between each test set of folds    
    for i in range(1,5):
        test_overlap.append(np.intersect1d(test_lists[0], test_lists[i]))
                            
    print('Total unique persons across train and test: ', total)
    print('Intersection of persons across train and test: ', intersect)
    print('Percentage of positive class per split: ', positive_ratio)
    print('Test overlap with first fold: ', test_overlap)

In [106]:
cv_verification(X_train, y_train)

Total unique persons across train and test:  [13595, 13595, 13595, 13595, 13595]
Intersection of persons across train and test:  [array([], dtype=object), array([], dtype=object), array([], dtype=object), array([], dtype=object), array([], dtype=object)]
Percentage of positive class per split:  [0.435, 0.435, 0.438, 0.441, 0.446]
Test overlap with first fold:  [array([], dtype=object), array([], dtype=object), array([], dtype=object), array([], dtype=object)]


### 1.2. Grid Search with Early Stopping

In order to utilise early stopping during gridsearch we will be unable to use the SKlearn GridSearchCV and will instead need to use our own custom function.

In [7]:
def generate_folds(cv, X_train, y_train):
    '''
    Iterate through cv folds and split into list of folds
    Checks that each fold has the same % of positive class
    
    Parameters
    -----------
    cv: cross validation generator
               
    Returns
    -------
    X_train, X_test, y_train, y_test: DataFrames
    '''
    train_X, train_y, test_X, test_y = [], [], [], []
    
    for i in cv:
        train_X.append(X_train.iloc[i[0]])
        train_y.append(y_train.iloc[i[0]])

        test_X.append(X_train.iloc[i[1]])
        test_y.append(y_train.iloc[i[1]])
      
    print('positive classification % per fold and length')
    for i in range(len(train_X)):
        print('train[' + str(i) + ']' , round(train_y[i].sum() / train_y[i].count(), 4), train_y[i].shape)
        print('test[' + str(i) + '] ' , round(test_y[i].sum() / test_y[i].count(), 4), test_y[i].shape)
           
    return train_X, train_y, test_X, test_y

In [107]:
cat_features = [0, 4, 5]

In [277]:
def gridsearch_early_stopping(cv, X, y, folds, grid, cat_features=cat_features, save=None):
    '''
    Perform grid search with early stopping across folds specified by index 
    
    Parameters
    -----------
    cv: cross validation
    X: DataFrame or Numpy array
    y: DataFrame or Numpy array
    fold: list of fold indexes
    grid: parameter grid
    save:   string, excluding file extension (default=None)
            saves results_df for each fold to folder '../../data/interim'
    '''
        
    # generate data folds 
    train_X, train_y, test_X, test_y = generate_folds(cv, X, y)
    
    # iterate through specified folds
    for fold in folds:
        # assign train and test pools
        test_pool = Pool(data=test_X[fold], label=test_y[fold], cat_features=cat_features)
        train_pool = Pool(data=train_X[fold], label=train_y[fold], cat_features=cat_features)

        # creating results_df dataframe
        results_df = pd.DataFrame(columns=['params' + str(fold), 'logloss'+ str(fold), 'AUC'+ str(fold), 'iteration'+ str(fold)])

        best_score = 99999

        # iterate through parameter grid
        for params in ParameterGrid(grid):

            # create catboost classifer with parameter params
            model = CatBoostClassifier(cat_features=cat_features,
                                        early_stopping_rounds=50,
                                        task_type='GPU',
                                        custom_loss=['AUC'],
                                        iterations=3000,
                                        **params)

            # fit model
            model.fit(train_pool, eval_set=test_pool, verbose=400)

            # append results to results_df
            
            print(model.get_best_score()['validation'])
            results_df = results_df.append(pd.DataFrame([[params, model.get_best_score()['validation']['Logloss'], 
                                                          model.get_best_score()['validation']['AUC'], 
                                                          model.get_best_iteration()]], 
                                                        columns=['params' + str(fold), 'logloss' + str(fold), 'AUC' + str(fold), 'iteration' + str(fold)]))

            # save best score and parameters
            if model.get_best_score()['validation']['Logloss'] < best_score:
                best_score = model.get_best_score()['validation']['Logloss']
                best_grid = params

        print("Best logloss: ", best_score)
        print("Grid:", best_grid)

        save_file(results_df, save + str(fold) + '.joblib', dirName='../../models')
        display(results_df)

In [12]:
t = ['t_1', 't_3', 't_7', 't_14', 't_21', 't_30', 't_1c', 't_3c', 
         't_7c', 't_14c', 't_21c', 't_30c']
hist = ['hist_reward_completed',
        'hist_reward_possible', 'hist_difficulty_completed', 
        'hist_difficulty_possible', 'hist_previous_completed',
        'hist_previous_offers', 'hist_viewed_and_completed',
        'hist_complete_not_viewed', 'hist_failed_complete', 
        'hist_viewed', 'hist_received_spend', 'hist_viewed_spend']
day = ['day', 'weekday', 'month', 'year']
last = ['last_amount', 'last_transaction_days']

In [11]:
def remove_features(df, feature_list):
    '''
    Removes specified groups of features from DataFrame
    '''
    remove_list = []
    for features in feature_list:
        for feature in features:
            remove_list.append(feature)
    
    df_removed = df.drop(remove_list, axis=1)         
    
    return df_removed

In [152]:
df = remove_features(df, [t, hist, day, last])

### 1.3. Running Grid Search by Person Split, base features

In [None]:
df = joblib.load('../../data/interim/transcript_final.joblib')
df = src.models.train_model.drop_completion_features(df)
df = remove_features(df, [t, hist, day, last])

# test sample set, uncomment to test
df = df.iloc[0:1000]

X = df.drop('completed', axis=1)
y = df.completed

cv = GroupShuffleSplit(n_splits=1, test_size=.2, train_size=.8, random_state=42).split(X , y, groups=X.person)
X_train, y_train, X_test, y_test = generate_folds(cv, X_train=X, y_train=y)
X_train=X_train[0]
y_train=y_train[0]
X_test=X_test[0]
y_test=y_test[0]

grid = {"learning_rate": [0.1, .07, 0.03, 0.01, 0.005],
        "max_depth": [5,6,7,8,9,10]}

cat_features = [0, 4, 5]
cv = GroupKFold(n_splits=5).split(X_train, y_train, groups=X_train.person)
folds = list(range(0,5))

#test grid, uncomment to test
grid = {"learning_rate": [0.1, .07], "max_depth": [5,6]}
folds = [0,1]

gridsearch_early_stopping(cv, X_train, y_train, [0,1,2,3,4], grid, cat_features=cat_features, save='grid_search_person_no_feaures_27_10_')

### 1.4. Running Grid Search by Person Split, all features

In [None]:
df = joblib.load('../../data/interim/transcript_final.joblib')
df = src.models.train_model.drop_completion_features(df)

# test sample set, uncomment to test
df = df.iloc[0:1000]

X = df.drop('completed', axis=1)
y = df.completed

cv = GroupShuffleSplit(n_splits=1, test_size=.2, train_size=.8, random_state=42).split(X , y, groups=X.person)
X_train, y_train, X_test, y_test = generate_folds(cv, X_train=X, y_train=y)
X_train=X_train[0]
y_train=y_train[0]
X_test=X_test[0]
y_test=y_test[0]

grid = {"learning_rate": [0.1, .07, 0.03, 0.01, 0.005],
        "max_depth": [5,6,7,8,9,10]}

cat_features = [0, 4, 5]
cv = GroupKFold(n_splits=5).split(X_train, y_train, groups=X_train.person)
folds = list(range(0,5))

#test grid, uncomment to test
grid = {"learning_rate": [0.1, .07], "max_depth": [5,6]}
folds = [0,1]

gridsearch_early_stopping(cv, X_train, y_train, [0,1,2,3,4], grid, cat_features=cat_features, save='grid_search_corrected_person_split_27_10')

### 1.5. Running Grid Search by TimeSeriesSplit, base features

In [13]:
df = joblib.load('../../data/interim/transcript_final.joblib')
df = src.models.train_model.drop_completion_features(df)
df.sort_values('time_days', inplace=True)
df = remove_features(df, [t, hist, day, last])

# test sample set, uncomment to test
df = df.iloc[0:1000]

X = df.drop('completed', axis=1)
y = df.completed

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=42)
grid = {"learning_rate": [0.1, .07, 0.03, 0.01, 0.005],
        "max_depth": [5,6,7,8,9,10]}

cat_features = [0, 4, 5]
cv = TimeSeriesSplit(n_splits=5).split(X_train, y_train)
folds = list(range(0,5))

# test grid, uncomment to test
grid = {"learning_rate": [0.1, .07], "max_depth": [5,6]}
folds = [0,1]

gridsearch_early_stopping(cv, X_train, y_train, folds, grid, cat_features=cat_features, save='grid_search__time_series_no_features_27_10_fold')

positive classification % per fold and length
train[0] 0.3778 (135,)
test[0]  0.3985 (133,)
train[1] 0.3881 (268,)
test[1]  0.3383 (133,)
train[2] 0.3716 (401,)
test[2]  0.3158 (133,)
train[3] 0.3577 (534,)
test[3]  0.3383 (133,)
train[4] 0.3538 (667,)
test[4]  0.3985 (133,)
0:	learn: 0.6301647	test: 0.6332944	best: 0.6332944 (0)	total: 56ms	remaining: 2m 47s


KeyboardInterrupt: 

### 1.6. Running Grid Search with TimeSeriesSplit all features

In [12]:
df = joblib.load('../../data/interim/transcript_final.joblib')
df = src.models.train_model.drop_completion_features(df)
df.sort_values('time_days', inplace=True)

# test sample set, uncomment to test
df = df.iloc[0:1000]

X = df.drop('completed', axis=1)
y = df.completed

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=42)
grid = {"learning_rate": [0.1, .07, 0.03, 0.01, 0.005],
        "max_depth": [5,6,7,8,9,10]}

cat_features = [0, 4, 5]
cv = TimeSeriesSplit(n_splits=5).split(X_train, y_train)
folds = list(range(0,5))

# test grid, uncomment to test
grid = {"learning_rate": [0.1, .07], "max_depth": [5,6]}
folds = [0,1]

gridsearch_early_stopping(cv, X_train, y_train, folds, grid, cat_features=cat_features, save='grid_search__time_series_27_10_fold')

positive classification % per fold and length
train[0] 0.3778 (135,)
test[0]  0.3985 (133,)
train[1] 0.3881 (268,)
test[1]  0.3383 (133,)
train[2] 0.3716 (401,)
test[2]  0.3158 (133,)
train[3] 0.3577 (534,)
test[3]  0.3383 (133,)
train[4] 0.3538 (667,)
test[4]  0.3985 (133,)
0:	learn: 0.6441201	test: 0.6720411	best: 0.6720411 (0)	total: 50.1ms	remaining: 2m 30s
bestTest = 0.4804176962
bestIteration = 31
Shrink model to first 32 iterations.
{'Logloss': 0.48041769615689617, 'AUC': 0.8311320543289185}
0:	learn: 0.6348458	test: 0.6734636	best: 0.6734636 (0)	total: 59.8ms	remaining: 2m 59s
bestTest = 0.5027058537
bestIteration = 13
Shrink model to first 14 iterations.
{'Logloss': 0.5027058536845043, 'AUC': 0.8113207519054413}
0:	learn: 0.6581963	test: 0.6777078	best: 0.6777078 (0)	total: 44.5ms	remaining: 2m 13s
bestTest = 0.4809713722
bestIteration = 31
Shrink model to first 32 iterations.
{'Logloss': 0.480971372217164, 'AUC': 0.8240566253662109}
0:	learn: 0.6515422	test: 0.6787093	best: 0

Unnamed: 0,params0,logloss0,AUC0,iteration0
0,"{'learning_rate': 0.1, 'max_depth': 5}",0.480418,0.831132,31
0,"{'learning_rate': 0.1, 'max_depth': 6}",0.502706,0.811321,13
0,"{'learning_rate': 0.07, 'max_depth': 5}",0.480971,0.824057,31
0,"{'learning_rate': 0.07, 'max_depth': 6}",0.469547,0.838208,47


0:	learn: 0.6262044	test: 0.6427626	best: 0.6427626 (0)	total: 132ms	remaining: 6m 34s
bestTest = 0.4832254854
bestIteration = 42
Shrink model to first 43 iterations.
{'Logloss': 0.48322548543600213, 'AUC': 0.8459596037864685}
0:	learn: 0.6262044	test: 0.6427626	best: 0.6427626 (0)	total: 159ms	remaining: 7m 57s
bestTest = 0.4930741733
bestIteration = 37
Shrink model to first 38 iterations.
{'Logloss': 0.493074173317816, 'AUC': 0.8347222208976746}
0:	learn: 0.6450055	test: 0.6567745	best: 0.6567745 (0)	total: 133ms	remaining: 6m 40s
bestTest = 0.4819498277
bestIteration = 72
Shrink model to first 73 iterations.
{'Logloss': 0.48194982772482964, 'AUC': 0.8424242436885834}
0:	learn: 0.6450055	test: 0.6567745	best: 0.6567745 (0)	total: 161ms	remaining: 8m 3s
bestTest = 0.4838668135
bestIteration = 38
Shrink model to first 39 iterations.
{'Logloss': 0.4838668134875764, 'AUC': 0.8443181812763214}
Best logloss:  0.48194982772482964
Grid: {'learning_rate': 0.07, 'max_depth': 5}
saved as ../../

Unnamed: 0,params1,logloss1,AUC1,iteration1
0,"{'learning_rate': 0.1, 'max_depth': 5}",0.483225,0.84596,42
0,"{'learning_rate': 0.1, 'max_depth': 6}",0.493074,0.834722,37
0,"{'learning_rate': 0.07, 'max_depth': 5}",0.48195,0.842424,72
0,"{'learning_rate': 0.07, 'max_depth': 6}",0.483867,0.844318,38


### Notes - Looks like accuracy is far superior when we do a completely random train_test_split with historical features showing greater importance.
* Need to check that there is no data leakage in this case - will need to look at the best trees and see if this is the case

## Grid Search Results
___

In [2]:
def grid_search_results(raw_file, num_folds):
    '''
    Loads raw cross validation fold results.
    Displays results highlighting best scores
    
    Parameters
    -----------
    raw_file: string, the name of the file excluding fold number and extension
    extension: string, type of file, e.g '.joblib', '.pkl'
    num_folds: number of cv folds
                
    Returns
    -------
    results DataFrame
    '''
    
    # list of folds
    results_files = [0 for i in range(0, num_folds)]
        
    # read results files for each fold
    for i in range(0, num_folds):
        results_files[i] = joblib.load('../../models/' + raw_file + str(i) + ".joblib")            
    
    # join results files in one dataframe
    results_df = pd.concat([results_files[i] for i in range(0, num_folds)], axis=1)
    metrics = int(results_df.shape[1] / num_folds - 1)
    
    # drop extra params columns
    results_df.rename(columns={"params0": "Params"}, inplace=True)
    results_df.drop([i for i in results_df.columns if 'params' in i], axis=1, inplace=True)
    
    # convert data columns to numeric 
    def to_numeric_ignore(x, errors='ignore'):
        return pd.to_numeric(x, errors=errors)    
    results_df = results_df.apply(to_numeric_ignore)
    
    # loops through metrics and create mean column for each metric
    metric_names=[]
    for i in results_df.columns[1:metrics+1]:
        i = i[:-1]
        metric_names.append(i)
        results_df[i + '_mean'] = results_df[[x for x in results_df.columns if i in x]].mean(axis=1)
    
    results_df.reset_index(drop=True, inplace=True)
    
    # instantiating best_scores dataframe
    best_scores = pd.DataFrame(columns=['Params', 'Metric', 'Score'])
        
    negative_better = ['logloss', 'iteration']
    positive_better = ['AUC']
        
    # get index of best parameters
    best_param_idx = []
    for i in metric_names:
        if i in ['logloss', 'iteration']:
            best_param_idx = results_df[i+ '_mean'].idxmin(axis=0)
        if i in ['AUC']:
            best_param_idx = results_df[i+ '_mean'].idxmax(axis=0)

        row = pd.DataFrame({'Metric': [i + '_mean'], 'Params': [results_df.loc[best_param_idx, 'Params']], 'Score': [results_df.loc[best_param_idx, i + '_mean']]})
        best_scores = best_scores.append(row, ignore_index=True)

    display(best_scores)
    
    negative_columns = []
    positive_columns = []
    
    # highlight columns where negative metrics are better
    for i in negative_better:
        negative_columns.extend([x for x in results_df.columns if i in x])
    
    # highlight columns where positive metrics are better
    for i in positive_better:
        positive_columns.extend([x for x in results_df.columns if i in x])
        
    display(results_df.style
    .highlight_max(subset = positive_columns, color='lightgreen')
    .highlight_min(subset= negative_columns, color='lightgreen'))
    
    return results_df, best_scores

### 2.1. Person Split, basic features

In [None]:
results_person, best_scores = grid_search_results('grid_search_person_no_feaures_27_10_', 5)

### 2.2. Person Split All Features

In [266]:
results_person = grid_search_results('grid_search_corrected_person_split_27_10', 5)

Unnamed: 0,Params,logloss0,AUC0,iteration0,logloss1,AUC1,iteration1,logloss2,AUC2,iteration2,logloss3,AUC3,iteration3,logloss4,AUC4,iteration4,logloss_mean,AUC_mean,iteration_mean
0,"{'learning_rate': 0.1, 'max_depth': 5}",0.421035,0.879496,143,0.58569,0.775331,3,0.441507,0.875826,108,0.572271,0.814087,4,0.515431,0.859334,47,0.507187,0.840815,61.0
1,"{'learning_rate': 0.1, 'max_depth': 6}",0.452161,0.886498,149,0.566285,0.784513,2,0.549433,0.807813,2,0.486043,0.847403,114,0.403135,0.893299,205,0.491411,0.843905,94.4
2,"{'learning_rate': 0.1, 'max_depth': 7}",0.383702,0.901855,320,0.434363,0.880214,207,0.407983,0.895531,119,0.398933,0.89556,329,0.390315,0.903159,165,0.403059,0.895264,228.0
3,"{'learning_rate': 0.1, 'max_depth': 8}",0.388921,0.900176,272,0.449617,0.881787,197,0.403736,0.893516,192,0.418945,0.892654,171,0.383577,0.908761,196,0.408959,0.895379,205.6
4,"{'learning_rate': 0.1, 'max_depth': 9}",0.384725,0.901958,150,0.408415,0.890495,184,0.3851,0.908188,149,0.398577,0.901521,233,0.501601,0.84166,7,0.415683,0.888764,144.6
5,"{'learning_rate': 0.1, 'max_depth': 10}",0.393124,0.897863,184,0.433599,0.891608,158,0.381672,0.909839,253,0.39214,0.906556,208,0.38994,0.900036,163,0.398095,0.90118,193.2
6,"{'learning_rate': 0.07, 'max_depth': 5}",0.567181,0.762592,8,0.583225,0.746964,5,0.429423,0.884155,155,0.567107,0.817043,12,0.56801,0.750106,4,0.542989,0.792172,36.8
7,"{'learning_rate': 0.07, 'max_depth': 6}",0.401686,0.893019,246,0.567647,0.792648,3,0.401449,0.896622,231,0.554324,0.831145,20,0.55889,0.826098,11,0.496799,0.847907,102.2
8,"{'learning_rate': 0.07, 'max_depth': 7}",0.393888,0.897989,297,0.546241,0.824922,9,0.390171,0.90512,284,0.526225,0.833418,22,0.548863,0.814391,11,0.481078,0.855168,124.6
9,"{'learning_rate': 0.07, 'max_depth': 8}",0.387595,0.901722,212,0.552145,0.819545,15,0.394292,0.902159,184,0.517677,0.856599,16,0.378141,0.907944,224,0.44597,0.877594,130.2


### 2.3. Time Series Split, basic features

In [267]:
results_person = grid_search_results('grid_search__time_series_no_features_27_10_fold', 5)

Unnamed: 0,Params,logloss0,AUC0,iteration0,logloss1,AUC1,iteration1,logloss2,AUC2,iteration2,logloss3,AUC3,iteration3,logloss4,AUC4,iteration4,logloss_mean,AUC_mean,iteration_mean
0,"{'learning_rate': 0.1, 'max_depth': 5}",0.420984,0.877393,249,0.3877,0.901614,347,0.367793,0.911156,362,0.342012,0.924812,365,0.344976,0.923275,307,0.372693,0.90765,326.0
1,"{'learning_rate': 0.1, 'max_depth': 6}",0.421132,0.877148,145,0.387052,0.901644,244,0.367225,0.911558,248,0.340982,0.925416,202,0.34447,0.923418,217,0.372172,0.907837,211.2
2,"{'learning_rate': 0.1, 'max_depth': 7}",0.421462,0.87691,276,0.38783,0.901385,125,0.366501,0.911773,292,0.340009,0.925842,196,0.343491,0.923846,306,0.371859,0.907951,239.0
3,"{'learning_rate': 0.1, 'max_depth': 8}",0.421385,0.877021,205,0.388261,0.901477,189,0.367047,0.911549,212,0.340875,0.925565,166,0.343909,0.923796,196,0.372295,0.907882,193.6
4,"{'learning_rate': 0.1, 'max_depth': 9}",0.421891,0.876634,115,0.387667,0.901167,124,0.366518,0.911671,147,0.339422,0.926461,264,0.343488,0.923786,124,0.371797,0.907944,154.8
5,"{'learning_rate': 0.1, 'max_depth': 10}",0.422043,0.877059,206,0.387443,0.901541,102,0.366163,0.912024,156,0.340275,0.925817,130,0.344354,0.923432,139,0.372056,0.907975,146.6
6,"{'learning_rate': 0.07, 'max_depth': 5}",0.420464,0.877451,370,0.387803,0.901427,375,0.366359,0.911796,878,0.340122,0.925777,533,0.344452,0.923564,339,0.37184,0.908003,499.0
7,"{'learning_rate': 0.07, 'max_depth': 6}",0.421325,0.876855,371,0.387559,0.901522,330,0.366736,0.911526,515,0.341335,0.925232,267,0.343097,0.924056,402,0.372011,0.907838,377.0
8,"{'learning_rate': 0.07, 'max_depth': 7}",0.4216,0.876748,258,0.387713,0.901327,358,0.366343,0.911827,464,0.339756,0.925986,262,0.343967,0.923627,424,0.371876,0.907903,353.2
9,"{'learning_rate': 0.07, 'max_depth': 8}",0.42122,0.876997,351,0.388075,0.901366,300,0.366806,0.911749,248,0.339551,0.926215,215,0.34281,0.92418,486,0.371693,0.908101,320.0


### 2.4. Time Series Split All Features

In [268]:
results_person = grid_search_results('grid_search__time_series_27_10_fold', 5)

Unnamed: 0,Params,logloss0,AUC0,iteration0,logloss1,AUC1,iteration1,logloss2,AUC2,iteration2,logloss3,AUC3,iteration3,logloss4,AUC4,iteration4,logloss_mean,AUC_mean,iteration_mean
0,"{'learning_rate': 0.1, 'max_depth': 5}",0.421439,0.876941,299,0.358939,0.917626,196,0.338198,0.928035,147,0.322296,0.9368,134,0.334839,0.934359,70,0.355142,0.918752,169.2
1,"{'learning_rate': 0.1, 'max_depth': 6}",0.421395,0.87691,248,0.358469,0.91775,228,0.337528,0.927667,281,0.311792,0.94073,401,0.317564,0.938845,149,0.34935,0.920381,261.4
2,"{'learning_rate': 0.1, 'max_depth': 7}",0.421588,0.877033,181,0.360575,0.916692,289,0.339195,0.927542,126,0.3163,0.93965,185,0.31965,0.939234,163,0.351462,0.92003,188.8
3,"{'learning_rate': 0.1, 'max_depth': 8}",0.421384,0.877418,84,0.361693,0.916895,175,0.340645,0.926997,148,0.313106,0.94126,281,0.31417,0.94032,258,0.3502,0.920578,189.2
4,"{'learning_rate': 0.1, 'max_depth': 9}",0.421963,0.87709,117,0.361769,0.916463,210,0.341446,0.926165,106,0.311114,0.940264,257,0.320903,0.93932,104,0.351439,0.91986,158.8
5,"{'learning_rate': 0.1, 'max_depth': 10}",0.422537,0.876844,85,0.366221,0.915572,131,0.338261,0.92789,161,0.31087,0.940089,207,0.322366,0.939243,168,0.352051,0.919928,150.4
6,"{'learning_rate': 0.07, 'max_depth': 5}",0.421444,0.876789,339,0.359028,0.917715,255,0.339962,0.927348,237,0.315888,0.940605,934,0.325865,0.937499,332,0.352437,0.919991,419.4
7,"{'learning_rate': 0.07, 'max_depth': 6}",0.421456,0.877131,230,0.36067,0.91691,289,0.336868,0.928567,162,0.311883,0.941276,557,0.323619,0.937845,193,0.350899,0.920346,286.2
8,"{'learning_rate': 0.07, 'max_depth': 7}",0.420804,0.877721,324,0.358228,0.918396,247,0.337407,0.928173,289,0.309385,0.941314,473,0.321129,0.938614,157,0.349391,0.920844,298.0
9,"{'learning_rate': 0.07, 'max_depth': 8}",0.422348,0.877189,178,0.361191,0.917997,169,0.341068,0.92684,229,0.313952,0.93969,277,0.311648,0.940798,387,0.350041,0.920503,248.0


# Cross Validation Conclusions

When we compare results  

# ------------------------- NOTES REMOVE LATER ------------------------------

# ------------------------- NOTES REMOVE LATER ------------------------------

In [None]:
def show_results(model=model, X_test=X_test, y_test=y_test):
    '''
    Predicts model with X_test against y_test displaying:
    - confusion matrix
    - accuracy
    - log loss
    - classification_report
    
    Parameters
    -----------
    model: model (default='model')
    X_test: X_test data (default='X_test')
    Y_test: Y_test data (default='Y_test')

    '''
    
    pred = model.predict(X_test)
    proba = model.predict_proba(X_test)
    print(confusion_matrix(y_test, pred), 
          ' accuracy:  ', round(accuracy_score(y_test, pred), 4),
          ' log_loss: ', round(log_loss(y_test, proba), 4)
         )
    print()
    print(classification_report(y_test, pred))
    return accuracy_score(y_test, pred)