# Experiment 1.1 - Xgboost with k-fold cross validation

The base xgboost model using default settings provided a reasonable AUC of 0.64.

To try and improve the score without having to change any settings, lets try k-fold cross validation. This should see improved generalisability on our imbalanced dataset with the use of stratified k-fold to keep the ratio of classes in each fold.

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load datasets

In [4]:
# import load_data function from helper file 
%load_ext autoreload
%autoreload 2

# fix system path
import sys
sys.path.append("/home/jovyan/work")

In [54]:
from src.features.helper_functions import load_sets

X_train, y_train, X_val, y_val, X_test = load_sets()

In [55]:
print(X_train)
print(X_val)

[[ 7.416e+03  6.400e+01  1.390e+01 ...  4.000e-01  1.000e-01  7.000e-01]
 [ 4.919e+03  8.100e+01  2.080e+01 ...  6.000e-01  1.000e-01  1.300e+00]
 [ 7.672e+03  5.000e+01  5.600e+00 ...  4.000e-01 -3.000e-01  3.000e-01]
 ...
 [ 5.832e+03  5.500e+01  1.560e+01 ...  5.000e-01  3.000e-01  8.000e-01]
 [ 5.163e+03  9.100e+01  4.790e+01 ...  9.000e-01  6.000e-01  1.600e+00]
 [ 8.346e+03  6.700e+01  2.640e+01 ...  6.000e-01  2.400e+00  1.000e+00]]
[[0.000e+00 5.600e+01 9.100e+00 ... 2.000e-01 3.000e-01 8.000e-01]
 [1.000e+00 4.300e+01 1.930e+01 ... 6.000e-01 0.000e+00 1.800e+00]
 [2.000e+00 8.200e+01 3.390e+01 ... 1.300e+00 3.000e-01 2.000e+00]
 ...
 [3.796e+03 5.300e+01 9.900e+00 ... 4.000e-01 2.000e-01 5.000e-01]
 [3.797e+03 8.900e+01 3.830e+01 ... 1.300e+00 3.000e-01 2.400e+00]
 [3.798e+03 5.500e+01 1.200e+01 ... 3.000e-01 2.000e-01 1.200e+00]]


## k-fold settings
Example taken from here https://machinelearningmastery.com/evaluate-gradient-boosting-models-xgboost-python/

In [58]:
from sklearn.model_selection import StratifiedKFold

# choose number of splits, set shuffle to true and random state=8 to ensure reproducible output
kfold = StratifiedKFold(n_splits=10, shuffle = True, random_state=8)

## Train model

Follow the same process as base model, this time using kfolds.

In [61]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.3.3-py3-none-manylinux2010_x86_64.whl (157.5 MB)
[K     |████████████████████████████████| 157.5 MB 14 kB/s  eta 0:00:011  |█▏                              | 5.7 MB 2.7 MB/s eta 0:00:58     |█▎                              | 6.1 MB 2.7 MB/s eta 0:00:57     |█▋                              | 7.9 MB 11.3 MB/s eta 0:00:14     |███                             | 15.1 MB 6.5 MB/s eta 0:00:22     |████▍                           | 21.7 MB 6.6 MB/s eta 0:00:21     |████▌                           | 22.1 MB 6.6 MB/s eta 0:00:21     |████▉                           | 23.8 MB 6.6 MB/s eta 0:00:21     |█████▏                          | 25.2 MB 6.6 MB/s eta 0:00:21     |█████▋                          | 27.7 MB 5.8 MB/s eta 0:00:23     |██████▊                         | 33.3 MB 5.8 MB/s eta 0:00:22     |███████▏                        | 35.0 MB 5.9 MB/s eta 0:00:21     |███████▏                        | 35.4 MB 5.9 MB/s eta 0:00:21     |███████▎          

In [62]:
from xgboost import XGBClassifier
# won't recognise pipfile, need to revisit this at some point

In [63]:
# instatiate model
model = XGBClassifier()

In [64]:
# fit on train set using sklearns cross val score
from sklearn.model_selection import cross_val_score

results = cross_val_score(model, X_train, y_train, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))



Accuracy: 81.81% (0.50%)


Accuracy using validation set in base model was slightly higher at 82%

need to perform the same thing on all training data, and with different kfolds.


In [65]:
results

array([0.8125   , 0.8140625, 0.81875  , 0.8234375, 0.8203125, 0.8109375,
       0.81875  , 0.828125 , 0.8140625, 0.8203125])

In [66]:
# import dataset - train
training_data = pd.read_csv('../data/raw/train (1).csv')

In [67]:
df_cleaned = training_data.copy()

In [68]:
# remove id_old
df_cleaned.drop('Id_old', axis=1, inplace=True)

In [69]:
df_cleaned.shape

(8000, 21)

In [70]:
# create Y
target = df_cleaned.pop('TARGET_5Yrs')

In [71]:
# use all training data with 10 kfolds
results_10 = cross_val_score(model, df_cleaned, target, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results_10.mean()*100, results_10.std()*100))



Accuracy: 82.23% (0.49%)


In [72]:
# use less kfolds
kfold5 = StratifiedKFold(n_splits=5, shuffle = True, random_state=8)

results_10 = cross_val_score(model, df_cleaned, target, cv=kfold5)
print("Accuracy: %.2f%% (%.2f%%)" % (results_10.mean()*100, results_10.std()*100))



Accuracy: 81.92% (0.23%)


In [73]:
# use more kfolds
kfold20 = StratifiedKFold(n_splits=20, shuffle = True, random_state=8)

results_20 = cross_val_score(model, df_cleaned, target, cv=kfold20)
print("Accuracy: %.2f%% (%.2f%%)" % (results_20.mean()*100, results_20.std()*100))



Accuracy: 81.91% (1.05%)


## Verdict

Using kfold cross validation did not improve the accuracy of the results by any significant amount. 

Even thought the classes are imbalanced, the use of random smaller samples doesn't seem to make a difference.

Can keep using train/val split to evaluate models.