In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from __future__ import division
import warnings
warnings.filterwarnings("ignore")

This imbalanced problem can be approched in multiple ways. First, since the data size is high and the number of positive samples 
are only 3.6 % , undersampling techniques can be employed to undersample the majority class. This leads to loss of some information .In the second approach we can use ensemble learning methods such as balance cascade which divides the data into
balanced data sets ,retaining all the information. Third, using XGB model on the whole model and using sample weights for giving more importance.

After trying all the above mentioned methods,we have decided to go with third approach which is giving better results.

In the given data set, the raw features are encoded, so it is not possible to do any feature engineering for this problem. This 
issue has created a major block in attaining good accuracies

** Importing the training Data **

In [3]:
df=pd.read_csv("final_train.csv")

In [4]:
df.columns

Index([u'Unnamed: 0', u'id', u'ps_calc_01', u'ps_calc_02', u'ps_calc_03',
       u'ps_calc_04', u'ps_calc_05', u'ps_calc_06', u'ps_calc_07',
       u'ps_calc_08',
       ...
       u'ps_ind_02_cat-4', u'ps_ind_04_cat-0', u'ps_ind_04_cat-1',
       u'ps_ind_05_cat-0', u'ps_ind_05_cat-1', u'ps_ind_05_cat-2',
       u'ps_ind_05_cat-3', u'ps_ind_05_cat-4', u'ps_ind_05_cat-5',
       u'ps_ind_05_cat-6'],
      dtype='object', length=216)

In [5]:
del(df[u'Unnamed: 0'])
del(df[u'id'])

In [6]:
list_predictors=list(df.columns)
list_predictors.remove('target')
X=df[list_predictors].get_values()
Y=df['target'].get_values()

In [7]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=42,stratify=Y)

In [8]:
print("The hit rate in train data: %.3f" %(Y_train.mean()))
print("The hit rate in test data: %.3f" %(Y_test.mean()))

The hit rate in train data: 0.036
The hit rate in test data: 0.036


Since feature engineering is not possible, we will tune the hyperparameters of the model to attain highest AUC

In [9]:
#Data Frames for training after randomised split
df_train_predictors=pd.DataFrame(X_train,columns=list_predictors)
df_train_labels=pd.DataFrame(Y_train,columns=['target'])

In [10]:
#Data Frames for training after randomised split
df_test_predictors=pd.DataFrame(X_test,columns=list_predictors)
df_test_labels=pd.DataFrame(Y_test,columns=['target'])

Step1 : Start with some default hyperparameters

In [11]:
XGB=XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=1000, silent=False, objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=15, base_score=0.5, random_state=0, seed=0, missing=None)

In [12]:
XGB.fit(df_train_predictors,df_train_labels,eval_set=[(df_test_predictors,df_test_labels)],eval_metric='auc',early_stopping_rounds=50,verbose=True)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[0]	validation_0-auc:0.591476
Will train until validation_0-auc hasn't improved in 50 rounds.
[1]	validation_0-auc:0.595975
[2]	validation_0-auc:0.602613
[3]	validation_0-auc:0.608549
[4]	validation_0-auc:0.60917
[5]	validation_0-auc:0.613519
[6]	validation_0-auc:0.616597
[7]	validation_0-auc:0.61802
[8]	validation_0-auc:0.619402
[9]	validation_0-auc:0.620097
[10]	validation_0-auc:0.620543
[11]	validation_0-auc:0.621739
[12]	validation_0-auc:0.622338
[13]	validation_0-auc:0.622913
[14]	validation_0-auc:0.624365
[15]	validation_0-auc:0.624535
[16]	validation_0-auc:0.625862
[17]	validation_0-auc:0.626534
[18]	validation_0-auc:0.626687
[19]	validation_0-auc:0.627232
[20]	validation_0-auc:0.628326
[21]	validation_0-auc:0.628755
[22]	validation_0-auc:0.629146
[23]	validation_0-auc:0.629643
[24]	validation_0-auc:0.630028
[25]	validation_0-auc:0.630518
[26]	validation_0-auc:0.630795
[27]	validation_0-auc:0.630821
[28]	validation_0-auc:0.631002
[29]	validation_0-auc:0.631747
[30]	validation_0-

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1)

In [13]:
XGB.set_params(n_estimators=189)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=189,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1)

In [14]:
XGB.fit(df_train_predictors,df_train_labels,eval_metric='auc')

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=189,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1)

The above default parameters gave AUC of 0.645. In the next steps we shall see if tuning these parameters will give score above 
this.

** Training maxdepth & min_child_weight**

Max depth is number splits allowed in a given tree. This is similar to max depth in Random Forest algorithm.

Min child weight will have a constrain on Max depth by making sure that no split happens below min child weight value

In [19]:
tuning_parameters={'max_depth':[1,3,5],'min_child_weight':[1,3,5]}
gridsearch1=GridSearchCV(XGB,param_grid=tuning_parameters,scoring='roc_auc',cv=5)
gridsearch1.fit(df_train_predictors,df_train_labels)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=189,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [1, 3, 5], 'min_child_weight': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [20]:
gridsearch1.best_params_

{'max_depth': 3, 'min_child_weight': 5}

Grid search gives max_depth =3 and min_child_weight=5 as best parameters. We will include these in the next step to see the 
performance

In [21]:
XGB3=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=5, missing=None, n_estimators=189,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1)

In [22]:
XGB3.fit(df_train_predictors,df_train_labels,eval_metric='auc',eval_set=[(df_test_predictors,df_test_labels)])

[0]	validation_0-auc:0.591476
[1]	validation_0-auc:0.595975
[2]	validation_0-auc:0.602613
[3]	validation_0-auc:0.608549
[4]	validation_0-auc:0.60917
[5]	validation_0-auc:0.613519
[6]	validation_0-auc:0.616597
[7]	validation_0-auc:0.61802
[8]	validation_0-auc:0.619402
[9]	validation_0-auc:0.620097
[10]	validation_0-auc:0.620543
[11]	validation_0-auc:0.621739
[12]	validation_0-auc:0.622338
[13]	validation_0-auc:0.622913
[14]	validation_0-auc:0.624365
[15]	validation_0-auc:0.624535
[16]	validation_0-auc:0.625862
[17]	validation_0-auc:0.626534
[18]	validation_0-auc:0.626687
[19]	validation_0-auc:0.627232
[20]	validation_0-auc:0.628326
[21]	validation_0-auc:0.628755
[22]	validation_0-auc:0.629146
[23]	validation_0-auc:0.629643
[24]	validation_0-auc:0.630028
[25]	validation_0-auc:0.630518
[26]	validation_0-auc:0.630795
[27]	validation_0-auc:0.630821
[28]	validation_0-auc:0.631002
[29]	validation_0-auc:0.631747
[30]	validation_0-auc:0.632283
[31]	validation_0-auc:0.63261
[32]	validation_0-auc

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=5, missing=None, n_estimators=189,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1)

**Tuning max_delta_step **

In [23]:
XGB4=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=5, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1)
tuning_parameters={'max_delta_step':[0,2,4,6,8,10]}
gridsearch2=GridSearchCV(XGB4,param_grid=tuning_parameters,scoring='roc_auc',cv=5)
gridsearch2.fit(df_train_predictors,df_train_labels)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=5, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_delta_step': [0, 2, 4, 6, 8, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [24]:
gridsearch2.best_params_

{'max_delta_step': 0}

max delta step 0 is the best parameter for this model.So, we will leave max delta
setp as it is.

** Tuning gamma **

In [27]:
gamma_tuning={'gamma':[i/10 for i in np.arange(1,6)]}

In [28]:
gridsearch3=GridSearchCV(XGB4,param_grid=gamma_tuning,scoring='roc_auc',cv=5)
gridsearch3.fit(df_train_predictors,df_train_labels)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=6, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'gamma': [0.1, 0.2, 0.3, 0.4, 0.5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [29]:
gridsearch3.grid_scores_

[mean: 0.63634, std: 0.00427, params: {'gamma': 0.1},
 mean: 0.63634, std: 0.00427, params: {'gamma': 0.2},
 mean: 0.63634, std: 0.00427, params: {'gamma': 0.3},
 mean: 0.63634, std: 0.00427, params: {'gamma': 0.4},
 mean: 0.63634, std: 0.00427, params: {'gamma': 0.5}]

In [30]:
gridsearch3.best_params_

{'gamma': 0.1}

0.1 is the best value for gamma.But we shall leave it at 0 for now

** Tuning subsample and colsample_bytree **

In [31]:
sample_parameters={'subsample':[0.6,0.7,0.8,1.0],'colsample_bytree':[0.6,0.7,0.8,1.0]}
gridsearch4=GridSearchCV(XGB4,param_grid=sample_parameters,scoring='roc_auc',cv=5)
gridsearch4.fit(df_train_predictors,df_train_labels)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=6, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'subsample': [0.6, 0.7, 0.8, 1.0], 'colsample_bytree': [0.6, 0.7, 0.8, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [32]:
gridsearch4.grid_scores_

[mean: 0.63621, std: 0.00474, params: {'subsample': 0.6, 'colsample_bytree': 0.6},
 mean: 0.63647, std: 0.00447, params: {'subsample': 0.7, 'colsample_bytree': 0.6},
 mean: 0.63711, std: 0.00458, params: {'subsample': 0.8, 'colsample_bytree': 0.6},
 mean: 0.63681, std: 0.00405, params: {'subsample': 1.0, 'colsample_bytree': 0.6},
 mean: 0.63651, std: 0.00456, params: {'subsample': 0.6, 'colsample_bytree': 0.7},
 mean: 0.63632, std: 0.00387, params: {'subsample': 0.7, 'colsample_bytree': 0.7},
 mean: 0.63704, std: 0.00400, params: {'subsample': 0.8, 'colsample_bytree': 0.7},
 mean: 0.63655, std: 0.00405, params: {'subsample': 1.0, 'colsample_bytree': 0.7},
 mean: 0.63669, std: 0.00407, params: {'subsample': 0.6, 'colsample_bytree': 0.8},
 mean: 0.63647, std: 0.00391, params: {'subsample': 0.7, 'colsample_bytree': 0.8},
 mean: 0.63744, std: 0.00403, params: {'subsample': 0.8, 'colsample_bytree': 0.8},
 mean: 0.63617, std: 0.00427, params: {'subsample': 1.0, 'colsample_bytree': 0.8},
 mea

In [33]:
gridsearch4.best_params_

{'colsample_bytree': 1.0, 'subsample': 0.8}

subsample:0.8 colsample_bytree :1

** Tuning regularisation parameters **

In [34]:
XGB4.set_params(subsample=0.8,colsample_bytree=1)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=6, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=0.8)

In [35]:
alpha_grid={'reg_alpha':[0,1e-5,1e-2,1,10,100]}
Gridsearch5=GridSearchCV(XGB4,param_grid=alpha_grid,scoring='roc_auc',cv=5)
Gridsearch5.fit(df_train_predictors,df_train_labels)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=6, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=0.8),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'reg_alpha': [0, 1e-05, 0.01, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [36]:
Gridsearch5.grid_scores_

[mean: 0.63747, std: 0.00403, params: {'reg_alpha': 0},
 mean: 0.63747, std: 0.00403, params: {'reg_alpha': 1e-05},
 mean: 0.63743, std: 0.00398, params: {'reg_alpha': 0.01},
 mean: 0.63738, std: 0.00417, params: {'reg_alpha': 1},
 mean: 0.63813, std: 0.00468, params: {'reg_alpha': 10},
 mean: 0.63773, std: 0.00437, params: {'reg_alpha': 100}]

 So regularisation parameter zero is giving highest auc

** Since this is an imbalance problem, we try different ratios of scale_pos_weight and shall see the one giving best result**

In [39]:
pos_sample_weights={'scale_pos_weight':[1,5,10,15,20,25,30]}
Gridsearchcv6=GridSearchCV(XGB4,param_grid=pos_sample_weights,scoring='roc_auc',cv=5)
Gridsearchcv6.fit(df_train_predictors,df_train_labels)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=6, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=0.8),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'scale_pos_weight': [1, 5, 10, 15, 20, 25, 30]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [40]:
Gridsearchcv6.grid_scores_

[mean: 0.63715, std: 0.00370, params: {'scale_pos_weight': 1},
 mean: 0.63744, std: 0.00436, params: {'scale_pos_weight': 5},
 mean: 0.63724, std: 0.00436, params: {'scale_pos_weight': 10},
 mean: 0.63747, std: 0.00403, params: {'scale_pos_weight': 15},
 mean: 0.63694, std: 0.00441, params: {'scale_pos_weight': 20},
 mean: 0.63697, std: 0.00469, params: {'scale_pos_weight': 25},
 mean: 0.63666, std: 0.00439, params: {'scale_pos_weight': 30}]

In [41]:
Gridsearchcv6.best_params_

{'scale_pos_weight': 15}

Scale_pos_weight of 15 is giving the best score.

In [44]:
XGB4.set_params(scale_pos_weight=15)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=6, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=0.8)

Checking for performance on the fully updated model

In [45]:
XGB4.fit(df_train_predictors,df_train_labels,eval_metric='auc',eval_set=[(df_test_predictors,df_test_labels)])

[0]	validation_0-auc:0.589986
[1]	validation_0-auc:0.594509
[2]	validation_0-auc:0.602985
[3]	validation_0-auc:0.605613
[4]	validation_0-auc:0.607152
[5]	validation_0-auc:0.612631
[6]	validation_0-auc:0.614455
[7]	validation_0-auc:0.616436
[8]	validation_0-auc:0.616451
[9]	validation_0-auc:0.617587
[10]	validation_0-auc:0.619255
[11]	validation_0-auc:0.62017
[12]	validation_0-auc:0.620511
[13]	validation_0-auc:0.622017
[14]	validation_0-auc:0.622575
[15]	validation_0-auc:0.624018
[16]	validation_0-auc:0.624852
[17]	validation_0-auc:0.625906
[18]	validation_0-auc:0.626235
[19]	validation_0-auc:0.626954
[20]	validation_0-auc:0.627123
[21]	validation_0-auc:0.628209
[22]	validation_0-auc:0.628929
[23]	validation_0-auc:0.629643
[24]	validation_0-auc:0.630016
[25]	validation_0-auc:0.630538
[26]	validation_0-auc:0.630913
[27]	validation_0-auc:0.63126
[28]	validation_0-auc:0.632173
[29]	validation_0-auc:0.6322
[30]	validation_0-auc:0.632721
[31]	validation_0-auc:0.632683
[32]	validation_0-auc:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=6, missing=None, n_estimators=126,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=15, seed=0,
       silent=False, subsample=0.8)

The Final model is giving AUC of 0.640376 and the validation set.As mentioned earlier, the accuracy of any model depends on
feature engineering, machine learning model and optimizing the parameters for the machine learning model.

In the current exercise, we have not tested different models, but XG boost has proved to give best results in many previous 
problems and it is reasonable to go forward with that assumptions. Also, we have tried different settings of hyperparameters
but none of them yielded any good imporvements in the results. Part of the limitation is because of obfuscated features hindering
us from doing any real feature engineering.

Over all this problem is more of a parameter optimisation excercise and we have exlored various hyperparameters available
for XGboost

As next attempt to improve score on this data set is Neural Networks.