# Model selection
This notebook trains different gradient boosting models on the cleaned version of the application training data and produces predictions for the application test data. The notebook is structured as follows:
- Data preparation 
    - load cleaned and merged data
    - create train and validation sets for model selection
- Model selection
    - tune hyperparameters, based on AUC and OOS performance
    - save best model
    - examine predictions of best model for further ideas
- Make predictions on test set
    - load selected model
    - predict on test set and create submission file

## Load cleaned data

In [1]:
import numpy as np
import pandas as pd

import basic_application_data_cleaner as cleaner

In [2]:
path_to_kaggle_data='~/kaggle_JPFGM/Data/'  # location of all the unzipped data files on local machine# Training data
df_train, df_test = cleaner.load_cleaned_application_data(path_to_kaggle_data)

Raw training data size: (307511, 121)
Raw test data size: (48744, 120)
Cleaned training data shape:  (307511, 246)
Cleaned testing data shape:  (48744, 245)


## Create train and validation sets

In [3]:
from sklearn.model_selection import train_test_split
import time
from sklearn.metrics import accuracy_score, roc_auc_score

In [4]:
# all of the training data (with labels)
X = df_train.drop(['TARGET'], axis=1)
y = df_train['TARGET']

In [14]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, stratify=y, random_state=42)

In [15]:
y_train.value_counts(normalize=True)

0    0.91927
1    0.08073
Name: TARGET, dtype: float64

In [16]:
y_val.value_counts(normalize=True)

0    0.919274
1    0.080726
Name: TARGET, dtype: float64

Note: X and X_train are all still pandas dataframes, not numpy arrays.

## Gradient Boosting Machines
Note that there are several ways to implement xgboost based on different libraries. This notebook compares the output, performance and speed of different ways.

Main hyperparameters to tune:
- It is recommended to set a small learning rate, and choose n_estimators by early stopping. warm_start: in relation with early stopping
- max_depth: tune it, depends on the input. could be quite large for our large number of features
- for decorrelating trees, subsample = 0.5 (train on subset of training data). Can also do that on whole df_train since can produce oob score
- for decorrelating trees, and reducing run time, use a small max_features

### Using sklearn.ensemble.GradientBoostingClassifier
Original sklearn module for tree boosting. Use like other sklearn models

In [7]:
from sklearn.ensemble import GradientBoostingClassifier

#### Fit model and get out-of-sample performance

In [8]:
# fit xgboost model on training data
start = time.time()

# by default fits 100 trees of max_depth 3
gbc_clf = GradientBoostingClassifier(learning_rate=0.1, random_state=0)
gbc_clf.fit(X_train, y_train)

end = time.time()
print('Execution time:', np.round(end - start, 1), 'seconds')

Execution time: 184.3 seconds


In [10]:
# model performance on validation set
y_proba_gbc = gbc_clf.predict_proba(X_val)[:,1]

auc_gbc = roc_auc_score(y_val, y_proba_gbc)
print("AUC: %.2f%%" % (auc_gbc * 100.0))

AUC: 75.27%


#### Tune model

In [16]:
# choose n_estimators by early stopping (train a large number of trees then find the optimal)
# runs slow
gbc_clf = GradientBoostingClassifier(learning_rate=0.5, n_estimators=120, random_state=0)
gbc_clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.5, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=120,
              presort='auto', random_state=0, subsample=1.0, verbose=0,
              warm_start=False)

In [17]:
auc_gbc_staged = [roc_auc_score(y_val, proba[:,1]) for proba in gbc_clf.staged_predict_proba(X_val)]
bst_n_estimator = np.argmax(auc_gbc_staged)

In [37]:
print('optimal num_estimators = %d with validation set AUC = %.2f%%' 
      %(bst_n_estimator, max(auc_gbc_staged)*100))

optimal num_estimators = 96 with validation set AUC = 75.58%


#### Examine predictions of best model

In [38]:
y_pred_gbc = gbc_clf.predict(X_val)  # creates np array
acr_gbc = accuracy_score(y_val, y_pred_gbc)
print("Accuracy: %.2f%%" % (acr_gbc * 100.0))

Accuracy: 92.01%


### Using xgboost.XGBClassifier
Using xgboost in our case has several advantages over sklearn's gradientboostingclassifier:
- includes regularization
- can output feature importance
- well optimized for sparse data (e.g. categorical data and one-hot-encoding)
- can do early stopping, helps a lot of kagglers
- fastest implementation of boosting algorithms 
- can be run in distributed way

In [17]:
import xgboost as xgb

#### Fit model and evaluate OOS performance

In [18]:
start = time.time()

xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train, y_train)

end = time.time()
print('Execution time:', np.round(end - start, 1), 'seconds')

Execution time: 95.8 seconds


In [19]:
# model performance on validation set
y_proba_xgb = xgb_clf.predict_proba(X_val)[:,1]

auc_xgb = roc_auc_score(y_val, y_proba_xgb)
print("AUC: %.2f%%" % (auc_xgb * 100.0))

AUC: 74.99%


In [52]:
y_pred_xgb = xgb_clf.predict(X_val)  # creates np array
acr_xgb = accuracy_score(y_val, y_pred_xgb)
print("Accuracy: %.2f%%" % (acr_xgb * 100.0))

Accuracy: 92.03%


#### Tune xgboost
Parameters for XGBClassifier to set:
- objective : use default 'binary:logistic'
- booster: use default 'gbtree'
- n_jobs : default 1, set higher for parallel processing
- random_state : set a number for reproduceability

Parameters for XGBClassifier to tune:
- gamma: 
- max_depth : int, default 3
    Maximum tree depth for base learners.
- learning_rate : float, default 0.1
    Boosting learning rate (xgb's "eta")
- n_estimators : int, default 100
    Number of boosted trees to fit.
- subsample : float
    Subsample ratio of the training instance.
- reg_alpha : float (xgb's alpha)
    L1 regularization term on weights
- reg_lambda : float (xgb's lambda)
    L2 regularization term on weights

In [29]:
# for scale positive weight
sum(y_train==0)/sum(y_train==1)

11.386941622076595

In [27]:
xgb_clf = xgb.XGBClassifier(scale_pos_weight=11,
                            random_state=123)

xgb_clf.fit(X_train, y_train)

y_proba_xgb = xgb_clf.predict_proba(X_val)[:,1]

auc_xgb = roc_auc_score(y_val, y_proba_xgb)
print("AUC: %.2f%%" % (auc_xgb * 100.0))

AUC: 75.08%


In [28]:
xgb_clf.get_params

<bound method XGBModel.get_params of XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=11,
       seed=None, silent=True, subsample=1)>

#### xgboost with DMatrix

In [55]:
# encode data in efficient internal data structure
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_val, y_val)

In [None]:
num_round = 30
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
evallist = [(dtest, 'eval'), (dtrain, 'train')]

bst = xgb.train(param, dtrain, num_round, evallist)

In [None]:
# with early stopping
num_round = 30
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
evallist = [(dtrain, 'train'), (dtest, 'test')]

bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=num_round, evals=evallist, early_stopping_rounds=10)

In [None]:
ypred = bst.predict(dtest)

### Generate test set predictions

In [20]:
model_select = xgb_clf
y_submit = model_select.predict_proba(df_test)

df_submit = pd.DataFrame({'SK_ID_CURR': df_test.index,
                          'TARGET': y_submit[:,1]})

In [21]:
df_submit.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.050252
1,100005,0.117547
2,100013,0.021563
3,100028,0.035864
4,100038,0.120923


In [23]:
df_submit.shape

(48744, 2)

In [22]:
filename_output='baseline2_xgb.csv'
df_submit.to_csv(filename_output, index = False)