# XGBoost on Application data
This notebook trains the xgboost model on the cleaned version of the application training data and produces predictions for the application test data. It is structured as follows:
- Data preparation 
    - load cleaned and merged data
    - create train and validation sets for model selection
- Model selection
    - benchmark: default model performance
    - tune hyperparameters, based on AUC and OOS performance
    - save best model and examine prediction errors
- Make predictions on test set
    - load selected model
    - predict on test set and create submission file

## Load cleaned data

In [2]:
import numpy as np
import pandas as pd

import basic_application_data_cleaner as cleaner

In [3]:
path_to_kaggle_data='~/kaggle_JPFGM/Data/'  # location of all the unzipped data files on local machine
df_train, df_test = cleaner.load_cleaned_application_data(path_to_kaggle_data)

Raw training data size: (307511, 121)
Raw test data size: (48744, 120)
Cleaned training data shape:  (307511, 246)
Cleaned testing data shape:  (48744, 245)


## Create train and validation sets

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
# all of the training data (with labels)
# SK_ID is set as index in previous data cleaning
X = df_train.drop(['TARGET'], axis=1)
y = df_train['TARGET']

In [7]:
# Create train and validation sets in stratified way
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, stratify=y, random_state=42)

In [8]:
y_train.value_counts(normalize=True)

0    0.91927
1    0.08073
Name: TARGET, dtype: float64

In [9]:
y_val.value_counts(normalize=True)

0    0.919274
1    0.080726
Name: TARGET, dtype: float64

Note: X and X_train are all still pandas dataframes, not numpy arrays.

## XGBoost Model

In [11]:
import xgboost as xgb
import time
from sklearn.metrics import accuracy_score, roc_auc_score

In [27]:
def model_performance(model, X, y, name='validation set',
                      metric=['auc', 'accuracy']):
    """Predict on X and evaluate performance using labels y"""
    print('Performance on %s:' % name)
    y_pred = model.predict(X)
    y_proba = model.predict_proba(X)[:,1]
    if metric=='auc' or 'auc' in metric:
        auc = roc_auc_score(y, y_proba)
        print("AUC: %.2f%%" % (auc * 100.0))
    if metric=='accuracy' or 'accuracy' in metric:
        acr = accuracy_score(y, y_pred)
        print("Accuracy: %.2f%%" % (acr * 100.0))
    return model

### Fit default model and evaluate val set performance

In [12]:
start = time.time()

xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train, y_train)

end = time.time()
print('Execution time:', np.round(end - start, 1), 'seconds')

Execution time: 101.5 seconds


In [35]:
# default parameters
xgb_clf

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [31]:
model_performance(xgb_clf, X_val, y_val, 'validation set');

Performance on validation set:
AUC: 74.99%
Accuracy: 91.95%


In [33]:
model_performance(xgb_clf, X_train, y_train, 'training set');

Performance on training set:
AUC: 75.92%
Accuracy: 91.95%


### Tune hyperparameters

#### Hyperparameters to set as fixed:
- n_jobs : default 1, set higher for parallel processing
- silent : boolean, default True. Set to False if you want printed messages while running boosting.

#### Hyperparameters to tune:
- Number of trees:
    - learning_rate : float, default 0.1. Boosting learning rate (xgb's "eta")
    - n_estimators : int, default 100. Number of boosted trees to fit.
    - Recommended to set a small learning rate, and choose n_estimators by early stopping
- Decorrelate trees:
    - subsample : float, default 1 Subsample ratio of the training instance.
    - use a small max_features
    - max_delta_step : int, default 0. Maximum delta step we allow each tree's weight estimation to be.
    - colsample_bytree : float, default 1. Subsample ratio of columns when constructing each tree.
    - colsample_bylevel : float, default 1. Subsample ratio of columns for each split, in each level.
- Regularization strenghts:
    - reg_alpha : float (xgb's alpha), default 0. L1 regularization term on weights
    - reg_lambda : float (xgb's lambda), default 1. L2 regularization term on weights    

- Individual tree properties:
    - max_depth : int, default 3. Increase to allow more complex trees
    - gamma: float, default 0. Minimum loss reduction required to make a further partition on a leaf node of the tree
    - min_child_weight : int, default 1. Minimum sum of instance weight(hessian) needed in a child.
- Imbalanced data:
    - scale_pos_weight : float, default 1. Balancing of positive and negative weights.
    - base_score, default 0.5. The initial prediction score of all instances, global bias.

In [29]:
# for scale positive weight
sum(y_train==0)/sum(y_train==1)

11.386941622076595

In [27]:
xgb_clf = xgb.XGBClassifier(scale_pos_weight=11,
                            random_state=123)

xgb_clf.fit(X_train, y_train)

y_proba_xgb = xgb_clf.predict_proba(X_val)[:,1]

auc_xgb = roc_auc_score(y_val, y_proba_xgb)
print("AUC: %.2f%%" % (auc_xgb * 100.0))

AUC: 75.08%


In [28]:
xgb_clf.get_params

<bound method XGBModel.get_params of XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=11,
       seed=None, silent=True, subsample=1)>

#### xgboost with DMatrix

In [55]:
# encode data in efficient internal data structure
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_val, y_val)

In [None]:
num_round = 30
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
evallist = [(dtest, 'eval'), (dtrain, 'train')]

bst = xgb.train(param, dtrain, num_round, evallist)

In [None]:
# with early stopping
num_round = 30
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
evallist = [(dtrain, 'train'), (dtest, 'test')]

bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=num_round, evals=evallist, early_stopping_rounds=10)

In [None]:
ypred = bst.predict(dtest)

## Examine predictions of best model

### Classification errors

### Feature importance

## Generate predictions on test set for submission

In [20]:
model_select = xgb_clf
y_submit = model_select.predict_proba(df_test)

df_submit = pd.DataFrame({'SK_ID_CURR': df_test.index,
                          'TARGET': y_submit[:,1]})

In [21]:
df_submit.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.050252
1,100005,0.117547
2,100013,0.021563
3,100028,0.035864
4,100038,0.120923


In [23]:
df_submit.shape

(48744, 2)

In [22]:
filename_output='baseline2_xgb.csv'
#df_submit.to_csv(filename_output, index = False)