# XGBoost on Application data
This notebook trains the xgboost model on the cleaned version of the application training data and produces predictions for the application test data. It is structured as follows:
- Data preparation 
    - load cleaned and merged data
    - create train and validation sets for model selection
- Model selection
    - benchmark: default model performance
    - tune hyperparameters, based on AUC and OOS performance
    - save best model and examine prediction errors
- Make predictions on test set
    - load selected model
    - predict on test set and create submission file

## Load cleaned data

In [4]:
import numpy as np
import pandas as pd

import basic_application_data_cleaner as cleaner

In [5]:
path_to_kaggle_data='~/kaggle_JPFGM/Data/'  # location of all the unzipped data files on local machine
df_train, df_test = cleaner.load_cleaned_application_data(path_to_kaggle_data)

Raw training data size: (307511, 121)
Raw test data size: (48744, 120)
Cleaned training data shape:  (307511, 246)
Cleaned testing data shape:  (48744, 245)


## Create train and validation sets

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
# all of the training data (with labels)
# SK_ID is set as index in previous data cleaning
X = df_train.drop(['TARGET'], axis=1)
y = df_train['TARGET']

In [8]:
# Create train and validation sets in stratified way
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, stratify=y, random_state=42)

In [33]:
print('Fraction of positive samples in training set: %.2f%%' % (100*sum(y_train==1)/len(y_train)))
print('Fraction of positive samples in validation set: %.2f%%' % (100*sum(y_val==1)/len(y_val)))

Fraction of positive samples in training set: 8.07%
Fraction of positive samples in validation set: 8.07%


Note: X and X_train are all still pandas dataframes, not numpy arrays.

## XGBoost Model

In [34]:
import xgboost as xgb
import time
from sklearn.metrics import accuracy_score, roc_auc_score

In [35]:
def _model_performance_metric(model, X, y, metric):
    """Predict on X and evaluate performance metric using labels y.
    
    Parameters
    ----------
    model: pre-trained model.
        Needs to have .predict() and .predict_proba() methods
    X: np array or pandas dataframe.
        Data to predict and evaluate model on
    y: np array or pandas Series
        labels to evaluate model
    metric: string or list of strings. Supports 'auc' and 'accuracy'
        Evaluation metrics.
    """
    y_pred = model.predict(X)
    y_proba = model.predict_proba(X)[:,1]

    if isinstance(metric, str):
        l_metric = [metric]
    elif isinstance(metric, list):
        l_metric = metric
    else:
        raise ValueError('metric has to be either string or list of strings')
    scores = []
    for metric in l_metric:
        if metric=='auc':
            score = roc_auc_score(y, y_proba)
        elif metric=='accuracy':
            score = accuracy_score(y, y_pred)
        else:
            raise ValueError('metric not defined')
        scores.append("%.2f%%" % (score * 100.0))
    s_scores = pd.Series(scores, index=l_metric)
    return s_scores

In [36]:
def model_performance_train_test_split(model, X_train, X_val, y_train, y_val, metric=['auc', 'accuracy']):
    """Compare model performance on train and test set.

    Parameters:
    -----------
    model: pre-trained model.
        Needs to have .predict() and .predict_proba() methods
    X_train, X_val: np array or pandas dataframe.
        Data to predict and evaluate model on
    y_train, y_val: np array or pandas Series
        labels to evaluate model
    metric: string or list of strings.
        Evaluation metrics.
    """
    train_scores = _model_performance_metric(model, X_train, y_train, metric)
    val_scores = _model_performance_metric(model, X_val, y_val, metric)
    df_scores = pd.concat([train_scores, val_scores], axis=1).rename(columns={0:'Training Set', 1:'Validation Set'})
    return df_scores

### Fit default model and evaluate val set performance

In [37]:
start = time.time()

xgb_def = xgb.XGBClassifier()
xgb_def.fit(X_train, y_train)

end = time.time()
print('Execution time:', np.round(end - start, 1), 'seconds')

Execution time: 98.9 seconds


In [38]:
# default parameters
xgb_def

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [40]:
model_performance_train_test_split(xgb_def, X_train, X_val, y_train, y_val)

Unnamed: 0,Training Set,Validation Set
auc,75.92%,74.99%
accuracy,91.95%,91.95%


### Tune hyperparameters

#### Hyperparameters to set as fixed:
- **n_jobs** : default 1, set higher for parallel processing
- **silent** : boolean, default True. Set to False if you want printed messages while running boosting.
- **scale_pos_weight** : float, default 1. Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances)
- **base_score**, default 0.5. The initial prediction score of all instances, global bias.
- **eval_metric**: set as auc

#### Hyperparameters to tune:
- Number of trees:
    - **learning_rate** : float, default 0.1. Boosting learning rate (xgb's "eta"). Recommended to set a small learning rate, and choose n_estimators by early stopping
    - **n_estimators** : int, default 100. Number of boosted trees to fit. 
- Decorrelate trees:
    - use a small max_features
    - (max_delta_step) : int, default 0. Maximum delta step we allow each tree's weight estimation to be. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
    - **colsample_bytree**: float, default 1. Subsample ratio of columns when constructing each tree.
    - (colsample_bylevel) : float, default 1. Subsample ratio of columns for each split, in each level.
- Regularization strenghts (default: L2 regularization):
    - reg_alpha : float (xgb's alpha), default 0. L1 regularization term on weights.  Increasing this value will make model more conservative.
    - **reg_lambda** : float (xgb's lambda), default 1. L2 regularization term on weights. Increasing this value will make model more conservative.  
    - **subsample** : float, default 1 Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. 
- Individual tree complexity: (start with one of them)
    - **max_depth** : int, default 3. Increase to allow more complex trees
    - (gamma): float, default 0. Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.
    - (min_child_weight): int, default 1. Minimum sum of instance weight(hessian) needed in a child. The larger min_child_weight is, the more conservative the algorithm will be.

In [41]:
# parameters to set for unbalanced data
pos_proba = sum(y_train==1)/len(y_train)
print('Naive initial prediction score:', pos_proba)

pos_weight = sum(y_train==0) / sum(y_train==1)
print('Negative to positive instances, for use as scale_pos_weight:', pos_weight)

Naive initial prediction score: 0.0807301778365
Negative to positive instances, for use as scale_pos_weight: 11.3869416221


In [42]:
start = time.time()

# reasonable parameter choices for not-to-tune parameters
xgb_clf = xgb.XGBClassifier(base_score=pos_proba,
                            scale_pos_weight=pos_weight,
                            max_depth=6,
                            subsample=0.8, colsample_bytree=0.8,  # decorrelate trees and faster run
                            eval_metric = 'auc',
                            silent=False)

                           
xgb_clf.fit(X_train, y_train)

end = time.time()
print('Execution time:', np.round(end - start, 1), 'seconds')

Execution time: 425.0 seconds


In [43]:
model_performance_train_test_split(xgb_clf, X_train, X_val, y_train, y_val)

Unnamed: 0,Training Set,Validation Set
auc,81.69%,75.32%
accuracy,72.79%,71.37%


In [None]:
# tune main parameters: learning rate & number of trees, reg_lambda


#### xgboost with DMatrix

In [55]:
# encode data in efficient internal data structure
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_val, y_val)

In [None]:
num_round = 30
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
evallist = [(dtest, 'eval'), (dtrain, 'train')]

bst = xgb.train(param, dtrain, num_round, evallist)

In [None]:
# with early stopping
num_round = 30
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
evallist = [(dtrain, 'train'), (dtest, 'test')]

bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=num_round, evals=evallist, early_stopping_rounds=10)

In [None]:
ypred = bst.predict(dtest)

## Examine predictions of best model

### Classification errors

### Feature importance

## Generate predictions on test set for submission

In [20]:
model_select = xgb_clf
y_submit = model_select.predict_proba(df_test)

df_submit = pd.DataFrame({'SK_ID_CURR': df_test.index,
                          'TARGET': y_submit[:,1]})

In [21]:
df_submit.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.050252
1,100005,0.117547
2,100013,0.021563
3,100028,0.035864
4,100038,0.120923


In [23]:
df_submit.shape

(48744, 2)

In [22]:
filename_output='baseline2_xgb.csv'
#df_submit.to_csv(filename_output, index = False)