# Case Study: Loan Deafault Prediction

#### Author: Ewen

## Description

This competition asks you to determine whether a loan will default, as well as the loss incurred if it does default. Unlike traditional finance-based approaches to this problem, where one distinguishes between good or bad counterparties in a binary way, we seek to anticipate and incorporate both the default and the severity of the losses that result. In doing so, we are building a bridge between traditional banking, where we are looking at reducing the consumption of economic capital, to an asset-management perspective, where we optimize on the risk to the financial investor.

This competition is sponsored by researchers at Imperial College London.

## Evaluation

This competition is evaluated on the mean absolute error (MAE).

## Data Discription

This data corresponds to a set of financial transactions associated with individuals. The data has been standardized, de-trended, and anonymized. You are provided with over two hundred thousand observations and nearly 800 features.  Each observation is independent from the previous. 

For each observation, it was recorded whether a default was triggered. In case of a default, the loss was measured. This quantity lies between 0 and 100. It has been normalised, considering that the notional of each transaction at inception is 100. For example, a loss of 60 means that only 40 is reimbursed. If the loan did not default, the loss was 0. You are asked to predict the losses for each observation in the test set.

Missing feature values have been kept as is, so that the competing teams can really use the maximum data available, implementing a strategy to fill the gaps if desired. Note that some variables may be categorical (e.g. f776 and f777).

The competition sponsor has worked to remove time-dimensionality from the data. However, the observations are still listed in order from old to new in the training set. In the test set they are in random order.

## Import Packages

In [110]:
import os
import numpy as np
import pandas as pd
from datetime import datetime
import xgboost as xgb
import lightgbm as lgb
from lightgbm import LGBMClassifier

from sklearn import metrics 
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
rcParams['figure.figsize'] = 12, 20

## Set Work Directory

In [28]:
wd = '/Users/ewenwang/Downloads/'
os.chdir(wd)

## Exploratory Data Analysis

In [30]:
train = pd.read_csv('train_v2.csv', low_memory=False)
train.info()

In [31]:
train.head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f770,f771,f772,f773,f774,f775,f776,f777,f778,loss
0,1,126,10,0.686842,1100,3,13699,7201.0,4949.0,126.75,...,5,2.14,-1.54,1.18,0.1833,0.7873,1,0,5,0
1,2,121,10,0.782776,1100,3,84645,240.0,1625.0,123.52,...,6,0.54,-0.24,0.13,0.1926,-0.6787,1,0,5,0
2,3,126,10,0.50008,1100,3,83607,1800.0,1527.0,127.76,...,13,2.89,-1.73,1.04,0.2521,0.7258,1,0,5,0
3,4,134,10,0.439874,1100,3,82642,7542.0,1730.0,132.94,...,4,1.29,-0.89,0.66,0.2498,0.7119,1,0,5,0
4,5,109,9,0.502749,2900,4,79124,89.0,491.0,122.72,...,26,6.11,-3.82,2.51,0.2282,-0.5399,0,0,5,0


In [5]:
np.shape(train)

(105471, 771)

In [6]:
# histgram
p = sns.color_palette()
plt.hist(train.loss, color = p[2])
plt.ylabel('Number of Observations')
plt.xlabel('Loss')
plt.title('Distribution of Loss')

Text(0.5,1,'Distribution of Loss')

In [8]:
# heatmap
f, ax = plt.subplots(figsize = (10, 8))
corr = train.corr()
sns.heatmap(corr,
            mask=np.zeros_like(corr, dtype = np.bool),
            cmap = sns.diverging_palette(220, 10, as_cmap = True),
            square = True,
            ax = ax)

## Data Preparation

In [18]:
train.select_dtypes(include=['object']).iloc[:,:].nunique()

f137      4719
f138     31151
f206     15793
f207     14510
f276      4415
f277     28709
f338      8663
f390    104661
f391    104658
f419     23557
f420     25772
f469     86420
f472    102913
f534     85376
f537    104114
f626    104753
f627    104750
f695     93729
f698     91985
dtype: int64

In [32]:
dtypes = train.dtypes.apply(lambda x: x.name).to_dict()
int_cols, float_cols, str_cols = [], [], []
for col, dtype in dtypes.items():
    if dtype == 'int64' and col not in ['id', 'loss']:
        int_cols.append(col)
    elif dtype == 'float64':
        float_cols.append(col)
    elif dtype == 'object':
        str_cols.append(col)

In [19]:
target = 'default'
train['default'] = train.loss.apply(lambda x: 1 if x > 0 else 0)
predictors = [x for x in train.columns if x not in [target, 'id', 'loss']]

train_x = train[predictors]
train_y = train[target]

X_train, X_test, y_train, y_test = train_test_split(train_x, trian_y, test_size=0.333, random_state=2017)

In [20]:
train_x.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f769,f770,f771,f772,f773,f774,f775,f776,f777,f778
0,126,10,0.686842,1100,3,13699,7201.0,4949.0,126.75,126.03,...,-3.14,5,2.14,-1.54,1.18,0.1833,0.7873,1,0,5
1,121,10,0.782776,1100,3,84645,240.0,1625.0,123.52,121.35,...,-1.38,6,0.54,-0.24,0.13,0.1926,-0.6787,1,0,5
2,126,10,0.50008,1100,3,83607,1800.0,1527.0,127.76,126.49,...,-5.18,13,2.89,-1.73,1.04,0.2521,0.7258,1,0,5
3,134,10,0.439874,1100,3,82642,7542.0,1730.0,132.94,133.58,...,-2.04,4,1.29,-0.89,0.66,0.2498,0.7119,1,0,5
4,109,9,0.502749,2900,4,79124,89.0,491.0,122.72,112.77,...,-11.12,26,6.11,-3.82,2.51,0.2282,-0.5399,0,0,5


In [40]:

path = '/Users/ewenwang/Downloads'
filename = 'train_v2.csv'
train_file = os.path.join(path, filename)
read_csv = partial(pd.read_csv, na_values=['NA', 'na'], low_memory=False)

dataset = read_csv(train_file)

test_size, seed = 0.33, 2017
train, test = train_test_split(dataset, test_size = test_size, random_state = seed)

In [43]:
test.drop(['loss'], axis=1).head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f769,f770,f771,f772,f773,f774,f775,f776,f777,f778
29622,29623,121,6,0.179645,1300,4,627,4335.0,6044.0,120.32,...,-3.09,5,2.25,-1.74,1.37,0.2609,0.9291,0,0,36
4833,4834,118,8,0.365902,2300,7,86027,144.0,5004.0,119.7,...,-4.79,9,3.63,-2.98,2.55,0.3465,0.649,0,0,15
101249,101250,126,9,0.19275,3100,7,113,7868.0,1333.0,125.98,...,-3.72,7,2.36,-1.63,1.19,0.2339,0.7056,1,0,31
78584,78585,124,9,0.936603,1800,4,14629,3820.0,1832.0,122.18,...,-4.65,8,3.51,-2.86,2.41,0.3176,0.7727,0,0,93
98822,98823,137,6,0.326305,3100,4,79202,5190.0,3144.0,133.52,...,-9.05,14,6.31,-4.69,3.64,0.1812,-0.5781,0,0,21


In [76]:
import os 
import pandas as pd 
from functools import partial
from sklearn.model_selection import train_test_split
import models
from simulated_annealing.optimize import SimulatedAnneal

read_csv = partial(pd.read_csv, na_values=['NA', 'na'], low_memory=False)

In [136]:
path = '/Users/ewenwang/Downloads'
filename = 'train_v2.csv'

In [140]:
test_size, seed = 0.33, 2017
dataset = read_csv(os.path.join(path, filename))
train, test = train_test_split(dataset, test_size=test_size, random_state=seed)

In [153]:
train.loss.isnull().any()

False

In [141]:
test_y = pd.DataFrame(test['loss'].values, columns=['loss'])
test_y[test_y>0] = 1
train_y = pd.DataFrame(train['loss'].values, columns=['loss'])
train_y[train_y>0] = 1

test_X = pd.DataFrame(models.FeatureSelector().fit_transform(test),
                      columns=["Var %d" % (i + 1) for i in range(435)])
train_X = pd.DataFrame(models.FeatureSelector().fit_transform(train),
                       columns=["Var %d" % (i + 1) for i in range(435)])

dtrain, dtest = train_y.join(train_X), test_y.join(test_X)

In [168]:
test_y.groupby('loss').size()

loss
0    31630
1     3176
dtype: int64

In [169]:
dat = test_y.join(test_X)

In [170]:
dat.head()

Unnamed: 0,loss,Var 1,Var 2,Var 3,Var 4,Var 5,Var 6,Var 7,Var 8,Var 9,...,Var 426,Var 427,Var 428,Var 429,Var 430,Var 431,Var 432,Var 433,Var 434,Var 435
0,0,232.32,-0.05,-232.37,1.0,464.76,6.0,692.13,0.0,14368060.0,...,2.533013e+22,162000000000000.0,9.926057e+27,1.269371e+37,1.33222e+28,9.46e+18,7.825182e+18,-725.879733,-35822620.0,-4987393.0
1,0,60.12,-0.06,-60.18,0.0,120.37,8.0,1945.11,0.0,63628640.0,...,2.011937e+19,1810000000000000.0,1.013411e+29,2.431146e+38,1.2354099999999998e+29,4.9748e+19,4.358754e+19,-175.981155,-158635400.0,-22077010.0
2,0,498.32,693.15,194.83,0.0,381.36,9.0,643.38,1.0,2513073.0,...,2.625488e+23,2160000000000.0,2.787357e+29,1.1352369999999999e+39,3.226438e+29,8.108634e+19,7.330177e+19,3180.896507,-6262373.0,-876283.7
3,0,399.07,-0.62,-399.69,1.0,799.39,9.0,1245.16,0.0,9841677.0,...,2.12e+17,49500000000000.0,4.033224e+28,7.752740999999999e+37,5.30178e+28,2.614922e+19,2.1757e+19,-1257.260402,-24539540.0,-3422478.0
4,0,10844.55,-216.58,-11061.13,0.0,22122.27,6.0,3940.2,0.0,1790669000.0,...,1.095946e+26,2.01e+18,5.4212949999999995e+29,1.966588e+39,6.234091e+29,1.683045e+20,1.533345e+20,-37630.684606,-4464298000.0,-620983400.0


In [155]:
train_y.isnull().any()

loss    False
dtype: bool

In [144]:
test_y.groupby('loss').size()

loss
0    31630
1     3176
dtype: int64

In [126]:
test_X = pd.DataFrame(models.FeatureSelector().fit_transform(test), 
                      columns = ["Var %d" % (i + 1) for i in range(435)])
train_X = pd.DataFrame(models.FeatureSelector().fit_transform(train), 
                      columns = ["Var %d" % (i + 1) for i in range(435)])

In [133]:
test_y = pd.DataFrame(test['loss'].apply(lambda x: 0 if x == 0 else 1))
train_y = pd.DataFrame(train['loss'].apply(lambda x: 0 if x == 0 else 1))

In [134]:
test_y.groupby('loss').size()

loss
0    31630
1     3176
dtype: int64

In [135]:
train_y.groupby('loss').size()

loss
0    64058
1     6607
dtype: int64

In [107]:
target = 'loss'
predictors = [x for x in dtrain.columns if x not in [target]]

In [120]:
print(train.shape, test.shape)

(70665, 779) (34806, 779)


In [119]:
print(dtest.groupby('loss').size()/dtest.shape[0])
print(dtrain.groupby('loss').size()/dtrain.shape[0])

loss
0    0.906503
1    0.093497
dtype: float64
loss
0    0.908751
1    0.091249
dtype: float64


In [105]:
dtrain.head()

Unnamed: 0,Var 1,Var 2,Var 3,Var 4,Var 5,Var 6,Var 7,Var 8,Var 9,Var 10,...,Var 427,Var 428,Var 429,Var 430,Var 431,Var 432,Var 433,Var 434,Var 435,loss
0,232.32,-0.05,-232.37,1.0,464.76,6.0,692.13,0.0,14368060.0,0.0,...,162000000000000.0,9.926057e+27,1.269371e+37,1.33222e+28,9.46e+18,7.825182e+18,-725.879733,-35822620.0,-4987393.0,0
1,60.12,-0.06,-60.18,0.0,120.37,8.0,1945.11,0.0,63628640.0,0.0,...,1810000000000000.0,1.013411e+29,2.431146e+38,1.2354099999999998e+29,4.9748e+19,4.358754e+19,-175.981155,-158635400.0,-22077010.0,0
2,498.32,693.15,194.83,0.0,381.36,9.0,643.38,1.0,2513073.0,0.0,...,2160000000000.0,2.787357e+29,1.1352369999999999e+39,3.226438e+29,8.108634e+19,7.330177e+19,3180.896507,-6262373.0,-876283.7,0
3,399.07,-0.62,-399.69,1.0,799.39,9.0,1245.16,0.0,9841677.0,0.0,...,49500000000000.0,4.033224e+28,7.752740999999999e+37,5.30178e+28,2.614922e+19,2.1757e+19,-1257.260402,-24539540.0,-3422478.0,0
4,10844.55,-216.58,-11061.13,0.0,22122.27,6.0,3940.2,0.0,1790669000.0,0.0,...,2.01e+18,5.4212949999999995e+29,1.966588e+39,6.234091e+29,1.683045e+20,1.533345e+20,-37630.684606,-4464298000.0,-620983400.0,0


In [106]:
dtest.head()

Unnamed: 0,Var 1,Var 2,Var 3,Var 4,Var 5,Var 6,Var 7,Var 8,Var 9,Var 10,...,Var 427,Var 428,Var 429,Var 430,Var 431,Var 432,Var 433,Var 434,Var 435,loss
0,31.59,0.0,-31.59,1.0,63.19,9.0,0.0,0.0,0.0,0.0,...,0.0,2.280525e+27,2.1961e+36,2.569752e+27,2.63e+18,2.43e+18,-82.699036,-191.3142,-47.0453,0
1,1643.72,720.33,-923.39,1.0,2706.8,8.0,918.5,1.0,3180385.0,1.0,...,8670000000000.0,1.4955619999999999e+29,7.938513e+38,1.6852359999999997e+29,3.050988e+19,2.817537e+19,-699.595127,-7928606.0,-1113485.0,0
2,893.68,1300.11,406.43,0.0,632.15,8.0,940.43,0.0,14023310.0,0.0,...,91100000000000.0,4.221158e+28,8.138116e+37,5.89591e+28,2.783101e+19,2.251259e+19,6087.42616,-34954520.0,-4869284.0,0
3,4668.41,-37.39,-4705.8,0.0,9411.62,7.0,1248.23,0.0,22128900000.0,0.0,...,2.810976e+20,3.4025629999999996e+29,6.311994e+38,4.018477e+29,2.093283e+20,1.875754e+20,-15351.531152,-55168980000.0,-7673270000.0,0
4,0.18,0.0,-0.18,0.0,0.37,8.0,385.83,0.0,27526660.0,1.0,...,663000000000000.0,2.1596159999999998e+30,1.1715939999999999e+40,2.4313369999999999e+30,4.554427e+20,4.179866e+20,14.984433,-68627430.0,-9547599.0,0


## Parameter Tuning

In [72]:
param = {
#     'boosting_type': ['gbdt', 'dart'],
#     'num_leaves': [i for i in range(3, 20, 1)],
#     'max_depth': [i for i in range(1, 5, 1)],
    'subsample': [i / 100.0 for i in range(20, 90, 1)],
    'colsample_bytree': [i / 100.0 for i in range(20, 90, 1)],
}

gbm = LGBMClassifier(
    learning_rate=0.01, n_estimators=5000, objective='binary', metric='auc', num_leaves=7,
    max_depth=3, save_binary=True, is_unbalance=True, random_state=2017
)

In [77]:
sa = SimulatedAnneal(gbm, param, T=10.0, T_min=0.001, alpha=0.75,
                     verbose=True, max_iter=0.25, n_trans=5, max_runtime=300,
                     cv=3, scoring='roc_auc', refit=True)


INFO: Number of possible iterations given cooling schedule: 160



In [112]:
sa.fit(dtrain[predictors].as_matrix(), dtrain[target].as_matrix())
# Print the best score and the best params
print(sa.best_score_, sa.best_params_)
# Use the best estimator to predict classes
optimized_clf = sa.best_estimator_
y_test_pred = optimized_clf.predict(dtest[predictors])
# Print a report of precision, recall, f1_score
print(classification_report(dtest[target], y_test_pred))

## Modeling

In [83]:
def modelfit(lgbm, dtrain, dtest, predictors, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    """ Fit models w/ parameters """    
    if useTrainCV:
        lgb_param = lgbm.get_params()
        lgbtrain = lgb.Dataset(dtrain[predictors].values, label=dtrain[target].values)
        lgbtest = lgb.Dataset(dtest[predictors].values, label=dtest[target].values, reference=lgbtrain)
        cvresult = lgb.cv(lgb_param, 
                          lgbtrain, 
                          num_boost_round=lgbm.get_params()['n_estimators'], 
                          nfold=cv_folds,
                          metrics='auc', 
                          early_stopping_rounds=early_stopping_rounds)
        cv = pd.DataFrame(cvresult)
        lgbm.set_params(n_estimators=cv.shape[0])
        print(cv.tail(10))
    
    lgbm.fit(dtrain[predictors], dtrain[target], eval_metric='auc')

    dtrain_predictions = lgbm.predict(dtrain[predictors])
    dtest_predictions = lgbm.predict(dtest[predictors])
    dtrain_predprob = lgbm.predict_proba(dtrain[predictors])[:,1]
    dtest_predprob = lgbm.predict_proba(dtest[predictors])[:,1]
        
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(dtrain[target], dtrain_predictions))
    print("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain[target], dtrain_predprob)) 
    print('AUC Score (Test): %f' % metrics.roc_auc_score(dtest[target], dtest_predprob))
    print(classification_report(dtest[target], dtest_predictions))
    
    lgb.plot_importance(lgbm, figsize=(12, 16), grid=False)
    return None

In [171]:
lgbm = LGBMClassifier(
    boosting_type='gbdt', 
    num_leaves=18, 
    max_depth=6, 
    learning_rate=0.01, 
    n_estimators=5000, 
    objective='binary', 
    subsample=0.7193, 
    colsample_bytree=0.7178, 
    random_state=2017
)

st = datetime.now()
modelfit(lgbm, dtrain, dtest, predictors, useTrainCV=False)
print(datetime.now()-st)

ValueError: Unknown label type: 'continuous'