## Predicting loan defaults with decision trees

In this project, we use past data from peer-to-peer lending company the LeandingClub, to predict whether a future loan is likely to be paid off or default. 

- exploritory data analysis on loan data
- data cleaning and feature extraction
- building a classifier with decision tree 
- validate the model to avoid over-fitting

The data is provided as part of the UW classification class. I extended the class project into a more thorough analysis presented in this notebook.

In [1]:
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (8., 10.0) 

### Loading the data

First, let's read in the data and print the first a few lines. 

In [2]:
loans = pd.read_csv('lending-club-data.csv') 
print loans.shape
loans.head()


(122607, 68)


  data = self._reader.read(nrows)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1,1,1,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1,1,1,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1,1,1,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1,1,1,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1,1,1,0,5.21533,20141201T000000,1,1,1


Whether the loan was defaulted or not is stored in the column "bad_loans". 1 means the loan was risky and 0 means it was safe. In this model, we will take this column as the ground truth of whether the loan was safe or not. 

Let's find out what percentages of the loans here are risky / safe.

In [3]:
num_bad = (loans['bad_loans'] == 1).sum()
num_total = loans['bad_loans'].count()
print 'number of bad loans = ', num_bad
print 'bad loan rate = ', float(num_bad)/num_total
print 'good loan rate = ', 1-float(num_bad)/num_total

number of bad loans =  23150
bad loan rate =  0.188814668004
good loan rate =  0.811185331996


Let's use a subset of features to make our decision tree model. Among those variables, [emp_length_num, dti, revol_util, total_rec_late_fee] are numerical, and all the others are categorical.

In [4]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'bad_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

In [5]:
loans.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,bad_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.7,0.0,0
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,1
2,C,C5,0,11,RENT,8.72,small_business,36 months,1,1,98.5,0.0,0
3,C,C1,0,11,RENT,20.0,other,36 months,0,1,21.0,16.97,0
4,A,A4,0,4,RENT,11.2,wedding,36 months,1,1,28.3,0.0,0


### Preprocessing


Sklearn only deals with continuous numbers as inputs, so we need preprocess categorical variables and make dummy variables for each category.

In [6]:
# Let's try this again with get dummies 
loans = pd.get_dummies(loans)

In [7]:
list(loans.columns.values)

['short_emp',
 'emp_length_num',
 'dti',
 'last_delinq_none',
 'last_major_derog_none',
 'revol_util',
 'total_rec_late_fee',
 'bad_loans',
 'grade_A',
 'grade_B',
 'grade_C',
 'grade_D',
 'grade_E',
 'grade_F',
 'grade_G',
 'sub_grade_A1',
 'sub_grade_A2',
 'sub_grade_A3',
 'sub_grade_A4',
 'sub_grade_A5',
 'sub_grade_B1',
 'sub_grade_B2',
 'sub_grade_B3',
 'sub_grade_B4',
 'sub_grade_B5',
 'sub_grade_C1',
 'sub_grade_C2',
 'sub_grade_C3',
 'sub_grade_C4',
 'sub_grade_C5',
 'sub_grade_D1',
 'sub_grade_D2',
 'sub_grade_D3',
 'sub_grade_D4',
 'sub_grade_D5',
 'sub_grade_E1',
 'sub_grade_E2',
 'sub_grade_E3',
 'sub_grade_E4',
 'sub_grade_E5',
 'sub_grade_F1',
 'sub_grade_F2',
 'sub_grade_F3',
 'sub_grade_F4',
 'sub_grade_F5',
 'sub_grade_G1',
 'sub_grade_G2',
 'sub_grade_G3',
 'sub_grade_G4',
 'sub_grade_G5',
 'home_ownership_MORTGAGE',
 'home_ownership_OTHER',
 'home_ownership_OWN',
 'home_ownership_RENT',
 'purpose_car',
 'purpose_credit_card',
 'purpose_debt_consolidation',
 'purpose_

Now let's separate the training and validation data. 

In [8]:
import json
with open('module-5-assignment-1-train-idx.json') as train_file:    
    train_index = json.load(train_file)
with open('module-5-assignment-1-validation-idx.json') as valid_file:    
    valid_index = json.load(valid_file)

In [9]:
print len(train_index)
print len(valid_index)

37224
9284


In [10]:
train_data= loans.iloc[train_index]
valid_data= loans.iloc[valid_index]

In [11]:
cols = [col for col in train_data.columns if col not in ['bad_loans']]
train_x = train_data[cols]
train_y = train_data['bad_loans']

cols = [col for col in valid_data.columns if col not in ['bad_loans']]
valid_x = valid_data[cols]
valid_y = valid_data['bad_loans']

### Training a decision tree classifier

In [12]:
import sklearn 
from sklearn.tree import DecisionTreeClassifier

In [13]:
treel6 = DecisionTreeClassifier(max_depth=6, random_state=1)
treel6.fit(train_x, train_y)

DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=6, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=1, splitter='best')

In [14]:
treel2 = DecisionTreeClassifier(max_depth=2, random_state=1)
treel2.fit(train_x, train_y)

DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=2, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=1, splitter='best')

In [15]:
treel10 = DecisionTreeClassifier(max_depth=10, random_state=1)
treel10.fit(train_x, train_y)

DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=10, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=1, splitter='best')

In [None]:
tree.export_graphviz(clf,
...     out_file='tree.dot')    

In [16]:
# What are the top 5 important features for the 6 level tree?
pd.DataFrame({'feature':cols, 'importance':treel6.feature_importances_}).sort('importance', ascending = False).head()

Unnamed: 0,feature,importance
7,grade_A,0.341343
8,grade_B,0.192375
6,total_rec_late_fee,0.177496
2,dti,0.110301
9,grade_C,0.07458


In [17]:
# for the 2 level one?
pd.DataFrame({'feature':cols, 'importance':treel2.feature_importances_}).sort('importance', ascending = False).head()

Unnamed: 0,feature,importance
7,grade_A,0.603376
8,grade_B,0.340052
6,total_rec_late_fee,0.056573
0,short_emp,0.0
43,sub_grade_F5,0.0


In [18]:
# for the 10 level one?
pd.DataFrame({'feature':cols, 'importance':treel10.feature_importances_}).sort('importance', ascending = False).head()

Unnamed: 0,feature,importance
7,grade_A,0.239427
2,dti,0.145065
8,grade_B,0.134937
6,total_rec_late_fee,0.132555
5,revol_util,0.087105


In [30]:
from sklearn.tree import export_graphviz
#export_graphviz(treeclf, feature_names=cols)
%matplotlib inline  
export_graphviz(treel6, out_file='treel6.dot', feature_names=cols)


In [32]:
# let's try to make a pdf out of it
import os
os.unlink('treel6.dot')
import pydotplus 
dot_data = tree.export_graphviz(treel6, out_file=None) 
graph = pydotplus.graph_from_dot_data(dot_data) 
graph.write_pdf("iris.pdf") 

ImportError: No module named pydotplus

In [19]:
treel2train = treel2.predict(train_x)
treel2valid = treel2.predict(valid_x)

treel6train = treel6.predict(train_x)
treel6valid = treel6.predict(valid_x)

treel10train = treel10.predict(train_x)
treel10valid = treel10.predict(valid_x)

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [23]:
print accuracy_score(train_y, treel2train)  # correct/total
print accuracy_score(valid_y, treel2valid)  # correct/total


0.613502041694
0.619345109866


In [21]:
print accuracy_score(train_y, treel6train)  # correct/total
print accuracy_score(valid_y, treel6valid)  # correct/total


0.640527616591
0.636148211978


In [22]:
print accuracy_score(train_y, treel10train)  # correct/total
print accuracy_score(valid_y, treel10valid)

0.66379217709
0.62483843171


In [24]:
#For the quiz only
validation_safe_loans = valid_data[valid_data['bad_loans'] == 0]
validation_risky_loans = valid_data[valid_data['bad_loans'] == 1]

sample_validation_data_risky = validation_risky_loans.iloc[0:2]
sample_validation_data_safe = validation_safe_loans.iloc[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,bad_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
19,0,11,11.18,1,1,82.4,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
79,0,10,16.85,1,1,96.4,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
24,0,3,13.97,0,1,59.5,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
41,0,11,16.33,1,1,62.1,0,1,1,0,...,0,0,0,0,0,0,0,0,1,0


In [25]:
sample_validation_data['bad_loans']
sample_validation_data[cols]

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,grade_A,grade_B,grade_C,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
19,0,11,11.18,1,1,82.4,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
79,0,10,16.85,1,1,96.4,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
24,0,3,13.97,0,1,59.5,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
41,0,11,16.33,1,1,62.1,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


In [26]:
# note 1 means a bad loan, and 0 means a safe one
# in the question, 1 means good and -1 means bad
treel6.predict(sample_validation_data[cols])

array([0, 1, 1, 0])

In [27]:
# the probability of predicting 0 (good) and 1 (bad)
treel6.predict_proba(sample_validation_data[cols])

array([[ 0.65843457,  0.34156543],
       [ 0.46369354,  0.53630646],
       [ 0.35249042,  0.64750958],
       [ 0.79210526,  0.20789474]])

In [29]:
treel2.predict(sample_validation_data[cols])

array([0, 1, 1, 0])

Let us assume that each mistake costs money:

Assume a cost of $10,000 per false negative.
Assume a cost of $20,000 per false positive.
What is the total cost of mistakes made by decision_tree_model on validation_data? Please enter your answer as a plain integer, without the dollar sign or the comma separator, e.g. 3002000.

In [41]:
type(valid_y)
type(treel6valid)

numpy.ndarray

In [44]:
#False negatives: Loans that were actually safe but were predicted to be risky. 
#False positives: Loans that were actually risky but were predicted to be safe. 

fn =0 # predict default 1, real safe 0
fp =0 # predict safe 0, real default 1

for i in range(len(valid_y)):
    if valid_y.iloc[i]==0 and  treel6valid[i] == 1:
        fn += 1
    if valid_y.iloc[i]==1 and  treel6valid[i] == 0:
        fp += 1
        
print fn*1e5 + fp*2e5

503900000.0


In [None]:
# Big question:
Why did we select those features? Lets work on some feature selection.

In [6]:
# Dealing with categorical data with LableEncoder and OneHotEncoder

from sklearn import preprocessing
catcol = ['grade', 'sub_grade', 'short_emp', 'home_ownership', 'purpose', 'term', 
'last_delinq_none','last_major_derog_none','bad_loans']
loans_cat = loans[catcol].apply(preprocessing.LabelEncoder().fit_transform)

In [None]:
onehot = preprocessing.OneHotEncoder()
onehot.fit(loans_cat)

In [18]:
onehot.get_params()
loans_cat_onehot = onehot.transform(loans_cat)
loans_cat_onehot.shape

{'categorical_features': 'all',
 'dtype': float,
 'n_values': 'auto',
 'sparse': True}

I could populate a sparse dataframe from the sparse matrix 
but this would be an extra toll on the memory so let's just keep things simple and process 
the data as numpy arrays

loans_cat_onehot = pd.SparseDataFrame([ pd.SparseSeries(loans_cat_onehot[i].toarray().ravel()) 
                              for i in np.arange(loans_cat_onehot.shape[0]) ])

In [34]:
# Now, let's combine the processed categorical data and the numerical data
# first let's output the numerical data as a numpy array
numerical = loans[['emp_length_num', 'dti', 'revol_util','total_rec_late_fee']].as_matrix()

In [39]:
from scipy.sparse import hstack
prepped= hstack((loans_cat_onehot,numerical))

In [None]:
# calculate RMSE for those predictions
from sklearn import metrics
import numpy as np
np.sqrt(metrics.mean_squared_error(train.price, train.prediction))

In [None]:
# fill in the missing values for age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)


In [None]:
print "Number of safe loans  : %s" % len(safe_loans_raw)

In [None]:
for the random prediction the answer is -1