#Identifying safe loans with decision trees

The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default.

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default. In this assignment you will:

Use SFrames to do some feature engineering.
Train a decision-tree on the LendingClub dataset.
Visualize the tree.
Predict whether a loan will default along with prediction probabilities (on a validation set).
Train a complex tree model and compare it to simple tree model.

Load the Lending Club dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
loans = pd.read_csv('/Users/April/Downloads/lending-club-data.csv')

  data = self._reader.read(nrows)


# Explore some features

In [3]:
loans.columns

Index([u'id', u'member_id', u'loan_amnt', u'funded_amnt', u'funded_amnt_inv',
       u'term', u'int_rate', u'installment', u'grade', u'sub_grade',
       u'emp_title', u'emp_length', u'home_ownership', u'annual_inc',
       u'is_inc_v', u'issue_d', u'loan_status', u'pymnt_plan', u'url', u'desc',
       u'purpose', u'title', u'zip_code', u'addr_state', u'dti',
       u'delinq_2yrs', u'earliest_cr_line', u'inq_last_6mths',
       u'mths_since_last_delinq', u'mths_since_last_record', u'open_acc',
       u'pub_rec', u'revol_bal', u'revol_util', u'total_acc',
       u'initial_list_status', u'out_prncp', u'out_prncp_inv', u'total_pymnt',
       u'total_pymnt_inv', u'total_rec_prncp', u'total_rec_int',
       u'total_rec_late_fee', u'recoveries', u'collection_recovery_fee',
       u'last_pymnt_d', u'last_pymnt_amnt', u'next_pymnt_d',
       u'last_credit_pull_d', u'collections_12_mths_ex_med',
       u'mths_since_last_major_derog', u'policy_code', u'not_compliant',
       u'status', u'inactiv

In [4]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1,1,1,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1,1,1,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1,1,1,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1,1,1,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1,1,1,0,5.21533,20141201T000000,1,1,1


#Exploring the target column

The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

+1 as a safe loan
-1 as a risky (bad) loan

In [5]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.drop('bad_loans', axis = 1)

Now, let us explore the distribution of the column safe_loans. This gives us a sense of how many safe and risky loans are present in the dataset. Print out the percentage of safe loans and risky loans in the data frame.

You should have:

Around 81% safe loans
Around 19% risky loans
It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.



In [6]:
safe_loans_prob = round(float(sum(loans['safe_loans'] == 1))/len(loans),2)

In [7]:
safe_loans_prob

0.81

#Features for the classification algorithm

 In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features. Extract these feature columns and target column from the dataset. We will only use these features.

In [8]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

# One-hot encoding

In [10]:
'''categorical_variables = []
for feat_name, feat_type in zip(loans.columns, loans.dtypes):
    if feat_type == str:
        categorical_variables.append(feat_name)

for feature in categorical_variables:
    loans_data_one_hot_encoded = loans[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)
    print loans_data_one_hot_encoded
    print loads_data_unpacked

    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans.remove_column(feature)
    loans.add_columns(loans_data_unpacked)''''''

In [10]:
categorical_var = [m for m in loans.columns if loans[m].dtypes == object]
categorical_var

['grade', 'sub_grade', 'home_ownership', 'purpose', 'term']

In [9]:
loans

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.70,0.00,1
1,C,C4,1,1,RENT,1.00,car,60 months,1,1,9.40,0.00,-1
2,C,C5,0,11,RENT,8.72,small_business,36 months,1,1,98.50,0.00,1
3,C,C1,0,11,RENT,20.00,other,36 months,0,1,21.00,16.97,1
4,A,A4,0,4,RENT,11.20,wedding,36 months,1,1,28.30,0.00,1
5,E,E1,0,10,RENT,5.35,car,36 months,1,1,87.50,0.00,1
6,F,F2,0,5,OWN,5.55,small_business,60 months,1,1,32.60,0.00,-1
7,B,B5,1,1,RENT,18.08,other,60 months,1,1,36.50,0.00,-1
8,C,C3,0,6,OWN,16.12,debt_consolidation,60 months,1,1,20.60,0.00,1
9,B,B5,0,11,OWN,10.78,debt_consolidation,36 months,1,1,67.10,0.00,1


In [11]:
loans = pd.get_dummies(loans)

In [12]:
loans

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,0,11,27.65,1,1,83.70,0.00,1,0,1,...,0,0,0,0,0,0,0,0,1,0
1,1,1,1.00,1,1,9.40,0.00,-1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,11,8.72,1,1,98.50,0.00,1,0,0,...,0,0,0,0,0,1,0,0,1,0
3,0,11,20.00,0,1,21.00,16.97,1,0,0,...,0,0,0,0,1,0,0,0,1,0
4,0,4,11.20,1,1,28.30,0.00,1,1,0,...,0,0,0,0,0,0,0,1,1,0
5,0,10,5.35,1,1,87.50,0.00,1,0,0,...,0,0,0,0,0,0,0,0,1,0
6,0,5,5.55,1,1,32.60,0.00,-1,0,0,...,0,0,0,0,0,1,0,0,0,1
7,1,1,18.08,1,1,36.50,0.00,-1,0,1,...,0,0,0,0,1,0,0,0,0,1
8,0,6,16.12,1,1,20.60,0.00,1,0,0,...,0,0,0,0,0,0,0,0,0,1
9,0,11,10.78,1,1,67.10,0.00,1,0,1,...,0,0,0,0,0,0,0,0,1,0


In [13]:
import json
with open('/Users/April/Desktop/datasci_course_materials-master/assignment1/train index.json', 'r') as f: # Reads the list of most frequent words
    train_idx = json.load(f)
with open('/Users/April/Desktop/datasci_course_materials-master/assignment1/validation index.json', 'r') as f1: # Reads the list of most frequent words
    validation_idx = json.load(f1)

In [14]:
train_data = loans.iloc[train_idx]
validation_data = loans.iloc[validation_idx]

In [15]:
train_data.shape

(37224, 68)

In [16]:
validation_data.shape

(9284, 68)

In [19]:
safe_loans_prob = round(float(sum(validation_data['safe_loans'] == 1))/len(validation_data),2)
safe_loans_prob

0.5

#Build a decision tree classifier

Now, let's use the built-in scikit learn decision tree learner (sklearn.tree.DecisionTreeClassifier) to create a loan prediction model on the training data. To do this, you will need to import sklearn, sklearn.tree, and numpy.

Note: You will have to first convert the SFrame into a numpy data matrix, and extract the target labels as a numpy array (Hint: you can use the .to_numpy() method call on SFrame to turn SFrames into numpy arrays). See the API for more information. Make sure to set max_depth=6.

In [20]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

In [21]:
decision_tree_model = tree.DecisionTreeClassifier(max_depth=6)

In [22]:
x = train_data.drop('safe_loans',1)

In [23]:
decision_tree_model.fit(x, train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

In [24]:
small_model = tree.DecisionTreeClassifier(max_depth=2)

In [25]:
small_model.fit(train_data.drop('safe_loans',1), train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

#Visualizing a learned model (Optional)

In [25]:
tree.export_graphviz(small_model, out_file='tree.dot')  

In [29]:
from IPython.display import Image  
import pydot
tree.export_graphviz(small_model, out_file='tree.dot') 
graph = pydot.graph_from_dot_data(out_file.getvalue())  
Image(graph.create_png()) 



ImportError: No module named pydot

In [30]:
from sklearn.tree import export_graphviz
import graphviz

export_graphviz(small_model, out_file="mytree.dot")
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

ImportError: No module named graphviz

#Making predictions

Let's consider two positive and two negative examples from the validation set and see what the model predicts. We will do the following:

Predict whether or not a loan is safe.
Predict the probability that a loan is safe.

First, let's grab 2 positive examples and 2 negative examples.

In [26]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
19,0,11,11.18,1,1,82.4,0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
79,0,10,16.85,1,1,96.4,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
24,0,3,13.97,0,1,59.5,0,-1,0,0,...,0,0,0,0,1,0,0,0,0,1
41,0,11,16.33,1,1,62.1,0,-1,1,0,...,0,0,0,0,0,0,0,0,1,0


Now, we will use our model to predict whether or not a loan is likely to default. For each row in the sample_validation_data, use the decision_tree_model to predict whether or not the loan is classified as a safe loan. (Hint: if you are using scikit-learn, you can use the .predict() method)

Quiz Question: What percentage of the predictions on sample_validation_data did decision_tree_model get correct?

In [27]:
decision_tree_model.predict(sample_validation_data.drop('safe_loans',1))

array([ 1, -1, -1,  1])

#Explore probability predictions

For each row in the sample_validation_data, what is the probability (according decision_tree_model) of a loan being classified as safe? (Hint: if you are using scikit-learn, you can use the .predict_proba() method)

Quiz Question: Which loan has the highest probability of being classified as a safe loan?

In [28]:
decision_tree_model.predict_proba(sample_validation_data.drop('safe_loans',1))

array([[ 0.34156543,  0.65843457],
       [ 0.53630646,  0.46369354],
       [ 0.64750958,  0.35249042],
       [ 0.20789474,  0.79210526]])

#Tricky predictions!

Now, we will explore something pretty interesting. For each row in the sample_validation_data, what is the probability (according to small_model) of a loan being classified as safe?



In [29]:
small_model.predict_proba(sample_validation_data.drop('safe_loans',1))

array([[ 0.41896585,  0.58103415],
       [ 0.59255339,  0.40744661],
       [ 0.59255339,  0.40744661],
       [ 0.23120112,  0.76879888]])

Quiz Question: Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?

# Visualize the prediction on a tree

Quiz Question: Based on the visualized tree, what prediction would you make for this data point (according to small_model)? (If you don't have Graphviz, you can answer this quiz question by executing the next part.)

In [30]:
small_model.predict(sample_validation_data.drop('safe_loans',1))

array([ 1, -1, -1,  1])

#Evaluating accuracy of the decision tree model

Evaluate the accuracy of small_model and decision_tree_model on the training data. (Hint: if you are using scikit-learn, you can use the .score() method)



In [31]:
decision_tree_model.score(train_data.drop('safe_loans',1), train_data[target], sample_weight=None)

0.64052761659144641

In [32]:
small_model.score(train_data.drop('safe_loans',1), train_data[target], sample_weight=None)

0.61350204169353106

 Now, evaluate the accuracy of the small_model and decision_tree_model on the entire validation_data, not just the subsample considered above.

In [33]:
decision_tree_model.score(validation_data.drop('safe_loans',1), validation_data[target], sample_weight=None)

0.63636363636363635

In [34]:
small_model.score(validation_data.drop('safe_loans',1), validation_data[target], sample_weight=None)

0.61934510986643687

#Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

In [35]:
big_model = tree.DecisionTreeClassifier(max_depth=10)

Evaluate the accuracy of big_model on the training set and validation set.

In [37]:
big_model.fit(x, train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

In [38]:
big_model.score(train_data.drop('safe_loans',1), train_data[target], sample_weight=None)

0.66384590586718251

In [39]:
big_model.score(validation_data.drop('safe_loans',1), validation_data[target], sample_weight=None)

0.62656182679879358

#Quantifying the cost of mistakes

Every mistake the model makes costs money. In this section, we will try and quantify the cost each mistake made by the model. Assume the following:

False negatives: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of loosing a loan that would have otherwise been accepted.

False positives: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given.

Correct predictions: All correct predictions don't typically incur any cost.


Let's write code that can compute the cost of mistakes made by the model. Complete the following 4 steps:

First, let us compute the predictions made by the model.
Second, compute the number of false positives.
Third, compute the number of false negatives.
Finally, compute the cost of mistakes made by the model by adding up the costs of true positives and false positves.

Quiz Question: Let's assume that each mistake costs us money: a false negative costs $10,000, while a false positive positive costs $20,000. What is the total cost of mistakes made by decision_tree_model on validation_data?

In [41]:
validation_prediction = decision_tree_model.predict(validation_data.drop('safe_loans',1))

In [42]:
false_negative_counts = sum(validation_prediction < validation_data[target])

In [43]:
false_positive_counts = sum(validation_prediction > validation_data[target])

In [44]:
total_cost = 10000*false_negative_counts + 20000*false_positive_counts

In [45]:
total_cost

50370000