# Identifying safe loans with decision trees
The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default.

In [1]:
import pandas as pd
import numpy as np


# Load LendingClub dataset

We will be using a dataset from the LendingClub. A parsed and cleaned form of the dataset is availiable here. Make sure you download the dataset before running the following command.

In [2]:
loans = pd.read_csv("lending-club-data.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


# Exploring some features

Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.

In [3]:
loans.columns


Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'is_inc_v', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans', 'bad_loans',
       'emp_length_num', 'grade_num', 'sub_gra


# Exploring the target column
The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

- +1 as a safe loan,
- -1 as a risky (bad) loan.

We put this in a new column called safe_loans.

In [7]:
loans["safe_loans"] = loans.bad_loans.apply(lambda x : +1 if x == 0 else -1)

In [8]:
loans.safe_loans

0         1
1        -1
2         1
3         1
4         1
         ..
122602   -1
122603    1
122604   -1
122605   -1
122606    1
Name: safe_loans, Length: 122607, dtype: int64

In [9]:
loans.drop("bad_loans",axis =1 ,inplace = True)

In [11]:
sum(loans.safe_loans == 1)/loans.shape[0]

0.8111853319957262

- Around 81% safe loans
- Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

# Features for the classification algorithm


In [12]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

In [13]:
loans.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.7,0.0,1
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,-1
2,C,C5,0,11,RENT,8.72,small_business,36 months,1,1,98.5,0.0,1
3,C,C1,0,11,RENT,20.0,other,36 months,0,1,21.0,16.97,1
4,A,A4,0,4,RENT,11.2,wedding,36 months,1,1,28.3,0.0,1


What remains now is a subset of features and the target that we will use for the rest of this notebook.



# Sample data to balance classes


As we explored above, our data is disproportionally full of safe loans. Let's create two datasets: one with just the safe loans (safe_loans_raw) and one with just the risky loans (risky_loans_raw).

In [15]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print ("Number of safe loans  : %s" % len(safe_loans_raw))
print ("Number of risky loans : %s" % len(risky_loans_raw))

Number of safe loans  : 99457
Number of risky loans : 23150


In [34]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(frac=percentage, random_state=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

In [36]:
loans_data

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
1,C,C4,1,1,RENT,1.00,car,60 months,1,1,9.4,0.0,-1
6,F,F2,0,5,OWN,5.55,small_business,60 months,1,1,32.6,0.0,-1
7,B,B5,1,1,RENT,18.08,other,60 months,1,1,36.5,0.0,-1
10,C,C1,1,1,RENT,10.08,debt_consolidation,36 months,1,1,91.7,0.0,-1
12,B,B2,0,4,RENT,7.06,other,36 months,1,1,55.5,0.0,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10016,A,A4,0,4,MORTGAGE,21.36,debt_consolidation,36 months,1,1,16.5,0.0,1
58367,B,B1,0,8,MORTGAGE,16.29,debt_consolidation,36 months,1,1,41.0,0.0,1
90431,B,B1,0,9,RENT,14.40,debt_consolidation,36 months,1,1,36.3,0.0,1
115727,F,F5,0,4,RENT,29.45,debt_consolidation,60 months,0,0,36.2,0.0,1


In [37]:
sum(loans_data.safe_loans == 1)/loans_data.shape[0]

0.5

In [42]:
lo_mat=pd.get_dummies(loans)


In [43]:
lo_mat

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,0,11,27.65,1,1,83.7,0.00,1,0,1,...,0,0,0,0,0,0,0,0,1,0
1,1,1,1.00,1,1,9.4,0.00,-1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,11,8.72,1,1,98.5,0.00,1,0,0,...,0,0,0,0,0,1,0,0,1,0
3,0,11,20.00,0,1,21.0,16.97,1,0,0,...,0,0,0,0,1,0,0,0,1,0
4,0,4,11.20,1,1,28.3,0.00,1,1,0,...,0,0,0,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122602,1,0,1.50,0,0,14.6,0.00,-1,0,0,...,0,0,1,0,0,0,0,0,0,1
122603,0,11,11.26,0,0,15.2,0.00,1,0,0,...,0,0,0,0,0,0,0,0,1,0
122604,0,6,12.28,0,0,10.7,0.00,-1,0,0,...,0,0,1,0,0,0,0,0,0,1
122605,0,11,18.45,1,1,46.3,0.00,-1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [44]:
train_index=pd.read_json('module-5-assignment-1-train-idx.json')
valid_index=pd.read_json('module-5-assignment-1-validation-idx.json')

In [45]:
train_data=lo_mat.loc[train_index[0]]
valid_data=lo_mat.loc[valid_index[0]]
train_data.shape

(37224, 68)

# Use decision tree to build a classifier¶


In [49]:
from sklearn import tree


In [54]:
y = train_data[['safe_loans']]
x = train_data.drop("safe_loans",axis = 1)


In [55]:
x.shape

(37224, 67)

In [56]:
y.shape

(37224, 1)

In [57]:
decision_tree_model = tree.DecisionTreeClassifier(max_depth = 6)
decision_tree_model.fit(x,y)

DecisionTreeClassifier(max_depth=6)

In [58]:
small_model = tree.DecisionTreeClassifier(max_depth=2)
small_model.fit(x,y)

DecisionTreeClassifier(max_depth=2)

In [96]:
validation_safe_loans = valid_data[valid_data[target] == 1]
validation_risky_loans = valid_data[valid_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
19,0,11,11.18,1,1,82.4,0.0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
79,0,10,16.85,1,1,96.4,0.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
24,0,3,13.97,0,1,59.5,0.0,-1,0,0,...,0,0,0,0,1,0,0,0,0,1
41,0,11,16.33,1,1,62.1,0.0,-1,1,0,...,0,0,0,0,0,0,0,0,1,0


In [97]:
sample_validation_data.safe_loans

19    1
79    1
24   -1
41   -1
Name: safe_loans, dtype: int64

In [98]:
y_valid = sample_validation_data.safe_loans
x_valid = sample_validation_data.drop("safe_loans",axis = 1)

# Explore label predictions


In [99]:
small_model.predict(x_valid)

array([ 1, -1, -1,  1], dtype=int64)

# Explore probability predictions


In [102]:
small_model.predict_proba(x_valid)

array([[0.41896585, 0.58103415],
       [0.59255339, 0.40744661],
       [0.59255339, 0.40744661],
       [0.23120112, 0.76879888]])

In [95]:
small_model.predict(x_valid)

array([ 1, -1, -1,  1], dtype=int64)

In [103]:
decision_tree_model.predict(x_valid)

array([ 1, -1, -1,  1], dtype=int64)

In [104]:
decision_tree_model.predict_proba(x_valid)

array([[0.34156543, 0.65843457],
       [0.53630646, 0.46369354],
       [0.64750958, 0.35249042],
       [0.20789474, 0.79210526]])

# Evaluating accuracy of the decision tree model

In [109]:
small_model.score(x,y)

0.6135020416935311

In [108]:
decision_tree_model.score(x,y)

0.6405276165914464

In [110]:
y_valid_entire = valid_data[['safe_loans']]
x_valid_entire = valid_data.drop("safe_loans",axis = 1)

In [112]:
small_model.score(x_valid_entire,y_valid_entire)

0.6193451098664369

In [111]:
decision_tree_model.score(x_valid_entire,y_valid_entire)

0.6363636363636364

# Evaluating accuracy of a complex decision tree model

In [113]:
big_model =tree.DecisionTreeClassifier(max_depth = 10)
big_model.fit(x,y)

DecisionTreeClassifier(max_depth=10)

In [115]:
big_model.score(x,y)

0.6637921770900495

In [114]:
big_model.score(x_valid_entire,y_valid_entire)

0.6261309780267126


# Quantifying the cost of mistakes
Every mistake the model makes costs money. In this section, we will try and quantify the cost of each mistake made by the model.

## Assume the following:

- **False negatives**: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of losing a loan that would have otherwise been accepted.
- **False positives**: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given.
- **Correct predictions**: All correct predictions don't typically incur any cost

In [116]:
from sklearn.metrics import confusion_matrix

In [124]:
y_pred =decision_tree_model.predict(x_valid_entire)

In [125]:
confusion_matrix(y_valid_entire,y_pred)

array([[3013, 1661],
       [1715, 2895]], dtype=int64)

In [126]:
(1715*10000)+(1661*20000)


50370000