# Identifying safe loans with decision trees

The [LendingClub](https://www.lendingclub.com/) is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to [default](https://en.wikipedia.org/wiki/Default_%28finance%29).

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be [charged off](https://en.wikipedia.org/wiki/Charge-off) and possibly go into default. In this assignment you will:

* Train a decision-tree on the LendingClub dataset.
* Visualize the tree.
* Predict whether a loan will default along with prediction probabilities (on a validation set).
* Train a complex tree model and compare it to simple tree model.

Let's get started!

In [1]:
import pandas as pd
import numpy as np

# Load LendingClub dataset

We will be using a dataset from the [LendingClub](https://www.lendingclub.com/). A parsed and cleaned form of the dataset is availiable [here](https://github.com/learnml/machine-learning-specialization-private). Make sure you **download the dataset** before running the following command.

In [2]:
loans = pd.read_csv('./lending-club-data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


## Exploring some features

Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset.

In [3]:
loans.columns.values

array(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'is_inc_v', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc',
       'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans',
       'bad_loans', 'emp_length_num', 'grade_num', '

Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.

In [4]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1


## Exploring the target column

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans`.

In [5]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.drop('bad_loans', axis=1)

Now, let us explore the distribution of the column `safe_loans`. This gives us a sense of how many safe and risky loans are present in the dataset.

In [32]:
count_all  = len(loans['safe_loans'])
count_safe = len(loans[loans['safe_loans'] == +1])
count_risky = len(loans[loans['safe_loans'] == -1])
ratio_safe = round(float(count_safe)/count_all, 2)
ratio_risky = round(float(count_risky)/count_all, 2)
print ("ratio_safe = ", ratio_safe)
print ("ratio_risky = ", ratio_risky)

('ratio_safe = ', 0.81)
('ratio_risky = ', 0.19)


You should have:
* Around 81% safe loans
* Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

## Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below. If you are a finance geek, the [LendingClub](https://www.lendingclub.com/) website has a lot more details about these features.

In [7]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

What remains now is a **subset of features** and the **target** that we will use for the rest of this notebook. 

## Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans.  Let's create two datasets: one with just the safe loans (`safe_loans_raw`) and one with just the risky loans (`risky_loans_raw`).

In [8]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)

Number of safe loans  : 99457
Number of risky loans : 23150


Now, write some code to compute below the percentage of safe and risky loans in the dataset and validate these numbers against what was given using `.show` earlier in the assignment:

In [9]:
print "Percentage of safe loans  :", round(float(len(safe_loans_raw))/len(loans[target]), 2)
print "Percentage of risky loans :", round(float(len(risky_loans_raw))/len(loans[target]), 2)

Percentage of safe loans  : 0.81
Percentage of risky loans : 0.19


# One-hot encoding

For scikit-learn's decision tree implementation, it requires numerical values for it's data matrix. This means you will have to turn categorical variables into binary features via one-hot encoding. 

In [10]:
categorical_variables = [var for var in loans.columns if loans[var].dtypes == object]
categorical_variables

['grade', 'sub_grade', 'home_ownership', 'purpose', 'term']

In [11]:
loans_decoded = pd.get_dummies(loans, columns = categorical_variables, prefix_sep = '.')
loans_decoded

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade.A,grade.B,...,purpose.house,purpose.major_purchase,purpose.medical,purpose.moving,purpose.other,purpose.small_business,purpose.vacation,purpose.wedding,term. 36 months,term. 60 months
0,0,11,27.65,1,1,83.70,0.00,1,0,1,...,0,0,0,0,0,0,0,0,1,0
1,1,1,1.00,1,1,9.40,0.00,-1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,11,8.72,1,1,98.50,0.00,1,0,0,...,0,0,0,0,0,1,0,0,1,0
3,0,11,20.00,0,1,21.00,16.97,1,0,0,...,0,0,0,0,1,0,0,0,1,0
4,0,4,11.20,1,1,28.30,0.00,1,1,0,...,0,0,0,0,0,0,0,1,1,0
5,0,10,5.35,1,1,87.50,0.00,1,0,0,...,0,0,0,0,0,0,0,0,1,0
6,0,5,5.55,1,1,32.60,0.00,-1,0,0,...,0,0,0,0,0,1,0,0,0,1
7,1,1,18.08,1,1,36.50,0.00,-1,0,1,...,0,0,0,0,1,0,0,0,0,1
8,0,6,16.12,1,1,20.60,0.00,1,0,0,...,0,0,0,0,0,0,0,0,0,1
9,0,11,10.78,1,1,67.10,0.00,1,0,1,...,0,0,0,0,0,0,0,0,1,0


## Split data into training and validation sets

We split the data into training and validation sets using an 80/20 split and specifying `seed=1` so everyone gets the same results.

**Note**: In previous assignments, we have called this a **train-test split**. However, the portion of data that we don't train on will be used to help **select model parameters** (this is known as model selection). Thus, this portion of data should be called a **validation set**. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.

In [12]:
train_indices = open("module-5-assignment-1-train-idx.txt", "r")
validation_indices = open("module-5-assignment-1-validation-idx.txt", "r")
for line in train_indices:
    train_list = [int(x.strip()) for x in line.split(',')]
for line in validation_indices:
    validation_list = [int(x.strip()) for x in line.split(',')]
train_data = loans_decoded.iloc[train_list]
validation_data = loans_decoded.iloc[validation_list]

In [13]:
safe_loans_ratio = round(float(sum(validation_data['safe_loans'] == 1))/len(validation_data),2)
safe_loans_ratio

0.5

# Use decision tree to build a classifier

In [14]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

In [15]:
decision_tree_model = tree.DecisionTreeClassifier(max_depth=6)
small_model = tree.DecisionTreeClassifier(max_depth=2)

In [16]:
decision_tree_model.fit(train_data.drop('safe_loans', axis=1), train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [17]:
small_model.fit(train_data.drop('safe_loans', axis=1), train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

# Making predictions

Let's consider two positive and two negative examples **from the validation set** and see what the model predicts. We will do the following:
* Predict whether or not a loan is safe.
* Predict the probability that a loan is safe.

In [18]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade.A,grade.B,...,purpose.house,purpose.major_purchase,purpose.medical,purpose.moving,purpose.other,purpose.small_business,purpose.vacation,purpose.wedding,term. 36 months,term. 60 months
19,0,11,11.18,1,1,82.4,0.0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
79,0,10,16.85,1,1,96.4,0.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
24,0,3,13.97,0,1,59.5,0.0,-1,0,0,...,0,0,0,0,1,0,0,0,0,1
41,0,11,16.33,1,1,62.1,0.0,-1,1,0,...,0,0,0,0,0,0,0,0,1,0


## Explore label predictions

Now, we will use our model  to predict whether or not a loan is likely to default. For each row in the **sample_validation_data**, use the **decision_tree_model** to predict whether or not the loan is classified as a **safe loan**. 

In [19]:
decision_tree_model.predict(sample_validation_data.drop('safe_loans', axis=1))

array([ 1, -1, -1,  1])

**Quiz Question:** What percentage of the predictions on `sample_validation_data` did `decision_tree_model` get correct?

## Explore probability predictions

For each row in the **sample_validation_data**, what is the probability (according **decision_tree_model**) of a loan being classified as **safe**? (Hint: if you are using scikit-learn, you can use the .predict_proba() method)

In [20]:
decision_tree_model.predict_proba(sample_validation_data.drop('safe_loans', axis=1))

array([[ 0.34156543,  0.65843457],
       [ 0.53630646,  0.46369354],
       [ 0.64750958,  0.35249042],
       [ 0.20789474,  0.79210526]])

**Quiz Question:** Which loan has the highest probability of being classified as a **safe loan**?

**Checkpoint:** Can you verify that for all the predictions with `probability >= 0.5`, the model predicted the label **+1**?

### Tricky predictions!

Now, we will explore something pretty interesting. For each row in the **sample_validation_data**, what is the probability (according to **small_model**) of a loan being classified as **safe**?

In [21]:
small_model.predict_proba(sample_validation_data.drop('safe_loans', axis=1))

array([[ 0.41896585,  0.58103415],
       [ 0.59255339,  0.40744661],
       [ 0.59255339,  0.40744661],
       [ 0.23120112,  0.76879888]])

**Quiz Question:** Notice that the probability preditions are the **exact same** for the 2nd and 3rd loans. Why would this happen?

## Visualize the prediction on a tree


Note that you should be able to look at the small tree, traverse it yourself, and visualize the prediction being made. Consider the following point in the **sample_validation_data**

In [22]:
sample_validation_data.iloc[1]

short_emp                      0.00
emp_length_num                10.00
dti                           16.85
last_delinq_none               1.00
last_major_derog_none          1.00
revol_util                    96.40
total_rec_late_fee             0.00
safe_loans                     1.00
grade.A                        0.00
grade.B                        0.00
grade.C                        0.00
grade.D                        1.00
grade.E                        0.00
grade.F                        0.00
grade.G                        0.00
sub_grade.A1                   0.00
sub_grade.A2                   0.00
sub_grade.A3                   0.00
sub_grade.A4                   0.00
sub_grade.A5                   0.00
sub_grade.B1                   0.00
sub_grade.B2                   0.00
sub_grade.B3                   0.00
sub_grade.B4                   0.00
sub_grade.B5                   0.00
sub_grade.C1                   0.00
sub_grade.C2                   0.00
sub_grade.C3                

Let's visualize the small tree here to do the traversing for this data point.

**Note:** In the tree visualization above, the values at the leaf nodes are not class predictions but scores (a slightly advanced concept that is out of the scope of this course). You can read more about this [here](https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf).  If the score is $\geq$ 0, the class +1 is predicted.  Otherwise, if the score < 0, we predict class -1.


Quiz Question: Based on the visualized tree, what prediction would you make for this data point (according to small_model)? (If you don't have Graphviz, you can answer this quiz question by executing the next part.)

Now, verify your prediction by examining the prediction made using small_model.


In [23]:
small_model.predict(sample_validation_data.drop('safe_loans', axis=1))

array([ 1, -1, -1,  1])

# Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows:
$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

Let us start by evaluating the accuracy of the `small_model` and `decision_tree_model` on the training data

In [24]:
print "accuracy of decision_tree_model = %s" % decision_tree_model.score(train_data.drop('safe_loans',1), train_data[target], sample_weight=None)
print "accuracy of small_model = %s" % small_model.score(train_data.drop('safe_loans',1), train_data[target], sample_weight=None)

accuracy of decision_tree_model = 0.640527616591
accuracy of small_model = 0.613502041694


**Checkpoint:** You should see that the **small_model** performs worse than the **decision_tree_model** on the training data.


Now, let us evaluate the accuracy of the **small_model** and **decision_tree_model** on the entire **validation_data**, not just the subsample considered above.

In [25]:
print "accuracy of decision_tree_model = %s" % decision_tree_model.score(validation_data.drop('safe_loans',1), validation_data[target], sample_weight=None)
print "accuracy of small_model = %s" % small_model.score(validation_data.drop('safe_loans',1), validation_data[target], sample_weight=None)

accuracy of decision_tree_model = 0.636148211978
accuracy of small_model = 0.619345109866


**Quiz Question:** What is the accuracy of `decision_tree_model` on the validation set, rounded to the nearest .01?

## Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with `max_depth=10`. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

In [26]:
big_model = tree.DecisionTreeClassifier(max_depth=10)
big_model.fit(train_data.drop('safe_loans', axis=1), train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Now, let us evaluate **big_model** on the training set and validation set.

In [27]:
print "accuracy of big_model on train = %s" % big_model.score(train_data.drop('safe_loans',1), train_data[target], sample_weight=None)
print "accuracy of big_model on validation = %s" % big_model.score(validation_data.drop('safe_loans',1), validation_data[target], sample_weight=None)

accuracy of big_model on train = 0.66379217709
accuracy of big_model on validation = 0.626992675571


**Checkpoint:** We should see that **big_model** has even better performance on the training set than **decision_tree_model** did on the training set.

**Quiz Question:** How does the performance of **big_model** on the validation set compare to **decision_tree_model** on the validation set? Is this a sign of overfitting?

### Quantifying the cost of mistakes

Every mistake the model makes costs money. In this section, we will try and quantify the cost of each mistake made by the model.

Assume the following:

* **False negatives**: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of losing a loan that would have otherwise been accepted. 
* **False positives**: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given. 
* **Correct predictions**: All correct predictions don't typically incur any cost.


Let's write code that can compute the cost of mistakes made by the model. Complete the following 4 steps:
1. First, let us compute the predictions made by the model.
1. Second, compute the number of false positives.
2. Third, compute the number of false negatives.
3. Finally, compute the cost of mistakes made by the model by adding up the costs of true positives and false positives.

First, let us make predictions on `validation_data` using the `decision_tree_model`:

In [28]:
validation_prediction = decision_tree_model.predict(validation_data.drop('safe_loans',1))

**False positives** are predictions where the model predicts +1 but the true label is -1. Complete the following code block for the number of false positives:

In [29]:
count_false_positive = sum(validation_prediction > validation_data[target])

**False negatives** are predictions where the model predicts -1 but the true label is +1. Complete the following code block for the number of false negatives:

In [30]:
count_false_negative = sum(validation_prediction < validation_data[target])

**Quiz Question:** Let us assume that each mistake costs money:
* Assume a cost of \$10,000 per false negative.
* Assume a cost of \$20,000 per false positive.

What is the total cost of mistakes made by `decision_tree_model` on `validation_data`?

In [31]:
total_cost = 10000*count_false_negative + 20000*count_false_positive
print "total_cost = %s" % total_cost

total_cost = 50390000
