# Identifying safe loans with decision trees
The [LendingClub](https://www.lendingclub.com/) is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to [default](https://en.wikipedia.org/wiki/Default_%28finance%29).

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be [charged off](https://en.wikipedia.org/wiki/Charge-off) and possibly go into default. In this assignment you will:

* Use Pandas to do some feature engineering.
* Train a decision-tree on the LendingClub dataset.
* Visualize the tree.
* Predict whether a loan will default along with prediction probabilities (on a validation set).
* Train a complex tree model and compare it to simple tree model.

Let's get started!

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
from sklearn.model_selection import train_test_split

## Data Preprocessing

### Load LendingClub dataset
**1**. We will be using a dataset from the [LendingClub](https://www.lendingclub.com/). A parsed and cleaned form of the dataset is availiable [here](https://github.com/learnml/machine-learning-specialization-private). Make sure we **download the dataset** before running the following command.

In [3]:
DATA_DIR = os.path.join('data')

print(os.listdir(DATA_DIR))

['.DS_Store', 'lending-club-data.sframe', 'module-5-assignment-1-validation-idx.json', 'module-5-assignment-1-train-idx.json', 'c3w3_Quiz.jpeg', 'lending-club-data.csv']


In [4]:
loans = pd.read_csv(os.path.join(DATA_DIR, 'lending-club-data.csv'))

  interactivity=interactivity, compiler=compiler, result=result)


### Exploring some features

**2**. Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset.

In [14]:
len(loans)

122607

In [5]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1


In [6]:
loans.columns

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'is_inc_v', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans', 'bad_loans',
       'emp_length_num', 'grade_num', 'sub_gra

Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.

In [7]:
loans['grade'].value_counts()

B    37172
C    29950
A    22314
D    19175
E     8990
F     3932
G     1074
Name: grade, dtype: int64

We can see that over half of the loan grades are assigned values `B` or `C`. Each loan is assigned one of these grades, along with a more finely discretized feature called `sub_grade` (feel free to explore that feature column as well!). These values depend on the loan application and credit report, and determine the interest rate of the loan. More information can be found [here](https://www.lendingclub.com/public/rates-and-fees.action).

In [8]:
loans['sub_grade'].unique()

array(['B2', 'C4', 'C5', 'C1', 'A4', 'E1', 'F2', 'B5', 'C3', 'B1', 'D1',
       'A1', 'B3', 'B4', 'C2', 'D2', 'A3', 'A5', 'D5', 'A2', 'E4', 'D3',
       'D4', 'F3', 'E3', 'F1', 'E5', 'G4', 'E2', 'G2', 'F5', 'F4', 'G5',
       'G1', 'G3'], dtype=object)

Now, let's look at a different feature.

In [9]:
loans['home_ownership'].value_counts()

MORTGAGE    59240
RENT        53245
OWN          9943
OTHER         179
Name: home_ownership, dtype: int64

This feature describes whether the loanee is mortaging, renting, or owns a home. We can see that a small percentage of the loanees own a home.

### Exploring the target column

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

**3**. We put this in a new column called `safe_loans`.

In [10]:
loans['bad_loans'].value_counts()

0    99457
1    23150
Name: bad_loans, dtype: int64

In [11]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: +1 if x == 0 else -1)

loans.drop('bad_loans', inplace=True, axis=1)

In [12]:
loans['safe_loans'].value_counts()

 1    99457
-1    23150
Name: safe_loans, dtype: int64

We should have:
* Around 81% safe loans
* Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

### Features for the classification algorithm
**5**. In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below. If you are a finance geek, the [LendingClub](https://www.lendingclub.com/) website has a lot more details about these features.

In [15]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

What remains now is a **subset of features** and the **target** that we will use for the rest of this notebook. 

### Sample data to balance classes

**6**. As we explored above, our data is disproportionally full of safe loans. We should balance two classes in our dataset.

**Since we would like to make sure our final result matches the correct result. We should use the specific training dataset and validdation dataset here.**

Then follow the following steps:

* Load the JSON files into the lists train_idx and validation_idx.
* Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding. Alternatively, see #7 for implementation hints.
* Perform train/validation split using train_idx and validation_idx. In Pandas, for instance:

We can regard the step below as the process of our imbalance dataset:

In [16]:
import json

with open(os.path.join(DATA_DIR, 'module-5-assignment-1-train-idx.json')) as json_file:
    train_idx = json.load(json_file)
    
with open(os.path.join(DATA_DIR, 'module-5-assignment-1-validation-idx.json')) as json_file:
    validation_idx = json.load(json_file)

**Except for this assignment, we can do as follows** ( *but we should skip this section in this assignment* ): 

---
As we explored above, our data is disproportionally full of safe loans.  Let's create two datasets: one with just the safe loans (`safe_loans_raw`) and one with just the risky loans (`risky_loans_raw`).

In [17]:
safe_loans_raw = loans[loans['safe_loans'] == +1]
risky_loans_raw = loans[loans['safe_loans'] == -1]

print(f"Number of safe loans: {len(safe_loans_raw)}")
print(f"Number of risky loans: {len(risky_loans_raw)}")

Number of safe loans: 99457
Number of risky loans: 23150


In [18]:
print(f"Percentage of safe loans  : {round(len(safe_loans_raw)/(len(safe_loans_raw)+len(risky_loans_raw)), 2)}") 
print(f"Percentage of risky loans : {round(len(risky_loans_raw)/(len(safe_loans_raw)+len(risky_loans_raw)), 2)}")

Percentage of safe loans  : 0.81
Percentage of risky loans : 0.19


One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `seed=1` so everyone gets the same results.

In [19]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(frac=percentage, random_state=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

Now, let's verify that the resulting percentage of safe and risky loans are each nearly 50%.

In [20]:
print(f"Percentage of safe loans                 : {len(safe_loans) / float(len(loans_data))}")
print(f"Percentage of risky loans                : {len(risky_loans) / float(len(loans_data))}")
print(f"Total number of loans in our new dataset : {len(loans_data)}")

Percentage of safe loans                 : 0.5
Percentage of risky loans                : 0.5
Total number of loans in our new dataset : 46300


**Note:** There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this [paper](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5128907&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F69%2F5173046%2F05128907.pdf%3Farnumber%3D5128907 ). For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

---

### One-hot encoding
**7**. For scikit-learn's decision tree implementation, it requires numerical values for it's data matrix. This means you will have to turn categorical variables into binary features via one-hot encoding. **The next assignment has more details about this**.

If you are using SFrame, feel free to use this piece of code as is. Refer to the SFrame API documentation for a deeper understanding. If you are using different machine learning software, make sure you prepare the data to be passed to the learning software.

First, we should select the categoriacal data

In [21]:
loans.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.7,0.0,1
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,-1
2,C,C5,0,11,RENT,8.72,small_business,36 months,1,1,98.5,0.0,1
3,C,C1,0,11,RENT,20.0,other,36 months,0,1,21.0,16.97,1
4,A,A4,0,4,RENT,11.2,wedding,36 months,1,1,28.3,0.0,1


In [22]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122607 entries, 0 to 122606
Data columns (total 13 columns):
grade                    122607 non-null object
sub_grade                122607 non-null object
short_emp                122607 non-null int64
emp_length_num           122607 non-null int64
home_ownership           122607 non-null object
dti                      122607 non-null float64
purpose                  122607 non-null object
term                     122607 non-null object
last_delinq_none         122607 non-null int64
last_major_derog_none    122607 non-null int64
revol_util               122607 non-null float64
total_rec_late_fee       122607 non-null float64
safe_loans               122607 non-null int64
dtypes: float64(3), int64(5), object(5)
memory usage: 12.2+ MB


Extract ths categorical columns

In [23]:
categorical_variables = loans.select_dtypes(include=['object']).columns.values

print(categorical_variables)

['grade' 'sub_grade' 'home_ownership' 'purpose' 'term']


Check if the values of all categorical variabels need to preprocess.

In [24]:
for col_name in categorical_variables:
    print(col_name)
    print(loans[col_name].value_counts(), '\n')

grade
B    37172
C    29950
A    22314
D    19175
E     8990
F     3932
G     1074
Name: grade, dtype: int64 

sub_grade
B3    9036
B4    8279
B2    7096
C1    7068
B5    6924
C2    6726
A5    6027
A4    5993
B1    5837
C3    5690
C4    5402
C5    5064
D1    4593
D2    4391
A3    3955
D3    3745
D4    3489
A2    3352
A1    2987
D5    2957
E2    2184
E1    2080
E3    1785
E4    1581
E5    1360
F1    1105
F2     930
F3     770
F4     629
F5     498
G1     370
G2     241
G3     167
G4     152
G5     144
Name: sub_grade, dtype: int64 

home_ownership
MORTGAGE    59240
RENT        53245
OWN          9943
OTHER         179
Name: home_ownership, dtype: int64 

purpose
debt_consolidation    68233
credit_card           22050
other                  9087
home_improvement       7543
major_purchase         3877
small_business         3264
car                    2375
medical                1607
wedding                1526
moving                 1180
house                  1005
vacation              

In [25]:
def convert_to_one_hot(df, col_name):
    # loop over all classes of this categorical variablee
    for cla in sorted(df[col_name].unique()):
        df[f"{col_name}_{cla}"] = df[col_name].apply(lambda x: 1 if x == cla else 0)
        
    return loans

In [26]:
for col in categorical_variables:
    loans = convert_to_one_hot(loans, col)

After one-hot encoding, remove the original column.

In [27]:
loans.drop(categorical_variables, axis=1, inplace=True)

In [28]:
loans.columns

Index(['short_emp', 'emp_length_num', 'dti', 'last_delinq_none',
       'last_major_derog_none', 'revol_util', 'total_rec_late_fee',
       'safe_loans', 'grade_A', 'grade_B', 'grade_C', 'grade_D', 'grade_E',
       'grade_F', 'grade_G', 'sub_grade_A1', 'sub_grade_A2', 'sub_grade_A3',
       'sub_grade_A4', 'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2',
       'sub_grade_B3', 'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1',
       'sub_grade_C2', 'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5',
       'sub_grade_D1', 'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4',
       'sub_grade_D5', 'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3',
       'sub_grade_E4', 'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2',
       'sub_grade_F3', 'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1',
       'sub_grade_G2', 'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5',
       'home_ownership_MORTGAGE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'purpose_car', 'purpose_credit_card'

In [29]:
loans.head()

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,0,11,27.65,1,1,83.7,0.0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
1,1,1,1.0,1,1,9.4,0.0,-1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,11,8.72,1,1,98.5,0.0,1,0,0,...,0,0,0,0,0,1,0,0,1,0
3,0,11,20.0,0,1,21.0,16.97,1,0,0,...,0,0,0,0,1,0,0,0,1,0
4,0,4,11.2,1,1,28.3,0.0,1,1,0,...,0,0,0,0,0,0,0,1,1,0


### Split data into training and validation sets

**8**. We split the data into training and validation sets using an 80/20 split and specifying `random_state=1` so everyone gets the same results.

**Since we would like to make sure our final result matches the correct result. We should use the specific training dataset and validdation dataset here.**

In [30]:
train_data = loans.iloc[train_idx]
validation_data = loans.iloc[validation_idx]

In [31]:
print(round((len(train_data)/(len(train_data)+len(validation_data))), 2))
print(round((len(validation_data)/(len(train_data)+len(validation_data))), 2))

0.8
0.2


**Except for this assignment, we can do as follows** ( *but we should skip this section in this assignment* ): 

---

**Note**: In previous assignments, we have called this a **train-test split**. However, the portion of data that we don't train on will be used to help **select model parameters** (this is known as model selection). Thus, this portion of data should be called a **validation set**. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.

In [32]:
train_dataset, validation_dataset = train_test_split(loans_data, test_size=0.2, random_state=1)

In [33]:
print(round((len(train_dataset)/len(loans_data)), 2))
print(round((len(validation_dataset)/len(loans_data)), 2))

0.8
0.2


---

## Build a decision tree classifier

**9**. Now, let's use the built-in scikit learn decision tree learner ([sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)) to create a loan prediction model on the training data. To do this, we will need to import **sklearn**, **sklearn.tree**, and **numpy**.

Note: We will have to first convert the SFrame into a numpy data matrix, and extract the target labels as a numpy array (Hint: you can use the .to_numpy() method call on SFrame to turn SFrames into numpy arrays). See the API for more information.  Make sure to set `max_depth=6`.

Call this model **decision_tree_model**.

In [34]:
from sklearn.tree import DecisionTreeClassifier

In [46]:
X_train = loans.drop('safe_loans', axis=1).values
y_train = np.array(loans['safe_loans'])

print(X_train.shape)
print(y_train.shape)

(122607, 67)
(122607,)


In [None]:
# Create Decision Tree classifer object
decision_tree_model = DecisionTreeClassifier(max_depth=6)

# Train Decision Tree Classifer
decision_tree_model = decision_tree_model.fit(X_train, y_train)