### Implementing binary decision trees

In [3]:
import numpy as np
import pandas as pd

In [5]:
loans = pd.read_csv("lending-club-data.csv")

  interactivity=interactivity, compiler=compiler, result=result)


### Exploring data

In [7]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1



The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans`.

In [9]:
loans["safe_loans"] = loans["bad_loans"].apply(lambda x: +1 if x ==0 else -1)

In [10]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none,safe_loans
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1,-1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1,1


Now, let us explore the distribution of the column `safe_loans`. This gives us a sense of how many safe and risky loans are present in the dataset.

In [19]:
print ("Percetage of Good Loans in input datasets is:")
print (round(sum(loans["safe_loans"] == +1)/len(loans),2))

Percetage of Good Loans in input datasets is:
0.81


In [21]:
print ("Percetage of Good Loans in input datasets is:")
print (round(sum(loans["safe_loans"] == -1)/len(loans),2))

Percetage of Good Loans in input datasets is:
0.19


### Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below. If you are a finance geek, the [LendingClub](https://www.lendingclub.com/) website has a lot more details about these features.

In [22]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

In [23]:
loans[features + [target]]

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.70,0.00,1
1,C,C4,1,1,RENT,1.00,car,60 months,1,1,9.40,0.00,-1
2,C,C5,0,11,RENT,8.72,small_business,36 months,1,1,98.50,0.00,1
3,C,C1,0,11,RENT,20.00,other,36 months,0,1,21.00,16.97,1
4,A,A4,0,4,RENT,11.20,wedding,36 months,1,1,28.30,0.00,1
5,E,E1,0,10,RENT,5.35,car,36 months,1,1,87.50,0.00,1
6,F,F2,0,5,OWN,5.55,small_business,60 months,1,1,32.60,0.00,-1
7,B,B5,1,1,RENT,18.08,other,60 months,1,1,36.50,0.00,-1
8,C,C3,0,6,OWN,16.12,debt_consolidation,60 months,1,1,20.60,0.00,1
9,B,B5,0,11,OWN,10.78,debt_consolidation,36 months,1,1,67.10,0.00,1


### Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans.  Let's create two datasets: one with just the safe loans (`safe_loans_raw`) and one with just the risky loans (`risky_loans_raw`).

In [26]:
import json
with open ("module-5-assignment-1-train-idx.json", "r") as f:
    train_idx = json.load(f)
with open ("module-5-assignment-1-validation-idx.json", "r") as f:
    valid_idx = json.load(f)

In [27]:
train_data = loans.iloc[train_idx]

In [28]:
valid_data = loans.iloc[valid_idx]

### Build a decision tree classifier