# Identifying safe loans with decision trees

The [LendingClub](https://www.lendingclub.com/) is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to [default](https://en.wikipedia.org/wiki/Default_%28finance%29).

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be [charged off](https://en.wikipedia.org/wiki/Charge-off) and possibly go into default. In this assignment you will:

* Use SFrames to do some feature engineering.
* Train a decision-tree on the LendingClub dataset.
* Visualize the tree.
* Predict whether a loan will default along with prediction probabilities (on a validation set).
* Train a complex tree model and compare it to simple tree model.

Let's get started!

## Fire up GraphLab Create

Make sure you have the latest version of GraphLab Create. If you don't find the decision tree module, then you would need to upgrade GraphLab Create using

```
   pip install graphlab-create --upgrade
```

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Load LendingClub dataset

We will be using a dataset from the [LendingClub](https://www.lendingclub.com/). A parsed and cleaned form of the dataset is availiable [here](https://github.com/learnml/machine-learning-specialization-private). Make sure you **download the dataset** before running the following command.

In [2]:
loans = pd.read_csv('loan.csv')

  interactivity=interactivity, compiler=compiler, result=result)


## Exploring some features

Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset.

In [3]:
#loans = loans[:40000]
print(loans)

              id  member_id  loan_amnt  funded_amnt  funded_amnt_inv  \
0        1077501    1296599     5000.0       5000.0      4975.000000   
1        1077430    1314167     2500.0       2500.0      2500.000000   
2        1077175    1313524     2400.0       2400.0      2400.000000   
3        1076863    1277178    10000.0      10000.0     10000.000000   
4        1075358    1311748     3000.0       3000.0      3000.000000   
5        1075269    1311441     5000.0       5000.0      5000.000000   
6        1069639    1304742     7000.0       7000.0      7000.000000   
7        1072053    1288686     3000.0       3000.0      3000.000000   
8        1071795    1306957     5600.0       5600.0      5600.000000   
9        1071570    1306721     5375.0       5375.0      5350.000000   
10       1070078    1305201     6500.0       6500.0      6500.000000   
11       1069908    1305008    12000.0      12000.0     12000.000000   
12       1064687    1298717     9000.0       9000.0      9000.00

In [4]:
# this is way of selecting column matrix (required column)
print(loans.iloc[:,2])

0          5000.0
1          2500.0
2          2400.0
3         10000.0
4          3000.0
5          5000.0
6          7000.0
7          3000.0
8          5600.0
9          5375.0
10         6500.0
11        12000.0
12         9000.0
13         3000.0
14        10000.0
15         1000.0
16        10000.0
17         3600.0
18         6000.0
19         9200.0
20        20250.0
21        21000.0
22        10000.0
23        10000.0
24         6000.0
25        15000.0
26        15000.0
27         5000.0
28         4000.0
29         8500.0
           ...   
887349    20000.0
887350    10300.0
887351     4200.0
887352    15000.0
887353    15000.0
887354     6000.0
887355    26950.0
887356    23000.0
887357    18700.0
887358    25000.0
887359    25000.0
887360    26500.0
887361    21000.0
887362     8000.0
887363    12000.0
887364    10775.0
887365     7000.0
887366     6225.0
887367    10000.0
887368    13150.0
887369     4000.0
887370     7500.0
887371    10850.0
887372    12000.0
887373    

In [5]:
loans.columns

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose',
       'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt',
       'next_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med',
       'mths_since_last_major_derog', 'policy_code', 'application_type',
       'annual_inc_joint', 'dti_joint', 'verification_status_joint',
    

In [6]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887379 entries, 0 to 887378
Data columns (total 74 columns):
id                             887379 non-null int64
member_id                      887379 non-null int64
loan_amnt                      887379 non-null float64
funded_amnt                    887379 non-null float64
funded_amnt_inv                887379 non-null float64
term                           887379 non-null object
int_rate                       887379 non-null float64
installment                    887379 non-null float64
grade                          887379 non-null object
sub_grade                      887379 non-null object
emp_title                      835917 non-null object
emp_length                     842554 non-null object
home_ownership                 887379 non-null object
annual_inc                     887375 non-null float64
verification_status            887379 non-null object
issue_d                        887379 non-null object
loan_status          

In [7]:
loans['total_cu_tl'].value_counts()

0.0     11478
1.0      3467
2.0      1992
3.0      1257
4.0       911
5.0       616
6.0       425
7.0       313
8.0       239
9.0       157
10.0      124
11.0       86
12.0       73
13.0       57
14.0       44
15.0       34
16.0       24
17.0       21
18.0       15
21.0        8
20.0        7
19.0        5
24.0        5
22.0        4
25.0        2
32.0        1
27.0        1
30.0        1
28.0        1
23.0        1
35.0        1
33.0        1
29.0        1
Name: total_cu_tl, dtype: int64

In [8]:
loans['loan_status'].value_counts()

Current                                                601779
Fully Paid                                             207723
Charged Off                                             45248
Late (31-120 days)                                      11591
Issued                                                   8460
In Grace Period                                          6253
Late (16-30 days)                                        2357
Does not meet the credit policy. Status:Fully Paid       1988
Default                                                  1219
Does not meet the credit policy. Status:Charged Off       761
Name: loan_status, dtype: int64

In [9]:
# replace some columns names and drop some of them 

loans = loans.rename(columns={"loan_amnt": "loan_amount", "funded_amnt": "funded_amount", "funded_amnt_inv": "investor_funds",
                       "int_rate": "interest_rate", "annual_inc": "annual_income"})

# Drop irrelevant columns
loans.drop(['id', 'member_id', 'emp_title', 'url', 'desc', 'zip_code', 'title'], axis=1, inplace=True)

In [10]:
# Determining the loans that are bad from loan_status column

bad_loan = ["Charged Off", "Default", "Does not meet the credit policy. Status:Charged Off", "In Grace Period", 
            "Late (16-30 days)", "Late (31-120 days)"]


In [11]:
loans = loans[:20000]

In [12]:

loans['loan_condition'] = np.nan

def loan_condition(status):
    if status in bad_loan:
        return 'Bad Loan'
    else:
        return 'Good Loan'
    
    

In [13]:
loans['loan_condition'] = loans['loan_status'].apply(loan_condition)

In [14]:
loans['loan_condition'].value_counts()

Good Loan    16952
Bad Loan      3048
Name: loan_condition, dtype: int64

In [15]:
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,...,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,loan_condition
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,...,,,,,,,,,,Good Loan
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,...,,,,,,,,,,Bad Loan
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,...,,,,,,,,,,Good Loan
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,...,,,,,,,,,,Good Loan
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,...,,,,,,,,,,Good Loan
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,...,,,,,,,,,,Good Loan
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,...,,,,,,,,,,Good Loan
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,...,,,,,,,,,,Good Loan
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,...,,,,,,,,,,Bad Loan
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,...,,,,,,,,,,Bad Loan


# here with  one hot encoding techinque is implemented using get_dummies()

In [16]:
df = pd.DataFrame({'country': ['russia', 'germany', 'australia','korea','germany']})

pd.get_dummies(df,prefix=['country'], drop_first=True)

df1 = pd.concat([loans,pd.get_dummies(loans['term'], prefix='term')],axis=1)
print(df1)

       loan_amount  funded_amount  investor_funds        term  interest_rate  \
0           5000.0         5000.0     4975.000000   36 months          10.65   
1           2500.0         2500.0     2500.000000   60 months          15.27   
2           2400.0         2400.0     2400.000000   36 months          15.96   
3          10000.0        10000.0    10000.000000   36 months          13.49   
4           3000.0         3000.0     3000.000000   60 months          12.69   
5           5000.0         5000.0     5000.000000   36 months           7.90   
6           7000.0         7000.0     7000.000000   60 months          15.96   
7           3000.0         3000.0     3000.000000   36 months          18.64   
8           5600.0         5600.0     5600.000000   60 months          21.28   
9           5375.0         5375.0     5350.000000   60 months          12.69   
10          6500.0         6500.0     6500.000000   60 months          14.65   
11         12000.0        12000.0    120

In [17]:
# one hot encoding of term column 

pd.get_dummies(loans['term'] ,prefix=['term'])

Unnamed: 0,['term']_ 36 months,['term']_ 60 months
0,1,0
1,0,1
2,1,0
3,1,0
4,0,1
5,1,0
6,0,1
7,1,0
8,0,1
9,0,1


In [18]:
loans.iloc[:,:10]

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT


In [19]:
features = ['term' ,'grade' , 'emp_length' ,'home_ownership']
for feature in features:
    loans = pd.concat([loans,pd.get_dummies(loans[feature], prefix=feature)],axis=1)
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,...,emp_length_5 years,emp_length_6 years,emp_length_7 years,emp_length_8 years,emp_length_9 years,emp_length_< 1 year,home_ownership_MORTGAGE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,...,0,0,0,0,0,0,0,0,0,1
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,...,0,0,0,0,0,1,0,0,0,1
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,...,0,0,0,0,0,0,0,0,0,1
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,...,0,0,0,0,0,0,0,0,0,1
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,...,0,0,0,0,0,0,0,0,0,1
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,...,0,0,0,0,0,0,0,0,0,1
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,...,0,0,0,1,0,0,0,0,0,1
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,...,0,0,0,0,1,0,0,0,0,1
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,...,0,0,0,0,0,0,0,0,1,0
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,...,0,0,0,0,0,1,0,0,0,1


In [20]:
loans['addr_state'].unique()

# Make a list with each of the regions by state.

west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
south_west = ['AZ', 'TX', 'NM', 'OK']
south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']


In [21]:

loans['region'] = np.nan

def finding_regions(state):
    if state in west:
        return 'West'
    elif state in south_west:
        return 'SouthWest'
    elif state in south_east:
        return 'SouthEast'
    elif state in mid_west:
        return 'MidWest'
    elif state in north_east:
        return 'NorthEast'
    

In [22]:
loans['region'] = loans['addr_state'].apply(finding_regions)

In [23]:
loans['region'].value_counts()

West         5202
NorthEast    5163
SouthEast    4687
MidWest      2878
SouthWest    2070
Name: region, dtype: int64

In [24]:
# Let's create categories for annual_income since most of the bad loans are located below 100k

loans['income_category'] = np.nan
lst = [loans]

for col in lst:
    col.loc[col['annual_income'] <= 100000, 'income_category'] = 'Low'
    col.loc[(col['annual_income'] > 100000) & (col['annual_income'] <= 200000), 'income_category'] = 'Medium'
    col.loc[col['annual_income'] > 200000, 'income_category'] = 'High'

In [25]:
print( loans['income_category'] ,loans['income_category'].value_counts()  )

0           Low
1           Low
2           Low
3           Low
4           Low
5           Low
6           Low
7           Low
8           Low
9           Low
10          Low
11          Low
12          Low
13          Low
14          Low
15          Low
16          Low
17       Medium
18          Low
19          Low
20          Low
21       Medium
22          Low
23          Low
24          Low
25          Low
26          Low
27          Low
28       Medium
29          Low
          ...  
19970       Low
19971       Low
19972    Medium
19973    Medium
19974    Medium
19975       Low
19976      High
19977       Low
19978       Low
19979       Low
19980    Medium
19981       Low
19982    Medium
19983    Medium
19984       Low
19985       Low
19986       Low
19987       Low
19988       Low
19989       Low
19990       Low
19991       Low
19992       Low
19993       Low
19994       Low
19995       Low
19996       Low
19997       Low
19998       Low
19999       Low
Name: income_category, L

In [26]:
loans['interest_rate'].describe()
# Average interest is 13.26% Anything above this will be considered of high risk let's see if this is true.
loans['interest_payments'] = np.nan
lst = [loans]

for col in lst:
    col.loc[col['interest_rate'] <= 13.23, 'interest_payments'] = 'Low'
    col.loc[col['interest_rate'] > 13.23, 'interest_payments'] = 'High'
    
loans.head()

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,...,emp_length_8 years,emp_length_9 years,emp_length_< 1 year,home_ownership_MORTGAGE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT,region,income_category,interest_payments
0,5000.0,5000.0,4975.0,36 months,10.65,162.87,B,B2,10+ years,RENT,...,0,0,0,0,0,0,1,SouthWest,Low,Low
1,2500.0,2500.0,2500.0,60 months,15.27,59.83,C,C4,< 1 year,RENT,...,0,0,1,0,0,0,1,SouthEast,Low,High
2,2400.0,2400.0,2400.0,36 months,15.96,84.33,C,C5,10+ years,RENT,...,0,0,0,0,0,0,1,MidWest,Low,High
3,10000.0,10000.0,10000.0,36 months,13.49,339.31,C,C1,10+ years,RENT,...,0,0,0,0,0,0,1,West,Low,High
4,3000.0,3000.0,3000.0,60 months,12.69,67.79,B,B5,1 year,RENT,...,0,0,0,0,0,0,1,West,Low,Low


In [27]:
loans['interest_payments'].value_counts()

Low     12174
High     7826
Name: interest_payments, dtype: int64

In [29]:
feature_1 = ['region' , 'income_category' , 'interest_payments']
features =  features + feature_1
features

['term',
 'grade',
 'emp_length',
 'home_ownership',
 'region',
 'income_category',
 'interest_payments']

In [30]:
for feature in feature_1:
    loans = pd.concat([loans , pd.get_dummies(loans[feature], prefix=feature)] , axis=1)
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,...,region_MidWest,region_NorthEast,region_SouthEast,region_SouthWest,region_West,income_category_High,income_category_Low,income_category_Medium,interest_payments_High,interest_payments_Low
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,...,0,0,0,1,0,0,1,0,0,1
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,...,0,0,1,0,0,0,1,0,1,0
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,...,1,0,0,0,0,0,1,0,1,0
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,...,0,0,0,0,1,0,1,0,1,0
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,...,0,0,0,0,1,0,1,0,0,1
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,...,0,0,0,1,0,0,1,0,0,1
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,...,0,0,1,0,0,0,1,0,1,0
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,...,0,0,0,0,1,0,1,0,1,0
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,...,0,0,0,0,1,0,1,0,1,0
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,...,0,0,0,1,0,0,1,0,0,1


In [31]:
loans['loan_amount'].value_counts()

12000.0    1291
10000.0    1175
6000.0      920
5000.0      877
15000.0     849
20000.0     756
8000.0      753
35000.0     685
4000.0      495
3000.0      482
7000.0      461
25000.0     457
16000.0     440
30000.0     380
14000.0     366
18000.0     341
9000.0      305
24000.0     255
2000.0      250
4800.0      219
7200.0      192
13000.0     177
9600.0      174
7500.0      172
2500.0      166
21000.0     162
3600.0      160
3500.0      152
28000.0     149
11000.0     148
           ... 
26025.0       1
18275.0       1
10025.0       1
19300.0       1
18875.0       1
14050.0       1
17900.0       1
8275.0        1
16350.0       1
19925.0       1
30100.0       1
24975.0       1
3550.0        1
7675.0        1
13625.0       1
18600.0       1
10925.0       1
22875.0       1
23425.0       1
15025.0       1
8025.0        1
2075.0        1
9850.0        1
22900.0       1
28250.0       1
34200.0       1
21700.0       1
16775.0       1
29550.0       1
34525.0       1
Name: loan_amount, Lengt

In [32]:
loans['loan_amount_range'] = np.nan
lst = [loans]

for col in lst:
    col.loc[col['loan_amount'] <= 10000, 'loan_amount_range'] = 'Low'
    col.loc[(col['loan_amount'] > 10000) & (col['loan_amount'] <= 20000), 'loan_amount_range'] = 'Medium'
    col.loc[col['loan_amount'] > 20000, 'loan_amount_range'] = 'High'

In [33]:
loans = pd.concat([loans , pd.get_dummies(loans['loan_amount_range'] , prefix = ['loan_amount_range'])] ,axis =1)
#loans = pd.concat([loans , pd.get_dummies(loans[feature], prefix=feature)] , axis=1)


In [34]:
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,...,region_West,income_category_High,income_category_Low,income_category_Medium,interest_payments_High,interest_payments_Low,loan_amount_range,['loan_amount_range']_High,['loan_amount_range']_Low,['loan_amount_range']_Medium
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,...,0,0,1,0,0,1,Low,0,1,0
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,...,0,0,1,0,1,0,Low,0,1,0
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,...,0,0,1,0,1,0,Low,0,1,0
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,...,1,0,1,0,1,0,Low,0,1,0
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,...,1,0,1,0,0,1,Low,0,1,0
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,...,0,0,1,0,0,1,Low,0,1,0
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,...,0,0,1,0,1,0,Low,0,1,0
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,...,1,0,1,0,1,0,Low,0,1,0
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,...,1,0,1,0,1,0,Low,0,1,0
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,...,0,0,1,0,0,1,Low,0,1,0


Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.

In [35]:
# Copy Dataframe
complete_df = loans.copy()


# Handling Missing Numeric Values

# Transform Missing Values for numeric dataframe
# Nevertheless check what these variables mean tomorrow in the morning.
for col in ('dti_joint', 'annual_inc_joint', 'il_util', 'mths_since_rcnt_il', 'open_acc_6m', 'open_il_6m', 'open_il_12m',
           'open_il_24m', 'inq_last_12m', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'inq_fi', 'total_cu_tl',
           'mths_since_last_record', 'mths_since_last_major_derog', 'mths_since_last_delinq', 'total_bal_il', 'tot_coll_amt',
           'tot_cur_bal', 'total_rev_hi_lim', 'revol_util', 'collections_12_mths_ex_med', 'open_acc', 'inq_last_6mths',
           'verification_status_joint', 'acc_now_delinq'):
    complete_df[col] = complete_df[col].fillna(0)
    

In [36]:
# # Get the mode of next payment date and last payment date and the last date credit amount was pulled   
complete_df["next_pymnt_d"] = complete_df.groupby("region")["next_pymnt_d"].transform(lambda x: x.fillna(x.mode))
complete_df["last_pymnt_d"] = complete_df.groupby("region")["last_pymnt_d"].transform(lambda x: x.fillna(x.mode))
complete_df["last_credit_pull_d"] = complete_df.groupby("region")["last_credit_pull_d"].transform(lambda x: x.fillna(x.mode))
complete_df["earliest_cr_line"] = complete_df.groupby("region")["earliest_cr_line"].transform(lambda x: x.fillna(x.mode))


In [37]:
# # Get the mode on the number of accounts in which the client is delinquent
complete_df["pub_rec"] = complete_df.groupby("region")["pub_rec"].transform(lambda x: x.fillna(x.median()))

# # Get the mean of the annual income depending in the region the client is located.
complete_df["annual_income"] = complete_df.groupby("region")["annual_income"].transform(lambda x: x.fillna(x.mean()))

# Get the mode of the  total number of credit lines the borrower has 
complete_df["total_acc"] = complete_df.groupby("region")["total_acc"].transform(lambda x: x.fillna(x.median()))

# Mode of credit delinquencies in the past two years.
complete_df["delinq_2yrs"] = complete_df.groupby("region")["delinq_2yrs"].transform(lambda x: x.fillna(x.mean()))

In [38]:
# convert loan_condition to loan_condition_int means (good loan = 1 bad loan = -1)

complete_df['loan_condition_int'] = complete_df['loan_condition'].apply(lambda x: 1 if x=='Good Loan' else -1)


In [41]:
complete_df = complete_df.drop(features ,axis =1)

In [45]:
complete_df = complete_df.iloc[:,64:]
complete_df

Unnamed: 0,term_ 36 months,term_ 60 months,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,emp_length_1 year,...,income_category_High,income_category_Low,income_category_Medium,interest_payments_High,interest_payments_Low,loan_amount_range,['loan_amount_range']_High,['loan_amount_range']_Low,['loan_amount_range']_Medium,loan_condition_int
0,1,0,0,1,0,0,0,0,0,0,...,0,1,0,0,1,Low,0,1,0,1
1,0,1,0,0,1,0,0,0,0,0,...,0,1,0,1,0,Low,0,1,0,-1
2,1,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,Low,0,1,0,1
3,1,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,Low,0,1,0,1
4,0,1,0,1,0,0,0,0,0,1,...,0,1,0,0,1,Low,0,1,0,1
5,1,0,1,0,0,0,0,0,0,0,...,0,1,0,0,1,Low,0,1,0,1
6,0,1,0,0,1,0,0,0,0,0,...,0,1,0,1,0,Low,0,1,0,1
7,1,0,0,0,0,0,1,0,0,0,...,0,1,0,1,0,Low,0,1,0,1
8,0,1,0,0,0,0,0,1,0,0,...,0,1,0,1,0,Low,0,1,0,-1
9,0,1,0,1,0,0,0,0,0,0,...,0,1,0,0,1,Low,0,1,0,-1


In [110]:
# Drop these variables before scaling but don't drop these when we perform feature engineering on missing values.
# Columns to delete or fix: earliest_cr_line, last_pymnt_d, next_pymnt_d, last_credit_pull_d, verification_status_joint

# ---->>>> Fix the problems shown during scaling with the columns above.

complete_df.drop(['income_category', 'region',  'emp_length', 'loan_condition',
                 'earliest_cr_line', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d', 
                 'verification_status_joint', 'emp_length_int', 'total_rec_prncp', 'funded_amount', 'investor_funds', 
                 'sub_grade', 'loan_status', 'interest_payments', 
                 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
               'total_pymnt_inv', 'total_rec_int', 'total_rec_late_fee', 'recoveries',
               'collection_recovery_fee', 'last_pymnt_amnt',
               'collections_12_mths_ex_med', 'mths_since_last_major_derog',
               'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint',
               'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',
               'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il',
               'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
               'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m'], axis=1, inplace=True)

KeyError: "['income_category' 'region' 'emp_length' 'loan_condition'\n 'earliest_cr_line' 'last_pymnt_d' 'next_pymnt_d' 'last_credit_pull_d'\n 'verification_status_joint' 'emp_length_int' 'total_rec_prncp'\n 'funded_amount' 'investor_funds' 'sub_grade' 'loan_status'\n 'interest_payments' 'initial_list_status' 'out_prncp' 'out_prncp_inv'\n 'total_pymnt' 'total_pymnt_inv' 'total_rec_int' 'total_rec_late_fee'\n 'recoveries' 'collection_recovery_fee' 'last_pymnt_amnt'\n 'collections_12_mths_ex_med' 'mths_since_last_major_derog' 'policy_code'\n 'application_type' 'annual_inc_joint' 'dti_joint' 'acc_now_delinq'\n 'tot_coll_amt' 'tot_cur_bal' 'open_acc_6m' 'open_il_6m' 'open_il_12m'\n 'open_il_24m' 'mths_since_rcnt_il' 'total_bal_il' 'il_util' 'open_rv_12m'\n 'open_rv_24m' 'max_bal_bc' 'all_util' 'total_rev_hi_lim' 'inq_fi'\n 'total_cu_tl' 'inq_last_12m'] not found in axis"

In [46]:
complete_df.columns

Index(['term_ 36 months', 'term_ 60 months', 'grade_A', 'grade_B', 'grade_C',
       'grade_D', 'grade_E', 'grade_F', 'grade_G', 'emp_length_1 year',
       'emp_length_10+ years', 'emp_length_2 years', 'emp_length_3 years',
       'emp_length_4 years', 'emp_length_5 years', 'emp_length_6 years',
       'emp_length_7 years', 'emp_length_8 years', 'emp_length_9 years',
       'emp_length_< 1 year', 'home_ownership_MORTGAGE',
       'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT',
       'region_MidWest', 'region_NorthEast', 'region_SouthEast',
       'region_SouthWest', 'region_West', 'income_category_High',
       'income_category_Low', 'income_category_Medium',
       'interest_payments_High', 'interest_payments_Low', 'loan_amount_range',
       '['loan_amount_range']_High', '['loan_amount_range']_Low',
       '['loan_amount_range']_Medium', 'loan_condition_int'],
      dtype='object')

In [47]:
complete_df.isnull().sum().max() # Maximum number of nulls.


0

We can see that over half of the loan grades are assigned values `B` or `C`. Each loan is assigned one of these grades, along with a more finely discretized feature called `sub_grade` (feel free to explore that feature column as well!). These values depend on the loan application and credit report, and determine the interest rate of the loan. More information can be found [here](https://www.lendingclub.com/public/rates-and-fees.action).

Now, let's look at a different feature.

In [48]:
complete_df['loan_condition_int'].value_counts()


 1    16952
-1     3048
Name: loan_condition_int, dtype: int64

 # some other code

This feature describes whether the loanee is mortaging, renting, or owns a home. We can see that a small percentage of the loanees own a home.

## Exploring the target column

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans`.

In [49]:
complete_df['loan_condition_int'].value_counts()


 1    16952
-1     3048
Name: loan_condition_int, dtype: int64

Now, let us explore the distribution of the column `safe_loans`. This gives us a sense of how many safe and risky loans are present in the dataset.

In [13]:
#loans['safe_loans'].show(view = 'Categorical')

You should have:
* Around 81% safe loans
* Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

## Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below. If you are a finance geek, the [LendingClub](https://www.lendingclub.com/) website has a lot more details about these features.

In [16]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]


In [19]:
#here only given features are extracted form data 
#loans

## Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans.  Let's create two datasets: one with just the safe loans (`safe_loans_raw`) and one with just the risky loans (`risky_loans_raw`).

In [20]:
#target ='safe_loans' from above 
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)

Number of safe loans  : 99457
Number of risky loans : 23150


In [25]:
len(safe_loans_raw)/float(len(loans['safe_loans']))


0.8111853319957262

Now, write some code to compute below the percentage of safe and risky loans in the dataset and validate these numbers against what was given using `.show` earlier in the assignment:

In [26]:
print "Percentage of safe loans  :", len(safe_loans_raw)/float(len(loans['safe_loans']))

print "Percentage of risky loans :", len(risky_loans_raw)/float(len(loans['safe_loans']))


Percentage of safe loans  : 0.811185331996
Percentage of risky loans : 0.188814668004


One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `seed=1` so everyone gets the same results.

In [47]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

# here safe_loans are extracted using sample % and adding to loans_data
#(i.e % is 0.23% which is used to get 0.23% of safe_loans is taken and rest is added to risky as loans_data)
risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

In [48]:
#here safe was 99100 after sample (percentge 0,23) only 0.23 is taken from safe and
#it added to risky as loans_data ie 23000+23100 =46508
print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))
print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))
print "Total number of loans in our new dataset :", len(loans_data)

Percentage of safe loans                 : 0.502236174422
Percentage of risky loans                : 0.497763825578
Total number of loans in our new dataset : 46508


**Note:** There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this [paper](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5128907&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F69%2F5173046%2F05128907.pdf%3Farnumber%3D5128907 ). For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

# after data engineering complete_df is new dataset which is further               going for process under 

In [56]:
complete_df = complete_df.drop(['loan_amount_range'] , axis =1)
complete_df

Unnamed: 0,term_ 36 months,term_ 60 months,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,emp_length_1 year,...,region_West,income_category_High,income_category_Low,income_category_Medium,interest_payments_High,interest_payments_Low,['loan_amount_range']_High,['loan_amount_range']_Low,['loan_amount_range']_Medium,loan_condition_int
0,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,1,0,1,0,1
1,0,1,0,0,1,0,0,0,0,0,...,0,0,1,0,1,0,0,1,0,-1
2,1,0,0,0,1,0,0,0,0,0,...,0,0,1,0,1,0,0,1,0,1
3,1,0,0,0,1,0,0,0,0,0,...,1,0,1,0,1,0,0,1,0,1
4,0,1,0,1,0,0,0,0,0,1,...,1,0,1,0,0,1,0,1,0,1
5,1,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,1,0,1
6,0,1,0,0,1,0,0,0,0,0,...,0,0,1,0,1,0,0,1,0,1
7,1,0,0,0,0,0,1,0,0,0,...,1,0,1,0,1,0,0,1,0,1
8,0,1,0,0,0,0,0,1,0,0,...,1,0,1,0,1,0,0,1,0,-1
9,0,1,0,1,0,0,0,0,0,0,...,0,0,1,0,0,1,0,1,0,-1


In [57]:

features = complete_df.columns


In [58]:
x = complete_df.drop(['loan_condition_int'] ,axis =1)
y = complete_df['loan_condition_int']
print(x.shape ,y.shape)

(20000, 37) (20000,)


## Split data into training and validation sets

We split the data into training and validation sets using an 80/20 split and specifying `seed=1` so everyone gets the same results.

**Note**: In previous assignments, we have called this a **train-test split**. However, the portion of data that we don't train on will be used to help **select model parameters** (this is known as model selection). Thus, this portion of data should be called a **validation set**. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.

In [59]:
from sklearn.model_selection import train_test_split

x_train, x_validation,y_train ,y_validation = train_test_split(x,y ,test_size =.2, random_state=10)

# Use decision tree to build a classifier

Now, let's use the built-in GraphLab Create decision tree learner to create a loan prediction model on the training data. (In the next assignment, you will implement your own decision tree learning algorithm.)  Our feature columns and target column have already been decided above. Use `validation_set=None` to get the same results as everyone else.

In [62]:
from sklearn.tree import DecisionTreeClassifier
decision_tree_model = DecisionTreeClassifier()

In [63]:
decision_tree_model.fit(x_train ,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [64]:
y_pred = decision_tree_model.predict(x_validation)
y_pred

array([-1,  1,  1, ..., -1,  1,  1], dtype=int64)

In [67]:
from sklearn.metrics import accuracy_score , confusion_matrix , f1_score , roc_auc_score
valid_acc = accuracy_score(y_validation ,y_pred)
valid_confusion_matrix = confusion_matrix(y_validation ,y_pred)
valid_f1_score = f1_score(y_validation ,y_pred)
valid_roc_auc_score = roc_auc_score(y_validation ,y_pred)
print('valid_acc ::' , valid_acc)
print('valid_confusion_matrix ::' ,valid_confusion_matrix)

print('valid_f1_score ::' , valid_f1_score)
print('valid_roc_auc_score ::' ,valid_roc_auc_score)

valid_acc :: 0.7845
valid_confusion_matrix :: [[  91  560]
 [ 302 3047]]
valid_f1_score :: 0.8760782058654398
valid_roc_auc_score :: 0.5248043871224599


## Visualizing a learned model

As noted in the [documentation](https://dato.com/products/create/docs/generated/graphlab.boosted_trees_classifier.create.html#graphlab.boosted_trees_classifier.create), typically the max depth of the tree is capped at 6. However, such a tree can be hard to visualize graphically.  Here, we instead learn a smaller model with **max depth of 2** to gain some intuition by visualizing the learned tree.

In [73]:
decision_tree_model_depth = DecisionTreeClassifier(max_depth=10)
decision_tree_model_depth.fit(x_train,y_train)
y_pred_depth = decision_tree_model_depth.predict(x_validation)
y_pred_depth

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In the view that is provided by GraphLab Create, you can see each node, and each split at each node. This visualization is great for considering what happens when this model predicts the target of a new data point. 

**Note:** To better understand this visual:
* The root node is represented using pink. 
* Intermediate nodes are in green. 
* Leaf nodes in blue and orange. 

In [74]:
valid_acc_depth = accuracy_score(y_validation ,y_pred_depth)
valid_confusion_matrix_depth = confusion_matrix(y_validation ,y_pred_depth)
valid_f1_score_depth = f1_score(y_validation ,y_pred_depth)
valid_roc_auc_score_depth = roc_auc_score(y_validation ,y_pred_depth)
print('valid_acc_depth ::' , valid_acc_depth)
print('valid_confusion_matrix_depth ::' ,valid_confusion_matrix_depth)

print('valid_f1_score_depth ::' , valid_f1_score_depth)
print('valid_roc_auc_score_depth ::' ,valid_roc_auc_score_depth)

valid_acc_depth :: 0.83225
valid_confusion_matrix_depth :: [[  20  631]
 [  40 3309]]
valid_f1_score_depth :: 0.9079434764713952
valid_roc_auc_score_depth :: 0.5093890511829425


# Making predictions

Let's consider two positive and two negative examples **from the validation set** and see what the model predicts. We will do the following:
* Predict whether or not a loan is safe.
* Predict the probability that a loan is safe.

In [75]:
y_pred_1 = decision_tree_model.predict(x_train)

train_acc = accuracy_score(y_train ,y_pred_1)
train_confusion_matrix = confusion_matrix(y_train ,y_pred_1)
train_f1_score = f1_score(y_train ,y_pred_1)
train_roc_auc_score = roc_auc_score(y_train ,y_pred_1)
print('train_acc ::' , train_acc)
print('train_confusion_matrix ::' ,train_confusion_matrix)

print('train_f1_score ::' , train_f1_score)
print('train_roc_auc_score ::' ,train_roc_auc_score)

train_acc :: 0.8859375
train_confusion_matrix :: [[ 1026  1371]
 [  454 13149]]
train_f1_score :: 0.9351064964619705
train_roc_auc_score :: 0.6973300264969527


In [76]:
y_pred_depth_1 = decision_tree_model_depth.predict(x_train)


train_acc_depth = accuracy_score(y_train ,y_pred_depth_1)
train_confusion_matrix_depth = confusion_matrix(y_train ,y_pred_depth_1)
train_f1_score_depth = f1_score(y_train ,y_pred_depth_1)
train_roc_auc_score_depth = roc_auc_score(y_train ,y_pred_depth_1)
print('train_acc_depth ::' , train_acc_depth)
print('train_confusion_matrix_depth ::' ,train_confusion_matrix_depth)

print('train_f1_score_depth ::' , train_f1_score_depth)
print('train_roc_auc_score_depth ::' ,train_roc_auc_score_depth)

train_acc_depth :: 0.8570625
train_confusion_matrix_depth :: [[  195  2202]
 [   85 13518]]
train_f1_score_depth :: 0.9220066159669883
train_roc_auc_score_depth :: 0.5375515339922164


## Explore label predictions

Now, we will use our model  to predict whether or not a loan is likely to default. For each row in the **sample_validation_data**, use the **decision_tree_model** to predict whether or not the loan is classified as a **safe loan**. 

**Hint:** Be sure to use the `.predict()` method.

In [85]:
x_validation_data = x_validation[:4]
y_validation_data = y_validation[:4]
x_validation_data 


Unnamed: 0,term_ 36 months,term_ 60 months,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,emp_length_1 year,...,region_SouthWest,region_West,income_category_High,income_category_Low,income_category_Medium,interest_payments_High,interest_payments_Low,['loan_amount_range']_High,['loan_amount_range']_Low,['loan_amount_range']_Medium
19778,0,1,0,0,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,1,0
4376,1,0,1,0,0,0,0,0,0,0,...,0,1,0,1,0,0,1,0,0,1
10188,1,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,1,0,0
9887,1,0,1,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,1,0


In [86]:
print(y_validation_data)


19778    1
4376     1
10188    1
9887     1
Name: loan_condition_int, dtype: int64


In [88]:
#check predict values are correct with validation_data 
print (decision_tree_model.predict_proba(x_validation_data) )
print (y_validation_data )

[[0.6        0.4       ]
 [0.09090909 0.90909091]
 [0.2        0.8       ]
 [0.         1.        ]]
19778    1
4376     1
10188    1
9887     1
Name: loan_condition_int, dtype: int64


**Quiz Question:** What percentage of the predictions on `sample_validation_data` did `decision_tree_model` get correct?

## Explore probability predictions

For each row in the **sample_validation_data**, what is the probability (according **decision_tree_model**) of a loan being classified as **safe**? 


**Hint:** Set `output_type='probability'` to make **probability** predictions using **decision_tree_model** on `sample_validation_data`:

In [56]:
probability_predictions = decision_tree_model.predict(sample_validation_data, output_type='probability')
probability_predictions

dtype: float
Rows: 4
[0.5473231077194214, 0.48868122696876526, 0.45579513907432556, 0.5458507537841797]

**Quiz Question:** Which loan has the highest probability of being classified as a **safe loan**?

**Checkpoint:** Can you verify that for all the predictions with `probability >= 0.5`, the model predicted the label **+1**?

### Tricky predictions!

Now, we will explore something pretty interesting. For each row in the **sample_validation_data**, what is the probability (according to **small_model**) of a loan being classified as **safe**?

**Hint:** Set `output_type='probability'` to make **probability** predictions using **small_model** on `sample_validation_data`:

In [57]:
print small_model.predict(sample_validation_data, output_type='probability')


[0.5242817997932434, 0.47226759791374207, 0.47226759791374207, 0.5797497630119324]


**Quiz Question:** Notice that the probability preditions are the **exact same** for the 2nd and 3rd loans. Why would this happen?

## Visualize the prediction on a tree


Note that you should be able to look at the small tree, traverse it yourself, and visualize the prediction being made. Consider the following point in the **sample_validation_data**

In [95]:
from sklearn import tree
from sklearn.tree import export_graphviz 

dot_data = tree.export_graphviz(decision_tree_model , out_file=None, 
                                feature_names=x_train.columns,  
                                class_names=y_train,  
                                filled=True, rounded=True,  
                                special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

TypeError: must be str, not numpy.int64

In [58]:
sample_validation_data[1]

{'dti': 16.85,
 'emp_length_num': 10L,
 'grade': 'D',
 'home_ownership': 'RENT',
 'last_delinq_none': 1L,
 'last_major_derog_none': 1L,
 'purpose': 'debt_consolidation',
 'revol_util': 96.4,
 'safe_loans': 1L,
 'short_emp': 0L,
 'sub_grade': 'D1',
 'term': ' 36 months',
 'total_rec_late_fee': 0.0}

Let's visualize the small tree here to do the traversing for this data point.

In [59]:
small_model.show(view="Tree")

**Note:** In the tree visualization above, the values at the leaf nodes are not class predictions but scores (a slightly advanced concept that is out of the scope of this course). You can read more about this [here](https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf).  If the score is $\geq$ 0, the class +1 is predicted.  Otherwise, if the score < 0, we predict class -1.


**Quiz Question:** Based on the visualized tree, what prediction would you make for this data point?

Now, let's verify your prediction by examining the prediction made using GraphLab Create.  Use the `.predict` function on `small_model`.

In [60]:
small_model.predict(sample_validation_data[1])


dtype: int
Rows: 1
[-1L]

# Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows:
$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

Let us start by evaluating the accuracy of the `small_model` and `decision_tree_model` on the training data

In [61]:
print small_model.evaluate(train_data)['accuracy']
print decision_tree_model.evaluate(train_data)['accuracy']

0.613421448528


0.639641091769


**Checkpoint:** You should see that the **small_model** performs worse than the **decision_tree_model** on the training data.


Now, let us evaluate the accuracy of the **small_model** and **decision_tree_model** on the entire **validation_data**, not just the subsample considered above.

In [65]:
print small_model.evaluate(sample_validation_data)['accuracy']
print decision_tree_model.evaluate(sample_validation_data)['accuracy']


0.5


0.5


**Quiz Question:** What is the accuracy of `decision_tree_model` on the validation set, rounded to the nearest .01?

## Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with `max_depth=10`. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

In [66]:
big_model = graphlab.decision_tree_classifier.create(train_data, validation_set=None,
                   target = target, features = features, max_depth = 10)

Now, let us evaluate **big_model** on the training set and validation set.

In [67]:
print big_model.evaluate(train_data)['accuracy']
print big_model.evaluate(validation_data)['accuracy']

0.665189125296


0.620853080569


**Checkpoint:** We should see that **big_model** has even better performance on the training set than **decision_tree_model** did on the training set.

**Quiz Question:** How does the performance of **big_model** on the validation set compare to **decision_tree_model** on the validation set? Is this a sign of overfitting?

### Quantifying the cost of mistakes

Every mistake the model makes costs money. In this section, we will try and quantify the cost of each mistake made by the model.

Assume the following:

* **False negatives**: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of losing a loan that would have otherwise been accepted. 
* **False positives**: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given. 
* **Correct predictions**: All correct predictions don't typically incur any cost.


Let's write code that can compute the cost of mistakes made by the model. Complete the following 4 steps:
1. First, let us compute the predictions made by the model.
1. Second, compute the number of false positives.
2. Third, compute the number of false negatives.
3. Finally, compute the cost of mistakes made by the model by adding up the costs of true positives and false positives.

First, let us make predictions on `validation_data` using the `decision_tree_model`:

In [69]:
predictions = decision_tree_model.predict(validation_data)
predictions

dtype: int
Rows: 9284
[-1L, 1L, -1L, -1L, 1L, -1L, 1L, 1L, -1L, -1L, -1L, 1L, 1L, -1L, 1L, 1L, 1L, -1L, -1L, -1L, 1L, 1L, -1L, -1L, -1L, 1L, 1L, -1L, 1L, -1L, -1L, -1L, -1L, 1L, 1L, -1L, 1L, -1L, 1L, -1L, 1L, 1L, 1L, 1L, 1L, -1L, 1L, -1L, 1L, 1L, -1L, -1L, -1L, -1L, -1L, 1L, 1L, 1L, -1L, -1L, 1L, -1L, -1L, -1L, 1L, 1L, -1L, -1L, -1L, -1L, -1L, 1L, -1L, 1L, -1L, -1L, -1L, 1L, -1L, -1L, -1L, -1L, 1L, 1L, -1L, 1L, 1L, -1L, -1L, 1L, -1L, 1L, -1L, -1L, -1L, 1L, -1L, -1L, -1L, -1L, ... ]

**False positives** are predictions where the model predicts +1 but the true label is -1. Complete the following code block for the number of false positives:

In [70]:
#here predictions are checked with true values of safe_loans and got total how many are there like that
true_labels = validation_data['safe_loans']
false_positives = sum((predictions == +1) & (true_labels == -1))
false_positives

1666L

**False negatives** are predictions where the model predicts -1 but the true label is +1. Complete the following code block for the number of false negatives:

In [71]:
false_negatives = sum((predictions == -1) & (true_labels == +1))
false_negatives

1729L

**Quiz Question:** Let us assume that each mistake costs money:
* Assume a cost of \$10,000 per false negative.
* Assume a cost of \$20,000 per false positive.

What is the total cost of mistakes made by `decision_tree_model` on `validation_data`?

In [72]:
total_cost = 10000 * false_negatives + 20000 * false_positives
total_cost

50610000L