## Implementing binary decision trees

The goal of this notebook is to implement your own binary decision tree classifier. You will:
    
* Use SFrames to do some feature engineering.
* Transform categorical variables into binary variables.
* Write a function to compute the number of misclassified examples in an intermediate node.
* Write a function to find the best feature to split on.
* Build a binary decision tree from scratch.
* Make predictions using the decision tree.
* Evaluate the accuracy of the decision tree.
* Visualize the decision at the root node.

**Important Note**: In this assignment, we will focus on building decision trees where the data contain **only binary (0 or 1) features**. This allows us to avoid dealing with:
* Multiple intermediate nodes in a split
* The thresholding issues of real-valued features.

This assignment **may be challenging**, so brace yourself :)

Make sure you have the latest version of GraphLab Create.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

 # here data engineering is done means features are converted into sub_columns as (0 &1 )

# Load the lending club dataset

In [4]:
loans = pd.read_csv('loan.csv')

We will be using the same [LendingClub](https://www.lendingclub.com/) dataset as in the previous assignment.

In [5]:
loans

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m
0,1077501,1296599,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,...,,,,,,,,,,
1,1077430,1314167,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,...,,,,,,,,,,
2,1077175,1313524,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,...,,,,,,,,,,
3,1076863,1277178,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,...,,,,,,,,,,
4,1075358,1311748,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,...,,,,,,,,,,
5,1075269,1311441,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,...,,,,,,,,,,
6,1069639,1304742,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,...,,,,,,,,,,
7,1072053,1288686,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,...,,,,,,,,,,
8,1071795,1306957,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,...,,,,,,,,,,
9,1071570,1306721,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,...,,,,,,,,,,


In [6]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887379 entries, 0 to 887378
Data columns (total 74 columns):
id                             887379 non-null int64
member_id                      887379 non-null int64
loan_amnt                      887379 non-null float64
funded_amnt                    887379 non-null float64
funded_amnt_inv                887379 non-null float64
term                           887379 non-null object
int_rate                       887379 non-null float64
installment                    887379 non-null float64
grade                          887379 non-null object
sub_grade                      887379 non-null object
emp_title                      835917 non-null object
emp_length                     842554 non-null object
home_ownership                 887379 non-null object
annual_inc                     887375 non-null float64
verification_status            887379 non-null object
issue_d                        887379 non-null object
loan_status          

In [11]:
loans['verification_status'].value_counts() # its just about status of loan so no need as a  column which effects data 

Source Verified    329558
Verified           291071
Not Verified       266750
Name: verification_status, dtype: int64

In [12]:
loans['loan_status'].value_counts()

Current                                                601779
Fully Paid                                             207723
Charged Off                                             45248
Late (31-120 days)                                      11591
Issued                                                   8460
In Grace Period                                          6253
Late (16-30 days)                                        2357
Does not meet the credit policy. Status:Fully Paid       1988
Default                                                  1219
Does not meet the credit policy. Status:Charged Off       761
Name: loan_status, dtype: int64

In [13]:
# replace some columns names and drop some of them 

loans = loans.rename(columns={"loan_amnt": "loan_amount", "funded_amnt": "funded_amount", "funded_amnt_inv": "investor_funds",
                       "int_rate": "interest_rate", "annual_inc": "annual_income"})

# Drop irrelevant columns
loans.drop(['id', 'member_id', 'emp_title', 'url', 'desc', 'zip_code', 'title'], axis=1, inplace=True)

In [21]:
# removed the columns which are not usefull from 18 to all 
loans = loans.iloc[: , :17]
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_income,verification_status,issue_d,loan_status,pymnt_plan,purpose,addr_state
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.00,Verified,Dec-2011,Fully Paid,n,credit_card,AZ
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.00,Source Verified,Dec-2011,Charged Off,n,car,GA
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.00,Not Verified,Dec-2011,Fully Paid,n,small_business,IL
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.00,Source Verified,Dec-2011,Fully Paid,n,other,CA
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.00,Source Verified,Dec-2011,Current,n,other,OR
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,36000.00,Source Verified,Dec-2011,Fully Paid,n,wedding,AZ
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,47004.00,Not Verified,Dec-2011,Current,n,debt_consolidation,NC
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,48000.00,Source Verified,Dec-2011,Fully Paid,n,car,CA
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,40000.00,Source Verified,Dec-2011,Charged Off,n,small_business,CA
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,15000.00,Verified,Dec-2011,Charged Off,n,other,TX


# change loan_status to good_loan or bad_loan for further process

In [23]:
# consider current and fully paid are good loan and others bad loan
Good_loan = ['Current' , 'Fully Paid']  # directly change good to 1 else  -1 

loans['loan_condition'] = np.nan

def loan_condition(status):
    if status in Good_loan :
        return 1
    else:
        return -1
    

In [25]:
loans['loan_condition'] = loans['loan_status'].apply(loan_condition)
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_income,verification_status,issue_d,loan_status,pymnt_plan,purpose,addr_state,loan_condition
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.00,Verified,Dec-2011,Fully Paid,n,credit_card,AZ,1
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.00,Source Verified,Dec-2011,Charged Off,n,car,GA,-1
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.00,Not Verified,Dec-2011,Fully Paid,n,small_business,IL,1
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.00,Source Verified,Dec-2011,Fully Paid,n,other,CA,1
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.00,Source Verified,Dec-2011,Current,n,other,OR,1
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,36000.00,Source Verified,Dec-2011,Fully Paid,n,wedding,AZ,1
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,47004.00,Not Verified,Dec-2011,Current,n,debt_consolidation,NC,1
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,48000.00,Source Verified,Dec-2011,Fully Paid,n,car,CA,1
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,40000.00,Source Verified,Dec-2011,Charged Off,n,small_business,CA,-1
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,15000.00,Verified,Dec-2011,Charged Off,n,other,TX,-1


In [26]:
loans['loan_condition'] .value_counts()

 1    809502
-1     77877
Name: loan_condition, dtype: int64

# now addr _state is changed to areas like west to east .. and then convert them into sub columns as adr.east ,adr.west .. as (1,0)

In [30]:
loans['addr_state'].value_counts().head(10)

CA    129517
NY     74086
TX     71138
FL     60935
IL     35476
NJ     33256
PA     31393
OH     29631
GA     29085
VA     26255
Name: addr_state, dtype: int64

In [31]:
loans['addr_state'].unique()

# Make a list with each of the regions by state.

west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
south_west = ['AZ', 'TX', 'NM', 'OK']
south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']

In [32]:
# get the respected places into region so get region column 
loans['region'] = np.nan

def finding_regions(state):
    if state in west :
        return 'west'
    elif state in south_west :
        return 'SouthWest'
    elif state in south_east :
        return 'SouthEast'
    elif state in mid_west :
        return 'MidWest'
    elif state in north_east :
        return 'NorthEast'
    

In [33]:
loans['region'] = loans['addr_state'].apply(finding_regions)
loans['region'].value_counts()

SouthEast    214646
west         208731
NorthEast    204399
MidWest      155029
SouthWest    104574
Name: region, dtype: int64

In [41]:
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_income,verification_status,issue_d,loan_status,pymnt_plan,purpose,addr_state,loan_condition,region
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.00,Verified,Dec-2011,Fully Paid,n,credit_card,AZ,1,SouthWest
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.00,Source Verified,Dec-2011,Charged Off,n,car,GA,-1,SouthEast
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.00,Not Verified,Dec-2011,Fully Paid,n,small_business,IL,1,MidWest
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.00,Source Verified,Dec-2011,Fully Paid,n,other,CA,1,west
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.00,Source Verified,Dec-2011,Current,n,other,OR,1,west
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,36000.00,Source Verified,Dec-2011,Fully Paid,n,wedding,AZ,1,SouthWest
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,47004.00,Not Verified,Dec-2011,Current,n,debt_consolidation,NC,1,SouthEast
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,48000.00,Source Verified,Dec-2011,Fully Paid,n,car,CA,1,west
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,40000.00,Source Verified,Dec-2011,Charged Off,n,small_business,CA,-1,west
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,15000.00,Verified,Dec-2011,Charged Off,n,other,TX,-1,SouthWest


# now check for purpose which is the loan taken 

In [42]:
loans['purpose'].value_counts()

debt_consolidation    524215
credit_card           206182
home_improvement       51829
other                  42894
major_purchase         17277
small_business         10377
car                     8863
medical                 8540
moving                  5414
vacation                4736
house                   3707
wedding                 2347
renewable_energy         575
educational              423
Name: purpose, dtype: int64

 # now the purpose of the loan is divided into two segments like whether the loan is taken for Bussiness_purpose or Personal_purpose 

In [43]:
# consider small_business and renewable_energy and educational are taken as bussiness_purpose and others personal_purpose 

Bussiness = ['small_business' , 'renewable_energy' , 'educational' , 'debt_consolidation']

loans['loan_purpose'] = np.nan

def finding_purpose(purpose) :
    if purpose in Bussiness :
        return 'Bussiness'
    else :
        return 'Personal'
    

In [44]:
loans['loan_purpose'] = loans['purpose'].apply(finding_purpose)
loans['loan_purpose'].value_counts()

Bussiness    535590
Personal     351789
Name: loan_purpose, dtype: int64

In [45]:
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_income,verification_status,issue_d,loan_status,pymnt_plan,purpose,addr_state,loan_condition,region,loan_purpose
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.00,Verified,Dec-2011,Fully Paid,n,credit_card,AZ,1,SouthWest,Personal
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.00,Source Verified,Dec-2011,Charged Off,n,car,GA,-1,SouthEast,Personal
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.00,Not Verified,Dec-2011,Fully Paid,n,small_business,IL,1,MidWest,Bussiness
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.00,Source Verified,Dec-2011,Fully Paid,n,other,CA,1,west,Personal
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.00,Source Verified,Dec-2011,Current,n,other,OR,1,west,Personal
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,36000.00,Source Verified,Dec-2011,Fully Paid,n,wedding,AZ,1,SouthWest,Personal
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,47004.00,Not Verified,Dec-2011,Current,n,debt_consolidation,NC,1,SouthEast,Bussiness
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,48000.00,Source Verified,Dec-2011,Fully Paid,n,car,CA,1,west,Personal
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,40000.00,Source Verified,Dec-2011,Charged Off,n,small_business,CA,-1,west,Bussiness
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,15000.00,Verified,Dec-2011,Charged Off,n,other,TX,-1,SouthWest,Personal


# now for annual_income  
# Let's create categories for annual_income since most of the bad loans are located below 100k


In [47]:
loans['annual_income'].value_counts()

60000.00     34281
50000.00     30575
65000.00     25498
70000.00     24121
40000.00     23943
80000.00     22729
45000.00     22699
75000.00     22435
55000.00     20755
90000.00     17159
100000.00    17131
85000.00     15648
35000.00     14868
30000.00     13764
120000.00    13202
52000.00     12174
42000.00     11705
48000.00     11330
110000.00    11090
72000.00      9656
95000.00      9274
150000.00     8136
62000.00      7770
36000.00      7700
38000.00      7208
125000.00     7006
32000.00      6774
54000.00      6627
58000.00      6621
56000.00      6557
             ...  
65032.00         1
65033.00         1
65035.00         1
65037.00         1
26260.00         1
65039.00         1
49275.22         1
65041.00         1
65045.00         1
101081.00        1
25270.00         1
25269.00         1
65052.00         1
15964.00         1
36895.02         1
65056.00         1
82999.92         1
73330.67         1
65062.00         1
101073.00        1
65065.00         1
101069.00   

In [66]:
def income_category(amount):
    if amount <= 50000 :
        return 'Low'
    elif (amount >=50000) & (amount <=200000) :
        return 'Medium'
    else :
        return 'High'
    
loans['income_category'] = loans['annual_income'].apply(income_category)
loans

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,...,verification_status,issue_d,loan_status,pymnt_plan,purpose,addr_state,loan_condition,region,loan_purpose,income_category
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,...,Verified,Dec-2011,Fully Paid,n,credit_card,AZ,1,SouthWest,Personal,Low
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,...,Source Verified,Dec-2011,Charged Off,n,car,GA,-1,SouthEast,Personal,Low
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,...,Not Verified,Dec-2011,Fully Paid,n,small_business,IL,1,MidWest,Bussiness,Low
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,...,Source Verified,Dec-2011,Fully Paid,n,other,CA,1,west,Personal,Low
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,...,Source Verified,Dec-2011,Current,n,other,OR,1,west,Personal,Medium
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,...,Source Verified,Dec-2011,Fully Paid,n,wedding,AZ,1,SouthWest,Personal,Low
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,...,Not Verified,Dec-2011,Current,n,debt_consolidation,NC,1,SouthEast,Bussiness,Low
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,...,Source Verified,Dec-2011,Fully Paid,n,car,CA,1,west,Personal,Low
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,...,Source Verified,Dec-2011,Charged Off,n,small_business,CA,-1,west,Bussiness,Low
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,...,Verified,Dec-2011,Charged Off,n,other,TX,-1,SouthWest,Personal,Low


In [67]:
loans['income_category'].value_counts()

Medium    579454
Low       291135
High       16790
Name: income_category, dtype: int64

In [74]:
loans.iloc[ :,:12]

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_income,verification_status
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.00,Verified
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.00,Source Verified
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.00,Not Verified
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.00,Source Verified
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.00,Source Verified
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,36000.00,Source Verified
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,47004.00,Not Verified
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,48000.00,Source Verified
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,40000.00,Source Verified
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,15000.00,Verified


# now consider emp_length feature get it into 3 categories as < 4 as Bad and <7  and  >4 as medium and > 7 as good 

In [76]:
loans['emp_length'].value_counts()

10+ years    291569
2 years       78870
< 1 year      70605
3 years       70026
1 year        57095
5 years       55704
4 years       52529
7 years       44594
8 years       43955
6 years       42950
9 years       34657
Name: emp_length, dtype: int64

In [77]:
Good_length = ['10+ years' , '9 years' , '8 years' , '7 years']
Medium_length = ['4 years' , '5 years' , '6 years']


In [78]:
loans['emp_status'] = np.nan

def emp_length(length) :
    if length in Good_length :
        return 'Good'
    elif length in Medium_length :
        return 'Medium'
    else :
        return 'Bad'
    

In [79]:
loans['emp_status'] = loans['emp_length'].apply(emp_length)
loans['emp_status'].value_counts()

Good      414775
Bad       321421
Medium    151183
Name: emp_status, dtype: int64

In [80]:
loans.iloc[:, :12]

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_income,verification_status
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.00,Verified
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.00,Source Verified
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.00,Not Verified
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.00,Source Verified
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.00,Source Verified
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,36000.00,Source Verified
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,47004.00,Not Verified
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,48000.00,Source Verified
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,40000.00,Source Verified
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,15000.00,Verified


 # now get interest rate into categories such low and high if interest                      is >13.3  get as high else low

In [92]:
loans['interest_rate'].value_counts().head()

10.99    34624
9.17     25720
15.61    25201
9.99     21553
7.89     20311
Name: interest_rate, dtype: int64

In [93]:
loans['interest_rate'].mean()

13.246739679437987

In [95]:
loans['interest'] = np.nan

def interest_rate(rate):
    if rate > 13.24 :
        return 'High'
    else :
        return 'Low'

In [96]:
loans['interest'] = loans['interest_rate'].apply(interest_rate)
loans['interest'].value_counts()

Low     465372
High    422007
Name: interest, dtype: int64

In [97]:
loans.iloc[:,:12]

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_income,verification_status
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.00,Verified
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.00,Source Verified
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.00,Not Verified
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.00,Source Verified
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.00,Source Verified
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,36000.00,Source Verified
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,47004.00,Not Verified
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,48000.00,Source Verified
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,40000.00,Source Verified
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,15000.00,Verified


# now get investor funds details into categories as investment_high or low

In [101]:
loans['investor_funds'].value_counts()

10000.000000    56111
12000.000000    44899
15000.000000    41566
20000.000000    40188
35000.000000    29563
8000.000000     25796
5000.000000     25566
6000.000000     24569
16000.000000    20683
25000.000000    20139
18000.000000    19227
24000.000000    19128
30000.000000    14482
7000.000000     13102
28000.000000    12877
14000.000000    11796
21000.000000    10333
9000.000000     10135
4000.000000      9889
3000.000000      9228
13000.000000     6760
11000.000000     6756
9600.000000      6314
7200.000000      6018
14400.000000     5212
17000.000000     5168
2000.000000      4945
7500.000000      4853
22000.000000     4596
4800.000000      4114
                ...  
11886.095098        1
13644.503996        1
3949.809998         1
5891.520812         1
6307.966092         1
814.283898          1
31878.539014        1
4987.018686         1
7540.076422         1
8933.600000         1
1549.998789         1
13518.257268        1
7021.989688         1
4249.999890         1
7849.18630

In [102]:
loans['investor_funds'].mean()

14702.46438322972

In [103]:
loans['investment'] = np.nan

def investor_funds(funds):
    if funds >= 15000 :
        return 'High'
    else:
        return 'Low'

In [104]:
loans['investment'] = loans['investor_funds'].apply(investor_funds)
loans['investment'].value_counts()

Low     495377
High    392002
Name: investment, dtype: int64

In [106]:
loans.iloc[:,:12]

Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_income,verification_status
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.00,Verified
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.00,Source Verified
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.00,Not Verified
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.00,Source Verified
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.00,Source Verified
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT,36000.00,Source Verified
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT,47004.00,Not Verified
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT,48000.00,Source Verified
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN,40000.00,Source Verified
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT,15000.00,Verified


# loan_amount is get into categories as low medium and high

In [107]:
loans['loan_amount'].mean()

14755.26460508982

In [111]:
loans['loan_amount'].value_counts().head(10)

10000.0    61837
12000.0    50183
15000.0    47210
20000.0    46932
35000.0    36368
8000.0     27870
5000.0     27167
6000.0     26207
25000.0    24125
16000.0    23708
Name: loan_amount, dtype: int64

In [112]:
loans['loan_range'] = np.nan

def loan_amount(amount):
    if amount < 10000 :
        return 'Low'
    elif (amount > 10000) & (amount < 20000) :
        return 'Medium'
    else :
        return 'High'

In [113]:
loans['loan_range'] = loans['loan_amount'].apply(loan_amount)
loans['loan_range'].value_counts()

High      309606
Medium    304030
Low       273743
Name: loan_range, dtype: int64

In [115]:
loans.iloc[:,:10]


Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,home_ownership
0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,RENT
1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,RENT
2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,RENT
3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,RENT
4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,RENT
5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,RENT
6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,RENT
7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,RENT
8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,OWN
9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,RENT


# now get data into one hot encoding of all feaures using get_dummies()

In [117]:
features = ['term', 'grade', 'home_ownership', 
            'region','loan_purpose','income_category',
            'emp_status','interest','investment','loan_range']

In [132]:
for feature in features:
    loans = pd.concat([loans , pd.get_dummies(loans[feature] , prefix = feature)] ,axis =1)


In [133]:
loans.iloc[:,13:25]

Unnamed: 0,loan_status,pymnt_plan,purpose,addr_state,loan_condition,region,loan_purpose,income_category,emp_status,interest,investment,loan_range
0,Fully Paid,n,credit_card,AZ,1,SouthWest,Personal,Low,Good,Low,Low,Low
1,Charged Off,n,car,GA,-1,SouthEast,Personal,Low,Bad,High,Low,Low
2,Fully Paid,n,small_business,IL,1,MidWest,Bussiness,Low,Good,High,Low,Low
3,Fully Paid,n,other,CA,1,west,Personal,Low,Good,High,Low,High
4,Current,n,other,OR,1,west,Personal,Medium,Bad,Low,Low,Low
5,Fully Paid,n,wedding,AZ,1,SouthWest,Personal,Low,Bad,Low,Low,Low
6,Current,n,debt_consolidation,NC,1,SouthEast,Bussiness,Low,Good,High,Low,Low
7,Fully Paid,n,car,CA,1,west,Personal,Low,Good,High,Low,Low
8,Charged Off,n,small_business,CA,-1,west,Bussiness,Low,Medium,High,Low,Low
9,Charged Off,n,other,TX,-1,SouthWest,Personal,Low,Bad,Low,Low,Low


In [134]:
loans.columns

Index(['loan_amount', 'funded_amount', 'investor_funds', 'term',
       'interest_rate', 'installment', 'grade', 'sub_grade', 'emp_length',
       'home_ownership', 'annual_income', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'purpose', 'addr_state', 'loan_condition',
       'region', 'loan_purpose', 'income_category', 'emp_status', 'interest',
       'investment', 'loan_range', 'term_ 36 months', 'term_ 60 months',
       'grade_A', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F',
       'grade_G', 'home_ownership_ANY', 'home_ownership_MORTGAGE',
       'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'region_MidWest', 'region_NorthEast',
       'region_SouthEast', 'region_SouthWest', 'region_west',
       'loan_purpose_Bussiness', 'loan_purpose_Personal',
       'income_category_High', 'income_category_Low', 'income_category_Medium',
       'emp_status_Bad', 'emp_status_Good', 'emp_status_Medium',
     

In [135]:
loans['loan_condition_int'] = np.nan
loans['loan_condition_int'] = loans['loan_condition'] 

In [137]:
loans['loan_condition_int'].value_counts()

 1    809502
-1     77877
Name: loan_condition_int, dtype: int64

In [143]:
loans.to_csv('lending_loans.csv', sep=',')


# from here the actcual process of decision tree classifier model is developed

In [2]:
loans = pd.read_csv('lending_loans.csv')
loans

Unnamed: 0.1,Unnamed: 0,loan_amount,funded_amount,investor_funds,term,interest_rate,installment,grade,sub_grade,emp_length,...,emp_status_Good,emp_status_Medium,interest_High,interest_Low,investment_High,investment_Low,loan_range_High,loan_range_Low,loan_range_Medium,loan_condition_int
0,0,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,10+ years,...,1,0,0,1,0,1,0,1,0,1
1,1,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,< 1 year,...,0,0,1,0,0,1,0,1,0,-1
2,2,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,10+ years,...,1,0,1,0,0,1,0,1,0,1
3,3,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,10+ years,...,1,0,1,0,0,1,1,0,0,1
4,4,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,1 year,...,0,0,0,1,0,1,0,1,0,1
5,5,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,3 years,...,0,0,0,1,0,1,0,1,0,1
6,6,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,8 years,...,1,0,1,0,0,1,0,1,0,1
7,7,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,9 years,...,1,0,1,0,0,1,0,1,0,1
8,8,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,4 years,...,0,1,1,0,0,1,0,1,0,-1
9,9,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,< 1 year,...,0,0,0,1,0,1,0,1,0,-1


In [3]:
loans.columns

Index(['Unnamed: 0', 'loan_amount', 'funded_amount', 'investor_funds', 'term',
       'interest_rate', 'installment', 'grade', 'sub_grade', 'emp_length',
       'home_ownership', 'annual_income', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'purpose', 'addr_state', 'loan_condition',
       'region', 'loan_purpose', 'income_category', 'emp_status', 'interest',
       'investment', 'loan_range', 'term_ 36 months', 'term_ 60 months',
       'grade_A', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F',
       'grade_G', 'home_ownership_ANY', 'home_ownership_MORTGAGE',
       'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'region_MidWest', 'region_NorthEast',
       'region_SouthEast', 'region_SouthWest', 'region_west',
       'loan_purpose_Bussiness', 'loan_purpose_Personal',
       'income_category_High', 'income_category_Low', 'income_category_Medium',
       'emp_status_Bad', 'emp_status_Good', 'emp_status_

In [4]:
df =loans.iloc[:,26:61]
df

Unnamed: 0,term_ 36 months,term_ 60 months,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,home_ownership_ANY,...,emp_status_Bad,emp_status_Good,emp_status_Medium,interest_High,interest_Low,investment_High,investment_Low,loan_range_High,loan_range_Low,loan_range_Medium
0,1,0,0,1,0,0,0,0,0,0,...,0,1,0,0,1,0,1,0,1,0
1,0,1,0,0,1,0,0,0,0,0,...,1,0,0,1,0,0,1,0,1,0
2,1,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,1,0,1,0
3,1,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,1,1,0,0
4,0,1,0,1,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,1,0
5,1,0,1,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,1,0
6,0,1,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,1,0,1,0
7,1,0,0,0,0,0,1,0,0,0,...,0,1,0,1,0,0,1,0,1,0
8,0,1,0,0,0,0,0,1,0,0,...,0,0,1,1,0,0,1,0,1,0
9,0,1,0,1,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,1,0


In [5]:
df

Unnamed: 0,term_ 36 months,term_ 60 months,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,home_ownership_ANY,...,emp_status_Bad,emp_status_Good,emp_status_Medium,interest_High,interest_Low,investment_High,investment_Low,loan_range_High,loan_range_Low,loan_range_Medium
0,1,0,0,1,0,0,0,0,0,0,...,0,1,0,0,1,0,1,0,1,0
1,0,1,0,0,1,0,0,0,0,0,...,1,0,0,1,0,0,1,0,1,0
2,1,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,1,0,1,0
3,1,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,1,1,0,0
4,0,1,0,1,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,1,0
5,1,0,1,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,1,0
6,0,1,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,1,0,1,0
7,1,0,0,0,0,0,1,0,0,0,...,0,1,0,1,0,0,1,0,1,0
8,0,1,0,0,0,0,0,1,0,0,...,0,0,1,1,0,0,1,0,1,0
9,0,1,0,1,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,1,0


In [6]:
# upto 26 columns remove them because its raw data  rest are converted data as (0  &  1)

loans = loans.iloc[:,26:62]
loans

# here the end of normal process next step is down at data split and further process  

Like the previous assignment, we reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.

Unlike the previous assignment where we used several features, in this assignment, we will just be using 4 categorical
features: 

1. grade of the loan 
2. the length of the loan term
3. the home ownership status: own, mortgage, rent
4. number of years of employment.

Since we are building a binary decision tree, we will have to convert these categorical features to a binary representation in a subsequent section using 1-hot encoding.

In [4]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
loans = loans[features + [target]]

Let's explore what the dataset looks like.

In [5]:
s=loans[:10]
s

grade,term,home_ownership,emp_length,safe_loans
B,36 months,RENT,10+ years,1
C,60 months,RENT,< 1 year,-1
C,36 months,RENT,10+ years,1
C,36 months,RENT,10+ years,1
A,36 months,RENT,3 years,1
E,36 months,RENT,9 years,1
F,60 months,OWN,4 years,-1
B,60 months,RENT,< 1 year,-1
C,60 months,OWN,5 years,1
B,36 months,OWN,10+ years,1


## Subsample dataset to make sure classes are balanced

Just as we did in the previous assignment, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We use `seed=1` so everyone gets the same results.

In [6]:
safe_loans_raw = loans[loans[target] == 1]
risky_loans_raw = loans[loans[target] == -1]

# Since there are less risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
safe_loans = safe_loans_raw.sample(percentage, seed = 1)
risky_loans = risky_loans_raw
loans_data = risky_loans.append(safe_loans)

print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))
print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))
print "Total number of loans in our new dataset :", len(loans_data)

Percentage of safe loans                 : 0.502236174422
Percentage of risky loans                : 0.497763825578
Total number of loans in our new dataset : 46508


In [14]:
print(safe_loans, risky_loans[10:], loans_data)

(Columns:
	grade	str
	term	str
	home_ownership	str
	emp_length	str
	safe_loans	int

Rows: 23358

Data:
+-------+------------+----------------+------------+------------+
| grade |    term    | home_ownership | emp_length | safe_loans |
+-------+------------+----------------+------------+------------+
|   B   |  36 months |      OWN       | 10+ years  |     1      |
|   B   |  36 months |    MORTGAGE    |  2 years   |     1      |
|   B   |  36 months |      RENT      |  < 1 year  |     1      |
|   A   |  36 months |    MORTGAGE    |  5 years   |     1      |
|   C   |  36 months |      RENT      |  7 years   |     1      |
|   B   |  36 months |      RENT      | 10+ years  |     1      |
|   B   |  36 months |      RENT      |   1 year   |     1      |
|   A   |  36 months |      RENT      |  5 years   |     1      |
|   B   |  36 months |      RENT      |   1 year   |     1      |
|   B   |  36 months |      RENT      |  2 years   |     1      |
+-------+------------+----------------+

**Note:** There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in "[Learning from Imbalanced Data](http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf)" by Haibo He and Edwardo A. Garcia, *IEEE Transactions on Knowledge and Data Engineering* **21**(9) (June 26, 2009), p. 1263–1284. For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

## Transform categorical data into binary features

In this assignment, we will implement **binary decision trees** (decision trees for binary features, a specific case of categorical variables taking on two values, e.g., true/false). Since all of our features are currently categorical features, we want to turn them into binary features. 

For instance, the **home_ownership** feature represents the home ownership status of the loanee, which is either `own`, `mortgage` or `rent`. For example, if a data point has the feature 
```
   {'home_ownership': 'RENT'}
```
we want to turn this into three features: 
```
 { 
   'home_ownership = OWN'      : 0, 
   'home_ownership = MORTGAGE' : 0, 
   'home_ownership = RENT'     : 1
 }
```

Since this code requires a few Python and GraphLab tricks, feel free to use this block of code as is. Refer to the API documentation for a deeper understanding.

In [13]:
feature='grade'
loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})    
print(loans_data_one_hot_encoded)
#here each column name is taken then its prefixes are used as column names replacing them( grade as grade .a grade.B ,,,)
loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)
print loans_data_unpacked
loans_data_unpacked['grade.A'].fillna(0) #here column A will repace none with 0 

[{'C': 1L}, {'F': 1L}, {'B': 1L}, {'C': 1L}, {'B': 1L}, {'B': 1L}, {'B': 1L}, {'C': 1L}, {'D': 1L}, {'A': 1L}, {'B': 1L}, {'C': 1L}, {'E': 1L}, {'D': 1L}, {'F': 1L}, {'D': 1L}, {'D': 1L}, {'B': 1L}, {'D': 1L}, {'D': 1L}, {'B': 1L}, {'E': 1L}, {'C': 1L}, {'A': 1L}, {'B': 1L}, {'B': 1L}, {'C': 1L}, {'C': 1L}, {'A': 1L}, {'C': 1L}, {'D': 1L}, {'F': 1L}, {'B': 1L}, {'D': 1L}, {'D': 1L}, {'B': 1L}, {'C': 1L}, {'B': 1L}, {'D': 1L}, {'C': 1L}, {'B': 1L}, {'D': 1L}, {'B': 1L}, {'B': 1L}, {'B': 1L}, {'B': 1L}, {'A': 1L}, {'C': 1L}, {'B': 1L}, {'A': 1L}, {'D': 1L}, {'B': 1L}, {'D': 1L}, {'E': 1L}, {'E': 1L}, {'C': 1L}, {'B': 1L}, {'B': 1L}, {'B': 1L}, {'A': 1L}, {'B': 1L}, {'B': 1L}, {'A': 1L}, {'B': 1L}, {'A': 1L}, {'B': 1L}, {'B': 1L}, {'A': 1L}, {'E': 1L}, {'C': 1L}, {'C': 1L}, {'C': 1L}, {'B': 1L}, {'D': 1L}, {'B': 1L}, {'D': 1L}, {'D': 1L}, {'A': 1L}, {'C': 1L}, {'A': 1L}, {'E': 1L}, {'C': 1L}, {'B': 1L}, {'C': 1L}, {'B': 1L}, {'E': 1L}, {'A': 1L}, {'C': 1L}, {'E': 1L}, {'D': 1L}, {'D': 1L}

dtype: int
Rows: 46508
[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, ... ]

In [15]:
loans_data = risky_loans.append(safe_loans)
for feature in features:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})    
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)
    
    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)

Let's see what the feature columns look like now:

In [21]:
features = loans_data.column_names()
features.remove('safe_loans')  # Remove the response variable
print features
loans_data

['grade.A', 'grade.B', 'grade.C', 'grade.D', 'grade.E', 'grade.F', 'grade.G', 'term. 36 months', 'term. 60 months', 'home_ownership.MORTGAGE', 'home_ownership.OTHER', 'home_ownership.OWN', 'home_ownership.RENT', 'emp_length.1 year', 'emp_length.10+ years', 'emp_length.2 years', 'emp_length.3 years', 'emp_length.4 years', 'emp_length.5 years', 'emp_length.6 years', 'emp_length.7 years', 'emp_length.8 years', 'emp_length.9 years', 'emp_length.< 1 year', 'emp_length.n/a']


safe_loans,grade.A,grade.B,grade.C,grade.D,grade.E,grade.F,grade.G,term. 36 months,term. 60 months
-1,0,0,1,0,0,0,0,0,1
-1,0,0,0,0,0,1,0,0,1
-1,0,1,0,0,0,0,0,0,1
-1,0,0,1,0,0,0,0,1,0
-1,0,1,0,0,0,0,0,1,0
-1,0,1,0,0,0,0,0,1,0
-1,0,1,0,0,0,0,0,1,0
-1,0,0,1,0,0,0,0,1,0
-1,0,0,0,1,0,0,0,0,1
-1,1,0,0,0,0,0,0,1,0

home_ownership.MORTGAGE,home_ownership.OTHER,home_ownership.OWN,home_ownership.RENT,emp_length.1 year,emp_length.10+ years
0,0,0,1,0,0
0,0,1,0,0,0
0,0,0,1,0,0
0,0,0,1,0,0
0,0,0,1,0,0
0,0,0,1,0,1
0,0,0,1,1,0
0,0,0,1,0,0
0,0,0,1,0,0
1,0,0,0,0,1

emp_length.2 years,emp_length.3 years,emp_length.4 years,emp_length.5 years,emp_length.6 years,emp_length.7 years
0,0,0,0,0,0
0,0,1,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0
0,1,0,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0
0,0,0,0,0,0
1,0,0,0,0,0
0,0,0,0,0,0

emp_length.8 years,emp_length.9 years,emp_length.< 1 year,emp_length.n/a
0,0,1,0
0,0,0,0
0,0,1,0
0,0,1,0
0,0,0,0
0,0,0,0
0,0,0,0
0,1,0,0
0,0,0,0
0,0,0,0


In [22]:
print "Number of features (after binarizing categorical variables) = %s" % len(features)

Number of features (after binarizing categorical variables) = 25


Let's explore what one of these columns looks like:

In [23]:
loans_data['grade.A']

dtype: int
Rows: 46508
[0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, ... ]

This column is set to 1 if the loan grade is A and 0 otherwise.

**Checkpoint:** Make sure the following answers match up.

In [24]:
print "Total number of grade.A loans : %s" % loans_data['grade.A'].sum()
print "Expexted answer               : 6422"

Total number of grade.A loans : 6422
Expexted answer               : 6422


## Train-test split

We split the data into a train test split with 80% of the data in the training set and 20% of the data in the test set. We use `seed=1` so that everyone gets the same result.

In [7]:
x = loans
y = loans.iloc[:,61:62]
print(x.shape ,y.shape)

In [8]:
from sklearn.model_selection import train_test_split


In [9]:
x_train ,x_test ,y_train ,y_test = train_test_split(x ,y ,test_size= 0.2 ,random_state = 40)

In [10]:
print(x_train.shape ,y_train.shape) 
print(x_test.shape,y_test.shape) 

(709903, 36) (709903, 0)
(177476, 36) (177476, 0)


In [11]:
x_train['loan_condition_int'].value_counts()

 1    647559
-1     62344
Name: loan_condition_int, dtype: int64

# Decision tree implementation

In this section, we will implement binary decision trees from scratch. There are several steps involved in building a decision tree. For that reason, we have split the entire assignment into several sections.

## Function to count number of mistakes while predicting majority class

Recall from the lecture that prediction at an intermediate node works by predicting the **majority class** for all data points that belong to this node.

Now, we will write a function that calculates the number of **missclassified examples** when predicting the **majority class**. This will be used to help determine which feature is the best to split on at a given node of the tree.

**Note**: Keep in mind that in order to compute the number of mistakes for a majority classifier, we only need the label (y values) of the data points in the node. 

** Steps to follow **:
* ** Step 1:** Calculate the number of safe loans and risky loans.
* ** Step 2:** Since we are assuming majority class prediction, all the data points that are **not** in the majority class are considered **mistakes**.
* ** Step 3:** Return the number of **mistakes**.


Now, let us write the function `intermediate_node_num_mistakes` which computes the number of misclassified examples of an intermediate node given the set of labels (y values) of the data points contained in the node. Fill in the places where you find `## YOUR CODE HERE`. There are **three** places in this function for you to fill in.

In [12]:
def intermediate_node_num_mistakes(labels_in_node):
    # Corner case: If labels_in_node is empty, return 0
    if len(labels_in_node) == 0:
        return 0
      
    # Count the number of 1's (safe loans)
    safe_loans_count = sum(labels_in_node == +1)
    #print('safe_loans_count ::' , safe_loans_count)
    
    # Count the number of -1's (risky loans)
    risky_loans_count = sum(labels_in_node == -1)
    #print('risky_loans_count ::', risky_loans_count)
    
    # Return the number of mistakes that the majority classifier makes.
    return min(safe_loans_count, risky_loans_count) 
#here if safe has higher no then risky then its min is risky and viseversa  

Because there are several steps in this assignment, we have introduced some stopping points where you can check your code and make sure it is correct before proceeding. To test your `intermediate_node_num_mistakes` function, run the following code until you get a **Test passed!**, then you should proceed. Otherwise, you should spend some time figuring out where things went wrong.

# here decision_tree with gini index and impurity model is developing

In [17]:
# here getting the probalility of the each feature with respect to output values using------> (log2)

safe_len = 5
risky_len = 9

total_len = safe_len + risky_len
safe      = (safe_len /total_len)
risky     = (risky_len /total_len)   
prob = - ( (safe)* (np.log2(safe)) ) - ( (risky) * (np.log2(risky)) )
prob

0.9402859586706311

In [44]:
def intermediate_node_num_probability(labels_in_node):
    # Corner case: If labels_in_node is empty, return 0
    if len(labels_in_node) == 0:
        return 0
      
    # Count the number of 1's (safe loans)
    safe_len = sum(labels_in_node == +1)
    #print('safe_len ::' , safe_len)
    
    # Count the number of -1's (risky loans)
    risky_len = sum(labels_in_node == -1)
    #print('risky_len ::', risky_len)
    
    total_len = safe_len + risky_len
    safe      = (safe_len /total_len)
    risky     = (risky_len /total_len)  
    
    probability = - ( (safe)* (np.log(safe)) ) - ( (risky) * (np.log(risky)) )
     
    
    # Return the probability that the majority classifier makes.
    return  probability.round(4) # rounded value to four digits


In [30]:
lab = np.array([-1, -1, 1, 1, 1])
lab

array([-1, -1,  1,  1,  1])

In [31]:
# Test case 1
example_labels = np.array([-1, -1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print ('Test passed!')
else:
    print ('Test 1 failed... try again!')

# Test case 2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print ('Test passed!')
else:
    print ('Test 2 failed... try again!')
    
# Test case 3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print ('Test passed!')
else:
    print ('Test 3 failed... try again!')

Test passed!
Test passed!
Test passed!


# here testing intermediate_node_num_probability 

In [45]:
#example_labels = np.array([-1, -1, 1, 1, 1])
#example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])

intermediate_node_num_probability(example_labels)

0.5983

In [46]:
# Test case 1
example_labels = np.array([-1, -1, 1, 1, 1])
if intermediate_node_num_probability(example_labels) ==0.673 :
    print ('Test passed!')
else:
    print ('Test 1 failed... try again!')

# Test case 2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
if intermediate_node_num_probability(example_labels) == 0.5983:
    print ('Test passed!')
else:
    print ('Test 2 failed... try again!')
    
# Test case 3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
if intermediate_node_num_probability(example_labels) == 0.5983:
    print ('Test passed!')
else:
    print ('Test 3 failed... try again!')

Test passed!
Test passed!
Test passed!


## Function to pick best feature to split on

The function **best_splitting_feature** takes 3 arguments: 
1. The data (SFrame of data which includes all of the feature columns and label column)
2. The features to consider for splits (a list of strings of column names to consider for splits)
3. The name of the target/label column (string)

The function will loop through the list of possible features, and consider splitting on each of them. It will calculate the classification error of each split and return the feature that had the smallest classification error when split on.

Recall that the **classification error** is defined as follows:
$$
\mbox{classification error} = \frac{\mbox{# mistakes}}{\mbox{# total examples}}
$$

Follow these steps: 
* **Step 1:** Loop over each feature in the feature list
* **Step 2:** Within the loop, split the data into two groups: one group where all of the data has feature value 0 or False (we will call this the **left** split), and one group where all of the data has feature value 1 or True (we will call this the **right** split). Make sure the **left** split corresponds with 0 and the **right** split corresponds with 1 to ensure your implementation fits with our implementation of the tree building process.
* **Step 3:** Calculate the number of misclassified examples in both groups of data and use the above formula to compute the **classification error**.
* **Step 4:** If the computed error is smaller than the best error found so far, store this **feature and its error**.

This may seem like a lot, but we have provided pseudocode in the comments in order to help you implement the function correctly.

**Note:** Remember that since we are only dealing with binary features, we do not have to consider thresholds for real-valued features. This makes the implementation of this function much easier.

Fill in the places where you find `## YOUR CODE HERE`. There are **five** places in this function for you to fill in.

In [40]:
s=x_train.columns[1:3]
print(s)
for i in s:
    print(i)

Index(['term_ 60 months', 'grade_A'], dtype='object')
term_ 60 months
grade_A


In [41]:
#feature = 'grade_A'
for feature in s:
    l_split = x_test[x_test[feature] == 0]
    print(l_split)

        term_ 36 months  term_ 60 months  grade_A  grade_B  grade_C  grade_D  \
733358                1                0        0        1        0        0   
336299                1                0        0        0        1        0   
806382                1                0        1        0        0        0   
17265                 1                0        1        0        0        0   
585889                1                0        0        0        1        0   
247519                1                0        0        0        1        0   
836102                1                0        1        0        0        0   
223391                1                0        0        1        0        0   
324405                1                0        0        0        0        1   
834239                1                0        0        0        1        0   
479679                1                0        0        0        1        0   
533958                1                0

        term_ 36 months  term_ 60 months  grade_A  grade_B  grade_C  grade_D  \
733358                1                0        0        1        0        0   
336299                1                0        0        0        1        0   
585889                1                0        0        0        1        0   
393268                0                1        0        1        0        0   
370528                0                1        0        0        0        1   
247519                1                0        0        0        1        0   
694936                0                1        0        1        0        0   
223391                1                0        0        1        0        0   
324405                1                0        0        0        0        1   
834239                1                0        0        0        1        0   
479679                1                0        0        0        1        0   
218516                0                1

In [42]:
target = 'loan_condition_int'
l_mist = intermediate_node_num_mistakes(l_split[target])            
l_mist

14508

# checking with probability 

In [47]:
l_prob = intermediate_node_num_probability(l_split[target])            
l_prob

0.3214

# best_feature_splitting with error technique

In [48]:

def best_splitting_feature(data, features, target):
    
    best_feature = None # Keep track of the best feature 
    best_error = 10     # Keep track of the best error so far 
    # Note: Since error is always <= 1, we should intialize it with something larger than 1.

    # Convert to float to make sure error gets computed correctly.
    num_data_points = float(len(data))  
    
    # Loop through each feature to consider splitting on that feature
    for feature in features:
        
        # The left split will have all data points where the feature value is 0
        left_split = data[data[feature] == 0]
        #print ('left_split_len ::' , len(left_split) )               # here all features check which point has  0/1 , and separte them into left and right 
        
        #print (left_split['safe_loans'])
        # The right split will have all data points where the feature value is 1
        right_split = data[data[feature] == 1]
        #print ("right_split_len ::" , len(right_split) )
                           
        #left contains -1 values of safe_loans right as +1 
        #thus in left +1 becomes minority(mistakes) and in right -1 becomes min 
        
        left_mistakes = intermediate_node_num_mistakes(left_split[target])            
        #print ('left_mistakes ::',  left_mistakes)
        # Calculate the number of misclassified examples in the right split.
        right_mistakes = intermediate_node_num_mistakes(right_split[target])
        #print ('right_mistakes ::',  right_mistakes)    
        # Compute the classification error of this split.
        error = (left_mistakes + right_mistakes)/num_data_points
        #print ('error ::' ,error)
        # If this is the best error we have found so far, store the feature as best_feature and the error as best_error
        if error < best_error:
            best_error = error
            best_feature = feature
    
    return best_feature # Return the best feature we found

# here best_feature_spliting with probability 

In [50]:
# l_split is above is also done
output_prob = intermediate_node_num_probability(l_split[target])            
output_prob

0.3214

In [75]:

def prob_splitting_feature(data, features, target):
    
    best_feature = None # Keep track of the best feature 
    best_error = 0     # Keep track of the best error so far 
    # Note: Since error is always <= 1, we should intialize it with something larger than 1.

    # Convert to float to make sure error gets computed correctly.
    num_data_points = float(len(data))  
    
    # here  output probability is getting because it is used in information_gain 
    #where it is subtracted from each feature _probab i.e (ouput_prob  - _indiviual_feature_prob)
    ouput_prob = intermediate_node_num_probability(data[target])            
    #print(ouput_prob)
    
    # Loop through each feature to consider splitting on that feature
    for feature in features:
        
        # The left split will have all data points where the feature value is 0
        left_split = data[data[feature] == 0]
        len_left   = len(left_split)
        #print ('left_len ::' , len_left ) 
        # here all features check which point has  0/1 , and separte them into left and right 
        
        #print (left_split['safe_loans'])
        # The right split will have all data points where the feature value is 1
        right_split = data[data[feature] == 1]
        len_right   = len(right_split)
        #print ("right_len ::" , len_right )
                           
        #left contains -1 values of safe_loans right as +1 
        #here getting probability of above splits for futher process 
        
        prob_left = intermediate_node_num_probability(left_split[target])            
        #print ('left_prob ::',  prob_left)
        # Calculate the number of misclassified examples in the right split.
        prob_right = intermediate_node_num_probability(right_split[target])
        #print ('right_prob ::',  prob_right)    
        
        len_total = len_left + len_right 
        left  = len_left/len_total 
        #print('left::',left)
        right = len_right/len_total
        #print('right::',right)
        
        information_feature = (left * prob_left) + (right * prob_right)
        #print(information_feature)
        
        # Compute the classification error of this split.
        information_gain = ouput_prob - information_feature
        #print('information_gain:::' , information_gain)
        #error = (left_mistakes + right_mistakes)/num_data_points
        #print ('error ::' ,error)
        # If this is the best error we have found so far, store the feature as best_feature and the error as best_error
        if  information_gain > best_error:
            best_gain    = information_gain
            best_feature = feature
    
    return best_feature # Return the best feature we found

In [69]:
data =l_split
features = x_train.columns[2:35]
target = 'loan_condition_int'
best_feat = prob_splitting_feature(data, features, target)
best_feat

0.3214
information_gain::: 0.0
information_gain::: 0.0038119876938089714
information_gain::: 0.00024076493548741418
information_gain::: 0.0009880570313347548
information_gain::: 0.0013012827984387187
information_gain::: 0.001979948227257944
information_gain::: 0.0006335818063537002
information_gain::: 0.0
information_gain::: 0.0005233404261086871
information_gain::: -9.505448335722644e-06
information_gain::: 4.141751599262555e-05
information_gain::: 6.072251436628484e-05
information_gain::: 0.0005853518377968436
information_gain::: 6.941952184758016e-05
information_gain::: 6.0574108207755994e-05
information_gain::: 1.8480971484324815e-05
information_gain::: 8.63385015721918e-05
information_gain::: 6.177626585712748e-05
information_gain::: 0.00012168491813946414
information_gain::: 0.00012168491813946414
information_gain::: 4.9441613357958936e-05
information_gain::: 0.0007710621543966512
information_gain::: 0.0006795287596227384
information_gain::: 0.0001644909465466915
information_gain

'loan_range_Medium'

# test best_splitting_feature 

To test your `best_splitting_feature` function, run the following code:

In [70]:
features = x_train.columns[2:35]
target = 'loan_condition_int'
print (features)
print(target)

Index(['grade_A', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F',
       'grade_G', 'home_ownership_ANY', 'home_ownership_MORTGAGE',
       'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'region_MidWest', 'region_NorthEast',
       'region_SouthEast', 'region_SouthWest', 'region_west',
       'loan_purpose_Bussiness', 'loan_purpose_Personal',
       'income_category_High', 'income_category_Low', 'income_category_Medium',
       'emp_status_Bad', 'emp_status_Good', 'emp_status_Medium',
       'interest_High', 'interest_Low', 'investment_High', 'investment_Low',
       'loan_range_High', 'loan_range_Low', 'loan_range_Medium'],
      dtype='object')
loan_condition_int


In [71]:
x1 = x_test[:1000]
x1_feat = x1.columns
for i in x1_feat:
    print(x1[i].value_counts())

1    696
0    304
Name: term_ 36 months, dtype: int64
0    696
1    304
Name: term_ 60 months, dtype: int64
0    842
1    158
Name: grade_A, dtype: int64
0    702
1    298
Name: grade_B, dtype: int64
0    714
1    286
Name: grade_C, dtype: int64
0    864
1    136
Name: grade_D, dtype: int64
0    914
1     86
Name: grade_E, dtype: int64
0    974
1     26
Name: grade_F, dtype: int64
0    990
1     10
Name: grade_G, dtype: int64
0    1000
Name: home_ownership_ANY, dtype: int64
1    504
0    496
Name: home_ownership_MORTGAGE, dtype: int64
0    1000
Name: home_ownership_NONE, dtype: int64
0    1000
Name: home_ownership_OTHER, dtype: int64
0    906
1     94
Name: home_ownership_OWN, dtype: int64
0    598
1    402
Name: home_ownership_RENT, dtype: int64
0    842
1    158
Name: region_MidWest, dtype: int64
0    756
1    244
Name: region_NorthEast, dtype: int64
0    752
1    248
Name: region_SouthEast, dtype: int64
0    867
1    133
Name: region_SouthWest, dtype: int64
0    783
1    217
Name: r

In [72]:
if best_splitting_feature(x_test, features, target) == 'grade_A':
    print ('Test passed!')
else:
    print ('Test failed... try again!')

Test passed!


# here testing prob_splitting_feature with x_test 


In [74]:
best_feature_prob = prob_splitting_feature(x_test, features, target)
best_feature_prob

0.2968


'loan_range_Medium'

# now get on train_data get best_splitting feature

In [76]:
best_feature=  best_splitting_feature(x_train, features, target) 
best_feature

'grade_A'

# here get on x_train for best_feature_prob

In [77]:
best_feature_prob1 = prob_splitting_feature(x_train, features, target)
best_feature_prob1



'loan_range_Medium'

## Building the tree

With the above functions implemented correctly, we are now ready to build our decision tree. Each node in the decision tree is represented as a dictionary which contains the following keys and possible values:

    { 
       'is_leaf'            : True/False.
       'prediction'         : Prediction at the leaf node.
       'left'               : (dictionary corresponding to the left tree).
       'right'              : (dictionary corresponding to the right tree).
       'splitting_feature'  : The feature that this node splits on.
    }

First, we will write a function that creates a leaf node given a set of target values. Fill in the places where you find `## YOUR CODE HERE`. There are **three** places in this function for you to fill in.

# here checking the value of prediction with respect to create_leaf using below 

In [78]:
#this is checking what is happening  (below one) 
target = 'loan_condition_int'
target_values=x_test[target]
num_ones = len(target_values[target_values == +1])
num_minus_ones = len(target_values[target_values == -1])
print (num_ones ,num_minus_ones)

leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True    }   
    
if num_ones > num_minus_ones:
    leaf['prediction'] =   +1       ## YOUR CODE HERE
else:
    leaf['prediction'] = -1
leaf


161943 15533


{'splitting_feature': None,
 'left': None,
 'right': None,
 'is_leaf': True,
 'prediction': 1}

In [79]:
def create_leaf(target_values , split_feature):
    
    # Create a leaf node
    leaf = {'splitting_feature' : split_feature,
            'left' : None,
            'right' : None,
            'is_leaf': True    }   ## YOUR CODE HERE
    
    # Count the number of data points that are +1 and -1 in this node.
    num_ones = len(target_values[target_values == +1])
    print(num_ones)
    num_minus_ones = len(target_values[target_values == -1])
    print(num_minus_ones)
    
    # For the leaf node, set the prediction to be the majority class.
    # Store the predicted class (1 or -1) in leaf['prediction']
    if num_ones > num_minus_ones:
        leaf['prediction'] =   +1       ## YOUR CODE HERE
    else:
        leaf['prediction'] =   -1       ## YOUR CODE HERE
        
    # Return the leaf node        
    return leaf 

We have provided a function that learns the decision tree recursively and implements 3 stopping conditions:
1. **Stopping condition 1:** All data points in a node are from the same class.
2. **Stopping condition 2:** No more features to split on.
3. **Additional stopping condition:** In addition to the above two stopping conditions covered in lecture, in this assignment we will also consider a stopping condition based on the **max_depth** of the tree. By not letting the tree grow too deep, we will save computational effort in the learning process. 

Now, we will write down the skeleton of the learning algorithm. Fill in the places where you find `## YOUR CODE HERE`. There are **seven** places in this function for you to fill in.

In [80]:
# as features list is array type it can't get directly checked ==0 ,by using .any()
# or get len(features) to test whether it is ==0 or not

feats = x_test.columns
rem_features = feats[:] # Make a copy of the features.
#rem_feat = len(rem_features)
#if rem_features.any()==0 :
if  len(rem_features) == 0:
    print('ok')
else:
    print('no')
        

no


In [74]:
feats = feats.drop(['grade_A']  )
feats

Index(['term_ 36 months', 'term_ 60 months', 'grade_B', 'grade_C', 'grade_D',
       'grade_E', 'grade_F', 'grade_G', 'home_ownership_ANY',
       'home_ownership_MORTGAGE', 'home_ownership_NONE',
       'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT',
       'region_MidWest', 'region_NorthEast', 'region_SouthEast',
       'region_SouthWest', 'region_west', 'loan_purpose_Bussiness',
       'loan_purpose_Personal', 'income_category_High', 'income_category_Low',
       'income_category_Medium', 'emp_status_Bad', 'emp_status_Good',
       'emp_status_Medium', 'interest_High', 'interest_Low', 'investment_High',
       'investment_Low', 'loan_range_High', 'loan_range_Low',
       'loan_range_Medium', 'loan_condition_int'],
      dtype='object')

In [81]:
def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10):
    remaining_features = features[:] # Make a copy of the features.
    splitting_feature = None
    
    target_values = data[target]
    print("--------------------------------------------------------------------")
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))
    
    # Stopping condition 1
    # (Check if there are mistakes at current node.
    # Recall you wrote a function intermediate_node_num_mistakes to compute this.)
    if intermediate_node_num_mistakes(target_values) == 0:  ## YOUR CODE HERE
        print("Stopping condition 1 reached.")
        # If not mistakes at current node, make current node a leaf node
        #return create_leaf(target_values)
        return create_leaf(target_values ,splitting_feature)
    
    # Stopping condition 2 (check if there are remaining features to consider splitting on)
    if len(remaining_features) == 0:   ## YOUR CODE HERE
        print("Stopping condition 2 reached.")
        # If there are no remaining features to consider, make current node a leaf node
        return create_leaf(target_values ,splitting_feature)
        #return create_leaf(target_values)    
    
    # Additional stopping condition (limit tree depth)
    if current_depth >= max_depth:  ## YOUR CODE HERE
        print("Reached maximum depth. Stopping for now.")
        # If the max tree depth has been reached, make current node a leaf node
        #return create_leaf(target_values)
        return create_leaf(target_values ,splitting_feature)

    # Find the best splitting feature (recall the function best_splitting_feature implemented above)
    ## YOUR CODE HERE
    splitting_feature = best_splitting_feature(data, features, target)
    
    # Split on the best feature that we found. 
    left_split = data[data[splitting_feature] == 0]
    #print left_split
    right_split = data[data[splitting_feature] == 1]     
    #print right_split
    #remaining_features.remove(splitting_feature)
    remaining_features = remaining_features.drop([splitting_feature])
    
    print("Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split)))
    
    # Create a leaf node if the split is "perfect"
    if len(left_split) == len(data):
        print("Creating leaf node.")
        return create_leaf(target_values ,splitting_feature)
        #return create_leaf(left_split[target])
    if len(right_split) == len(data):
        print("Creating leaf node.")
        ## YOUR CODE HERE
        return create_leaf(target_values ,splitting_feature)
        #return create_leaf(right_split[target])
    
    
    current_depth = current_depth + 1
    # Repeat (recurse) on left and right subtrees
    left_tree = decision_tree_create(left_split, remaining_features, target, current_depth , max_depth)        
    ## YOUR CODE HERE
    right_tree = decision_tree_create(right_split, remaining_features, target, current_depth , max_depth)

    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

In [82]:
fea = x_test.columns[2:35]
print(fea)
target = 'loan_condition_int'
print(target)
small_data_tree = decision_tree_create(x_test, fea, target, max_depth = 3)
small_data_tree

Index(['grade_A', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F',
       'grade_G', 'home_ownership_ANY', 'home_ownership_MORTGAGE',
       'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'region_MidWest', 'region_NorthEast',
       'region_SouthEast', 'region_SouthWest', 'region_west',
       'loan_purpose_Bussiness', 'loan_purpose_Personal',
       'income_category_High', 'income_category_Low', 'income_category_Medium',
       'emp_status_Bad', 'emp_status_Good', 'emp_status_Medium',
       'interest_High', 'interest_Low', 'investment_High', 'investment_Low',
       'loan_range_High', 'loan_range_Low', 'loan_range_Medium'],
      dtype='object')
loan_condition_int
--------------------------------------------------------------------
Subtree, depth = 0 (177476 data points).
Split on feature grade_A. (147568, 29908)
--------------------------------------------------------------------
Subtree, depth = 1 (147568 data points).
Split

{'is_leaf': False,
 'prediction': None,
 'splitting_feature': 'grade_A',
 'left': {'is_leaf': False,
  'prediction': None,
  'splitting_feature': 'grade_B',
  'left': {'is_leaf': False,
   'prediction': None,
   'splitting_feature': 'grade_C',
   'left': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1},
   'right': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1}},
  'right': {'splitting_feature': 'grade_C',
   'left': None,
   'right': None,
   'is_leaf': True,
   'prediction': 1}},
 'right': {'splitting_feature': 'grade_B',
  'left': None,
  'right': None,
  'is_leaf': True,
  'prediction': 1}}

Here is a recursive function to count the nodes in your tree:

# here build decision_tree_model with probability (inforamtion _gain) technique

In [83]:
def decision_tree_model(data, features, target, current_depth = 0, max_depth = 10):
    remaining_features = features[:] # Make a copy of the features.
    splitting_feature = None
    
    target_values = data[target]
    print("--------------------------------------------------------------------")
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))
    
    # Stopping condition 1
    # (Check if there are mistakes at current node.
    # Recall you wrote a function intermediate_node_num_mistakes to compute this.)
    if intermediate_node_num_probability(target_values) == 0:  ## YOUR CODE HERE
        print("Stopping condition 1 reached.")
        # If not mistakes at current node, make current node a leaf node
        #return create_leaf(target_values)
        return create_leaf(target_values ,splitting_feature)
    
    # Stopping condition 2 (check if there are remaining features to consider splitting on)
    if len(remaining_features) == 0:   ## YOUR CODE HERE
        print("Stopping condition 2 reached.")
        # If there are no remaining features to consider, make current node a leaf node
        return create_leaf(target_values ,splitting_feature)
        #return create_leaf(target_values)    
    
    # Additional stopping condition (limit tree depth)
    if current_depth >= max_depth:  ## YOUR CODE HERE
        print("Reached maximum depth. Stopping for now.")
        # If the max tree depth has been reached, make current node a leaf node
        #return create_leaf(target_values)
        return create_leaf(target_values ,splitting_feature)

    # Find the best splitting feature (recall the function best_splitting_feature implemented above)
    ## YOUR CODE HERE
    splitting_feature = prob_splitting_feature(data, features, target)
    
    # Split on the best feature that we found. 
    left_split = data[data[splitting_feature] == 0]
    #print left_split
    right_split = data[data[splitting_feature] == 1]     
    #print right_split
    #remaining_features.remove(splitting_feature)
    remaining_features = remaining_features.drop([splitting_feature])
    
    print("Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split)))
    
    # Create a leaf node if the split is "perfect"
    if len(left_split) == len(data):
        print("Creating leaf node.")
        return create_leaf(target_values ,splitting_feature)
        #return create_leaf(left_split[target])
    if len(right_split) == len(data):
        print("Creating leaf node.")
        ## YOUR CODE HERE
        return create_leaf(target_values ,splitting_feature)
        #return create_leaf(right_split[target])
    
    
    current_depth = current_depth + 1
    # Repeat (recurse) on left and right subtrees
    left_tree = decision_tree_model(left_split, remaining_features, target, current_depth , max_depth)        
    ## YOUR CODE HERE
    right_tree = decision_tree_model(right_split, remaining_features, target, current_depth , max_depth)

    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

In [84]:
def count_nodes(tree):
    if tree['is_leaf']:
        return 1
    return 1 + count_nodes(tree['left']) + count_nodes(tree['right'])

# test of decision_tree_model (information_gain) using x_test 

In [87]:

model_test = decision_tree_model(x_test, features, target, current_depth = 0, max_depth = 3)
model_test

--------------------------------------------------------------------
Subtree, depth = 0 (177476 data points).
Split on feature loan_range_Medium. (116606, 60870)
--------------------------------------------------------------------
Subtree, depth = 1 (116606 data points).
Split on feature loan_range_Low. (61845, 54761)
--------------------------------------------------------------------
Subtree, depth = 2 (61845 data points).
Split on feature investment_Low. (49282, 12563)
--------------------------------------------------------------------
Subtree, depth = 3 (49282 data points).
Reached maximum depth. Stopping for now.
45045
4237
--------------------------------------------------------------------
Subtree, depth = 3 (12563 data points).
Reached maximum depth. Stopping for now.
11508
1055
--------------------------------------------------------------------
Subtree, depth = 2 (54761 data points).
Split on feature interest_Low. (22999, 31762)
----------------------------------------------



Split on feature interest_Low. (14148, 14949)
--------------------------------------------------------------------
Subtree, depth = 3 (14148 data points).
Reached maximum depth. Stopping for now.
12349
1799
--------------------------------------------------------------------
Subtree, depth = 3 (14949 data points).
Reached maximum depth. Stopping for now.
14202
747
--------------------------------------------------------------------
Subtree, depth = 2 (31773 data points).
Split on feature interest_Low. (16030, 15743)
--------------------------------------------------------------------
Subtree, depth = 3 (16030 data points).
Reached maximum depth. Stopping for now.
14016
2014
--------------------------------------------------------------------
Subtree, depth = 3 (15743 data points).
Reached maximum depth. Stopping for now.
14922
821


{'is_leaf': False,
 'prediction': None,
 'splitting_feature': 'loan_range_Medium',
 'left': {'is_leaf': False,
  'prediction': None,
  'splitting_feature': 'loan_range_Low',
  'left': {'is_leaf': False,
   'prediction': None,
   'splitting_feature': 'investment_Low',
   'left': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1},
   'right': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1}},
  'right': {'is_leaf': False,
   'prediction': None,
   'splitting_feature': 'interest_Low',
   'left': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1},
   'right': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1}}},
 'right': {'is_leaf': False,
  'prediction': None,
  'splitting_feature': 'investment_Low',
  'left': {'is_leaf': False,
   'prediction': None,
   'splitting_

 # removing two features term_36 months and term_60 months because         #they are stopping at early ( 0 , 15555) zero is getting in both

In [88]:
small_tree = decision_tree_create(x_train, features, target, max_depth = 3)
small_tree

--------------------------------------------------------------------
Subtree, depth = 0 (709903 data points).
Split on feature grade_A. (591609, 118294)
--------------------------------------------------------------------
Subtree, depth = 1 (591609 data points).
Split on feature grade_B. (388003, 203606)
--------------------------------------------------------------------
Subtree, depth = 2 (388003 data points).
Split on feature grade_C. (191187, 196816)
--------------------------------------------------------------------
Subtree, depth = 3 (191187 data points).
Reached maximum depth. Stopping for now.
163593
27594
--------------------------------------------------------------------
Subtree, depth = 3 (196816 data points).
Reached maximum depth. Stopping for now.
179246
17570
--------------------------------------------------------------------
Subtree, depth = 2 (203606 data points).
Split on feature grade_C. (203606, 0)
Creating leaf node.
190602
13004
--------------------------------

{'is_leaf': False,
 'prediction': None,
 'splitting_feature': 'grade_A',
 'left': {'is_leaf': False,
  'prediction': None,
  'splitting_feature': 'grade_B',
  'left': {'is_leaf': False,
   'prediction': None,
   'splitting_feature': 'grade_C',
   'left': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1},
   'right': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1}},
  'right': {'splitting_feature': 'grade_C',
   'left': None,
   'right': None,
   'is_leaf': True,
   'prediction': 1}},
 'right': {'splitting_feature': 'grade_B',
  'left': None,
  'right': None,
  'is_leaf': True,
  'prediction': 1}}

# get on x_train decision_tree_model with(information_gain)

In [89]:
model_train = decision_tree_model(x_train, features, target, current_depth = 0, max_depth = 3)
model_train

--------------------------------------------------------------------
Subtree, depth = 0 (709903 data points).




Split on feature loan_range_Medium. (466743, 243160)
--------------------------------------------------------------------
Subtree, depth = 1 (466743 data points).
Split on feature investment_Low. (197751, 268992)
--------------------------------------------------------------------
Subtree, depth = 2 (197751 data points).
Split on feature interest_Low. (106848, 90903)
--------------------------------------------------------------------
Subtree, depth = 3 (106848 data points).
Reached maximum depth. Stopping for now.
93866
12982
--------------------------------------------------------------------
Subtree, depth = 3 (90903 data points).
Reached maximum depth. Stopping for now.
87001
3902
--------------------------------------------------------------------
Subtree, depth = 2 (268992 data points).
Split on feature interest_Low. (110810, 158182)
--------------------------------------------------------------------
Subtree, depth = 3 (110810 data points).
Reached maximum depth. Stopping for no

{'is_leaf': False,
 'prediction': None,
 'splitting_feature': 'loan_range_Medium',
 'left': {'is_leaf': False,
  'prediction': None,
  'splitting_feature': 'investment_Low',
  'left': {'is_leaf': False,
   'prediction': None,
   'splitting_feature': 'interest_Low',
   'left': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1},
   'right': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1}},
  'right': {'is_leaf': False,
   'prediction': None,
   'splitting_feature': 'interest_Low',
   'left': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1},
   'right': {'splitting_feature': None,
    'left': None,
    'right': None,
    'is_leaf': True,
    'prediction': 1}}},
 'right': {'is_leaf': False,
  'prediction': None,
  'splitting_feature': 'investment_Low',
  'left': {'is_leaf': False,
   'prediction': None,
   'splitting_fe

# get now on train_data for small _decision_tree

Run the following test code to check your implementation. Make sure you get **'Test passed'** before proceeding.

In [90]:
small_data_decision_tree = decision_tree_create(x_train, features, target, max_depth = 5)
if count_nodes(small_data_decision_tree) == 13:
    print ('Test passed!')
else:
    print ('Test failed... try again!')
    print ('Number of nodes found  :', count_nodes(small_data_decision_tree))
    print ('Number of nodes that should be there : 13' )

--------------------------------------------------------------------
Subtree, depth = 0 (709903 data points).
Split on feature grade_A. (591609, 118294)
--------------------------------------------------------------------
Subtree, depth = 1 (591609 data points).
Split on feature grade_B. (388003, 203606)
--------------------------------------------------------------------
Subtree, depth = 2 (388003 data points).
Split on feature grade_C. (191187, 196816)
--------------------------------------------------------------------
Subtree, depth = 3 (191187 data points).
Split on feature interest_High. (268, 190919)
--------------------------------------------------------------------
Subtree, depth = 4 (268 data points).
Split on feature investment_High. (244, 24)
--------------------------------------------------------------------
Subtree, depth = 5 (244 data points).
Reached maximum depth. Stopping for now.
88
156
--------------------------------------------------------------------
Subtree, d

## Build the tree!

Now that all the tests are passing, we will train a tree model on the **train_data**. Limit the depth to 6 (**max_depth = 6**) to make sure the algorithm doesn't run for too long. Call this tree **my_decision_tree**. 

**Warning**: This code block may take 1-2 minutes to learn. 

In [86]:
# above depth  =5 
small_data_decision_tree

{'is_leaf': False,
 'prediction': None,
 'splitting_feature': 'grade_A',
 'left': {'is_leaf': False,
  'prediction': None,
  'splitting_feature': 'grade_B',
  'left': {'is_leaf': False,
   'prediction': None,
   'splitting_feature': 'grade_C',
   'left': {'is_leaf': False,
    'prediction': None,
    'splitting_feature': 'interest_High',
    'left': {'is_leaf': False,
     'prediction': None,
     'splitting_feature': 'investment_High',
     'left': {'splitting_feature': None,
      'left': None,
      'right': None,
      'is_leaf': True,
      'prediction': -1},
     'right': {'splitting_feature': None,
      'left': None,
      'right': None,
      'is_leaf': True,
      'prediction': 1}},
    'right': {'is_leaf': False,
     'prediction': None,
     'splitting_feature': 'home_ownership_OTHER',
     'left': {'splitting_feature': None,
      'left': None,
      'right': None,
      'is_leaf': True,
      'prediction': 1},
     'right': {'splitting_feature': None,
      'left': None,


## Making predictions with a decision tree

As discussed in the lecture, we can make predictions from the decision tree with a simple recursive function. Below, we call this function `classify`, which takes in a learned `tree` and a test point `x` to classify.  We include an option `annotate` that describes the prediction path when set to `True`.

Fill in the places where you find `## YOUR CODE HERE`. There is **one** place in this function for you to fill in.

In [87]:
def classify(tree, x, annotate = False):   
    # if the node is a leaf node.
    if tree['is_leaf']:
        if annotate: 
            print ("At leaf, predicting %s" % tree['prediction'] )
        return tree['prediction'] 
    else:
        # split on feature.
        split_feature_value = x[tree['splitting_feature']]
        if annotate: 
            print ("Split on %s = %s" % (tree['splitting_feature'], split_feature_value) )
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            ### YOUR CODE HERE
            return classify(tree['right'], x, annotate)


Now, let's consider the first example of the test set and see what `my_decision_tree` model predicts for this data point.

In [93]:
x_test.iloc[0:1]
x_test[:1]

Unnamed: 0,term_ 36 months,term_ 60 months,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,home_ownership_ANY,...,emp_status_Good,emp_status_Medium,interest_High,interest_Low,investment_High,investment_Low,loan_range_High,loan_range_Low,loan_range_Medium,loan_condition_int
733358,1,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,1


In [96]:
print ('Predicted class: %s ' % classify(small_data_decision_tree, x_test[:1].any()) )


Predicted class: 1 


Let's add some annotations to our prediction to see what the prediction path was that lead to this predicted class:

In [103]:
classify(small_data_decision_tree, x_test[:1].any(), annotate=True)

Split on grade_A = False
Split on grade_B = True
At leaf, predicting 1


1

** Quiz Question:** What was the feature that **my_decision_tree** first split on while making the prediction for test_data[0]?

** Quiz Question:** What was the first feature that lead to a right split of test_data[0]?

** Quiz Question:** What was the last feature split on before reaching a leaf node for test_data[0]?

## Evaluating your decision tree

Now, we will write a function to evaluate a decision tree by computing the classification error of the tree on the given dataset.

Again, recall that the **classification error** is defined as follows:
$$
\mbox{classification error} = \frac{\mbox{# mistakes}}{\mbox{# total examples}}
$$

Now, write a function called `evaluate_classification_error` that takes in as input:
1. `tree` (as described above)
2. `data` (an SFrame)
3. `target` (a string - the name of the target/label column)

This function should calculate a prediction (class label) for each row in `data` using the decision `tree` and return the classification error computed using the above formula. Fill in the places where you find `## YOUR CODE HERE`. There is **one** place in this function for you to fill in.

In [98]:
def evaluate_classification_error(tree, data, target):
    # Apply the classify(tree, x) to each row in your data
    prediction = data.apply(lambda x: classify(tree, x))
    
    print (prediction,len(prediction) )
    # Once you've made the predictions, calculate the classification error and return it
    ## YOUR CODE HERE
    num_mistakes = (prediction != data[target]).sum()/float(len(data))
    s=(prediction != data[target]).sum()
    print ('real true values', data[target] )
    print ('no of  mistakes', s)
    return num_mistakes  #this is error percentage

Now, let's use this function to evaluate the classification error on the test set.

In [100]:
evaluate_classification_error(small_data_decision_tree, x_test.any(), target)

TypeError: 'bool' object is not subscriptable

**Quiz Question:** Rounded to 2nd decimal point, what is the classification error of **my_decision_tree** on the **test_data**?

## Printing out a decision stump

As discussed in the lecture, we can print out a single decision stump (printing out the entire tree is left as an exercise to the curious reader). 

In [55]:
def print_stump(tree, name = 'root'):
    split_name = tree['splitting_feature'] # split_name is something like 'term. 36 months'
    if split_name is None:
        print "(leaf, label: %s)" % tree['prediction']
        return None
    split_feature, split_value = split_name.split('.')
    print '                       %s' % name
    print '         |---------------|----------------|'
    print '         |                                |'
    print '         |                                |'
    print '         |                                |'
    print '  [{0} == 0]               [{0} == 1]    '.format(split_name)
    print '         |                                |'
    print '         |                                |'
    print '         |                                |'
    print '    (%s)                         (%s)' \
        % (('leaf, label: ' + str(tree['left']['prediction']) if tree['left']['is_leaf'] else 'subtree'),
           ('leaf, label: ' + str(tree['right']['prediction']) if tree['right']['is_leaf'] else 'subtree'))

In [56]:
print_stump(my_decision_tree)

                       root
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [term. 36 months == 0]               [term. 36 months == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


**Quiz Question:** What is the feature that is used for the split at the root node?

### Exploring the intermediate left subtree

The tree is a recursive dictionary, so we do have access to all the nodes! We can use
* `my_decision_tree['left']` to go left
* `my_decision_tree['right']` to go right

In [57]:
print_stump(my_decision_tree['left'], my_decision_tree['splitting_feature'])

                       term. 36 months
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [grade.A == 0]               [grade.A == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


### Exploring the left subtree of the left subtree


In [58]:
print_stump(my_decision_tree['left']['left'], my_decision_tree['left']['splitting_feature'])

                       grade.A
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [grade.B == 0]               [grade.B == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


**Quiz Question:** What is the path of the **first 3 feature splits** considered along the **left-most** branch of **my_decision_tree**?

**Quiz Question:** What is the path of the **first 3 feature splits** considered along the **right-most** branch of **my_decision_tree**?

In [59]:
print_stump(my_decision_tree['right']['right'], my_decision_tree['right']['splitting_feature'])


(leaf, label: -1)
