# Implementing binary decision trees
The goal of this notebook is to implement your own binary decision tree classifier. You will:
    
* Use Pandas to do some feature engineering.
* Transform categorical variables into binary variables.
* Write a function to compute the number of misclassified examples in an intermediate node.
* Write a function to find the best feature to split on.
* Build a binary decision tree from scratch.
* Make predictions using the decision tree.
* Evaluate the accuracy of the decision tree.
* Visualize the decision at the root node.

**Important Note**: In this assignment, we will focus on building decision trees where the data contain **only binary (0 or 1) features**. This allows us to avoid dealing with:
* Multiple intermediate nodes in a split
* The thresholding issues of real-valued features.

This assignment **may be challenging**, so brace yourself :)

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

In [2]:
from sklearn.model_selection import train_test_split

## Data Preprocessing

### Load LendingClub dataset
**1**. We will be using a dataset from the [LendingClub](https://www.lendingclub.com/). A parsed and cleaned form of the dataset is availiable [here](https://github.com/learnml/machine-learning-specialization-private).

In [3]:
DATA_DIR = os.path.join('data')

print(os.listdir(DATA_DIR))

['.DS_Store', 'lending-club-data.sframe', 'module-5-assignment-1-validation-idx.json', 'module-5-assignment-1-train-idx.json', 'module-5-assignment-2-test-idx.json', 'c3w3_Quiz.jpeg', 'lending-club-data.csv', 'module-5-assignment-2-train-idx.json']


In [4]:
loans = pd.read_csv(os.path.join(DATA_DIR, 'lending-club-data.csv'))

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1


**2**. Like the previous assignment, reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.

In [6]:
loans['bad_loans'].value_counts()

0    99457
1    23150
Name: bad_loans, dtype: int64

In [7]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: +1 if x == 0 else -1)

loans.drop('bad_loans', axis=1, inplace=True)

In [8]:
loans['safe_loans'].value_counts()

 1    99457
-1    23150
Name: safe_loans, dtype: int64

**3**. Unlike the previous assignment, we will only be considering these four features:

In [9]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'

Extract these feature columns from the dataset, and discard the rest of the feature columns.

In [10]:
loans = loans[features + [target]]

In [11]:
loans.head()

Unnamed: 0,grade,term,home_ownership,emp_length,safe_loans
0,B,36 months,RENT,10+ years,1
1,C,60 months,RENT,< 1 year,-1
2,C,36 months,RENT,10+ years,1
3,C,36 months,RENT,10+ years,1
4,A,36 months,RENT,3 years,1


Check if there are null values in the columns.

In [12]:
loans.isnull().sum()

grade                0
term                 0
home_ownership       0
emp_length        4091
safe_loans           0
dtype: int64

Convert the null values to 0s.

In [13]:
for col in loans:
    loans[col] = loans[col].fillna(0)

In [14]:
loans.isnull().sum()

grade             0
term              0
home_ownership    0
emp_length        0
safe_loans        0
dtype: int64

### Sample data to balance classes

**4**. Just as we did in the previous assignment, we will undersample the larger class (safe loans) in order to balance out our dataset.

**Since we would like to make sure our final result matches the correct result. We should use the specific training dataset and validdation dataset here.**

Then follow the following steps:

* Load the JSON files into the lists train_idx and validation_idx.
* Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding. Alternatively, see #7 for implementation hints.
* Perform train/validation split using train_idx and validation_idx. In Pandas, for instance:

We can regard the step below as the process of our imbalance dataset:

In [15]:
import json

with open(os.path.join(DATA_DIR, 'module-5-assignment-2-train-idx.json')) as json_file:
    train_idx = json.load(json_file)
    
with open(os.path.join(DATA_DIR, 'module-5-assignment-2-test-idx.json')) as json_file:
    test_idx = json.load(json_file)

**Except for this assignment, we can do as follows** ( *but we should skip this section in this assignment* ): 

---
As we explored above, our data is disproportionally full of safe loans.  Let's create two datasets: one with just the safe loans (`safe_loans_raw`) and one with just the risky loans (`risky_loans_raw`).

In [16]:
safe_loans_raw = loans[loans['safe_loans'] == +1]
risky_loans_raw = loans[loans['safe_loans'] == -1]

print(f"Number of safe loans: {len(safe_loans_raw)}")
print(f"Number of risky loans: {len(risky_loans_raw)}")
print()
print(f"Percentage of safe loans  : {round(len(safe_loans_raw)/(len(safe_loans_raw)+len(risky_loans_raw)), 2)}") 
print(f"Percentage of risky loans : {round(len(risky_loans_raw)/(len(safe_loans_raw)+len(risky_loans_raw)), 2)}")

Number of safe loans: 99457
Number of risky loans: 23150

Percentage of safe loans  : 0.81
Percentage of risky loans : 0.19


One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `random_state=1` so everyone gets the same results.

In [17]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(frac=percentage, random_state=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

Now, let's verify that the resulting percentage of safe and risky loans are each nearly 50%.

In [18]:
print(f"Percentage of safe loans                 : {len(safe_loans) / float(len(loans_data))}")
print(f"Percentage of risky loans                : {len(risky_loans) / float(len(loans_data))}")
print(f"Total number of loans in our new dataset : {len(loans_data)}")

Percentage of safe loans                 : 0.5
Percentage of risky loans                : 0.5
Total number of loans in our new dataset : 46300


**Note:** There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this [paper](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5128907&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F69%2F5173046%2F05128907.pdf%3Farnumber%3D5128907 ). For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

---

### Transform categorical data into binary features -- one-hot encoding
In this assignment, we will implement **binary decision trees** (decision trees for binary features, a specific case of categorical variables taking on two values, e.g., true/false). Since all of our features are currently categorical features, we want to turn them into binary features.

For instance, the **home_ownership** feature represents the home ownership status of the loanee, which is either `own`, `mortgage` or `rent`. For example, if a data point has the feature 
```
   {'home_ownership': 'RENT'}
```
we want to turn this into three features: 
```
 { 
   'home_ownership = OWN'      : 0, 
   'home_ownership = MORTGAGE' : 0, 
   'home_ownership = RENT'     : 1
 }
```

**5**. This technique of turning categorical variables into binary variables is called one-hot encoding. Perform one-hot encoding on the four features described above. You should now have 25 binary features.

In [19]:
# def one_hot_encoding(df, col_name):
#     for cla in sorted(df[col_name].unique()):
#         df[f"{col_name}_{cla}"] = df[col_name].apply(lambda x: 1 if x == cla else 0)
        
#     return df

In [19]:
loans.columns

Index(['grade', 'term', 'home_ownership', 'emp_length', 'safe_loans'], dtype='object')

In [21]:
loans = pd.get_dummies(loans, prefix=features)

In [22]:
loans.head()

Unnamed: 0,safe_loans,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,term_ 36 months,term_ 60 months,...,emp_length_10+ years,emp_length_2 years,emp_length_3 years,emp_length_4 years,emp_length_5 years,emp_length_6 years,emp_length_7 years,emp_length_8 years,emp_length_9 years,emp_length_< 1 year
0,1,0,1,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,-1,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,1,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
3,1,0,0,1,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0


In [23]:
loans.columns

Index(['safe_loans', 'grade_A', 'grade_B', 'grade_C', 'grade_D', 'grade_E',
       'grade_F', 'grade_G', 'term_ 36 months', 'term_ 60 months',
       'home_ownership_MORTGAGE', 'home_ownership_OTHER', 'home_ownership_OWN',
       'home_ownership_RENT', 'emp_length_0', 'emp_length_1 year',
       'emp_length_10+ years', 'emp_length_2 years', 'emp_length_3 years',
       'emp_length_4 years', 'emp_length_5 years', 'emp_length_6 years',
       'emp_length_7 years', 'emp_length_8 years', 'emp_length_9 years',
       'emp_length_< 1 year'],
      dtype='object')

After one-hot encoding, remove the original column.

In [25]:
loans.drop(features, axis=1, inplace=True)

KeyError: "['grade' 'term' 'home_ownership' 'emp_length'] not found in axis"