# 6. Decision Trees and Ensemble Learning

In this section, we'll talk about decision trees and tree-based ensemble algorithms.

## 6.1 Credit Risk Scoring Project

In the project we'll use credit risk scoring data to make predictions whether the client is eligible for bank load or not. The dataset can be found from this [link](https://github.com/gastonstat/CreditScoring).

Below are the description of the columns in the dataset:

- `Status`: credit status
- `Seniority`: job seniority (years)
- `Home`: type of home ownership
- `Time`: time of requested load
- `Age`: client's age
- `Marital`: marital status
- `Records`: existance of records
- `Job`: type of job
- `Expenses`: amount of expenses
- `Income`: amount of income
- `Assets`: amount of assets
- `Debt`: amount of debt
- `Amount`: amount requested of loan
- `Price`: price of good

To begin, we need to import required libraries for the project:

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## 6.2 Data Cleaning and Preparation

- Download the dataset
- Re-encoding the categorical variables
- Doing the train/validation/test split

In [2]:
# Dataset url
data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-06-trees/CreditScoring.csv'

# Download the data
if not os.path.isfile('CreditScoring.csv'):
    !wget $data

In [3]:
# Read the data in dataframe
df = pd.read_csv('CreditScoring.csv')
df.head()

Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


In [4]:
# Check the number of rows and columns
df.shape

(4455, 14)

The dataset has 4455 rows and 14 columns but the column names are not in lowercase, we need to deal with it:

In [5]:
# Convert columns to lowercase
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


Next thing is to check the data types of these columns.

In [6]:
# Check the columns data type
df.dtypes

status       int64
seniority    int64
home         int64
time         int64
age          int64
marital      int64
records      int64
job          int64
expenses     int64
income       int64
assets       int64
debt         int64
amount       int64
price        int64
dtype: object

We see there is inconsitancey in the data types. For example, columns like `status`, `home`, `marital`, `records`, and `job` are categorical but they are stored as integer. We want to convert them into right data type.

For this purpose, we'll create a list of categorical columns and loop over them to find their unique values:

In [7]:
# List of categorical columns
categorical_cols = ['status', 'home', 'marital', 'records', 'job']

# Check unique values in each of the column
for c in categorical_cols:
    display(df[c].value_counts())

1    3200
2    1254
0       1
Name: status, dtype: int64

2    2107
1     973
5     783
6     319
3     247
4      20
0       6
Name: home, dtype: int64

2    3241
1     978
4     130
3      67
5      38
0       1
Name: marital, dtype: int64

1    3682
2     773
Name: records, dtype: int64

1    2806
3    1024
2     452
4     171
0       2
Name: job, dtype: int64

Some the columns above have `0` values which will set as unknown, for rest of the values we'll replace them with appropiate values using pandas [map()](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) method:

In [8]:
# Map dict for 'status'
status_values = {
    1: 'ok',
    2: 'default',
    0: 'unk'
}

df.status = df.status.map(status_values)

In [9]:
# Check uniqe values of 'status' after reformatting
df.status.value_counts()

ok         3200
default    1254
unk           1
Name: status, dtype: int64

In [10]:
# Implement reformatting on rest of the categorical columns
home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}

df.home = df.home.map(home_values)

marital_values = {
    1: 'single',
    2: 'married',
    3: 'widow',
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}

df.marital = df.marital.map(marital_values)

records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}

df.records = df.records.map(records_values)

job_values = {
    1: 'fixed',
    2: 'partime',
    3: 'freelance',
    4: 'others',
    0: 'unk'
}

df.job = df.job.map(job_values)

In [11]:
# View the dataframe
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


The columns are correctly formatted, now let's see the summary statistics of the numerical columns.

In [12]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


There is unsual maximum value for `income`, `assets`, and `debt`. We'll replace these values to `NaNs`.

In [13]:
# Replace '99999999' value with 'NaNs'
for c in ['income', 'assets', 'debt']:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

Since we have replace the above values with NaNs, we'll have to take one more step to fill these missing values with `0` so that we can use the data for model.

In [14]:
# Fill missing values with 0
df = df.fillna(0)

In [15]:
# Check the summary statistic again
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,130.0,5346.0,342.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,87.0,11525.0,1244.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,119.0,3000.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,164.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,959.0,300000.0,30000.0,5000.0,11140.0


The maximum values are changed to the reasonable range. Next, we'll deal with the categorical values one more time. Our target column `status` has three categories `ok`, `default`, and `unk` but we are only intrested to know which in the clients that have the status either ok or default. Therefore, we'll extract the only those rows in the `status` column where we have the values.

In [16]:
# Extract rows of the 'status' column where the value is not 'unk'
df = df[df.status != 'unk'].reset_index(drop=True) # reset index
df.shape

(4454, 14)

Next, we'll split the data into 80% train, 20% validation, and 20% test sets with the random state of 11.

In [17]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

In [18]:
# Reset the index of train/val/test
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [19]:
# Convert target variable 'status' from categorical to binary for train/val/test
y_train = (df_train.status == 'default').astype(int).values
y_val = (df_val.status == 'default').astype(int).values
y_test = (df_test.status == 'default').astype(int).values

In [20]:
# Drop 'status' column from train/val/test
del df_train['status']
del df_val['status']
del df_test['status']

In [21]:
# Varify the split
df_train.shape[0], df_val.shape[0], df_test.shape[0]

(2672, 891, 891)

In [22]:
df_train.shape[0] + df_val.shape[0] + df_test.shape[0]

4454

In [23]:
y_train.shape, y_val.shape, y_test.shape

((2672,), (891,), (891,))

## 6.3 Decision Trees

- How a decision tree looks like
- Training a decision tree
- Overfitting
- Controlling the size of a tree

In simple words, decision trees make predictions based on the bunch of *if/else* statements. It starts at a single node and then splits in to two or more branches.

To replicate how decision tree works, let's create a function `assess_risk()` that takes a client as the parameter and defines tree based rules accordingly:

In [24]:
def assess_risk(client):
    if client['records'] == 'yes':
        if client['job'] == 'parttime':
            return 'default'
        else:
            return 'ok'
    else:
        if client['assets'] > 6000:
            return 'ok'
        else:
            return 'default'

Now, let's extract a client from the train dataframe. As we have learnt in the previous sessions that our model will be getting requests in the json/dict format. Therefore, we need convert the pandas series to dictionary.

In [25]:
xi = df_train.iloc[0].to_dict()
xi

{'seniority': 10,
 'home': 'owner',
 'time': 36,
 'age': 36,
 'marital': 'married',
 'records': 'no',
 'job': 'freelance',
 'expenses': 75,
 'income': 0.0,
 'assets': 10000.0,
 'debt': 0.0,
 'amount': 1000,
 'price': 1400}

In [26]:
# Make prediction using 'assess_risk()' function
assess_risk(xi)

'ok'

Since the client has no records and the assets are more than 6000, the model reponse is ok.

Let's implement this phenomenon on train data and make predictions on validation set using sklearn `DecisionTreeClassifier`. We'll also need to import other classes and methods from sklearn library:

In [27]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.tree import export_text

In [28]:
# Convert train dataframe to dictionary
train_dicts = df_train.to_dict(orient='records')
# Instantiate DictVectorizer
dv = DictVectorizer(sparse=False)
# Apply 'dv' to fit and transfrom train data to input X features
X_train = dv.fit_transform(train_dicts)

# Create the decision tree model
dt = DecisionTreeClassifier()
# Train the model
dt.fit(X_train, y_train)

Our model is trained, next we need to make predictions and evaluate the model on validation set.

In [29]:
# Convert val dataframe to dict
val_dicts = df_val.to_dict(orient='records')
# Apply 'dv' to transform val data to X features
X_val = dv.transform(val_dicts)

# Make predictions on val data
y_pred = dt.predict_proba(X_val)[:, 1]
# Calculate model AUC score on y_val
roc_auc_score(y_val, y_pred)

0.6640369572061708

The model is not performing well on the unseen data. Let's find out the AUC on train data.

In [30]:
y_pred = dt.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, y_pred)

1.0

Our model performs very well for training data but has poor performance with validation data (unseen data). This is known as overfitting. One of the reasons our model is overfitting cause of it's depth which is increasing the complexity of the model and it is learning all the possible patterns in training data but struggle to match those patterns in the validation data.

One way to overcome this problem is by reducing the number of depth in the decision tree which can be determined using `max_depth` hyperparameter.

In [31]:
# Reduce the max depth to 2 in the model
dt = DecisionTreeClassifier(max_depth=2)
dt.fit(X_train, y_train)

In [32]:
# Make predictions on train and valiation data
y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train:', auc)

y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val:', auc)

train: 0.7054989859726213
val: 0.6685264343319367


Even with only two levels, we can see the slight improvement in the model performance. Let's visualize the model making these predictions:

In [33]:
# Visualize decision tree determines the rules for predictions
print(export_text(dt, feature_names=dv.get_feature_names()))

|--- records=yes <= 0.50
|   |--- job=partime <= 0.50
|   |   |--- class: 0
|   |--- job=partime >  0.50
|   |   |--- class: 1
|--- records=yes >  0.50
|   |--- seniority <= 6.50
|   |   |--- class: 1
|   |--- seniority >  6.50
|   |   |--- class: 0





From the above tree structure, we can see that when the feature `records` is **yes** (i.e., records=no <= 0.50) and if the `job seniority` is less than or equals to 6.50 years then the client ends up being *default*, but if the `job seniority` is more than 6.50 years then the client is *ok*. 

On the other hand if the client has **no** `records` (i.e., records=no > 0.50) and if the `job` is not partime, in that case the client is *ok*, but the client is *default* if the `job` is partime.

Let's create a decision tree model with only 1 maximum depth and use it to make predictions.

In [34]:
dt = DecisionTreeClassifier(max_depth=1)
dt.fit(X_train, y_train)
y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
auc

0.6058644740984719

We see the score has got worst than our first model. The decision tree with single depth is called **decision stump**. It has only one split.

## 6.4 Decision Tree Learning Algorithm

- Finding the best split for one column
- Finding the best split for the entire dataset
- Stopping criteria
- Decision tree learning algorithm