## Decision Trees and Ensemble Learning

In this project, we'll learn trees usind <b> Decision Tree</b> and <b>Ensemble learning</b>.

The project is <b>Credit Risk Score</b> for loan applicants using using <b>CreditScoring.csv</b> dataset.

## Data Preparation and Cleaning

- Loading the dataset
- Re-encoding the categorical variables
- Doing the train/validation/test split

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
dataset = pd.read_csv("CreditScoring.csv")
df = pd.DataFrame(dataset)
df.head()

In [None]:
df.columns = df.columns.str.lower()
df.head()

There are categorical variables that are in numerical format that should be converted back to categorical for easy understanding. They include <b> status, home, marital, Recored and Job.</b>

In [None]:
# let's start with status
status_values = {
    1: "ok",
    2: "default",
    0: "unk"
}
df.status = df.status.map(status_values)

In [None]:
# Now for all other categorical features
home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}

df.home = df.home.map(home_values)

marital_values = {
    1: 'single',
    2: 'married',
    3: 'widow',
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}

df.marital = df.marital.map(marital_values)

records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}

df.records = df.records.map(records_values)

job_values = {
    1: 'fixed',
    2: 'partime',
    3: 'freelance',
    4: 'others',
    0: 'unk'
}

df.job = df.job.map(job_values)

In [None]:
df.head()

In [None]:
df.describe().round()

From statistical results above, some features have <b>99999999.0</b>for Max values which means that there are <b> missing values</b>. We'll handle them next 

In [None]:
# replace those 99999999.0 with nan
for c in ["income", "assets", "debt"]:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

In [None]:
df.describe().round()

In [None]:
# Let's also remove the unkown value of status so that we only remain with OK and Default
df = df[df.status != "unk"].reset_index(drop=True)

Next we split the data frame in <b> train, validation, test</b>

In [None]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

In [None]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [None]:
len(df_train), len(df_val), len(df_test)

In [None]:
# create our target variable as we convert the values from categorical format to numerical format
y_train = (df_train.status == "default").astype("int").values
y_val = (df_val.status == "default").astype("int").values
y_test = (df_test.status == "default").astype("int").values

In [None]:
# now we remove the target variables from the rest so that they are not accidentally used as X
del df_train["status"]
del df_val["status"]
del df_test["status"]

In [None]:
df_train

## Decision Trees

- How a decision tree looks like
- Training a decision tree
- Overfitting
- Controling the size of a tree

In [None]:
# a simple decision tree using if else statement
def assess_risk(client):
    if client["records"] == "yes":
        if client["job"] == "parttime":
            return "default"
        else:
            return "ok"
    else:
        if client["assets"] > 6000:
            return "ok"
        else:
            return "default"

In [None]:
# let's test the decision tree above
xi = df_train.iloc[0].to_dict()
xi

In [None]:
assess_risk(xi)

And it works well given that the client's job is Freelance and the assests are 10000

In [None]:
# Now let's train using sklean's DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

In [None]:
# convert the train data frame into dictionaries as we fill missing values with zeros
train_dicts = df_train.fillna(0).to_dict(orient="records")

In [None]:
# handle categorical variables
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

In [None]:
# check feature names
dv.get_feature_names_out()

In [None]:
# Now train the DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [None]:
# test the model using validation dataset
val_dicts = df_val.fillna(0).to_dict(orient="records")
X_val = dv.transform(val_dicts)

In [None]:
y_pred = dt.predict_proba(X_val)[:, 1]

In [None]:
# calculate the roc auc of the model
roc_auc_score(y_val, y_pred)

In [None]:
# AUC of the training dataset
y_pred = dt.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, y_pred)

Given that the model's <b>ROC AUC score</b> in validation set is 0.65 and that of train dataset is 1.0, it suggests that there is a problem of overfitting.

Now this might be due to the model learning too deep to an extent it memorizes specific information about clients hence fail to <b>generalize</b>.

We can try to prevent this by restricting the level as to which the model can reach when it is training as shown below.

In [None]:
# create a new model which only goes as far as to the depth of 3
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)

In [None]:
# auc when the tree is restricted to 3 levels
y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print("train: ", auc)

y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print("val: ", auc)

As seen above, the model is better than when it was before restricting to 3 levels, however if it is restricted too much, the results will be even worse.

In [None]:
# lets visualize the rules the tree learned from
from sklearn.tree import export_text

In [None]:
print(export_text(dt))

## Decision tree learning algorithm

- Finding the best split for one column
- Finding the best split for the entire dataset
- Stopping criteria
- Decision tree learning algorithm

In [None]:
# let's create a small dataset for demonstration
data = [
    [8000, "default"],
    [2000, "default"],
    [0, "default"],
    [5000, "ok"],
    [5000, "ok"],
    [4000, "ok"],
    [9000, "ok"],
    [3000, "default"],
]

df_example = pd.DataFrame(data, columns=["assets", "status"])
df_example

In [None]:
df_example.sort_values("assets")

In [None]:
# potential thresholds for splitting the dataframe
Ts = [0, 2000, 3000, 4000, 5000, 8000]

In [None]:
from IPython.display import display

In [None]:
# demonstrate splitting using the various splits
for T in Ts:
    print(T)
    df_left = df_example[df_example.assets <= T]
    df_right = df_example[df_example.assets > T]
    
    display(df_left)
    print(df_left.status.value_counts(normalize=True))
    display(df_right)
    print(df_right.status.value_counts(normalize=True))

    
    print()

In [None]:
# dataset with more than one feature
data = [
    [8000, 3000, "default"],
    [2000, 1000, "default"],
    [0, 1000, "default"],
    [5000, 1000, "ok"],
    [5000, 1000, "ok"],
    [4000, 1000, "ok"],
    [9000, 500, "ok"],
    [3000, 2000, "default"],
]

df_example = pd.DataFrame(data, columns=["assets","debt", "status"])
df_example

In [None]:
df_example.sort_values("debt")

In [None]:
# generalized potential thresholds for splitting the dataset with more than one feature
thresholds = {
    "assets": [0, 2000, 3000, 4000, 5000, 8000],
    "debt": [500, 1000, 2000]
}

In [None]:
for feature, Ts in thresholds.items():
    print("##########")
    print(feature)
    for T in Ts:
        print(T)
        df_left = df_example[df_example[feature] <= T]
        df_right = df_example[df_example[feature] > T]

        display(df_left)
        print(df_left.status.value_counts(normalize=True))
        display(df_right)
        print(df_right.status.value_counts(normalize=True))


        print()
    print("##########")

## Decision Trees Parameter Tuning

- Selecting max_depth
- selecting min_samples_leaf

In [None]:
# create a DecisionTree model based on different set depths as you calculate auc of each
for d in [1, 2, 3, 4, 5, 6, 10, 15, 20, None]:
    dt = DecisionTreeClassifier(max_depth=d)
    dt.fit(X_train, y_train)
    
    y_pred = dt.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, y_pred)
    
    print("%4s -> %.3f" % (d, auc))

In [None]:
# implementing min_samples_leaf to the model
scores = []

for d in [4, 5, 6, 7, 10, 15, 20]:
    for s in [1, 2, 5, 10, 15, 20, 100, 200, 500]:
        dt = DecisionTreeClassifier(max_depth=d, min_samples_leaf=s)
        dt.fit(X_train, y_train)

        y_pred = dt.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)

        scores.append((d, s, auc))

In [None]:
# create a data frame for the scores
columns = ["max_depth", "min_samples_leaf", "auc"]
df_scores = pd.DataFrame(scores, columns=columns)
df_scores.head()

In [None]:
df_scores.sort_values(by="auc", ascending=False).head()

In [None]:
# pivote the data frame to visualize it well
df_scores_pivot = df_scores.pivot(index="min_samples_leaf",
                                 columns=["max_depth"], values=["auc"])
df_scores_pivot.round(3)

In [None]:
# visualize it as a heatmap
sns.heatmap(df_scores_pivot, annot=True, fmt=".3f")

Using a max_depth of 6 and min_samples_leaf of 15 seems to work well. Let's implement it to a model.

In [None]:
dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(X_train, y_train)

## Ensemble Learning and Random Forest

- Board experts
- Ensembling models
- Random forest - ensembling decision trees
- Tuning random forest

In [None]:
# import RandomForestClassifier from sklearns ensemble
from sklearn.ensemble import RandomForestClassifier

In [None]:
# create a random forest model
rf = RandomForestClassifier(n_estimators=10, random_state=1)
rf.fit(X_train, y_train)

In [None]:
# probability predict based on validation set
y_pred = rf.predict_proba(X_val)[:, 1]
# auc score
roc_auc_score(y_val, y_pred)

In [None]:
# examine what happend when the number of estimators changes
scores = []

for n in range(10, 201, 10):
    rf = RandomForestClassifier(n_estimators=n, random_state=1)
    rf.fit(X_train, y_train)
    y_pred = rf.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, y_pred)
    scores.append((n, auc))

In [None]:
# create a scores data frame for easy visualization
df_scores = pd.DataFrame(scores, columns=["n_estimators", "auc"])

In [None]:
df_scores

In [None]:
# plot the graph of estimators and auc score to see the number of trees necessay for the model
plt.plot(df_scores.n_estimators, df_scores.auc)

The plot above indicates that the model's score increases up until when it reaches 50 trees, it then remains to be stagnant for the rest number of trees.

This then implies that the required number of trees for this model is 50, the rest don't contribute much to the performance of the model.

In [None]:
# Now let's tune the random forest by training it using different depths for the trees
# using max_depth
scores = []

for d in [5, 10, 15]:
    for n in range(10, 201, 10):
        rf = RandomForestClassifier(n_estimators=n, 
                                    max_depth=d, 
                                    random_state=1)
        rf.fit(X_train, y_train)
        y_pred = rf.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        scores.append((d, n, auc))

In [None]:
# df_score dataframe with max_depth features
columns = ["max_depth", "n_estimators", "auc"]
df_scores = pd.DataFrame(scores, columns=columns)
df_scores.head()

In [None]:
# A plot of the different depths performances
for d in [5, 10, 15]:
    df_subset = df_scores[df_scores.max_depth == d]
    plt.plot(df_subset.n_estimators, df_subset.auc,
            label="max_depth=%d" % d)
    
plt.legend()

From the plot above, the best size of depth is 10.

Now let's find out that of min_sample_leaf

In [None]:
max_depth = 10

In [None]:
# tune it further as we find out the best min_sample_leaf
scores = []

for s in [1, 3, 5, 10, 50]:
    for n in range(10, 201, 10):
        rf = RandomForestClassifier(n_estimators=n, 
                                    max_depth=max_depth,
                                    min_samples_leaf=s,
                                    random_state=1)
        rf.fit(X_train, y_train)
        y_pred = rf.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        scores.append((s, n, auc))

In [None]:
columns = ["min_samples_leaf", "n_estimators", "auc"]
df_scores = pd.DataFrame(scores, columns=columns)
df_scores.head()

In [None]:
# A plot of the different min_sample_leaf performances
colors = ["black", "blue", "orange", "red", "grey"]
min_samples_leaf_values = [1, 3, 5, 10, 50]
for s, col in zip(min_samples_leaf_values, colors):
    df_subset = df_scores[df_scores.min_samples_leaf == s]
    plt.plot(df_subset.n_estimators, df_subset.auc, color=col,
            label="min_samples_leaf=%s" % s)
    
plt.legend()

In [None]:
min_samples_leaf = 3

Here our best min_sample_leaf is 3 and it works well at around 100 because beyond that the model is fairly stagnant.

We'll now retrain the model using these two parameters, that is, max_depth of 10 and min_sample_leaf of 3.

In [None]:
# retrain the model
rf = RandomForestClassifier(n_estimators=100, 
                            max_depth=max_depth,
                            min_samples_leaf=min_samples_leaf,
                            random_state=1,
                            n_jobs=-1)
rf.fit(X_train, y_train)

## Gradient boosting and XGBoost

- Gradient boosting vs Random Forest
- Installing XGBoost
- Training the first model
- Performance monitoring
- Parsing xgboost's monitoring output

In [None]:
# import xgboost
import xgboost as xgb

In [None]:
# wrap the training dataset into a XGBoost's Dmatrix datamatrix for easy training with xgboost
features = dv.get_feature_names_out()
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)

In [None]:
# train the model
xgb_params = {
    "eta": 0.3,
    "max_depth": 6,
    "min_child_weight": 1,
    
    "objective": "binary:logistic",
    "nthread": 8,
    
    "seed": 1,
    "verbosity": 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=10)

In [None]:
y_pred = model.predict(dval)

In [None]:
roc_auc_score(y_val, y_pred)

In [None]:
# evaluate the model based on the training data by creating a watchlist
watchlist = [(dtrain, "train"), (dval, "val")]

In [None]:
# from IPython.core.magic import register_line_magic

def %%capture(code):
    captured_output = None
    try:
        captured_output = eval(code)
    except Exception as e:
        print(e)
    return captured_output


In [None]:
# retrain the xgboost model
%%capture output
xgb_params = {
    "eta": 0.3,
    "max_depth": 6,
    "min_child_weight": 1,
    
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "nthread": 8,
    
    "seed": 1,
    "verbosity": 1,
}

model = xgb.train(xgb_params, dtrain, 
                  evals=watchlist,
                  verbose_eval= 5,
                  num_boost_round=200)

In [None]:
s = output.stdout
print(s)

In [None]:
# split the auc scores so that you can visualize them differently
def parse_xgb_output(output):
    results = []
    
    for line in output.stdout.strip().split("\n"):
        it_line, train_line, val_line = line.split("\t")
        
        it = int(it_line.strip("[]"))
        train = float(train_line.split(":")[1])
        val = float(val_line.split(":")[1])
        
        results.append((it, train_line, val))
        
    columns = ["num_iter", "train_auc", "val_auc"]
    df_results = pd.DataFrame(results, columns=columns)
    return df_results

In [None]:
df_score = parse_xgb_output(output)

In [None]:
# plot the score dataframe
plt.plot(df_score.num_iter, df_score.train_auc, label="train")
plt.plot(df_score.num_iter, df_score.val_auc, label="val")
plt.legend()