# Classification with decision trees

In this notebook, we will use decision trees to classify the data. We will use the `DecisionTreeClassifier` class from the `sklearn.tree` module. Decision trees are a popular method for various machine learning tasks. They are easy to understand and interpret, and they are often used as a baseline for more complex models.

We start by loading the data and preparing the train set and the test set.

In [7]:
import pandas as pd
from os import path
import numpy as np
from preprocessing import get_train_test_data

X_train, y_train, X_test, y_test, columns_to_keep = get_train_test_data()

In [8]:
from sklearn.metrics import classification_report
def report_scores(test_label, test_pred):
    print(classification_report(test_label,
                            test_pred,
                            target_names=['0', '1']))


So the data is set up however we need to evaluate training data to see which approach works best.

In [9]:
X_train.shape

(554459, 10)

In [10]:
from sklearn.metrics import classification_report
def report_scores(test_label, test_pred):
    print(classification_report(test_label,
                            test_pred,
                            target_names=['0', '1']))

We procede in the following steps:
1. We define the hyperparameters of the model so that we can tune them later by using a grid search.
2. We split the training data into a training and a validation set. The data is divided into 80% training and 20% validation.
3. The code iterates through a Parameter Grid to find the best hyperparameters for the model. The result of each combination of parameters is stored inside the `parans_tested` list, so that they can be analyzed later.

In [13]:
from sklearn.model_selection import ParameterGrid, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

NUM_FOLDS = 5
RANDOM_SEED = 42

# Definition of the hyperparameters grid
hyper_params = {
    'criterion': ['gini', 'entropy'],   # Try different impurity criteria
    'splitter': ['best', 'random'],     # Try different splitting strategies
    'max_depth': [5, 10, 15],           # Max depth of the tree
    'min_samples_split': [2, 10, 20],   # Min samples required to split a node
    'min_samples_leaf': [1, 5, 10],     # Min samples required at each leaf node
}

grid_params = ParameterGrid(hyper_params)

X_train_set, X_val_set, Y_train_set, Y_val_set = train_test_split(
    X_train,y_train,
    test_size=0.2,
    stratify=y_train,
    random_state=RANDOM_SEED,
    shuffle=True
)

params_tested = list()

for comb in grid_params:
    dt = DecisionTreeClassifier(**comb)
    dt.fit(X_train_set, Y_train_set)
    Y_pred_train_set = dt.predict(X_train_set)
    Y_pred_val_set = dt.predict(X_val_set)
    train_f_score = f1_score(Y_train_set, Y_pred_train_set, average='macro')
    val_f_score = f1_score(Y_val_set, Y_pred_val_set, average='macro')
    new_comb = comb
    new_comb|={
        'train_f_score': train_f_score,
        'val_f_score': val_f_score
    }
    print(comb)
    report_scores(Y_val_set, Y_pred_val_set)
    params_tested.append(new_comb)

{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best', 'train_f_score': 0.5529273329088534, 'val_f_score': 0.5491130604450959}
              precision    recall  f1-score   support

           0       0.84      0.99      0.91     92129
           1       0.74      0.11      0.19     18763

    accuracy                           0.84    110892
   macro avg       0.79      0.55      0.55    110892
weighted avg       0.83      0.84      0.79    110892

{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random', 'train_f_score': 0.5557565030131106, 'val_f_score': 0.5537558555544488}
              precision    recall  f1-score   support

           0       0.85      0.99      0.91     92129
           1       0.64      0.12      0.20     18763

    accuracy                           0.84    110892
   macro avg       0.74      0.55      0.55    110892
weighted avg       0.81      0.84      0.79  

Since the research of the best hyperparameters is computationally expensive, we store the results contained in the params_tested list in a CSV file. This way, we can analyze the results later without having to re-run the code.

In [14]:
import json

params_df=pd.DataFrame(params_tested)

params_df.sort_values(by='val_f_score',ascending=False)

params_df.to_csv('params_dt/test_f1_averaged.csv')

In [15]:
pd.read_csv('params_dt/test_f1_averaged.csv')

Unnamed: 0.1,Unnamed: 0,criterion,max_depth,min_samples_leaf,min_samples_split,splitter,train_f_score,val_f_score
0,0,gini,5,1,2,best,0.552927,0.549113
1,1,gini,5,1,2,random,0.555757,0.553756
2,2,gini,5,1,10,best,0.552927,0.549113
3,3,gini,5,1,10,random,0.549858,0.547797
4,4,gini,5,1,20,best,0.552927,0.549113
...,...,...,...,...,...,...,...,...
103,103,entropy,15,10,2,random,0.648588,0.630046
104,104,entropy,15,10,10,best,0.693909,0.655314
105,105,entropy,15,10,10,random,0.639378,0.618496
106,106,entropy,15,10,20,best,0.693811,0.655371


Finally, after finding the best hyperparameters, we train the model with the entire training set and evaluate it with the test set.

In [24]:
best_model = DecisionTreeClassifier(
    criterion = 'entropy',
    max_depth = 15,
    min_samples_leaf = 5,
    min_samples_split = 20,
    splitter = 'best',
)
best_model.fit(X_train, y_train)

In [25]:
test_pred_dt = best_model.predict(X_test)

In [26]:
report_scores(y_test, test_pred_dt)

              precision    recall  f1-score   support

           0       0.87      0.97      0.91     30219
           1       0.41      0.13      0.19      5187

    accuracy                           0.84     35406
   macro avg       0.64      0.55      0.55     35406
weighted avg       0.80      0.84      0.81     35406

