In [1]:
!git clone https://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@github.com/DadeOrsu/dm_project24_group_6.git

Cloning into 'dm_project24_group_6'...
remote: Enumerating objects: 1269, done.[K
remote: Counting objects: 100% (298/298), done.[K
remote: Compressing objects: 100% (193/193), done.[K
remote: Total 1269 (delta 195), reused 178 (delta 93), pack-reused 971 (from 1)[K
Receiving objects: 100% (1269/1269), 51.43 MiB | 11.49 MiB/s, done.
Resolving deltas: 100% (844/844), done.
Updating files: 100% (39/39), done.


In [2]:
cd dm_project24_group_6/src/task4_machine_learning/

/content/dm_project24_group_6/src/task4_machine_learning


# Classification with decision trees

In this notebook, we will use decision trees to classify the data. We will use the `DecisionTreeClassifier` class from the `sklearn.tree` module. Decision trees are a popular method for various machine learning tasks. They are easy to understand and interpret, and they are often used as a baseline for more complex models.

We start by loading the data and preparing the train set and the test set.

In [3]:
import pandas as pd
from os import path
import numpy as np
from preprocessing import get_train_test_data

X_train, y_train, X_test, y_test, columns_to_keep = get_train_test_data()

In [4]:
from sklearn.metrics import classification_report
def report_scores(test_label, test_pred):
    print(classification_report(test_label,
                            test_pred,
                            target_names=['0', '1']))


So the data is set up however we need to evaluate training data to see which approach works best.

In [5]:
X_train.head()

Unnamed: 0,bmi,career_points,career_duration(days),debut_year,difficulty_score,competitive_age,is_tarmac,climbing_efficiency,startlist_quality,avg_pos
0,23.765432,0.0,0.0,1977.0,0.635375,22,True,0.006796,1241,0.0
1,20.897959,0.0,0.0,1974.0,0.635375,27,True,0.006796,1241,0.0
2,22.790329,0.0,0.0,1977.0,0.635375,24,True,0.006796,1241,0.0
3,21.46915,0.0,0.0,1970.0,0.635375,30,True,0.006796,1241,0.0
4,21.295295,0.0,0.0,1977.0,0.635375,27,True,0.006796,1241,0.0


We procede in the following steps:
1. We define the hyperparameters of the model so that we can tune them later by using a grid search.
2. We split the training data into a training and a validation set. The data is divided into 80% training and 20% validation.
3. The code iterates through a Parameter Grid to find the best hyperparameters for the model. The result of each combination of parameters is stored inside the `parans_tested` list, so that they can be analyzed later.

In [6]:
from sklearn.model_selection import ParameterGrid, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

NUM_FOLDS = 5
RANDOM_SEED = 42

# Definition of the hyperparameters grid
hyper_params = {
    'criterion': ['gini', 'entropy'],   # Try different impurity criteria
    'splitter': ['best', 'random'],     # Try different splitting strategies
    'max_depth': [5, 10, 15],           # Max depth of the tree
    'min_samples_split': [2, 10, 20],   # Min samples required to split a node
    'min_samples_leaf': [1, 5, 10],     # Min samples required at each leaf node
    'class_weight': ['balanced']        # Try different class weights
}

grid_params = ParameterGrid(hyper_params)

X_train_set, X_val_set, Y_train_set, Y_val_set = train_test_split(
    X_train,y_train,
    test_size=0.2,
    stratify=y_train,
    random_state=RANDOM_SEED,
    shuffle=True
)

params_tested = list()

for comb in grid_params:
    dt = DecisionTreeClassifier(**comb)
    dt.fit(X_train_set, Y_train_set)
    Y_pred_train_set = dt.predict(X_train_set)
    Y_pred_val_set = dt.predict(X_val_set)
    train_f_score = f1_score(Y_train_set, Y_pred_train_set, average='macro')
    val_f_score = f1_score(Y_val_set, Y_pred_val_set, average='macro')
    new_comb = comb
    new_comb|={
        'train_f_score': train_f_score,
        'val_f_score': val_f_score
    }
    print(comb)
    report_scores(Y_val_set, Y_pred_val_set)
    params_tested.append(new_comb)

{'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best', 'train_f_score': 0.6259963029987817, 'val_f_score': 0.6212178132155295}
              precision    recall  f1-score   support

           0       0.90      0.76      0.82     92129
           1       0.33      0.57      0.42     18763

    accuracy                           0.73    110892
   macro avg       0.61      0.67      0.62    110892
weighted avg       0.80      0.73      0.76    110892

{'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random', 'train_f_score': 0.6084994740171923, 'val_f_score': 0.6038455756897013}
              precision    recall  f1-score   support

           0       0.88      0.79      0.83     92129
           1       0.31      0.47      0.38     18763

    accuracy                           0.73    110892
   macro avg       0.60      0.63      0.60

Since the research of the best hyperparameters is computationally expensive, we store the results contained in the params_tested list in a CSV file. This way, we can analyze the results later without having to re-run the code.

In [7]:
import json

params_df=pd.DataFrame(params_tested)

params_df.sort_values(by='val_f_score',ascending=False)

params_df.to_csv('params_dt/test_f1_averaged.csv')

In [8]:
pd.read_csv('params_dt/test_f1_averaged.csv').sort_values(by='val_f_score',ascending=False).head(10)

Unnamed: 0.1,Unnamed: 0,class_weight,criterion,max_depth,min_samples_leaf,min_samples_split,splitter,train_f_score,val_f_score
90,90,balanced,entropy,15,1,2,best,0.687762,0.628812
92,92,balanced,entropy,15,1,10,best,0.684247,0.628353
94,94,balanced,entropy,15,1,20,best,0.679707,0.627972
104,104,balanced,entropy,15,10,10,best,0.675909,0.627502
106,106,balanced,entropy,15,10,20,best,0.675895,0.627477
51,51,balanced,gini,15,10,10,random,0.646557,0.627469
102,102,balanced,entropy,15,10,2,best,0.676097,0.627224
100,100,balanced,entropy,15,5,20,best,0.67706,0.626789
96,96,balanced,entropy,15,5,2,best,0.680367,0.626788
98,98,balanced,entropy,15,5,10,best,0.680263,0.626748


Finally, after finding the best hyperparameters, we train the model with the entire training set and evaluate it with the test set.

In [10]:
from sklearn.tree import DecisionTreeClassifier
best_model = DecisionTreeClassifier(
    class_weight = 'balanced',
    criterion = 'entropy',
    max_depth = 15,
    min_samples_leaf = 1,
    min_samples_split = 2,
    splitter = 'best',
)
best_model.fit(X_train, y_train)

In [11]:
test_pred_dt = best_model.predict(X_test)

In [12]:
report_scores(y_test, test_pred_dt)

              precision    recall  f1-score   support

           0       0.90      0.75      0.82     30219
           1       0.27      0.54      0.36      5187

    accuracy                           0.72     35406
   macro avg       0.59      0.65      0.59     35406
weighted avg       0.81      0.72      0.76     35406

