# Bagging and Random Forests

In this notebook, we will use the Random Forest algorithm to predict if a cyclist will be in the top 20 in the next race or not.

We start by loading the data and preparing the train set and the test set.

In [1]:
import pandas as pd
from os import path
import numpy as np

races_final_path = path.join('..','dataset', 'engineered_races.csv')
cyclists_final_path = path.join('..','dataset', 'cyclists_final_enhanced.csv')


cyclists_data = pd.read_csv(cyclists_final_path)
races_data = pd.read_csv(races_final_path)

In [2]:

cyclists_data.rename(columns={'name': 'cyclist'}, inplace=True)


merged_data = races_data.merge(cyclists_data, left_on='cyclist', right_on='_url', how='inner')

merged_data['top_20'] = merged_data['position'].apply(lambda x: 1 if x <= 20 else 0)


merged_data['date'] = pd.to_datetime(merged_data['date'])

columns_to_keep = [

    'bmi','career_points','career_duration(days)','debut_year', # cyclists features
    'points','difficulty_score','competitive_age','climbing_efficiency', # races features
    'top_20'# target feature
]

train_set = merged_data[merged_data['date'] < '2022-01-01']
test_set = merged_data[merged_data['date'] >= '2022-01-01']

train_set = train_set[columns_to_keep]
test_set = test_set[columns_to_keep]

X_train = train_set.drop(columns=['top_20'])
y_train = train_set['top_20']


X_test = test_set.drop(columns=['top_20'])
y_test = test_set['top_20']

So the data is set up however we need to evaluate training data to see which approach works best.

In [3]:
X_train.shape

(554459, 8)

In [4]:
from sklearn.metrics import classification_report
def report_scores(test_label, test_pred):
    print(classification_report(test_label, 
                            test_pred, 
                            target_names=['0', '1']))

Data is a lot using a k-fold is best but very time consuming, a holdout method is more practical in this case.

In [None]:
from sklearn.metrics import f1_score
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold , GridSearchCV, train_test_split,ParameterGrid
from sklearn.ensemble import RandomForestClassifier

RANDOM_SEED=42
NUM_FOLDS=5

#heavy imbalance in the dataset for class 1, a test using heavily imbalanced class weights is due

hyper_params={
    'n_estimators': [50, 100, 200,500],
    'criterion': ['gini', 'entropy'],
    'max_features': ['sqrt', 2, 3,4,5],
    'max_depth': [10, 20,30, None],
    'min_samples_split': [2, 5, 10,20],
    'min_samples_leaf': [1, 3, 5,10],
    'bootstrap': [True],
    'class_weight': ['balanced']# heavy class imbalance -> penalize misclassification on minority class
}

# hyper params grid
grid_params=ParameterGrid(hyper_params)

X_train_set, X_val_set, Y_train_set, Y_val_set = train_test_split(
    X_train,y_train,
    test_size=0.2,
    stratify=y_train,
    random_state=RANDOM_SEED,
    shuffle=True
)

params_tested=list()

for comb in grid_params:
    rfc=RandomForestClassifier(**comb,n_jobs=-1)
    rfc=rfc.fit(X_train_set,Y_train_set)

    Y_pred_train_set=rfc.predict(X_train_set)
    Y_pred_val_set=rfc.predict(X_val_set)
    train_f_score=f1_score(Y_train_set,Y_pred_train_set,average='macro')
    val_f_score=f1_score(Y_val_set,Y_pred_val_set,average='macro')
    new_comb=comb
    new_comb|={
        'train_f1_score':train_f_score,
        'val_f1_score':val_f_score,
    }
    print(comb)
    report_scores(Y_val_set,Y_pred_val_set)
    params_tested.append(new_comb)

params_df=pd.DataFrame(params_tested)


params_df.sort_values(by='val_f1_score')

{'bootstrap': True, 'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'train_f1_score': 0.6439403162692163, 'val_f1_score': 0.6332770211896618}
              precision    recall  f1-score   support

           0       0.91      0.76      0.83     92129
           1       0.34      0.61      0.44     18763

    accuracy                           0.74    110892
   macro avg       0.62      0.69      0.63    110892
weighted avg       0.81      0.74      0.76    110892

{'bootstrap': True, 'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100, 'train_f1_score': 0.6415093791495892, 'val_f1_score': 0.6311789014835959}
              precision    recall  f1-score   support

           0       0.91      0.76      0.83     92129
           1       0.34      0.61      0.44     18763

    ac

KeyboardInterrupt: 

In [None]:
import json

params_df=pd.DataFrame(params_tested)


params_df.sort_values(by='val_f1_score',ascending=False)

params_df.to_csv('params_rf/test_f1_averaged.csv')

In [12]:
pd.read_json('params_rf/test_f1_averaged.json')

Unnamed: 0,bootstrap,class_weight,criterion,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,train_f1_score,val_f1_score
0,True,balanced,gini,10.0,sqrt,1,2,50,0.643940,0.633277
1,True,balanced,gini,10.0,sqrt,1,2,100,0.641509,0.631179
2,True,balanced,gini,10.0,sqrt,1,2,200,0.644905,0.633983
3,True,balanced,gini,10.0,sqrt,1,2,500,0.644478,0.633727
4,True,balanced,gini,10.0,sqrt,1,5,50,0.643003,0.631634
...,...,...,...,...,...,...,...,...,...,...
3351,True,"{'0': 1, '1': 9}",gini,30.0,3,3,5,500,0.923057,0.730526
3352,True,"{'0': 1, '1': 9}",gini,30.0,3,3,10,50,0.887271,0.725352
3353,True,"{'0': 1, '1': 9}",gini,30.0,3,3,10,100,0.890735,0.727283
3354,True,"{'0': 1, '1': 9}",gini,30.0,3,3,10,200,0.892685,0.726863


In [None]:
#metrics computed on the test set
from sklearn.metrics import classification_report
def report_scores(test_label, test_pred):
    print(classification_report(test_label, 
                            test_pred, 
                            target_names=['0', '1']))

In [None]:


#compute the performance of the model
report_scores(y_test, test_pred_rf)

              precision    recall  f1-score   support

           0       0.94      0.88      0.91     30219
           1       0.47      0.65      0.55      5187

    accuracy                           0.84     35406
   macro avg       0.70      0.76      0.73     35406
weighted avg       0.87      0.84      0.85     35406



Clearly a simple model is not enough but it managews to reach good discrimination for the first class, a nice alternative could be bagging first we need a simple learner to use we can use a RandomForest and see how it performs with bagging.

As from theory we know that bagging works and manages to work well when we have independent erros of the models inside the ensemble , as specified in section 7.11 [here](https://www.deeplearningbook.org/contents/regularization.html). We also hae high benefits from using bootstraping to make errors independent allowing for more robust models.

For the high dimensionality random forests seem to be the most sensed case and bootstraping should be enough to enhance generalization.