# Bagging and Random Forests

In this notebook, we will use the Random Forest algorithm to predict if a cyclist will be in the top 20 in the next race or not.

We start by loading the data and preparing the train set and the test set.

In [1]:
import pandas as pd
from os import path
import numpy as np

races_final_path = path.join('..','dataset', 'races_cleaned.csv')
cyclists_final_path = path.join('..','dataset', 'cyclists_cleaned.csv')


cyclists_data = pd.read_csv(cyclists_final_path)
races_data = pd.read_csv(races_final_path)

In [2]:

cyclists_data.rename(columns={'name': 'cyclist'}, inplace=True)


merged_data = races_data.merge(cyclists_data, left_on='cyclist', right_on='_url', how='inner')

merged_data['top_20'] = merged_data['position'].apply(lambda x: 1 if x <= 20 else 0)


merged_data['date'] = pd.to_datetime(merged_data['date'])

columns_to_keep = [
    'points', 'length', 'profile', 'startlist_quality', 'cyclist_age',
    'is_tarmac', 'delta', 'top_20', 'weight', 'height'
]


train_set = merged_data[merged_data['date'] < '2022-01-01']
test_set = merged_data[merged_data['date'] >= '2022-01-01']

train_set = train_set[columns_to_keep]
test_set = test_set[columns_to_keep]

X_train = train_set.drop(columns=['top_20'])
y_train = train_set['top_20']


X_test = test_set.drop(columns=['top_20'])
y_test = test_set['top_20']

So the data is set up however we need to evaluate training data to see which approach works best.

In [3]:
X_train.shape

(554459, 9)

In [11]:
from sklearn.metrics import classification_report
def report_scores(test_label, test_pred):
    print(classification_report(test_label, 
                            test_pred, 
                            target_names=['0', '1']))

Data is a lot using a k-fold is best but very time consuming, a holdout method is more practical in this case.

In [13]:
from sklearn.metrics import f1_score
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold , GridSearchCV, train_test_split,ParameterGrid
from sklearn.ensemble import RandomForestClassifier

RANDOM_SEED=42
NUM_FOLDS=5

hyper_params={
    'n_estimators': [50, 100, 200],
    'criterion': ['gini', 'entropy','log_loss'],
    'max_features': ['sqrt', 2, 3],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 3, 5],
    'bootstrap': [True],
    'class_weight': [None, 'balanced']
}

# hyper params grid
grid_params=ParameterGrid(hyper_params)

X_train_set, X_val_set, Y_train_set, Y_val_set = train_test_split(
    X_train,y_train,
    test_size=0.2,
    stratify=y_train,
    random_state=RANDOM_SEED,
    shuffle=True
)

params_tested=list()

for comb in grid_params:
    rfc=RandomForestClassifier(**comb,n_jobs=-1)
    rfc=rfc.fit(X_train_set,Y_train_set)

    Y_pred_val_set=rfc.predict(X_val_set)

    train_f_score=f1_score(Y_val_set,Y_pred_val_set,average='weighted')
    val_f_score=f1_score(Y_val_set,Y_pred_val_set,average='weighted')
    new_comb=comb
    new_comb|={
        'train_f1_score':train_f_score,
        'val_f1_score':val_f_score,
    }
    print(comb)
    report_scores(Y_val_set,Y_pred_val_set)
    params_tested.append(new_comb)

params_df=pd.DataFrame(params_tested)


params_df.sort_values(by='val_f1_score')

{'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'train_f1_score': 0.833588482779484, 'val_f1_score': 0.833588482779484}
              precision    recall  f1-score   support

           0       0.87      0.98      0.92     92129
           1       0.78      0.26      0.40     18763

    accuracy                           0.86    110892
   macro avg       0.82      0.62      0.66    110892
weighted avg       0.85      0.86      0.83    110892

{'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100, 'train_f1_score': 0.8339691592193942, 'val_f1_score': 0.8339691592193942}
              precision    recall  f1-score   support

           0       0.87      0.99      0.92     92129
           1       0.79      0.26      0.40     18763

    accuracy        

Unnamed: 0,bootstrap,class_weight,criterion,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,train_f1_score,val_f1_score
1266,True,balanced,log_loss,10.0,2,5,10,50,0.754712,0.754712
1252,True,balanced,log_loss,10.0,2,3,2,100,0.755688,0.755688
1246,True,balanced,log_loss,10.0,2,1,5,100,0.758160,0.758160
765,True,balanced,gini,10.0,2,3,2,50,0.758360,0.758360
1005,True,balanced,entropy,10.0,2,1,10,50,0.758548,0.758548
...,...,...,...,...,...,...,...,...,...,...
1436,True,balanced,log_loss,,3,1,5,200,0.905578,0.905578
707,True,,log_loss,,3,1,5,200,0.905823,0.905823
1382,True,balanced,log_loss,,sqrt,1,5,200,0.905890,0.905890
1139,True,balanced,entropy,,sqrt,1,5,200,0.905952,0.905952


In [None]:
print(best_params)

In [None]:
#predict on the test set
test_pred_rf = best_model.predict(X_test)

In [None]:
#metrics computed on the test set
from sklearn.metrics import classification_report
def report_scores(test_label, test_pred):
    print(classification_report(test_label, 
                            test_pred, 
                            target_names=['0', '1']))

In [None]:


#compute the performance of the model
report_scores(y_test, test_pred_rf)

              precision    recall  f1-score   support

           0       0.94      0.88      0.91     30219
           1       0.47      0.65      0.55      5187

    accuracy                           0.84     35406
   macro avg       0.70      0.76      0.73     35406
weighted avg       0.87      0.84      0.85     35406



Clearly a simple model is not enough but it managews to reach good discrimination for the first class, a nice alternative could be bagging first we need a simple learner to use we can use a RandomForest and see how it performs with bagging.

As from theory we know that bagging works and manages to work well when we have independent erros of the models inside the ensemble , as specified in section 7.11 [here](https://www.deeplearningbook.org/contents/regularization.html). We also hae high benefits from using bootstraping to make errors independent allowing for more robust models.

For the high dimensionality random forests seem to be the most sensed case and bootstraping should be enough to enhance generalization.