Jake Onkka
https://www.kaggle.com/datasets/anthonytherrien/gladiator-combat-records-and-profiles-dataset

In [1]:
import pandas as pd
import numpy as np
import math
from sklearn.pipeline import Pipeline
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

In [2]:
data = pd.read_csv('gladiator_data.csv')
#there is over 730,000 rows, with hyperparameters the pipeline runs 6 times
#for the purposes of this assignment, I just take a sample of the data to use for my xs and ys so the runtime is around 1.5 minutes
data = data.sample(frac = .1) 
data.fillna(value=0,inplace=True)

xs = data.drop(columns = ["Survived", "Name"])
ys = data['Survived']
xs = pd.get_dummies(xs)
print(xs)
print(ys)

        Age  Birth Year  Height  Weight  Wins  Losses  Public Favor  \
360044   31         224     148      78     8       1      0.780466   
399960   38         234     169      64    20       3      0.862680   
398256   37         235     179      90    14       2      0.990932   
497081   24         289     174     105     5       0      0.718914   
137368   43         100     178      71     7       2      0.718894   
...     ...         ...     ...     ...   ...     ...           ...   
111488   33          94     179      80     8       3      0.545995   
704005   41         347     182      59     7       0      0.735550   
385436   31         235     162      75     4       1      0.488531   
656497   24         347     178     114     4       1      0.518854   
670608   21         356     169     104     2       0      0.562891   

        Mental Resilience  Battle Experience  Origin_Gaul  ...  \
360044           6.187498                  9        False  ...   
399960         

In [3]:
gradientboosting_pipeline = Pipeline([
    ('scaler',MinMaxScaler()),
    ('gradient_boosting', GradientBoostingClassifier())
])

gradientboosting_grid = {
    'gradient_boosting__n_estimators': [5,30],  #number of boosting stages, larger number tends to do better
    'gradient_boosting__max_depth': [3,5,7],  #limits number of nodes in tree
}
gradientboosting_search = GridSearchCV(gradientboosting_pipeline, gradientboosting_grid, scoring='balanced_accuracy', n_jobs=-1)
gradientboosting_search.fit(xs, ys)


In [4]:
gradientboosting_params = gradientboosting_search.best_params_
gradientboosting_score = gradientboosting_search.best_score_
print(f"Balanced_Accuracy: {gradientboosting_score}")
print(f"Best params: {gradientboosting_params}\n")

Balanced_Accuracy: 0.8416056186858321
Best params: {'gradient_boosting__max_depth': 7, 'gradient_boosting__n_estimators': 30}



Reflections:
If I used a train/test split instead of cross validation I would expect a lot more variance with my score metric. A single train/test split has too much randomness to know whether or not the test and train sets are balanced.

I chose balanced_accuracy as my metric because there may be some imbalance in the dataset. It's the average recall which works great for classifying a simple boolean target.

For the hyperparameters I'm not surprised that n_estimators is better with a higher number since more trees will allow it to learn more. I'm also not suprised that higher values for max_depth are better because there are a lot of features that I believe have important relations to the target. So more complex trees would be able to detect more of those relations for a better score.