# ML1 Final Project: F1 Race Finish Prediction
**Work by LT 2**

---

## Background

F1 is one of if not the most prestigious motorsports competition in the world where 20 drivers from 10 different teams race against each other at speeds reaching 370kph throughout the year to determine who is the best driver and which team is the best in terms of car performance, strategy, etc. In the motorsports industry, F1 earned $3.65 billion in revenue in 2024, which is 25% more than what they earned in the previous year. This revenue comes from race promotion fees, media rights, sponsorships, and other sources such as high-margin hospitality, support series, and merchandise. Out of all these revenue streams, media rights contribute the most in annual revenue (32.8%) in the form of lucrative broadcasting agreements with major global networks and digital streaming services in partnership with Netflix‚Äôs Drive to Survive television series. Close to F1, much like many other globally-renowned sports, is sports betting. Sports betting is a highly lucrative market projected to reach $17.23 billion dollar in revenue in 2025. Of which, F1 makes up 0.4% of the global betting handle.

## Motivation

With these figures in mind, the teams, as well as sports bettors, stand to gain a lot from determining whether a driver and their car would win a race or not. To elaborate, teams would be able to determine when their car and driver are performing poorly, which would allow them to make the necessary adjustments as early as at the end of practice sessions or as late as at the end of qualifying(race before race to determine starting position of a car and driver in the actual race). As for bettors, people would want to know which driver-car duo has the highest probability of winning to make the right bets. Whether it's an F1 team or simply a bettor, the goal is ultimately to win. For an F1 team, winning means more prize money, and increased merchandise and car sales especially for car brands associated with teams.

This study was also inspired by the work of Katelyn Castillo, Christopher Nash Jasmin, Jhedson Angelo Petilo, and Louie Sangalang from the MSDS 2025 cohort whose final project for DMW1 last year generated a driver performance index, which quantifies the performance of F1 drivers. The team wanted to include other factors that influence race results such as car/team performance and track difficulty as a way of continuing their work. THis section was included specifically to acknowledge the contribution of Kate and Nash's team to the current study. 

## Objective

Considering the facts, the group set out to train a machine learning model to predict whether a driver-car duo would get a podium finish (1st-3rd place) or not based on driver performance, car/team performance, and track difficulty.

## Dataset Information

In order to train machine learning models to predict F1 race finishes, data was collected from 3 different sources namely: FastF1/Ergast API, F1 Official Website, and Kaggle. The FastF1 python package facilitates webscraping through the FastF1 API and Ergast API for data such as telemetry, lap times, race results, etc. As for the F1 Official Website, data pertaining to historical constructor's points and driver points were collected from here. Lastly, historical data about race track incidents and reasons for the incidents were collected from Kaggle. Once data from the 3 sources were collected, these were consolidated for analysis and modelling.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from math import ceil
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
import time
from sklearn.model_selection import KFold
import seaborn as sns
# from sklearn.model_selection import train_test_split

In [36]:
import warnings
warnings.filterwarnings("ignore")

import warnings
def showwarning(*args, **kwargs):
    if args[1] is DeprecationWarning:
        return
    warnings._showwarning_orig(*args, **kwargs)
warnings.showwarning = showwarning

## Load Data

In [5]:
# Load and clean data
data_raw = pd.read_csv("F1_main_data_v9.csv")
data = data_raw.copy()
# data=data.drop(columns=["Timestamp","driver_code","GrandPrix","Consistency_Race", "Style_Race",
#                         "Technical_Race","Pace_Race","PerformanceIndex_Race","driver_points","team_points"])
# target = 'RaceFinishPosition'
data.head()

Unnamed: 0,Timestamp,driver_code,Consistency_Race,Style_Race,Technical_Race,Pace_Race,PerformanceIndex_Race,Consistency_Qual,Style_Qual,Technical_Qual,...,Finish_pct,Accident_pct,Collision_pct,Damage Related_pct,DNF_pct,Race_Complexity_Score,Safety_Index,mechanical_faults,avg_stops_per_car_race,avg_pitstop_ms
0,11/11/2025 20:10,HAM,0.487105,0.869994,0.307713,0.884615,0.637357,0.254063,0.202593,0.403174,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215
1,11/11/2025 20:10,LEC,0.5,0.203907,0.573512,0.5,0.444355,0.0,0.199975,0.305004,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215
2,11/11/2025 20:10,NOR,0.663963,0.548893,0.326395,1.0,0.634813,0.16895,0.5,0.41743,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215
3,11/11/2025 20:10,PIA,0.459342,0.38213,0.708799,0.961538,0.627952,0.340871,0.202912,0.190251,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215
4,11/11/2025 20:10,RUS,0.480019,0.531531,0.402736,0.653846,0.517033,0.463925,0.79203,0.231651,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215


In [19]:
data['Safety_Index']

0      0.352468
1      0.352468
2      0.352468
3      0.352468
4      0.352468
         ...   
721    0.463846
722    0.463846
723    0.463846
724    0.463846
725    0.463846
Name: Safety_Index, Length: 726, dtype: float64

## Data Exploration

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 726 entries, 0 to 725
Data columns (total 40 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Timestamp               726 non-null    object 
 1   driver_code             726 non-null    object 
 2   Consistency_Race        713 non-null    float64
 3   Style_Race              714 non-null    float64
 4   Technical_Race          726 non-null    float64
 5   Pace_Race               726 non-null    float64
 6   PerformanceIndex_Race   712 non-null    float64
 7   Consistency_Qual        725 non-null    float64
 8   Style_Qual              725 non-null    float64
 9   Technical_Qual          726 non-null    float64
 10  Pace_Qual               726 non-null    float64
 11  PerformanceIndex_Qual   724 non-null    float64
 12  GrandPrix               726 non-null    object 
 13  Round                   726 non-null    int64  
 14  year                    726 non-null    in

There are entries with null values for various columns. These will be imputated later on as appropriate.

In [7]:
data.columns

Index(['Timestamp', 'driver_code', 'Consistency_Race', 'Style_Race',
       'Technical_Race', 'Pace_Race', 'PerformanceIndex_Race',
       'Consistency_Qual', 'Style_Qual', 'Technical_Qual', 'Pace_Qual',
       'PerformanceIndex_Qual', 'GrandPrix', 'Round', 'year',
       'QualifyingPosition', 'RaceFinishPosition', 'team', 'driver_points',
       'team_points', 'Laps', 'Corners', 'Circuit length (km)',
       'Race distance (km)', 'Direction', 'Accident', 'Collision',
       'Damage Related', 'Finish', 'Total_Entries', 'Finish_pct',
       'Accident_pct', 'Collision_pct', 'Damage Related_pct', 'DNF_pct',
       'Race_Complexity_Score', 'Safety_Index', 'mechanical_faults',
       'avg_stops_per_car_race', 'avg_pitstop_ms'],
      dtype='object')

The dataset is comprised of 726 entries (individual driver results) for 9 drivers who have been driving consistently since 2020 up until the present. All in all there is a total of 40 columns(features) that each describe the individual driver results. The following are brief descriptions of each feature:

- **Timestamp**: Time and date when data was scraped.
- **driver_code**: 3-letter abbreviation of a driver's last name
- **Consistency_Race**: Floating point number quantifying consistency of a driver in the actual race
- **Style_Race**: Floating point number quantifying the driving style of a driver in the actual race
- **Technical_Race**: Floating point number quantifying the technical execution of a driver in the actual race
- **Pace_Race**: Floating point number quantifying the pace of a driver in the actual race
- **PerformanceIndex_Race**: Floating point number quantifying the overall performance of a driver based on consistency, driving style, technical execution, and pace in the actual race
- **Consistency_Qual**: Floating point number quantifying consistency of a driver in qualifying
- **Style_Qual**: Floating point number quantifying the driving style of a driver in qualifying
- **Technical_Qual**: Floating point number quantifying the technical execution of a driver in qualifying
- **Pace_Qual**: Floating point number quantifying the pace of a driver in qualifying
- **PerformanceIndex_Qual**: Floating point number quantifying the overall performance of a driver based on consistency, driving style, technical execution, and pace in qualifying
- **GrandPrix**: Name of the grand prix(race) typically associated to where the race took place
- **Round**: Race number for the year
- **year**: Year when the race took place
- **QualifyingPosition**: Position of the driver after the qualifying session
- **RaceFinishPosition**: Position of the driver after the actual race
- **team**: Team that the driver drove for for that particular race result
- **driver_points**: Total number of points that a driver received in that year
- **team_points**: Total number of points that a team received in that year
- **Laps**: Number of laps in the race
- **Corners**: Number of corners of the track where the race took place **(Target Variable)**
- **Circuit length (km)**: Length of the track in km
- **Race distance (km)**: Total distance covered by a car in a race obtained by multiplying the number of laps with the circuit length 
- **Direction**: Direction of the race (either clockwise or counter-clockwise)
- **Accident**: Total number of accidents in a year in a specific track
- **Collision**: Total number of collisions in a year in a specific track
- **Damage Related**: Total number of damage related incidents in a year in a specific track
- **Finish**: Total number of race finishes in a specific track 
- **Total_Entries**: Total number of race starts in a specific track
- **Finish_pct**: Probability of a finishing the race in a specific track for a given year
- **Accident_pct**: Probability of an accident happening during the race in a specific track for a given year
- **Collision_pct**: Probability of a collision happening during the race in a specific track for a given year
- **Damage Related_pct**: Probability of a damage related incident occurring during the race in a specific track for a given year
- **DNF_pct**: Probability of a DNF occurring during the race in a specific track for a given year
- **Race_Complexity_Score**: Floating point number quantifying the complexity/difficulty of a track
- **Safety_Index**: Floating point number quantifying the safety level of a track
- **mechanical_faults**: Average number of mechanical faults of a specific team per race for a given year
- **avg_stops_per_car_race**: Average number of stops per car of a specific team per race for a given year
- **avg_pitstop_ms**: Average pitstop time in milliseconds of a specific team for a given year

In [20]:
data.describe()

Unnamed: 0,Consistency_Race,Style_Race,Technical_Race,Pace_Race,PerformanceIndex_Race,Consistency_Qual,Style_Qual,Technical_Qual,Pace_Qual,PerformanceIndex_Qual,...,Finish_pct,Accident_pct,Collision_pct,Damage Related_pct,DNF_pct,Race_Complexity_Score,Safety_Index,mechanical_faults,avg_stops_per_car_race,avg_pitstop_ms
count,713.0,714.0,726.0,726.0,712.0,725.0,725.0,726.0,726.0,724.0,...,726.0,726.0,726.0,726.0,726.0,726.0,726.0,726.0,726.0,726.0
mean,0.5545828,0.450378,0.481263,0.800398,0.57238,0.563431,0.388492,0.389483,0.79367,0.53392,...,88.999015,3.153359,7.533249,0.314377,11.000985,0.381604,0.508386,0.822314,1.89162,160011.280215
std,0.3179646,0.204018,0.178434,0.165668,0.111998,0.359871,0.22687,0.149213,0.163252,0.120205,...,3.999096,2.345899,3.664557,0.613144,3.999096,0.095063,0.133067,1.397574,0.090201,19826.872578
min,-4.440892e-16,0.0,0.020357,0.5,0.23743,0.0,0.0,0.0,0.5,0.229744,...,80.0,0.0,0.0,0.0,2.941176,0.228824,0.219836,0.0,1.612903,99782.851852
25%,0.305426,0.30526,0.351222,0.682416,0.497102,0.24087,0.211024,0.28158,0.678571,0.442981,...,86.792453,1.041667,5.076508,0.0,6.61157,0.2875,0.405263,0.0,1.89162,160011.280215
50%,0.5721452,0.446033,0.484563,0.84375,0.575344,0.559262,0.361005,0.383827,0.833333,0.538607,...,88.235294,3.508772,8.333333,0.0,11.764706,0.392427,0.481359,0.0,1.89162,160011.280215
75%,0.8395323,0.580773,0.595681,0.941176,0.652805,0.979844,0.509317,0.488183,0.933333,0.623445,...,93.38843,4.494382,10.416667,0.561798,13.207547,0.45068,0.646364,1.0,1.89162,160011.280215
max,1.0,1.0,0.982764,1.0,0.852287,1.0,1.0,0.968698,1.0,0.839893,...,97.058824,9.016393,20.0,2.083333,20.0,0.583443,0.741765,5.0,2.15,249916.878378


## Data Preparation and Processing

In [21]:
data=data.drop(columns=["Timestamp","driver_code","GrandPrix","Consistency_Race", "Style_Race",
                        "Technical_Race","Pace_Race","PerformanceIndex_Race","driver_points","team_points"])
target = 'RaceFinishPosition'

Some of the above columns were dropped to avoid data leakage for those columns that were only obtained after the actual race to avoid overfitting. The rest were dropped due to redundancy or clear lack of relevance to the modelling that will be performed later on. As for the target column this would be the RaceFinishPosition which tells the position of the driver after the actual race.

In [23]:
data.shape

(726, 30)

### Modified Race Finish Position as Ordinal Values

In [24]:
# # METHOD 1 What Position ==========================================
# # Converting Race Finish Output to Integer Values
# # data2.loc[:, "RaceFinishPosition"] = data2["RaceFinishPosition"].astype(int)
# data["RaceFinishPosition"] = pd.to_numeric(data["RaceFinishPosition"], errors="coerce").fillna(0).astype(int)


# # METHOD 2 Podium vs No Podium ==========================================
# # Convert to numeric safely
# data["RaceFinishPosition"] = pd.to_numeric(data["RaceFinishPosition"], errors="coerce")

# # Classify: 1 if Podium (positions 1, 2, 3), 0 otherwise
# data["RaceFinishPosition"] = np.where(data["RaceFinishPosition"].between(1, 3), 1, 0)

# # METHOD 3 1st place or None ==========================================
# # Convert to numeric safely
# data["RaceFinishPosition"] = pd.to_numeric(data["RaceFinishPosition"], errors="coerce")

# # Classify: 1 if Podium (positions 1, 2, 3), 0 otherwise
# data["RaceFinishPosition"] = np.where(data["RaceFinishPosition"].between(1, 1), 1, 0)

# data["RaceFinishPosition"].head()

# METHOD 4. 4 Categories. 1st 2nd 3rd and No Podium =========================
# Convert to numeric safely
data["RaceFinishPosition"] = pd.to_numeric(data["RaceFinishPosition"], errors="coerce")

# 4 categories: 1st, 2nd, 3rd, no podium (0)
data["RaceFinishPosition"] = np.select(
    [
        data["RaceFinishPosition"] == 1,
        data["RaceFinishPosition"] == 2,
        data["RaceFinishPosition"] == 3,
    ],
    [1, 2, 3],
    default=0
)

data["RaceFinishPosition"].head()


0    0
1    0
2    1
3    2
4    0
Name: RaceFinishPosition, dtype: int64

### Check Numerical and Categorical Columns

In [25]:
# Identify categorical and numeric columns
cat_cols = data.select_dtypes(include=['object', 'category']).columns
num_cols = data.select_dtypes(exclude=['object', 'category']).columns

print("Numeric columns:", num_cols.tolist())
print("Categorical columns:", cat_cols.tolist())

# --- Check for missing values ---
missing_info = data.isna().sum()  # count NaN per column
missing_info = missing_info[missing_info > 0]  # only show columns with NaN

if not missing_info.empty:
    print("\nüß© Columns with missing values:")
    print(missing_info.sort_values(ascending=False))
else:
    print("\n‚úÖ No missing values found in the dataset!")

# Fill all numeric columns with their mean
data[num_cols] = data[num_cols].apply(lambda x: x.fillna(x.mean()))

# (Optional) For categorical columns, you can fill NaN with a placeholder
data[cat_cols] = data[cat_cols].fillna("Unknown")

# Verify if any NaN remain
print("\nRemaining missing values after mean imputation:")
print(data.isna().sum()[data.isna().sum() > 0])



Numeric columns: ['Consistency_Qual', 'Style_Qual', 'Technical_Qual', 'Pace_Qual', 'PerformanceIndex_Qual', 'Round', 'year', 'QualifyingPosition', 'RaceFinishPosition', 'Laps', 'Corners', 'Circuit length (km)', 'Race distance (km)', 'Accident', 'Collision', 'Damage Related', 'Finish', 'Total_Entries', 'Finish_pct', 'Accident_pct', 'Collision_pct', 'Damage Related_pct', 'DNF_pct', 'Race_Complexity_Score', 'Safety_Index', 'mechanical_faults', 'avg_stops_per_car_race', 'avg_pitstop_ms']
Categorical columns: ['team', 'Direction']

üß© Columns with missing values:
PerformanceIndex_Qual    2
Consistency_Qual         1
Style_Qual               1
dtype: int64

Remaining missing values after mean imputation:
Series([], dtype: int64)


There are 3 columns with null values. The entries with null values corresponding to any of the 3 columns were then mean imputated.

### One Hot Encoding

In [26]:
X = data.drop(columns=[target])
y = data[target]

# Separate categorical and numeric columns
cat_cols = X.select_dtypes(include=['object', 'category']).columns
num_cols = X.select_dtypes(exclude=['object', 'category']).columns.tolist()

# Only build encoder if categorical columns exist
if len(cat_cols) > 0:
    ohe = OneHotEncoder(
        handle_unknown='infrequent_if_exist',
        min_frequency=0.01,   # ~1% threshold for infrequent categories
        sparse_output=False,
        dtype=int              # ensures 0/1 not True/False
    )

    preprocessor = ColumnTransformer(
        transformers=[
            ('cat', ohe, cat_cols),
            ('num', 'passthrough', num_cols)
        ],
        remainder='drop'
    )

    # ‚ö†Ô∏è Use fit_transform only the first time
    X_encoded = preprocessor.fit_transform(X)
    feature_names = preprocessor.get_feature_names_out()
else:
    X_encoded = X.values
    feature_names = X.columns

data2 = pd.DataFrame(X_encoded, columns=feature_names)

# Attach target
data2 = pd.concat([data2, y.reset_index(drop=True)], axis=1)

data2.head()

Unnamed: 0,cat__team_Alpine Renault,cat__team_Aston Martin Aramco Mercedes,cat__team_Ferrari,cat__team_McLaren,cat__team_McLaren Mercedes,cat__team_McLaren Renault,cat__team_Mercedes,cat__team_Red Bull Racing,cat__team_Red Bull Racing Honda,cat__team_Red Bull Racing Honda RBPT,...,num__Accident_pct,num__Collision_pct,num__Damage Related_pct,num__DNF_pct,num__Race_Complexity_Score,num__Safety_Index,num__mechanical_faults,num__avg_stops_per_car_race,num__avg_pitstop_ms,RaceFinishPosition
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215,0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215,0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215,1
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215,2
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215,0


Categorical columns that were identified ('team' and 'Direction') were one hot encoded to allow for inclusion in the modelling later on.

## Split Train, Validation, and Test Sets

### Select Test Data: Predicting 2025 Data

In [27]:
# ‚úÖ Split dataset by year
train_val_data = data2[data2["num__year"] < 2025].copy()    # all years before 2025
data_2025 = data2[data2["num__year"] == 2025].copy()        # only 2025 data

# --- Control what % of 2025 data to use for test ---
test_fraction = 0.99 # üëà set this to the % (e.g., 0.3 = 30%, 0.5 = 50%)

data_2025_train, data_2025_test = train_test_split(
    data_2025,
    test_size=test_fraction,
    random_state=42,     # reproducibility
    shuffle=True         # shuffle so random subset of 2025 data
)

# Combine all training data (pre-2025 + part of 2025)
train_val_data = pd.concat([train_val_data, data_2025_train], ignore_index=True)
test_data = data_2025_test.copy()

# Define features and target
X_train_val = train_val_data.drop(columns=[target])
y_train_val = train_val_data[target]

X_test = test_data.drop(columns=[target])
y_test = test_data[target]

# --- Display summary ---
print(f"Training + Validation size: {len(X_train_val)} rows ({len(X_train_val)/len(data2)*100:.1f}%)")
print(f"Test size ({test_fraction*100:.0f}% of 2025 data): {len(X_test)} rows ({len(X_test)/len(data2)*100:.1f}%)")

Training + Validation size: 601 rows (82.8%)
Test size (99% of 2025 data): 125 rows (17.2%)


In [28]:
X_test.head()

Unnamed: 0,cat__team_Alpine Renault,cat__team_Aston Martin Aramco Mercedes,cat__team_Ferrari,cat__team_McLaren,cat__team_McLaren Mercedes,cat__team_McLaren Renault,cat__team_Mercedes,cat__team_Red Bull Racing,cat__team_Red Bull Racing Honda,cat__team_Red Bull Racing Honda RBPT,...,num__Finish_pct,num__Accident_pct,num__Collision_pct,num__Damage Related_pct,num__DNF_pct,num__Race_Complexity_Score,num__Safety_Index,num__mechanical_faults,num__avg_stops_per_car_race,num__avg_pitstop_ms
73,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,86.27451,2.941176,10.784314,0.0,13.72549,0.41451,0.448235,0.0,1.89162,160011.280215
19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,94.117647,1.176471,4.705882,0.0,5.882353,0.277647,0.663529,0.0,1.89162,160011.280215
116,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,85.6,6.4,8.0,0.0,14.4,0.478,0.378,0.0,1.89162,160011.280215
67,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,88.541667,1.041667,10.416667,0.0,11.458333,0.369167,0.51625,0.0,1.89162,160011.280215
94,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,87.378641,3.883495,8.737864,0.0,12.621359,0.392427,0.481359,0.0,1.89162,160011.280215


### Select Validation Data

In [29]:
X = X_train_val.copy()
y = y_train_val.copy()

# --- Step 1: Split into Train (70%) and Temp (30%) ---
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.10,           # 30% for val+test
    random_state=42,         # reproducibility
    stratify=None            # set to y if classification
)

print(f"Train and Val size: {len(X_train)} rows ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test size: {len(X_val)} rows ({len(X_val)/len(X)*100:.1f}%)")

Train and Val size: 540 rows (89.9%)
Test size: 61 rows (10.1%)


## Machine Learning Modelling

Default Settings. Single Run. No Hyperparameter Tuning.

In [32]:
# --- imports ---
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.pipeline import make_pipeline
import pandas as pd

# NEW: import an ordinal classifier
# !pip install mord --quiet
import mord as m  # Mord implements ordinal logistic regression (LogisticIT, OrdinalRidge, etc.)

# --- define models (pipelines where scaling helps) ---
pipe_lr  = make_pipeline(MinMaxScaler(), LogisticRegression(max_iter=10000))
pipe_knn = make_pipeline(MinMaxScaler(), KNeighborsClassifier())
pipe_svm = make_pipeline(MinMaxScaler(), SVC())
pipe_nb  = make_pipeline(MinMaxScaler(), GaussianNB())

DT  = DecisionTreeClassifier()
RF  = RandomForestClassifier()
GBM = GradientBoostingClassifier()
LGB = LGBMClassifier(verbose=-1)

# NEW: Ordinal Logistic Regression model (LogisticIT)
ORD = make_pipeline(MinMaxScaler(), m.LogisticIT())  # cumulative link model

# --- XGBoost requires encoded y ---
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_val_enc   = le.transform(y_val)
y_test_enc  = le.transform(y_test)
XGB = xgb.XGBClassifier(verbosity=0, eval_metric='mlogloss')

# --- fit on X_train/y_train ---
ORD.fit(X_train, y_train)
pipe_lr.fit(X_train, y_train)
pipe_knn.fit(X_train, y_train)
pipe_svm.fit(X_train, y_train)
pipe_nb.fit(X_train, y_train)

DT.fit(X_train, y_train)
RF.fit(X_train, y_train)
GBM.fit(X_train, y_train)
LGB.fit(X_train, y_train)
XGB.fit(X_train, y_train_enc)

# --- evaluate models ---
round_val = 5
cols = ['Machine Learning Classification Method', 'Train Accuracy', 'Validation Accuracy']
df_result = pd.DataFrame(columns=cols)

df_result.loc[0] = ['Ordinal Logistic Regression (mord)', round(ORD.score(X_train, y_train), round_val), round(ORD.score(X_val, y_val), round_val)]
df_result.loc[1] = ['Logistic Regression', round(pipe_lr.score(X_train, y_train), round_val), round(pipe_lr.score(X_val, y_val), round_val)]
df_result.loc[2] = ['kNN', round(pipe_knn.score(X_train, y_train), round_val), round(pipe_knn.score(X_val, y_val), round_val)]
df_result.loc[3] = ['Decision Tree', round(DT.score(X_train, y_train), round_val), round(DT.score(X_val, y_val), round_val)]
df_result.loc[4] = ['Random Forest', round(RF.score(X_train, y_train), round_val), round(RF.score(X_val, y_val), round_val)]
df_result.loc[5] = ['Gradient Boosting', round(GBM.score(X_train, y_train), round_val), round(GBM.score(X_val, y_val), round_val)]
df_result.loc[6] = ['XGBoost', round(XGB.score(X_train, y_train_enc), round_val), round(XGB.score(X_val, y_val_enc), round_val)]
df_result.loc[7] = ['LightGBM', round(LGB.score(X_train, y_train), round_val), round(LGB.score(X_val, y_val), round_val)]
df_result.loc[8] = ['Support Vector Machine', round(pipe_svm.score(X_train, y_train), round_val), round(pipe_svm.score(X_val, y_val), round_val)]
df_result.loc[9] = ['Naive Bayes', round(pipe_nb.score(X_train, y_train), round_val), round(pipe_nb.score(X_val, y_val), round_val)]

df_result = df_result.sort_values(by='Validation Accuracy', ascending=False)
df_result

Unnamed: 0,Machine Learning Classification Method,Train Accuracy,Validation Accuracy
7,LightGBM,1.0,0.78689
6,XGBoost,1.0,0.77049
2,kNN,0.78889,0.7541
1,Logistic Regression,0.78519,0.7541
8,Support Vector Machine,0.76296,0.7541
0,Ordinal Logistic Regression (mord),0.73889,0.7377
4,Random Forest,1.0,0.7377
5,Gradient Boosting,1.0,0.7377
3,Decision Tree,1.0,0.70492
9,Naive Bayes,0.34444,0.39344


ML Modelling: Single Run. Combining with Test Data.

In [33]:
# table
cols = ['Machine Learning Classification Method','Train Accuracy','Validation Accuracy','Test Accuracy']
df_result2 = pd.DataFrame(columns=cols)
round_val = 8
df_result2.loc[0] = ['Logistic Regression', round(pipe_lr.score(X_train,y_train),round_val), round(pipe_lr.score(X_val,y_val),round_val), round(pipe_lr.score(X_test,y_test),round_val)]
df_result2.loc[1] = ['kNN', round(pipe_knn.score(X_train,y_train),round_val), round(pipe_knn.score(X_val,y_val),round_val), round(pipe_knn.score(X_test,y_test),round_val)]
df_result2.loc[2] = ['Decision Tree', round(DT.score(X_train,y_train),round_val), round(DT.score(X_val,y_val),round_val), round(DT.score(X_test,y_test),round_val)]
df_result2.loc[3] = ['Random Forest', round(RF.score(X_train,y_train),round_val), round(RF.score(X_val,y_val),round_val), round(RF.score(X_test,y_test),round_val)]
df_result2.loc[4] = ['Gradient Boosting', round(GBM.score(X_train,y_train),round_val), round(GBM.score(X_val,y_val),round_val), round(GBM.score(X_test,y_test),round_val)]
df_result2.loc[5] = ['XGBoost', round(XGB.score(X_train,y_train_enc),round_val), round(XGB.score(X_val,y_val_enc),round_val), round(XGB.score(X_test,y_test_enc),round_val)]
df_result2.loc[6] = ['LightGBM', round(LGB.score(X_train,y_train),round_val), round(LGB.score(X_val,y_val),round_val), round(LGB.score(X_test,y_test),round_val)]
df_result2.loc[7] = ['Support Vector Machine', round(pipe_svm.score(X_train,y_train),round_val), round(pipe_svm.score(X_val,y_val),round_val), round(pipe_svm.score(X_test,y_test),round_val)]
df_result2.loc[8] = ['Naive Bayes', round(pipe_nb.score(X_train,y_train),round_val), round(pipe_nb.score(X_val,y_val),round_val), round(pipe_nb.score(X_test,y_test),round_val)]

df_result2 = df_result2.sort_values(by='Validation Accuracy', ascending=False)
df_result2

Unnamed: 0,Machine Learning Classification Method,Train Accuracy,Validation Accuracy,Test Accuracy
6,LightGBM,1.0,0.786885,0.648
5,XGBoost,1.0,0.770492,0.648
0,Logistic Regression,0.785185,0.754098,0.6
1,kNN,0.788889,0.754098,0.632
7,Support Vector Machine,0.762963,0.754098,0.6
3,Random Forest,1.0,0.737705,0.656
4,Gradient Boosting,1.0,0.737705,0.64
2,Decision Tree,1.0,0.704918,0.616
8,Naive Bayes,0.344444,0.393443,0.152


## Hyperparameter Tuning: Grid Search

### All Models

In [34]:
# ============================================================
# üèéÔ∏è Machine Learning Training Framework with Ordinal + Test Eval
# ============================================================

import time
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold, cross_val_score
from skopt import BayesSearchCV
from skopt.space import Integer, Categorical, Real
from skopt.callbacks import DeadlineStopper

# === Classifiers ===
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
import xgboost as xgb
import mord as m
import ast

# ============================================================
# === Configuration ===
# ============================================================

cv = KFold(n_splits=3, shuffle=True, random_state=42)

# ============================================================
# === Helper functions ===
# ============================================================

def _final_estimator(est):
    """Return final estimator from pipeline or model."""
    return list(est.named_steps.values())[-1] if hasattr(est, "named_steps") else est

def _top_feature(estimator, X):
    """Return most important feature based on coefficients or importances."""
    est = _final_estimator(estimator)
    try:
        if hasattr(est, "feature_importances_"):
            idx = int(np.argmax(est.feature_importances_))
            return X.columns[idx]
        if hasattr(est, "coef_"):
            coef = np.asarray(est.coef_)
            if coef.ndim > 1:
                coef = np.mean(np.abs(coef), axis=0)
            idx = int(np.argmax(np.abs(coef)))
            return X.columns[idx]
    except Exception:
        pass
    return "NA"

# ============================================================
# === Core Training Function Template ===
# ============================================================

def _train_model(model_name, model, search_space, X, y, X_test, y_test):
    """Unified function for BayesSearchCV + test evaluation."""
    t0 = time.time()

    # Mord models skip BayesSearchCV since no params to tune
    if model_name == "Ordinal Logistic Regression (mord)":
        scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy", n_jobs=-1)
        model.fit(X, y)
        test_score = model.score(X_test, y_test)
        return [
            model_name,
            scores.mean(),
            "{}",
            _top_feature(model, X),
            test_score,
            time.time() - t0,
        ]

    # Standard models with BayesSearchCV
    bayes = BayesSearchCV(model, search_space, n_iter=30, scoring="accuracy",
                          cv=cv, n_jobs=-1, random_state=42, verbose=0)
    bayes.fit(X, y, callback=[DeadlineStopper(60)])

    # Evaluate best model on test data
    best_model = bayes.best_estimator_
    test_score = best_model.score(X_test, y_test)

    return [
        model_name,
        bayes.best_score_,
        str(bayes.best_params_),
        _top_feature(best_model, X),
        test_score,
        time.time() - t0,
    ]

# ============================================================
# === Individual Model Functions ===
# ============================================================

def train_ordinal(X, y, X_test, y_test):
    model = make_pipeline(MinMaxScaler(), m.LogisticIT())
    return _train_model("Ordinal Logistic Regression (mord)", model, {}, X, y, X_test, y_test)

def train_logreg(X, y, X_test, y_test):
    pipe = make_pipeline(MinMaxScaler(), LogisticRegression(max_iter=10000))
    search = {
        "logisticregression__C": Real(1e-3, 1e3, prior="log-uniform"),
        "logisticregression__penalty": Categorical(["l2", "l1"]),
        "logisticregression__solver": Categorical(["liblinear", "saga"]),
    }
    return _train_model("Logistic Regression", pipe, search, X, y, X_test, y_test)

def train_knn(X, y, X_test, y_test):
    pipe = make_pipeline(MinMaxScaler(), KNeighborsClassifier())
    search = {
        "kneighborsclassifier__n_neighbors": Integer(1, 50),
        "kneighborsclassifier__weights": Categorical(["uniform", "distance"]),
        "kneighborsclassifier__p": Categorical([1, 2]),
    }
    return _train_model("kNN", pipe, search, X, y, X_test, y_test)

def train_dt(X, y, X_test, y_test):
    model = DecisionTreeClassifier(random_state=42)
    search = {
        "max_depth": Integer(2, 50),
        "min_samples_split": Integer(2, 20),
        "min_samples_leaf": Integer(1, 20),
        "max_features": Categorical([None, "sqrt", "log2"]),
    }
    return _train_model("Decision Tree", model, search, X, y, X_test, y_test)

def train_rf(X, y, X_test, y_test):
    model = RandomForestClassifier(random_state=42, n_jobs=-1)
    search = {
        "n_estimators": Integer(80, 200),
        "max_depth": Integer(2, 20),
        "max_features": Categorical(["sqrt", "log2"]),
        "bootstrap": Categorical([True, False]),
        
    }
    return _train_model("Random Forest", model, search, X, y, X_test, y_test)

def train_gbm(X, y, X_test, y_test):
    model = GradientBoostingClassifier(random_state=42)
    search = {
        "n_estimators": Integer(80, 200),
        "learning_rate": Real(1e-2, 0.3, prior="log-uniform"),
        "max_depth": Integer(2, 6),
        "subsample": Real(0.5, 1.0),
    }
    return _train_model("Gradient Boosting", model, search, X, y, X_test, y_test)

def train_xgb(X, y, X_test, y_test):
    model = xgb.XGBClassifier(random_state=42, nthread=-1, tree_method="hist",
                              use_label_encoder=False, eval_metric="logloss")
    search = {
        "n_estimators": Integer(100, 1000),
        "learning_rate": Real(1e-3, 0.3, prior="log-uniform"),
        "max_depth": Integer(2, 10),
        "subsample": Real(0.5, 1.0),
        "colsample_bytree": Real(0.5, 1.0),
        "reg_lambda": Real(1e-3, 15.0, prior="log-uniform"),
        "reg_alpha": Real(1e-6, 1.0, prior="log-uniform"),
    }
    return _train_model("XGBoost", model, search, X, y, X_test, y_test)

def train_lgb(X, y, X_test, y_test):
    model = LGBMClassifier(random_state=42, n_jobs=-1)
    search = {
        "n_estimators": Integer(200, 1200),
        "learning_rate": Real(1e-3, 0.3, prior="log-uniform"),
        "max_depth": Integer(-1, 50),
        "num_leaves": Integer(15, 255),
        "subsample": Real(0.5, 1.0),
    }
    return _train_model("LightGBM", model, search, X, y, X_test, y_test)

def train_svm(X, y, X_test, y_test):
    pipe = make_pipeline(MinMaxScaler(), SVC())
    search = {
        "svc__C": Real(1e-3, 1e3, prior="log-uniform"),
        "svc__kernel": Categorical(["linear", "rbf", "poly"]),
    }
    return _train_model("SVM", pipe, search, X, y, X_test, y_test)

def train_nb(X, y, X_test, y_test):
    model = GaussianNB()
    t0 = time.time()
    scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy", n_jobs=-1)
    model.fit(X, y)
    test_score = model.score(X_test, y_test)
    return [
        "Naive Bayes",
        scores.mean(),
        "{}",
        _top_feature(model, X),
        test_score,
        time.time() - t0,
    ]

In [None]:
# ==========================================
# Run Classification Models
# ==========================================
results = []

# Ordinal Logistic Regression
results.append(train_ordinal(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - Ordinal Logistic Regression")

# Logistic Regression
results.append(train_logreg(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - Logistic Regression")

# kNN
results.append(train_knn(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - kNN")

# Decision Tree
results.append(train_dt(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - Decision Tree")

‚úÖ done - Ordinal Logistic Regression
‚úÖ done - Logistic Regression
‚úÖ done - kNN
‚úÖ done - Decision Tree


In [38]:
# Random Forest
results.append(train_rf(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - Random Forest")

‚úÖ done - Random Forest


In [39]:
# Gradient Boosting
results.append(train_gbm(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - Gradient Boosting")

# XGBoost: encode y only for XGB
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train_val_enc = le.fit_transform(y_train_val)
y_test_enc = le.transform(y_test)

results.append(train_xgb(X_train_val, y_train_val_enc, X_test, y_test_enc))
print("‚úÖ done - XGBoost")

‚úÖ done - Gradient Boosting
‚úÖ done - XGBoost


In [40]:
# LightGBM
results.append(train_lgb(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - LightGBM")

# Support Vector Machine
results.append(train_svm(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - SVM")

# Naive Bayes
results.append(train_nb(X_train_val, y_train_val, X_test, y_test))
print("‚úÖ done - Naive Bayes")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002157 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002212 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1010
[LightGBM] [Info] Total Bins 1010
[LightGBM] [Info] Number of data points in the train set: 401, number of used features: 37
[LightGBM] [Info] Number of data points in the train set: 400, number of used features: 36
[LightGBM] [Info] Start training from score -0.293518
[LightGBM] [Info] Start training from score -2.559974
[LightGBM] [Info] Start training from score -2.356375
[LightGBM] [Info] Start training from score -2.497454
[LightGBM] [Info] Start training from score -0.294371
[LightGBM] [Info] Start trai

In [41]:
# ==========================================
# Combine and Display Results
# ==========================================
df = pd.DataFrame(
    results,
    columns=['Model', 'CV Accuracy', 'Best Params', 'Top Feature', 'Test Accuracy', 'Runtime (s)']
)

# Sort by best CV Accuracy
df = df.sort_values('CV Accuracy', ascending=False).reset_index(drop=True)

print("\n=== Final Model Performance Summary ===")
display(df)


=== Final Model Performance Summary ===


Unnamed: 0,Model,CV Accuracy,Best Params,Top Feature,Test Accuracy,Runtime (s)
0,Decision Tree,0.775406,"OrderedDict({'max_depth': 48, 'max_features': ...",num__Pace_Qual,0.696,26.401844
1,Logistic Regression,0.770373,OrderedDict({'logisticregression__C': 0.640385...,num__QualifyingPosition,0.616,27.561792
2,Random Forest,0.762056,"OrderedDict({'bootstrap': False, 'max_depth': ...",num__Pace_Qual,0.656,39.852602
3,Gradient Boosting,0.757056,OrderedDict({'learning_rate': 0.01340889179078...,num__Pace_Qual,0.656,56.910965
4,XGBoost,0.757048,"OrderedDict({'colsample_bytree': 1.0, 'learnin...",num__Pace_Qual,0.616,56.952723
5,SVM,0.748748,"OrderedDict({'svc__C': 5.987513324867395, 'svc...",num__QualifyingPosition,0.624,38.546102
6,kNN,0.745423,OrderedDict({'kneighborsclassifier__n_neighbor...,,0.6,18.882892
7,Ordinal Logistic Regression (mord),0.737098,{},num__Pace_Qual,0.6,0.271716
8,LightGBM,0.727114,OrderedDict({'learning_rate': 0.01037235167503...,num__Pace_Qual,0.648,84.46308
9,Naive Bayes,0.628947,{},,0.632,0.076178


### REFINED HYPERSPACE PARAMETER  
- Explore larger parameter space
- More refined tuning settings
- more predictors returned

In [42]:
#  ============================================================
# === Configuration ===
# ============================================================

pd.options.display.float_format = '{:.3f}'.format
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# ============================================================
# === Start Logistic Regression Model Training
# ============================================================

print("üöÄ Starting Logistic Regression training...")
t0 = time.time()

# Pipeline and parameter space
pipe = make_pipeline(MinMaxScaler(), LogisticRegression(max_iter=10000))
search = {
    "logisticregression__C": Real(1e-8, 1e8, prior="log-uniform"),
    "logisticregression__penalty": Categorical(["l2", "l1"]),
    "logisticregression__solver": Categorical(["liblinear", "saga"]),
}

# Run Bayesian Optimization
bayes = BayesSearchCV(
    pipe,
    search,
    n_iter=60,
    scoring="accuracy",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=0
)
bayes.fit(X_train_val, y_train_val, callback=[DeadlineStopper(60)])

best_model = bayes.best_estimator_

# ============================================================
# === Predictions and Test Accuracy
# ============================================================

y_pred = best_model.predict(X_test)
test_score = best_model.score(X_test, y_test)
runtime = round(time.time() - t0, 3)

# ============================================================
# === Extract Top 10 Features (Inline)
# ============================================================

est = list(best_model.named_steps.values())[-1] if hasattr(best_model, "named_steps") else best_model
top_features = "N/A"

try:
    if hasattr(est, "feature_importances_"):
        importances = np.asarray(est.feature_importances_)
        top_idx = np.argsort(importances)[::-1][:10]
        top_features = ", ".join([f"{X_train_val.columns[i]} ({importances[i]:.3f})" for i in top_idx])
    elif hasattr(est, "coef_"):
        coef = np.asarray(est.coef_)
        if coef.ndim > 1:
            coef = np.mean(np.abs(coef), axis=0)
        top_idx = np.argsort(np.abs(coef))[::-1][:10]
        top_features = ", ".join([f"{X_train_val.columns[i]} ({coef[i]:.3f})" for i in top_idx])
except Exception:
    top_features = "N/A"

# ============================================================
# === Build Summary DataFrame
# ============================================================

result = {
    "Model": "Logistic Regression",
    "CV Accuracy": round(bayes.best_score_, 3),
    "Best Params": bayes.best_params_,
    "Top 10 Features": top_features,
    "Test Accuracy": round(test_score, 3),
    "Runtime (s)": runtime,
}

df = pd.DataFrame([result])
print("\n=== Final Model Performance Summary ===")
display(df)

# ============================================================
# === Visualization of Search Space
# ============================================================

results_df = pd.DataFrame(bayes.cv_results_)
param_cols = [c for c in results_df.columns if c.startswith('param_')]
results_df = results_df.rename(columns={'mean_test_score': 'CV_Accuracy'})

# plt.figure(figsize=(10, 6))
# sns.scatterplot(
#     data=results_df,
#     x=param_cols[0],
#     y='CV_Accuracy',
#     hue='CV_Accuracy',
#     palette='viridis',
#     s=80
# )
# plt.title(f"Logistic Regression Search Space ({param_cols[0]} vs Accuracy)")
# plt.xlabel(param_cols[0])
# plt.ylabel("Cross-Validation Accuracy")
# plt.tight_layout()
# plt.show()

# ============================================================
# === Show Predictions
# ============================================================

test_results = X_test.copy()
test_results["Actual"] = y_test
test_results["Predicted"] = y_pred

if hasattr(best_model, "predict_proba"):
    test_results["Predicted_Prob"] = np.max(best_model.predict_proba(X_test), axis=1)

print("\n=== Sample Predictions ===")
print(test_results.head(10))

print("\n‚úÖ Training complete!")
print(f"CV Accuracy: {result['CV Accuracy']}")
print(f"Test Accuracy: {result['Test Accuracy']}")
print(f"Runtime: {result['Runtime (s)']} seconds")
print(f"Top 10 Features:\n{result['Top 10 Features']}")

üöÄ Starting Logistic Regression training...

=== Final Model Performance Summary ===


Unnamed: 0,Model,CV Accuracy,Best Params,Top 10 Features,Test Accuracy,Runtime (s)
0,Logistic Regression,0.769,"{'logisticregression__C': 1.1983644169040468, ...","num__QualifyingPosition (6.058), num__avg_pits...",0.6,50.683



=== Sample Predictions ===
     cat__team_Alpine Renault  cat__team_Aston Martin Aramco Mercedes  \
73                      0.000                                   0.000   
19                      0.000                                   0.000   
116                     0.000                                   0.000   
67                      0.000                                   0.000   
94                      0.000                                   0.000   
77                      0.000                                   0.000   
31                      0.000                                   0.000   
53                      0.000                                   0.000   
117                     0.000                                   0.000   
44                      0.000                                   0.000   

     cat__team_Ferrari  cat__team_McLaren  cat__team_McLaren Mercedes  \
73               0.000              1.000                       0.000   
19               0.000

In [43]:
# ============================================================
# === Configuration ===
# ============================================================

pd.options.display.float_format = '{:.3f}'.format
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# ============================================================
# === Start Gradient Boosting Model Training
# ============================================================

print("üöÄ Starting Gradient Boosting training...")
t0 = time.time()

# Pipeline and parameter space
pipe = make_pipeline(MinMaxScaler(), GradientBoostingClassifier(random_state=42))
search = {
    "gradientboostingclassifier__n_estimators": Integer(10, 500),
    "gradientboostingclassifier__learning_rate": Real(0.00001, 0.2, prior="log-uniform"),
    "gradientboostingclassifier__max_depth": Integer(2, 10),
    "gradientboostingclassifier__subsample": Real(0.5, 1.0, prior="uniform"),
    "gradientboostingclassifier__min_samples_split": Integer(2, 20),
}

# Run Bayesian Optimization
bayes = BayesSearchCV(
    pipe,
    search,
    n_iter=60,
    scoring="accuracy",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=0
)
bayes.fit(X_train_val, y_train_val, callback=[DeadlineStopper(60)])

best_model = bayes.best_estimator_

# ============================================================
# === Predictions and Test Accuracy
# ============================================================

y_pred = best_model.predict(X_test)
test_score = best_model.score(X_test, y_test)
runtime = round(time.time() - t0, 3)

# ============================================================
# === Extract Top 10 Features (Inline)
# ============================================================

est = list(best_model.named_steps.values())[-1] if hasattr(best_model, "named_steps") else best_model
top_features = "N/A"

try:
    if hasattr(est, "feature_importances_"):
        importances = np.asarray(est.feature_importances_)
        top_idx = np.argsort(importances)[::-1][:10]
        top_features = ", ".join([f"{X_train_val.columns[i]} ({importances[i]:.3f})" for i in top_idx])
    elif hasattr(est, "coef_"):
        coef = np.asarray(est.coef_)
        if coef.ndim > 1:
            coef = np.mean(np.abs(coef), axis=0)
        top_idx = np.argsort(np.abs(coef))[::-1][:10]
        top_features = ", ".join([f"{X_train_val.columns[i]} ({coef[i]:.3f})" for i in top_idx])
except Exception:
    top_features = "N/A"

# ============================================================
# === Build Summary DataFrame
# ============================================================

result = {
    "Model": "Gradient Boosting",
    "CV Accuracy": round(bayes.best_score_, 3),
    "Best Params": bayes.best_params_,
    "Top 10 Features": top_features,
    "Test Accuracy": round(test_score, 3),
    "Runtime (s)": runtime,
}

df = pd.DataFrame([result])
print("\n=== Final Model Performance Summary ===")
display(df)

# ============================================================
# === Visualization of Search Space
# ============================================================

results_df = pd.DataFrame(bayes.cv_results_)
param_cols = [c for c in results_df.columns if c.startswith('param_')]
results_df = results_df.rename(columns={'mean_test_score': 'CV_Accuracy'})

# ============================================================
# === Show Predictions
# ============================================================


test_results = X_test.copy()
test_results["Actual"] = y_test
test_results["Predicted"] = y_pred

if hasattr(best_model, "predict_proba"):
    test_results["Predicted_Prob"] = np.max(best_model.predict_proba(X_test), axis=1)
print("\n‚úÖ Training complete!")
print(f"CV Accuracy: {result['CV Accuracy']}")
print(f"Test Accuracy: {result['Test Accuracy']}")
print(f"Runtime: {result['Runtime (s)']} seconds")
print(f"Top 10 Features:\n{result['Top 10 Features']}")

üöÄ Starting Gradient Boosting training...

=== Final Model Performance Summary ===


Unnamed: 0,Model,CV Accuracy,Best Params,Top 10 Features,Test Accuracy,Runtime (s)
0,Gradient Boosting,0.745,{'gradientboostingclassifier__learning_rate': ...,"num__Pace_Qual (0.369), num__QualifyingPositio...",0.6,44.764



‚úÖ Training complete!
CV Accuracy: 0.745
Test Accuracy: 0.6
Runtime: 44.764 seconds
Top 10 Features:
num__Pace_Qual (0.369), num__QualifyingPosition (0.091), num__Technical_Qual (0.085), num__PerformanceIndex_Qual (0.050), num__Style_Qual (0.043), num__avg_pitstop_ms (0.038), num__mechanical_faults (0.033), num__Consistency_Qual (0.033), num__Round (0.027), num__Race distance (km) (0.023)


In [1]:
# ============================================================
# === Interactive Contour Plot: n_estimators vs learning_rate (CV Accuracy)
# ============================================================

import plotly.graph_objects as go
import numpy as np
import pandas as pd
from scipy.interpolate import griddata

# Prepare data
contour_df = results_df.copy()
contour_df["n_estimators"] = contour_df["param_gradientboostingclassifier__n_estimators"].astype(float)
contour_df["learning_rate"] = contour_df["param_gradientboostingclassifier__learning_rate"].astype(float)
contour_df["CV_Accuracy"] = contour_df["CV_Accuracy"].astype(float)

# Extract data for interpolation
x = contour_df["learning_rate"]
y = contour_df["n_estimators"]
z = contour_df["CV_Accuracy"]

# Create interpolation grid
xi = np.linspace(x.min(), x.max(), 1000)
yi = np.linspace(y.min(), y.max(), 1000)
Xi, Yi = np.meshgrid(xi, yi)
Zi = griddata((x, y), z, (Xi, Yi), method='cubic')

# Create contour plot
fig = go.Figure(data=
    go.Contour(
        z=Zi,
        x=xi,  # learning_rate
        y=yi,  # n_estimators
        colorscale='Viridis',
        ncontours=100,  # 
        contours=dict(showlabels=True, labelfont=dict(size=12, color='white')),
        colorbar=dict(title='CV Accuracy'),
    )
)

# Overlay actual sampled points
fig.add_trace(go.Scatter(
    x=x,
    y=y,
    mode='markers',
    marker=dict(
        size=7,
        color=z,
        colorscale='Viridis',
        line=dict(width=0.7, color='white'),
        showscale=False
    ),
    text=[f"CV Accuracy: {val:.3f}" for val in z],
    hovertemplate="Learning Rate: %{x:.4f}<br>n_estimators: %{y}<br>CV Accuracy: %{text}<extra></extra>",
    name="Sampled Points"
))

# Annotate best point
best_idx = np.argmax(z)
fig.add_trace(go.Scatter(
    x=[x.iloc[best_idx]],
    y=[y.iloc[best_idx]],
    mode='markers+text',
    marker=dict(size=12, color='red', symbol='star'),
    text=["Best"],
    textposition="top center",
    name="Best Model"
))

# Layout
fig.update_layout(
    title="Gradient Boosting CV Accuracy Contour (Interactive)",
    xaxis_title="Learning Rate",
    yaxis_title="Number of Estimators",
    template="plotly_white",
    width=900,
    height=600
)

fig.show()

NameError: name 'results_df' is not defined

## Plotting Results

Top Features

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import re

# Parse "feature (importance)" pairs from the result string
feature_str = result["Top 10 Features"]

if isinstance(feature_str, str) and feature_str != "N/A":
    # Use regex to extract (feature, importance)
    pattern = r"([\w\-]+)\s*\(([-+]?\d*\.\d+|\d+)\)"
    parsed = re.findall(pattern, feature_str)

    if parsed:
        top_features_df = pd.DataFrame(parsed, columns=["Feature", "Importance"])
        top_features_df["Importance"] = top_features_df["Importance"].astype(float)

        # Select top 5
        top5 = top_features_df.head(5)

        # Plot
        plt.figure(figsize=(8, 5))
        plt.barh(top5["Feature"], top5["Importance"])
        plt.xlabel("Importance")
        plt.ylabel("Feature")
        plt.title("Top 5 Features (Logistic Regression)")
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
    else:
        print("‚ö†Ô∏è Could not parse feature importance string.")
else:
    print("‚ö†Ô∏è No feature importance data available.")

Merging Data for Plotting

In [None]:
data_raw.head()

In [None]:
test_results.head()

In [None]:
test_results = test_results.rename(columns={
    'num__PerformanceIndex_Qual': 'PerformanceIndex_Qual',
    # 'race_name': 'GrandPrix',
    # 'season': 'Year'
})

final_results = pd.merge(
    test_results,
    data_raw[['PerformanceIndex_Qual','driver_code', 'GrandPrix', 'year','team',"RaceFinishPosition"]],  # keep only relevant keys
    on=['PerformanceIndex_Qual'],
    how='left'   # keeps all rows from test_results
)

final_results.head()

Plotting Test Results

In [None]:
data_results_v2=pd.read_csv(r"C:\Enzo_Files\AIM Data Science\AIM_Sharing\ML1_Final_Project\F1_Model_Results_v2.csv")
# data_results_v2 = final_results.copy()
data_results_v2.columns

In [None]:
data_results_v2_2 = data_results_v2[[
    'PerformanceIndex_Qual',
    'num__Round',
    'num__year',
    'num__QualifyingPosition',
    # 'num__driver_points',
    # 'num__team_points',
    'num__Finish_pct',
    'num__Accident_pct',
    'num__Collision_pct',
    'num__Race_Complexity_Score',
    'num__Safety_Index',
    'Actual',
    'Predicted',
    'Predicted_Prob',
    'driver_code',
    'GrandPrix',
    'year',
    'team',
    'RaceFinishPosition'
]]

data_results_v2_2.head()


In [None]:
import plotly.graph_objects as go

# Sort by round so races appear in order
data_results_v2_2 = data_results_v2_2.sort_values(by='num__Round')

# Unique drivers
drivers = data_results_v2_2['driver_code'].unique()

# Create figure
fig = go.Figure()

# One trace per driver
for driver in drivers:
    driver_data = data_results_v2_2[data_results_v2_2['driver_code'] == driver]
    fig.add_trace(go.Scatter(
        x=driver_data['GrandPrix'],
        y=driver_data['RaceFinishPosition'],
        mode='markers+lines+text',
        name=driver,
        text=driver_data['RaceFinishPosition'],
        textposition='top center',
        marker=dict(size=10, line=dict(width=1, color='black')),
        hovertemplate=(
            "<b>%{x}</b><br>"
            "Driver: <b>%{customdata[0]}</b><br>"
            "Team: %{customdata[1]}<br>"
            "Round: %{customdata[2]}<br>"
            "Finish: %{y}<br>"
            "Predicted: %{customdata[3]:.3f}<br>"
            "Actual: %{customdata[4]:.3f}<br>"
            "Prob: %{customdata[5]:.3f}<extra></extra>"
        ),
        customdata=np.stack([
            driver_data['driver_code'],
            driver_data['team'],
            driver_data['num__Round'],
            driver_data['Predicted'],
            driver_data['Actual'],
            driver_data['Predicted_Prob']
        ], axis=-1)
    ))

# Reverse Y-axis (P1 at the top)
fig.update_yaxes(
    autorange='reversed',
    title_text='Race Finish Position'
)

# Grand Prix on X-axis
fig.update_xaxes(
    title_text='Grand Prix',
    tickangle=45,
    tickmode='array',
    tickvals=data_results_v2_2['GrandPrix'].unique()
)

# ============================================================
# üèÜ Add podium line at position = 3
# ============================================================
fig.add_shape(
    type='line',
    x0=-0.5,  # extend slightly before first point
    x1=len(data_results_v2_2['GrandPrix'].unique()) - 0.5,
    y0=3,
    y1=3,
    line=dict(color='red', width=3, dash='dash'),
    xref='x',
    yref='y'
)

# Add annotation label
fig.add_annotation(
    xref='paper',
    yref='y',
    x=1.02,
    y=3,
    # text='üèÜ Podium Cutoff (P3)',
    showarrow=False,
    font=dict(color='gold', size=12)
)

# Layout styling
fig.update_layout(
    title='üèÅ Race Finish Position by Grand Prix (Colored by Driver)',
    template='plotly_white',
    hovermode='closest',
    legend_title_text='Driver',
    title_x=0.5,
    height=700,
)

fig.show()


In [None]:
import plotly.graph_objects as go
import numpy as np

# Sort data for chronological order
data_results_v2_2 = data_results_v2_2.sort_values(by='num__Round')

# Add a boolean column for correct predictions
data_results_v2_2['Correct'] = np.where(
    data_results_v2_2['Actual'] == data_results_v2_2['Predicted'], 1, 0
)

# Map colors: green = correct, red = wrong
data_results_v2_2['Color'] = np.where(
    data_results_v2_2['Correct'] == 1, "#13de3c", 'red'
)

# Get list of unique drivers
drivers = data_results_v2_2['driver_code'].unique()

# Create figure
fig = go.Figure()

# Add one trace per driver
for driver in drivers:
    driver_data = data_results_v2_2[data_results_v2_2['driver_code'] == driver]
    fig.add_trace(go.Scatter(
        x=driver_data['GrandPrix'],
        y=driver_data['RaceFinishPosition'],
        mode='markers+lines',
        name=driver,
        text=driver_data['RaceFinishPosition'],
        textposition='top center',
        marker=dict(
            size=15,
            color=driver_data['Color'],      # red/green based on correctness
            line=dict(width=1, color='black')
        ),
        hovertemplate=(
            "<b>%{x}</b><br>"
            "Driver: <b>%{customdata[0]}</b><br>"
            "Team: %{customdata[1]}<br>"
            "Round: %{customdata[2]}<br>"
            "Finish: %{y}<br>"
            "Predicted: %{customdata[3]:.3f}<br>"
            "Actual: %{customdata[4]:.3f}<br>"
            "<b>Correct:</b> %{customdata[5]}<br>"
            "Prob: %{customdata[6]:.3f}<extra></extra>"
        ),
        customdata=np.stack([
            driver_data['driver_code'],
            driver_data['team'],
            driver_data['num__Round'],
            driver_data['Predicted'],
            driver_data['Actual'],
            np.where(driver_data['Correct'] == 1, '‚úÖ Yes', '‚ùå No'),
            driver_data['Predicted_Prob']
        ], axis=-1)
    ))

# Reverse Y-axis so P1 is on top
fig.update_yaxes(
    autorange='reversed',
    title_text='Race Finish Position'
)

# Set Grand Prix labels on X-axis
fig.update_xaxes(
    title_text='Grand Prix',
    tickangle=45,
    tickmode='array',
    tickvals=data_results_v2_2['GrandPrix'].unique()
)

# Add Podium line (P3)
fig.add_shape(
    type='line',
    x0=-0.5,
    x1=len(data_results_v2_2['GrandPrix'].unique()) - 0.5,
    y0=3,
    y1=3,
    line=dict(color='red', width=3, dash='dash'),
    xref='x',
    yref='y'
)

# Layout styling
fig.update_layout(
    title='üèÅ Race Finish Position by Grand Prix (Colored by Prediction Accuracy)',
    template='plotly_white',
    hovermode='closest',
    legend_title_text='Driver',
    title_x=0.5,
    height=700,
)

fig.write_html("F1_Main_graph.html")
fig.show()



Comparing with Benchmark

data_check

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

data_check=data_results_v2_2

data_check["Qual_prediction"] = (data_check["num__QualifyingPosition"] <= 3).astype(int)


data_check.head()

# If needed: ensure binary ints (adjust mapping as appropriate)
y_true = pd.to_numeric(data_check["Actual"], errors="coerce").astype(int)

# If Predicted is probabilities, threshold at 0.5; otherwise cast to int
pred_raw = data_check["Predicted"]
y_pred = ((pred_raw >= 0.5).astype(int)
          if np.issubdtype(pred_raw.dtype, np.number) and not np.issubdtype(pred_raw.dtype, np.integer)
          else pd.to_numeric(pred_raw, errors="coerce").astype(int))

y_rule = pd.to_numeric(data_check["Qual_prediction"], errors="coerce").astype(int)

# Confusion matrices
cm_model = confusion_matrix(y_true, y_pred, labels=[0, 1])
cm_rule  = confusion_matrix(y_true, y_rule, labels=[0, 1])

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
ConfusionMatrixDisplay(cm_model, display_labels=[0,1]).plot(ax=axes[0], colorbar=False)
axes[0].set_title("Model Predicted vs Actual")
ConfusionMatrixDisplay(cm_rule, display_labels=[0,1]).plot(ax=axes[1], colorbar=False)
axes[1].set_title("Rule Qual_prediction vs Actual")
plt.tight_layout()
plt.show()

# Summary metrics
summary = pd.DataFrame({
    "accuracy":  [accuracy_score(y_true, y_pred), accuracy_score(y_true, y_rule)],
    "precision": [precision_score(y_true, y_pred, zero_division=0), precision_score(y_true, y_rule, zero_division=0)],
    "recall":    [recall_score(y_true, y_pred, zero_division=0), recall_score(y_true, y_rule, zero_division=0)],
    "f1":        [f1_score(y_true, y_pred, zero_division=0), f1_score(y_true, y_rule, zero_division=0)],
}, index=["Model", "Rule"])
print(summary)



