# 7. F1 Prediction Project - Model performance for betting (2024 F1 Data)

<img src="Betting.png" alt="Betting" width="500"/>

# Table of Contents
- [Introduction](#Introduction)
- [1. Import Packages](#1.-Import-Packages)
- [2. Load Data](#2.-Load-Data)
- [3. Load Random Forest Model](#3.-Load-Random-Forest-Model)
- [4. Extract Feature Importance from Model](#3-extract-feature-importance-from-model)
- [5. Classification Report and Confusion Matrix](#classification-report-and-confusion-matrix)
- [6. 2024 Betting Analysis](#2024-betting-analysis)
- [7. SHAP Values Analysis](#where-did-the-model-make-errors---shapley-values)
- [Conclusion](#conclusion)


# Introduction

So far in this project, we've embarked on a captivating journey through the world of Formula 1, wielding the tools of data science to predict race winners and assess betting performance. In this notebook, we are going to explore how well are best model is at predicting the winners of 2024, which is a dateset the model hasn't seen yet.

As we step through the analysis, you'll learn how our model interprets the wealth of race data and how each variable influences the likelihood of a driver emerging victorious. The notebook culminates in an exploration of betting strategies, employing our model's predictions to hypothetically place bets across the season's races and evaluate the financial outcomes.

By the end of this journey, you'll have witnessed the power of machine learning in sports analytics, gained insight into the interpretability of complex models, and understood how predictive modeling can inform real-world decisions—even in scenarios as dynamic and uncertain as Formula 1 racing.

#### 1. Import Packages

In [2]:
# Basic Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from sklearn.metrics import confusion_matrix, classification_report


import joblib

# If you get an ImportError, you may need to install these packages using pip.
try:
    # Import shap, lime
    import shap
    import lime
    import lime.lime_tabular
except ImportError as e:
    print(e)
    print("You need to install the missing modules using pip (e.g., 'pip install shap lime xgboost')")


#### 2. Load Data

In [3]:
# Load DataFrame
file = 'C:/Users/Alex/OneDrive/BrainStation/Data_Science_Bootcamp/Capstone_Project/capstone-Aboard89/Notebooks/model_data_2024_race_4_no_winner.csv'
df = pd.read_csv(file)

We'll do a quick check on what the data looks like:

In [4]:
# Display all columns
pd.set_option('display.max_columns', None)

# Now when you run df.head(), you'll see all columns
df.head()

Unnamed: 0,year,age,years_in_f1,races_with_each_team_since_1995,F2_champion,Former_F1_World_Champion,home_race,starting_grid_position,points_in_previous_race,laps_in_previous_race,constructorId_points_at_stage_of_season,driver_points_at_stage_of_season,race_70th Anniversary Grand Prix,race_Abu Dhabi Grand Prix,race_Argentine Grand Prix,race_Australian Grand Prix,race_Austrian Grand Prix,race_Azerbaijan Grand Prix,race_Bahrain Grand Prix,race_Belgian Grand Prix,race_Brazilian Grand Prix,race_British Grand Prix,race_Canadian Grand Prix,race_Chinese Grand Prix,race_Dutch Grand Prix,race_Eifel Grand Prix,race_Emilia Romagna Grand Prix,race_European Grand Prix,race_French Grand Prix,race_German Grand Prix,race_Hungarian Grand Prix,race_Indian Grand Prix,race_Italian Grand Prix,race_Japanese Grand Prix,race_Korean Grand Prix,race_Luxembourg Grand Prix,race_Malaysian Grand Prix,race_Mexican Grand Prix,race_Mexico City Grand Prix,race_Miami Grand Prix,race_Monaco Grand Prix,race_Pacific Grand Prix,race_Portuguese Grand Prix,race_Qatar Grand Prix,race_Russian Grand Prix,race_Sakhir Grand Prix,race_San Marino Grand Prix,race_Saudi Arabian Grand Prix,race_Singapore Grand Prix,race_Spanish Grand Prix,race_Styrian Grand Prix,race_SÃ£o Paulo Grand Prix,race_Turkish Grand Prix,race_Tuscan Grand Prix,race_United States Grand Prix,engine_manufacturer_Acer,engine_manufacturer_Arrows,engine_manufacturer_Asiatech,engine_manufacturer_BMW,engine_manufacturer_Cosworth,engine_manufacturer_Ferrari,engine_manufacturer_Ford,engine_manufacturer_Hart,engine_manufacturer_Honda,engine_manufacturer_Mecachrome,engine_manufacturer_Mercedes,engine_manufacturer_Mugen-Honda,engine_manufacturer_Petronas,engine_manufacturer_Peugeot,engine_manufacturer_Playlife,engine_manufacturer_Red Bull,engine_manufacturer_Renault,engine_manufacturer_Supertec,engine_manufacturer_Toro Rosso,engine_manufacturer_Toyota,engine_manufacturer_Yamaha,constructor_nationality_American,constructor_nationality_Austrian,constructor_nationality_British,constructor_nationality_Dutch,constructor_nationality_French,constructor_nationality_German,constructor_nationality_Indian,constructor_nationality_Irish,constructor_nationality_Italian,constructor_nationality_Japanese,constructor_nationality_Malaysian,constructor_nationality_Russian,constructor_nationality_Spanish,constructor_nationality_Swiss,Nationality_American,Nationality_Argentine,Nationality_Australian,Nationality_Austrian,Nationality_Belgian,Nationality_Brazilian,Nationality_British,Nationality_Canadian,Nationality_Chinese,Nationality_Colombian,Nationality_Czech,Nationality_Danish,Nationality_Dutch,Nationality_Finnish,Nationality_French,Nationality_German,Nationality_Hungarian,Nationality_Indian,Nationality_Indonesian,Nationality_Irish,Nationality_Italian,Nationality_Japanese,Nationality_Malaysian,Nationality_Mexican,Nationality_Monegasque,Nationality_New Zealander,Nationality_Polish,Nationality_Portuguese,Nationality_Russian,Nationality_Spanish,Nationality_Swedish,Nationality_Swiss,Nationality_Thai,Nationality_Venezuelan,Mechanical,Driver_Issue,Lapped,Number_Of_Stops,Total_time_in_pits,Avg_time_in_pits,Weather_Conditions_Dry,Weather_Conditions_Rain,Weather_Conditions_Very changeable,Circuit_Type_Permanent Race Track,Circuit_Type_Street Circuit,Circuit_Type_Street Circuit.1,constructorId_10,constructorId_11,constructorId_117,constructorId_12,constructorId_13,constructorId_131,constructorId_14,constructorId_15,constructorId_16,constructorId_164,constructorId_166,constructorId_17,constructorId_18,constructorId_19,constructorId_2,constructorId_20,constructorId_205,constructorId_206,constructorId_207,constructorId_208,constructorId_209,constructorId_21,constructorId_210,constructorId_211,constructorId_213,constructorId_214,constructorId_22,constructorId_23,constructorId_24,constructorId_25,constructorId_26,constructorId_27,constructorId_28,constructorId_29,constructorId_3,constructorId_30,constructorId_31,constructorId_4,constructorId_5,constructorId_51,constructorId_6,constructorId_7,constructorId_8,constructorId_9
0,2024,29,9,66,0,0,0,4,0,57,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,48.34,24.17,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,2024,34,13,66,0,0,0,5,12,58,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,49.08,24.54,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,2024,26,9,166,0,1,0,1,26,58,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,49.27,24.64,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,2024,24,2,44,0,0,0,17,0,58,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,50.3,25.15,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,2024,34,11,44,0,0,0,16,0,57,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,26.42,13.21,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


### 3. Load Random Forest Model

Here we are importing our best performing model from notebook 6, which has been saved as pkl file. A `.pkl` file, also known as a pickle file, is a special type of file used to store Python objects, preserving their structure and contents. In the context of a data science project, it's incredibly useful for saving machine learning models after training, so they can be reused later without having to retrain. To save a model, we use Python's `pickle` module to serialize the model object, which converts it into a byte stream that can be written to a file. Later on, this file can be loaded to deserialize the model object back into its original state, ready for making predictions or further analysis. This process is efficient and convenient for both temporary storage and long-term archiving of your carefully trained models.

Here we want to use our best model (Random Forest SMOTE) and use it on the 2024 F1 race test set, which the model hasn't seen before.

In [5]:
import joblib  # or you can use: import pickle

# Load the Random Forest model
model = joblib.load('C:/Users/Alex/OneDrive/BrainStation/Data_Science_Bootcamp/Capstone_Project/capstone-Aboard89/Notebooks/random_forest_grid_search.pkl')
# If you prefer pickle, use: model = pickle.load(open('/mnt/data/random_forest_grid_search.pkl', 'rb'))

### 4. Extract Feature Importance from Model

In [6]:
best_model = model.best_estimator_

The code `best_model = model.best_estimator_` is assigning the best performing model, found during the hyperparameter tuning process (like grid search or random search), to the variable `best_model` for later use.

In [7]:
model

In [8]:
# To make predictions
y_pred = model.predict(df)

In [9]:
# now we extract the soft prediction from the model
y_pred_proba = model.predict_proba(df)

In [10]:
y_pred_proba[:,1]

array([0.05      , 0.08      , 0.3       , 0.        , 0.        ,
       0.        , 0.01      , 0.07      , 0.18      , 0.02      ,
       0.        , 0.01      , 0.        , 0.02      , 0.        ,
       0.        , 0.        , 0.1       , 0.        , 0.        ,
       0.14      , 0.3       , 0.59      , 0.01      , 0.01      ,
       0.04      , 0.01      , 0.08      , 0.13      , 0.01      ,
       0.02      , 0.01      , 0.        , 0.02      , 0.02      ,
       0.        , 0.        , 0.16      , 0.        , 0.        ,
       0.27995548, 0.17      , 0.18      , 0.04      , 0.04      ,
       0.        , 0.04      , 0.12      , 0.18      , 0.01      ,
       0.        , 0.01      , 0.        , 0.09      , 0.02      ,
       0.        , 0.06      , 0.13      , 0.03      , 0.        ,
       0.15      , 0.33      , 0.56      , 0.        , 0.        ,
       0.05      , 0.05      , 0.14      , 0.14      , 0.02      ,
       0.04      , 0.01      , 0.        , 0.13      , 0.01   

The above shows us that the model has calculated the probabilities for each driver to win a race, and the extracted figures represent each driver's likelihood of winning, as predicted by the model, before making a final decision on who is most likely to win.

In [11]:
# adding these prediction back to the dataframe as 'prediction_probability'
df['prediction_probability'] = y_pred_proba[:,1]

In [12]:
# adding hard predictions as column named 'predictions'
df['predictions'] = y_pred

In [13]:
# save this to CSV
df.to_csv('2024_Races_with_predictions.csv', index=False)

### Classification Report and Confusion Matrix

In [14]:
file = 'C:/Users/Alex/OneDrive/BrainStation/Data_Science_Bootcamp/Capstone_Project/capstone-Aboard89/Data/Betting/y_true.csv'
y_true = pd.read_csv(file)


Generate Confusion Matrix
To generate the confusion matrix, you can use the confusion_matrix function.

In [15]:
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)


Confusion Matrix:
[[76  0]
 [ 2  2]]


The confusion matrix is a fundamental tool in classification tasks, helping us understand the performance of our predictive model by showing how its predictions stack up against the actual outcomes. In our F1 race winner prediction project, the matrix provides a snapshot of the model's performance over the first four races of the 2024 season.

Looking at the matrix, we have:
- 76 true negatives (TN): instances where the model correctly predicted a non-winner.
- 2 false negatives (FN): occasions where the model incorrectly predicted a non-winner when the driver actually won.
- 0 false positives (FP): there were no cases where the model incorrectly identified a winner when there was none.
- 2 true positives (TP): instances where the model accurately predicted the race winner.

From these figures, we can see that our model is quite conservative, focusing more on correctly identifying non-winners (TN) and is quite accurate when it does predict a winner (TP). The absence of false positives (FP) indicates that when our model claims a driver won't win, it is extremely reliable. However, the presence of false negatives (FN) suggests there's room to improve the model's sensitivity to potential winners.

For the predictive analytics of our F1 project, this confusion matrix reveals that our model is highly specific but not overly sensitive. It excels at identifying non-winning outcomes, which is valuable, but to be truly effective, it must also minimize false negatives and not miss out on identifying the winners. This information is pivotal as we refine our model, aiming for a balance between precision (minimizing FP) and recall (minimizing FN), ensuring that we capture as many true winners (TP) as possible without mistakenly flagging non-winners as winners.

### Step 5: Generate Classification Report
To generate the classification report, which includes metrics such as precision, recall, f1-score, and accuracy, use the classification_report function.

In [16]:
cr = classification_report(y_true, y_pred)
print("Classification Report:")
print(cr)


Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99        76
           1       1.00      0.50      0.67         4

    accuracy                           0.97        80
   macro avg       0.99      0.75      0.83        80
weighted avg       0.98      0.97      0.97        80



The classification report from the 2024 test dataset, shows the summary of the performance metrics for our best classification model (Random Forest SMOTE from Notebook 6). For Class 0 (non-winners), the precision is very high at 0.97, which indicates that when the model predicts a driver won't win, it's correct 97% of the time. The recall for Class 0 is perfect, at 1.00, meaning the model identifies all actual non-winners correctly. The F1-score, which balances precision and recall, is 0.99, showing outstanding performance for the negative class.

For Class 1 (winners), the precision is also perfect at 1.00, indicating that every driver predicted to win actually won. However, the recall is only 0.50, suggesting that the model only identified half of the actual winners. The F1-score for winners is lower at 0.67, reflecting the trade-off between the high precision and lower recall. This implies that while the model is very reliable in its predictions of winners, it is conservative and misses several actual winning cases.

When we compare these results to the original test results from Random Forest model with SMOTE (our best-performing model), we see a different story. The Random Forest SMOTE model had an overall accuracy of 0.95, a macro F1-score of 0.74, and class-specific performance metrics with precision at 0.50 and recall at 0.52 for predicting winners. The results from this last test appear to strike a better balance between precision and recall for Class 1 predictions.

For predictions for the rest of the season, one should also consider the trade-off between false positives and false negatives when relying on this model for future predictions. The original test results had a precision at 0.50 and recall at 0.52,so although the model performed well in this case, we shouldn't blindly accept the predictions. Instead this can be another data driven tool to aid in a betting strategy for Formula 1 races. 

In conclusion, while the Random Forest SMOTE model may not be as precise in identifying winners as the model reflected in the 2024 test classification report, its balanced performance indicates a better overall utility for predicting race outcomes across the season, than just blindly guessing. As with any model used in a dynamic environment such as F1 racing, continuous monitoring and adjustment of the model would be required to adapt to the evolving data through the season.

# 2024 Betting Analysis

![would We Have Won Money](would_we_have_won_money.png)


The graphic above presents the model's predictions against the actual race winners for a selection of the 2024 F1 races, along with the corresponding betting odds and potential winnings.

Highlighted in bold are the races where our model confidently predicted the winner (the "hard predictions"), while the more faded images represent races where the model saw a high likelihood of winning, but not enough to make a firm call (these are "soft predictions"). The strong visual contrast effectively communicates the model's level of certainty in its predictions.

The betting odds are denoted by the multipliers (e.g., 1.25x), indicating the return on a bet for the predicted winner. For instance, betting £100 on a driver with odds of 1.25x would return £125, yielding a profit of £25 if the prediction was correct. In this example, betting on the model’s hard predictions would have resulted in winning bets for some races, evidenced by the highlighted winnings, contributing to an overall profit of £92 for the season thus far.

This visualization not only serves as a validation of the model's predictive capabilities but also illustrates a practical application: using data science insights to inform betting strategies in sports. While the model has shown promise, it’s also clear that high-probability predictions (soft predictions) don't always translate to actual wins, reminding us of the inherent uncertainties in both the sport and in predictive modeling.

# Where did the model make errors - Shapley Values?

Imagine we have a group of advisors (our RandomForestClassifier model) who help us predict who will win a Formula 1 race. We want to understand how each advisor makes their decisions. To do this, we have a tool (SHAP TreeExplainer) that allows us to see which factors each advisor considers most important when predicting the winner. 

Before we can use this tool, we need to make sure we only give it the information that our advisors used to make their predictions—like the driver's past performance and car specifications—without any extra details. We also need to present this information in a way they're used to seeing it, which is why we 'transform' it first, kind of like translating it into their language.

Once everything is set up, we use the SHAP tool on the transformed information, and it tells us exactly how much each piece of information influenced the advisors' predictions. This is a bit like finding out whether an advisor gives more weight to a driver’s experience, the car’s speed, or the weather conditions on race day. This process is like having a behind-the-scenes look at how the model thinks and makes it's choices for a driver's chance of winning.

In [22]:
%%time
# This calculation might take 2-3 minutes!

# Extract the RandomForestClassifier from the pipeline
rf_model = best_model.named_steps['clf']

# Now you can create a TreeExplainer with the extracted RandomForest model
explainer = shap.TreeExplainer(rf_model)

# If 'df_test' includes the target variable or additional columns, drop them
df_shapley = df.drop(columns=['prediction_probability', 'predictions'])

# Transform df_test using the scaler from the pipeline
df_test_transformed = best_model.named_steps['scl'].transform(df_shapley)

# Now obtain SHAP values for the transformed test set
shap_values = explainer.shap_values(df_test_transformed)

CPU times: total: 6.25 s
Wall time: 8.33 s


In [23]:
for array in shap_values:
    print(array.shape)

(80, 180)
(80, 180)


In [None]:
print([i for i in df_shapley.columns])
print([i for i in shap_values[0]])
print([i for i in shap_values[10]])

['year', 'age', 'years_in_f1', 'races_with_each_team_since_1995', 'F2_champion', 'Former_F1_World_Champion', 'home_race', 'starting_grid_position', 'points_in_previous_race', 'laps_in_previous_race', 'constructorId_points_at_stage_of_season', 'driver_points_at_stage_of_season', 'race_70th Anniversary Grand Prix', 'race_Abu Dhabi Grand Prix', 'race_Argentine Grand Prix', 'race_Australian Grand Prix', 'race_Austrian Grand Prix', 'race_Azerbaijan Grand Prix', 'race_Bahrain Grand Prix', 'race_Belgian Grand Prix', 'race_Brazilian Grand Prix', 'race_British Grand Prix', 'race_Canadian Grand Prix', 'race_Chinese Grand Prix', 'race_Dutch Grand Prix', 'race_Eifel Grand Prix', 'race_Emilia Romagna Grand Prix', 'race_European Grand Prix', 'race_French Grand Prix', 'race_German Grand Prix', 'race_Hungarian Grand Prix', 'race_Indian Grand Prix', 'race_Italian Grand Prix', 'race_Japanese Grand Prix', 'race_Korean Grand Prix', 'race_Luxembourg Grand Prix', 'race_Malaysian Grand Prix', 'race_Mexican G

IndexError: list index out of range

In [33]:
# Initiate Javascript for visualization
shap.initjs()

# Plot SHAP values for row 0 of the positive class
shap.force_plot(
    explainer.expected_value[1],  # Use the expected value for the positive class
    shap_values[1][2],            # Use the SHAP values for the positive class and first row
    features=df_test_transformed[0],  # Use the corresponding row from the transformed features
    feature_names=df_shapley.columns  # Use the feature names from the original test set
)


This image above illustrates a SHAP value analysis, which helps explain a model's prediction. The colorful visual represents different features—like a driver's grid position, and the points they and their team have at this stage of the season—and how these factors influence the model's decision. Features pushing the prediction toward "NO WIN" (red) and those pushing toward "WIN" (blue) are displayed. This was related to the first race of the year, for the driver Max Verstappen, whom the model gave the highest soft win prediction to. Let's break this down in more detail.

![first_race_shapley](first_race_shapley.png)

For this race (the first of the season), both `constructorId_points_at_stage_of_season` and `driver_points_at_stage_of_season` were at zero, indicating that there were no points accumulated by either the team or the driver at this early stage. Since these features likely have a positive relationship with the likelihood of a win (the model has probably learned that drivers and teams with more points are more likely to win), the lack of points contributed negatively to the prediction, dragging it below the base value threshold of 0.5003, which is used to distinguish between a win and no win prediction.

Even though the prediction for Max Verstappen was not a hard 'win', he was the driver with the highest soft probability, meaning that despite the lack of points data, other factors such as team , starting grid position, & performance from the 2023 final race, influenced the model to lean in his favor, albeit not enough to cross the threshold for a definitive win prediction. This instance illustrates the model's nuanced approach to prediction, where it assesses all available data to estimate the outcome, and also highlights the model's potential cautiousness at the start of the season when less data is available.

# Conclusion

In our notebook on the F1 Betting Performance project, we've taken a comprehensive journey through the intricacies of predictive analytics in the context of sports betting. We began by importing the necessary libraries and data, setting the stage for our analytical endeavor. With our data in hand, we deployed our best model (Random Forest SMOTE), fine-tuned through meticulous grid searching to forecast the outcomes of F1 races in 2024.

We delved into the heart of the model to extract its wisdom—understanding the features that most significantly impacted its predictions using SHAP values, a technique that reveals the contribution of each feature towards the predictive outcome. We've learned that factors such as a driver's points in the season, their team's performance, and starting grid positions are pivotal in shaping the model's foresight.

Through the confusion matrix and classification report, we've quantified the model's predictive accuracy and gleaned insights into its strengths and weaknesses. Notably, we identified a conservative bent in the model's predictions—excellent at flagging non-winners but sometimes missing the mark on actual winners, which provides a clear direction for future model refinement.

We've also ventured into the practical application of our predictions by translating them into a betting strategy. By contrasting the model's predictions with real race outcomes and the associated betting odds, we established a hypothetical financial outcome, finding that our model, while not infallible, could indeed be a valuable tool for informed betting, albeit with room for improvement.

In essence, this notebook has not only enhanced our understanding of model interpretability and evaluation but also showcased how data science can intersect with real-world applications like betting, reinforcing the value of a data-driven approach to decision-making in uncertain environments like F1 racing.