# ML1 Final Project: F1 Race Finish Prediction
**Work by LT 2**

---

## Background

F1 is one of if not the most prestigious motorsports competition in the world where 20 drivers from 10 different teams race against each other at speeds reaching 370kph throughout the year to determine who is the best driver and which team is the best in terms of car performance, strategy, etc. In the motorsports industry, F1 earned $3.65 billion in revenue in 2024, which is 25% more than what they earned in the previous year. This revenue comes from race promotion fees, media rights, sponsorships, and other sources such as high-margin hospitality, support series, and merchandise. Out of all these revenue streams, media rights contribute the most in annual revenue (32.8%) in the form of lucrative broadcasting agreements with major global networks and digital streaming services in partnership with Netflix’s Drive to Survive television series. Close to F1, much like many other globally-renowned sports, is sports betting. Sports betting is a highly lucrative market projected to reach $17.23 billion dollar in revenue in 2025. Of which, F1 makes up 0.4% of the global betting handle.

## Motivation

With these figures in mind, the teams, as well as sports bettors, stand to gain a lot from determining whether a driver and their car would win a race or not. To elaborate, teams would be able to determine when their car and driver are performing poorly, which would allow them to make the necessary adjustments as early as at the end of practice sessions or as late as at the end of qualifying(race before race to determine starting position of a car and driver in the actual race). As for bettors, people would want to know which driver-car duo has the highest probability of winning to make the right bets. Whether it's an F1 team or simply a bettor, the goal is ultimately to win. For an F1 team, winning means more prize money, and increased merchandise and car sales especially for car brands associated with teams.

This study was also inspired by the work of Katelyn Castillo, Christopher Nash Jasmin, Jhedson Angelo Petilo, and Louie Sangalang from the MSDS 2025 cohort whose final project for DMW1 last year generated a driver performance index, which quantifies the performance of F1 drivers. The team wanted to include other factors that influence race results such as car/team performance and track difficulty as a way of continuing their work. THis section was included specifically to acknowledge the contribution of Kate and Nash's team to the current study. 

## Objective

Considering the facts, the group set out to train a machine learning model to predict whether a driver-car duo would get a podium finish (1st-3rd place) or not based on driver performance, car/team performance, and track difficulty.

## Dataset Information

In order to train machine learning models to predict F1 race finishes, data was collected from 3 different sources namely: FastF1/Ergast API, F1 Official Website, and Kaggle. The FastF1 python package facilitates webscraping through the FastF1 API and Ergast API for data such as telemetry, lap times, race results, etc. As for the F1 Official Website, data pertaining to historical constructor's points and driver points were collected from here. Lastly, historical data about race track incidents and reasons for the incidents were collected from Kaggle. Once data from the 3 sources were collected, these were consolidated for analysis and modelling.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from math import ceil
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
import time
from sklearn.model_selection import KFold
import seaborn as sns
# from sklearn.model_selection import train_test_split

## Load Data

In [None]:
# Load and clean data
data_raw = pd.read_csv("F1_main_data_v9.csv")
data = data_raw.copy()
data=data.drop(columns=["Timestamp","driver_code","GrandPrix","Consistency_Race", "Style_Race",
                        "Technical_Race","Pace_Race","PerformanceIndex_Race","driver_points","team_points"])
target = 'RaceFinishPosition'
data.head()

Unnamed: 0,Consistency_Qual,Style_Qual,Technical_Qual,Pace_Qual,PerformanceIndex_Qual,Round,year,QualifyingPosition,RaceFinishPosition,team,...,Finish_pct,Accident_pct,Collision_pct,Damage Related_pct,DNF_pct,Race_Complexity_Score,Safety_Index,mechanical_faults,avg_stops_per_car_race,avg_pitstop_ms
0,0.254063,0.202593,0.403174,0.8,0.414957,12,2025,5,4.0,Ferrari,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215
1,0.0,0.199975,0.305004,0.75,0.313745,12,2025,6,14.0,Ferrari,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215
2,0.16895,0.5,0.41743,0.9,0.496595,12,2025,3,1.0,McLaren,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215
3,0.340871,0.202912,0.190251,0.95,0.421008,12,2025,2,2.0,McLaren,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215
4,0.463925,0.79203,0.231651,0.85,0.584402,12,2025,4,10.0,Mercedes,...,84.415584,3.896104,11.688312,0.0,15.584416,0.491688,0.352468,0.0,1.89162,160011.280215


## Exploratory Data Analysis (EDA)

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 726 entries, 0 to 725
Data columns (total 30 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Consistency_Qual        725 non-null    float64
 1   Style_Qual              725 non-null    float64
 2   Technical_Qual          726 non-null    float64
 3   Pace_Qual               726 non-null    float64
 4   PerformanceIndex_Qual   724 non-null    float64
 5   Round                   726 non-null    int64  
 6   year                    726 non-null    int64  
 7   QualifyingPosition      726 non-null    int64  
 8   RaceFinishPosition      706 non-null    float64
 9   team                    726 non-null    object 
 10  Laps                    726 non-null    int64  
 11  Corners                 726 non-null    int64  
 12  Circuit length (km)     726 non-null    float64
 13  Race distance (km)      726 non-null    float64
 14  Direction               726 non-null    ob

In [4]:
data.columns

Index(['Consistency_Qual', 'Style_Qual', 'Technical_Qual', 'Pace_Qual',
       'PerformanceIndex_Qual', 'Round', 'year', 'QualifyingPosition',
       'RaceFinishPosition', 'team', 'Laps', 'Corners', 'Circuit length (km)',
       'Race distance (km)', 'Direction', 'Accident', 'Collision',
       'Damage Related', 'Finish', 'Total_Entries', 'Finish_pct',
       'Accident_pct', 'Collision_pct', 'Damage Related_pct', 'DNF_pct',
       'Race_Complexity_Score', 'Safety_Index', 'mechanical_faults',
       'avg_stops_per_car_race', 'avg_pitstop_ms'],
      dtype='object')