The goal of this project is to use historical data to predict the finishing positions of drivers in Formula 1 races, specifically identifying which drivers will likely finish in the top three and whether we can determine the exact finishing position for each driver.

Loading data -> https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020 

Context
Formula 1 (a.k.a. F1 or Formula One) is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA) and owned by the Formula One Group. The FIA Formula One World Championship has been one of the premier forms of racing around the world since its inaugural season in 1950. The word "formula" in the name refers to the set of rules to which all participants' cars must conform. A Formula One season consists of a series of races, known as Grands Prix, which take place worldwide on purpose-built circuits and on public roads.

Content
The dataset consists of all information on the Formula 1 races, drivers, constructors, qualifying, circuits, lap times, pit stops, championships from 1950 till the latest 2024 season.

In [43]:

import pandas as pd
import os

path_cartella = './Dataset/'
data_costruttori = pd.read_csv(path_cartella + 'constructors.csv')
data_piloti = pd.read_csv(path_cartella + 'drivers.csv')
data_gare = pd.read_csv(path_cartella + 'races.csv')
data_risultati = pd.read_csv(path_cartella + 'results.csv')
pd.get_option("display.max_columns",None)



20

In [None]:
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier

We will organize our data by year, focusing only on information from 1982 onwards, as F1 cars before 1981 are significantly different from today's models.

In [None]:
pd.get_option("display.max_columns",None)
race_df = data_gare[["raceId", "year", "round", "circuitId"]].copy()
race_df = race_df.sort_values(by=['year', 'round'])
race_df = race_df[race_df["year"] >= 1982]

print(race_df)

Choose the following columns from the original dataframe: raceId, driverId, constructorId, grid (starting position), and positionOrder (finishing position)

In [None]:
results = data_risultati[["raceId", "driverId", "constructorId", "grid", "positionOrder", "points"]].copy()
#print(results)
duplicati = race_df.duplicated()
num_duplicati = duplicati.sum()
#print(f"Numero di righe duplicate: {num_duplicati}")
#print(race_df)

We plan to combine the two datasets to get details about the year, round, and circuit for each race, and since there are no duplicate race IDs, we can move ahead with the merge.

In [None]:
df = pd.merge(race_df, results, on='raceId')
df.describe().T

"Top 3 Finish": Introducing the Target Variable
We create a new feature called "Top 3 Finish", which indicates whether a driver finished within the top three positions in each race. This feature will serve as the target variable for our model or analysis.

In [None]:
df['Top 3 Finish'] = df['positionOrder'].le(3).astype(int)

print(df['Top 3 Finish'].value_counts())

Driver:
Percentage of top 3 finishes in the previous season

Percentage of top 3 finishes in the current season up to the race before the current one

Constructor (Team):
Percentage of top 3 finishes in the previous season

Percentage of top 3 finishes in the current season up to the race before the current one

To help predict future race results, we aim to capture past performance trends. One effective method is to calculate the percentage of top 3 finishes for each driver and constructor, both for the previous season and for the current season up to the previous race.

It's crucial to only include data from before the current race to avoid data leakage.

Also, the round number of the race is important, as it indicates how much data is available for the current season.

When calculating last year’s percentage, we need to shift the 'year' value by 1 to correctly associate the data with the previous season.

In [None]:
driver_yearly_stats = df.groupby(['year', 'driverId']).agg(
    total_race =('raceId', 'nunique'),
    top3_finishes =('Top 3 Finish', 'sum')
).reset_index()
print(driver_yearly_stats)
driver_yearly_stats['Driver Top 3 Finish Percentage (This Year)'] = (driver_yearly_stats['top3_finishes'] / driver_yearly_stats['total_race']) * 100
driver_last_year_stats = driver_yearly_stats.copy()
driver_last_year_stats['year'] += 1
driver_last_year_stats = driver_last_year_stats.rename(columns={'Driver Top 3 Finish Percentage (This Year)': 'Driver Top 3 Finish Percentage (Last Year)'})

df = pd.merge(df, driver_last_year_stats[['year', 'driverId', 'Driver Top 3 Finish Percentage (Last Year)']], on=['year', 'driverId'], how='left')

Constructor(Team): Percentage of finishing in top 3 for past year¶
Here constructor percentage of finishing in top 3 means the average percentage of this team's drivers finishing in top 3.

In [None]:
constructor_last_year_stats = df.groupby(['year', 'constructorId', 'round']).agg(
    Sum_Top_3_Finishes_Last_Year=('Driver Top 3 Finish Percentage (Last Year)', 'sum')
).reset_index()

print("Constructor annual stats")
print(constructor_last_year_stats)

# Calculating the percentage of top 3 finishes for each constructor last year
constructor_last_year_stats['Constructor Top 3 Finish Percentage (Last Year)'] = constructor_last_year_stats["Sum_Top_3_Finishes_Last_Year"]/2

df = pd.merge(df, constructor_last_year_stats[['year', 'constructorId', 'round', 'Constructor Top 3 Finish Percentage (Last Year)']], on=['year', 'constructorId', 'round'], how='left')
