The goal of this project is to use historical data to predict the finishing positions of drivers in Formula 1 races, specifically identifying which drivers will likely finish in the top three and whether we can determine the exact finishing position for each driver.

Loading data -> https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020 

Context
Formula 1 (a.k.a. F1 or Formula One) is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA) and owned by the Formula One Group. The FIA Formula One World Championship has been one of the premier forms of racing around the world since its inaugural season in 1950. The word "formula" in the name refers to the set of rules to which all participants' cars must conform. A Formula One season consists of a series of races, known as Grands Prix, which take place worldwide on purpose-built circuits and on public roads.

Content
The dataset consists of all information on the Formula 1 races, drivers, constructors, qualifying, circuits, lap times, pit stops, championships from 1950 till the latest 2024 season.

In [43]:

import pandas as pd
import os

path_cartella = './Dataset/'
data_costruttori = pd.read_csv(path_cartella + 'constructors.csv')
data_piloti = pd.read_csv(path_cartella + 'drivers.csv')
data_gare = pd.read_csv(path_cartella + 'races.csv')
data_risultati = pd.read_csv(path_cartella + 'results.csv')
pd.get_option("display.max_columns",None)



20

In [None]:
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier

We will organize our data by year, focusing only on information from 1982 onwards, as F1 cars before 1981 are significantly different from today's models.

In [None]:
pd.get_option("display.max_columns",None)
race_df = data_gare[["raceId", "year", "round", "circuitId"]].copy()
race_df = race_df.sort_values(by=['year', 'round'])
race_df = race_df[race_df["year"] >= 1982]

print(race_df)

Choose the following columns from the original dataframe: raceId, driverId, constructorId, grid (starting position), and positionOrder (finishing position)

In [None]:
results = data_risultati[["raceId", "driverId", "constructorId", "grid", "positionOrder", "points"]].copy()
#print(results)
duplicati = race_df.duplicated()
num_duplicati = duplicati.sum()
#print(f"Numero di righe duplicate: {num_duplicati}")
#print(race_df)

We plan to combine the two datasets to get details about the year, round, and circuit for each race, and since there are no duplicate race IDs, we can move ahead with the merge.

In [52]:
df = pd.merge(race_df, results, on='raceId')
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
raceId,17321.0,477.44651,349.475072,1.0,206.0,364.0,881.0,1110.0
year,17321.0,2001.829687,12.253518,1982.0,1991.0,2001.0,2013.0,2023.0
round,17321.0,9.240517,5.183158,1.0,5.0,9.0,13.0,22.0
circuitId,17321.0,19.726863,17.79454,1.0,8.0,14.0,25.0,79.0
driverId,17321.0,225.175914,307.095663,1.0,23.0,94.0,173.0,858.0
constructorId,17321.0,35.653426,55.326192,1.0,6.0,17.0,33.0,214.0
grid,17321.0,11.281508,7.041134,0.0,5.0,11.0,17.0,29.0
positionOrder,17321.0,12.716009,7.606928,1.0,6.0,12.0,18.0,39.0
points,17321.0,2.363576,4.873451,0.0,0.0,0.0,2.0,50.0


"Top 3 Finish": Introducing the Target Variable
We create a new feature called "Top 3 Finish", which indicates whether a driver finished within the top three positions in each race. This feature will serve as the target variable for our model or analysis.

In [None]:
df['Top 3 Finish'] = df['positionOrder'].le(3).astype(int)

print(df['Top 3 Finish'].value_counts())

       raceId  year  round  circuitId  driverId  constructorId  grid  \
0         467  1982      1         30       117              4     5   
1         467  1982      1         30       199              3     8   
2         467  1982      1         30       163              4     1   
3         467  1982      1         30       182              1    13   
4         467  1982      1         30       177              3     7   
...       ...   ...    ...        ...       ...            ...   ...   
17316    1110  2023     12         13       817            213    19   
17317    1110  2023     12         13       858              3    18   
17318    1110  2023     12         13       807            210     0   
17319    1110  2023     12         13       832              6     4   
17320    1110  2023     12         13       857              1     5   

       positionOrder  points  Top 3 Finish  
0                  1     9.0             1  
1                  2     6.0             1  
