# Formula 1 Race Predictor
### CP468 Final Project
### By: Robert Mazza and Ronny Yehia

<img src="./Images/F1-logo.png" alt="F1 Logo" width="50%"/>

Data set used: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020

See the README for more contextual information around this project.

# Data Collection and Cleaning

### Importing the data

In [None]:
import pandas as pd
import numpy as np

# Importing the datasets
results_df = pd.read_csv('./data/results.csv')
status_df = pd.read_csv('./data/status.csv')
drivers_df = pd.read_csv('./data/drivers.csv')
races_df = pd.read_csv('./data/races.csv')
constructor_df = pd.read_csv('./data/constructors.csv')
driver_standings_df = pd.read_csv('./data/driver_standings.csv')
quali_df = pd.read_csv('./data/qualifying.csv')
constructor_standings_df = pd.read_csv('./data/constructor_standings.csv')
pd.get_option("display.max_columns",None)


We are importing the necessary data sets along with the necessary packages to correctly assess the data. The above data sets include races, drivers, standings, results, qualifications for certain races, etc. 

## Qualifying
The rules of qualifying has changed over the years regarding how many sessions there are, and the time they last. The one variable that has never changed is how the fastest lap each driver sets counts towards where they start the race. So for this reason I will only be focusing on the starting position of each driver since that is the result of their qualifying performance.

<img src="./Images/starting-grid.jpg" width="50%"/>


In [None]:

# drop driver number column as it is not needed since we have the driver id
quali_df.drop(['number'], axis=1, inplace=True)

# drop Q1, Q2, Q3 columns as we only care about starting position
quali_df.drop(['q1','q2','q3'], axis=1, inplace=True)

# check for null values
print(quali_df.isna().sum())
# drop all rows that have NULL values
quali_df.dropna(inplace = True)

print((quali_df['qualifyId'] >= 0).all())
print((quali_df['raceId'] >= 0).all())
print((quali_df['driverId'] >= 0).all())
print((quali_df['constructorId'] >= 0).all())
print((quali_df['position'] >= 0).all())

print()


Prior to us even being able to predict which drivers are going to place where, we must see who qualifies. Starting position is based on the time set on the qualifying laps before the actual competitive race has begun. From there we can determine where each driver is starting, which will impact their chances in the race. 

In [None]:
quali_df.head()

## Races


In [None]:
# drop the time column as it is not needed
races_df.drop(['time'], axis=1, inplace=True)

# drop the wikipedia URLs column
races_df.drop(['url'], axis=1, inplace=True)

# check for null values
print(races_df.isna().sum())

races_df.info()


We can see from the sums above we have no null values.

After dropping unnecessary values that wouldn't be used to determine placement in the actual race, we double check for null values to ensure there are none

In [None]:
races_df.head()

## Results


In [None]:
results_df.head()

In [None]:
# check for null values
results_df.isna().sum()

In [None]:
# dropping unneeded columns
results_df.drop(['time', 'number'], axis=1, inplace=True)

In [None]:
results_df.info()

We can see that both "position" and "fastestLapSpeed" are not numerical, which is fixed below.

In [None]:
pd.to_numeric(results_df['fastestLapSpeed'], errors='coerce')
pd.to_numeric(results_df['position'], errors='coerce')

## Driver Standings

In [None]:
# check for null values
print(driver_standings_df.isna().sum())

driver_standings_df.info()

print("Checking if all values are equal or greater than zero")
# check if each coloumn that is int64 has a value greater or equal of zero
# True = good, False = bad
print((driver_standings_df['driverStandingsId'] >= 0).all())
print((driver_standings_df['raceId'] >= 0).all())
print((driver_standings_df['driverId'] >= 0).all())
print((driver_standings_df['points'] >= 0).all())
print((driver_standings_df['position'] >= 0).all())
print((driver_standings_df['wins'] >= 0).all())

# dropping unneeded columns
driver_standings_df.drop(['driverStandingsId', 'positionText' ], axis=1, inplace=True)
# renaming the columns for easier understanding
driver_standings_df.rename(columns = {'wins':'driver_standings_race_wins', 'points':'driver_standings_points', 'position': 'driver_standings_position'}, inplace = True)



As presumed, the driver standings are determined by numerous factors. All the values above that are checked whether they are larger or equal to 0 play a role in determining standings, points, position, wins, etc. All these values are taken into account when attempting to determine who is likely to place where in the race

In [None]:
driver_standings_df.head()

## Constructors Standings

Formula 1 didn't have a Constructors Champiionship till 1958, before that the only championship was the Drivers Championship. So we will not have values in this dataset that represent any races before 1958.

In [None]:
# check for null values
print(constructor_standings_df.isna().sum())



print("Checking if all values are equal or greater than zero")
# check if each coloumn that is int64 has a value greater or equal of zero
# True = good, False = bad
print((constructor_standings_df['raceId'] >= 0).all())
print((constructor_standings_df['constructorId'] >= 0).all())
print((constructor_standings_df['points'] >= 0).all())
print((constructor_standings_df['position'] >= 0).all())
print((constructor_standings_df['wins'] >= 0).all())

# dropping useless columns
constructor_standings_df.drop(['constructorStandingsId', 'positionText'], axis=1, inplace=True)
# rename columns to avoid confusion with other dataframes
constructor_standings_df.rename(columns = {'wins':'constructor_race_wins', 'points':'constructor_points', 'position': 'constructor_position'}, inplace = True)

constructor_standings_df.info()

In [None]:
constructor_standings_df.head()

## Constructors

There have been many different constructors ever the years, not all still race in current seasons but there are still a few that have never left F1, such as Ferrari, Williams, and McLaren.

In [None]:
# check for null values
print(constructor_df.isna().sum())

# drop URL column
constructor_df.drop(['url'], axis=1, inplace=True)
# drop constructor natonality column as it is not needed
constructor_df.drop(['nationality'], axis=1, inplace=True)

constructor_df.info()

print("Checking if all values are equal or greater than zero")
# check if each coloumn that has a value greater or equal of zero or not ''
# True = good, False = bad
print((constructor_df['name'] != '').all())
print((constructor_df['constructorRef'] != '').all())
print((constructor_df['constructorId'] >= 0).all())


A majority of the above code is testing to ensure that the data is usable, which is why we are checking for 0 values or NULL values in most of them. Values with 0 or NULL cannot be used in predicting placement.

In [None]:
constructor_df.head()

## Drivers

In [None]:
# check for null values
print(drivers_df.isna().sum())
# drop URL column
drivers_df.drop(['url'], axis=1, inplace=True)
# drop constructor natonality column as it is not needed
drivers_df.drop(['nationality'], axis=1, inplace=True)
# drop driver number column, we will refer to each driver by their last name
drivers_df.drop(['number'], axis=1, inplace=True )

drivers_df.info()

print("Checking if all values are equal or greater than zero")
# check if each coloumn that has a value greater or equal of zero or not ''
# True = good, False = bad
print((drivers_df['driverRef'] != '').all())
print((drivers_df['code'] != '').all())
print((drivers_df['forename'] != '').all())
print((drivers_df['surname'] != '').all())
print((drivers_df['driverId'] >= 0).all())


In [None]:
drivers_df.head()

## Status
Describes the status of the car in respect to finishing the race. If the car did not finish the race then it will have a "Status" that gives some more detail to why the car did not finish. 

In [None]:
# check for null values
print(status_df.isna().sum())

status_df.info()

print((status_df['statusId'] >= 0).all())
print((status_df['status'] != '').all())


In [None]:
status_df.head()

# Data Preperation

Merging all the dataframes into a single comprehensive one.

In [None]:
# merging all seperate dataframe into single dataframe as df

df0 = pd.merge(results_df, constructor_standings_df, on = ['raceId', 'constructorId'])
df0.head()

In [None]:

df1 = pd.merge(df0, races_df, on ='raceId')
df1.head()


In [None]:

df2 = pd.merge(df1, drivers_df, on = 'driverId')

df2.head()

In [None]:

df3 = pd.merge(df2, constructor_df, on ='constructorId')
df3.head()

In [None]:

df = pd.merge(df3, status_df, on ='statusId')
pd.get_option("display.max_columns",None)

df.drop(['statusId', 'rank','fastestLapTime', 'constructorId', 'constructorRef', 'driverRef'], axis=1, inplace=True)

df.head()

In [None]:
df.columns

In [None]:
col_name = {'milliseconds':'timetaken_in_millisec','fastestLapSpeed':'max_speed',
 'name_x':'grand_prix','number_y':'driver_num','code':'driver_code','name_y':'constructor_name',
 'raceId_x':'racerId','points_x':'points'}

df.rename(columns=col_name, inplace=True)

df.head()

In [None]:
# Combining the forename and surname of the driver into a single column
df['driver_name'] = df['forename'] + ' ' + df['surname']
df  = df.drop(['forename', 'surname'], axis=1)

In [None]:
df.info()

Date and dob columns are being stored as an Object, so it will be changed to Date object below.

In [None]:
pd.to_datetime(df.date)

In [None]:
df['dob'] = pd.to_datetime(df['dob'])
df['date'] = pd.to_datetime(df['date'])

In [None]:
from datetime import datetime

# getting each driver's age and storing it in a new column 'age'

dates = datetime.today()-df['dob']
age = dates.dt.days/365
# round them up if they are close to their next birthday than their last
df['age'] = round(age) 
pd.set_option('display.max_columns', None)

df.drop(['dob'], axis=1, inplace=True)

df.head()

In [None]:
# changing datatype

l = ['timetaken_in_millisec','fastestLap','max_speed']
for i in l:
    df[i] = pd.to_numeric(df[i],errors='coerce')

In [None]:
# filling missing values
df[['fastestLap']] = df[['fastestLap']].fillna(0)
df['timetaken_in_millisec'] = df['timetaken_in_millisec'].fillna(df['timetaken_in_millisec'].mean())
df['max_speed']= df['max_speed'].fillna(df['max_speed'].mean())
df.isnull().sum() / len(df) * 100

In [None]:

cat = []
num = []
for i in df.columns:
    if df[i].dtypes == 'O':
        cat.append(i)
    else:
        num.append(i)

In [None]:
df[cat].head()

In [None]:
df[num].head()

# Exploritory Analysis

## Qualifying Advantage?
Everyone knows that qualifying is a big deal, but how important is a high starting position? More specifically, how important is qualifying first? This changes depending on the circuit as some circuits are narrow and hard to perform overtakes such as Monaco where qualifying position is almost more important than the race itself.

In [None]:
import matplotlib.pyplot as plt

# circuitId: 6 is monaco
x = df[(df.circuitId == 6) & (df.status == 'Finished')].grid

# using positionOrder here because it takes into account finishing position of DNFs
y = df[(df.circuitId == 6) & (df.status == 'Finished')].positionOrder

plt.scatter(x,y)

b, a = np.polyfit(x, y, deg=1)

print("Correlation:", x.corr(y))


We can see that there is a moderate positive correlation between a drivers starting position and finishing position at Monaco. This is most likely due to all the other variables of racing like crashes and pit strategies that mix up the grid from qualifying order.

### How Important Is Starting From 1st?



In [None]:
x = df[(df.grid == 1) & (df.status == 'Finished')].positionOrder
plt.hist(x)

In [None]:
x.mean()

From the figure above and a mean finishing position of 1.8 if the driver started from 1st, we can make the conclusion that qualifying position is vital for winning a race.

## Grand Prix Locations
Formula 1 has been running since the 1950s, in this time the sport has grown from a total of 7 Grand Proxs in a championship to 19 - 20 Grand Prixs located all around the world. Below is a map showing every circuit F1 has visited since the 1950s.

In [None]:
import folium

circuits_df = pd.read_csv('./data/circuits.csv')

coords = []

# plotting all circuits F1 has raced at in the world
for lat,lng in zip(circuits_df['lat'],circuits_df['lng']):
    coords.append([lat,lng])
maps = folium.Map(zoom_start=2,tiles='OpenStreetMap')  #map_types (Stamen Terrain, Stamen Toner, Mapbox Bright, cartodbpositron)
for i,j in zip(coords,circuits_df.name):
    marker = folium.Circle(
        location=i,
        radius=1000,
        popup="<strong>{0}</strong>".format(j))  #strong is used to bold the font (optional)
    marker.add_to(maps)
maps

# Machine Learning Model

### Regression

Processing the data into and x and y set. x will contain all our data used to predict y, which is the finishing position. We are only going to look at races after 1990 as there are major gaps in the data before then.

### Data Splitting
Splitting the data so that training is F1 seasons after 1990 and before 2021 (exlusive). We will test the data on the 2021 season. 


In [None]:

# filter data for only Finished results after 1990 and before 2020

x_train = df[(df.status == 'Finished') & (df.year > 1990) & (df.year < 2021)]

# only keep numerical columns
x_train = x_train.select_dtypes(['number'])

x_train.drop(['resultId', 'positionOrder', 'points'], axis=1, inplace=True)

x_train.head()



Our training data input with be evaluated race data between 1990 - 2021 (exclusive).

In [None]:
x_test = df[(df.status == 'Finished') & (df.year == 2021)]
x_test = x_test.select_dtypes(['number'])

x_test.drop(['resultId', 'positionOrder', 'points'], axis=1, inplace=True)

x_test.head()

Our training data output with be evaluated using race position data between 1990 - 2021 (exclusive).

In [None]:
y_train = df[(df.status == 'Finished') & (df.year > 1990) & (df.year < 2021)].positionOrder
y_train.head()

Filtering the main dataframe to create y_test dataframe containing all the finishing results from the 2021 season.

In [None]:
y_test = df[(df.status == 'Finished') & (df.year == 2021)].positionOrder

In [None]:

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor



Scaling our data for use in our neural network

In [None]:

scaler = StandardScaler()
x_train = pd.DataFrame(scaler.fit_transform(x_train), columns = x_train.columns)

Creating a custom function to score the regression based on if the model was able to predict the race winner correctly or not. Purely focused on the race winner and not other positions.

In [None]:

def score_regression(model):
    score = 0
    # test the model on the test set, which is the 2021 season
    for circuit in df[df.year == 2021]['round'].unique():
        test = df[(df['year'] == 2021) & (df['round'] == circuit)]
        x_test = test.select_dtypes(['number'])

        x_test.drop(['resultId', 'positionOrder', 'points'], axis=1, inplace=True)
        y_test = test.positionOrder

        # scaling the data
        x_test = pd.DataFrame(scaler.transform(x_test), columns = x_test.columns)

        # make predictions dataframe
        prediction_df = pd.DataFrame(model.predict(x_test), columns = ['results'])
        prediction_df['positionOrder'] = y_test.reset_index(drop = True)
        prediction_df['actual'] = prediction_df.positionOrder.map(lambda x: 1 if x == 1 else 0)
        prediction_df.sort_values('results', ascending = True, inplace = True)
        prediction_df.reset_index(inplace = True, drop = True)
        prediction_df['predicted'] = prediction_df.index
        prediction_df['predicted'] = prediction_df.predicted.map(lambda x: 1 if x == 0 else 0)

        print("Round", circuit)
        
        score += precision_score(prediction_df.actual, prediction_df.predicted)

    model_score = score / df[df.year == 2021]['round'].unique().max()
    return model_score


### Model Building

Crating a dictionary that will allow us to compare results from all our different models.

In [None]:
comparison_dict ={'model':[],
                  'params': [],
                  'score': []}

Linear regression model. Testing both with and without "fit_intercept" parameters. This model is not going to be accurate.

In [None]:
# Linear Regression

params={'fit_intercept': ['True', 'False']}

for fit_intercept in params['fit_intercept']:
    model_params = (fit_intercept)
    model = LinearRegression(fit_intercept = fit_intercept)
    model.fit(x_train, y_train)
            
    model_score = score_regression(model)
            
    comparison_dict['model'].append('linear_regression')
    comparison_dict['params'].append(model_params)
    comparison_dict['score'].append(model_score)

In [None]:
pd.DataFrame(comparison_dict).groupby('model')['score'].max()

In [None]:
# Neural network

params={'hidden_layer_sizes': [(80,20,40,5), (75,30,50,10,3)], 
        'activation': ['identity', 'relu','logistic', 'tanh',], 
        'solver': ['lbfgs','sgd', 'adam'], 
        'alpha': np.logspace(-4,1,20)} 

for hidden_layer_sizes in params['hidden_layer_sizes']:
    for activation in params['activation']:
        for solver in params['solver']:
            for alpha in params['alpha']:
                model_params = (hidden_layer_sizes, activation, solver, alpha )
                model = MLPRegressor(hidden_layer_sizes = hidden_layer_sizes,
                                      activation = activation, solver = solver,
                                       alpha = alpha, random_state = 1, max_iter = 1000)
                model.fit(x_train, y_train)

                model_score = score_regression(model)

                comparison_dict['model'].append('nn_regressor')
                comparison_dict['params'].append(model_params)
                comparison_dict['score'].append(model_score)

In [None]:
pd.DataFrame(comparison_dict).groupby('model')['score'].max()

In [None]:
pd.DataFrame(comparison_dict).groupby('model')['params'].max()