# F1 Win Prediction Project
#### Alex Boardman - BrainStation

### Introduction to Feature Engineering

In the high-octane world of Formula 1 racing, where fractions of a second can be the difference between victory and defeat, the art and science of predictive modeling take on a thrilling edge. At the heart of this endeavor is Feature Engineering, a critical process where we carefully select and transform raw data into informative inputs (features) that our predictive models can understand and utilize. This process is not just helpful but essential in tailoring our data to reflect the nuances of the sport, allowing us to capture the complexities of race dynamics that influence a driver's likelihood of winning.

In our quest to predict F1 race winners, we've refined our features to reflect the multifaceted nature of the sport. By decomposing the 'Status' of race outcomes into categories like 'Mechanical Issues' and 'Driver Issues', we aim to give the model a clearer picture of team reliability and driver consistency. We've also incorporated 'Weather Conditions', recognizing its profound impact on strategy and performance. This was meticulously compiled from both a dedicated Formula 1 dataset and detailed race reports.

Moreover, 'Circuit Characteristics' were evaluated to account for the diverse challenges posed by different tracks, whether they favor high-speed performance or the technical prowess suited to street circuits. We've collated data on 'Recent Form' to catch the momentum of drivers and teams, understanding that past performance can be a harbinger of future results.

The unsung heroes in the pit lane have not been overlooked; 'Team Strategy and Pit Crew Performance' data has been reintegrated to spotlight their influence on race outcomes. And lastly, the 'Engine and Tyre Performance' feature draws on historical data to estimate the durability of these critical components, which often decide the fate of a race.

Through these enhancements in feature engineering, we've set the stage for a model that doesn't just process data but interprets the pulse of the race, giving us unprecedented insights into what makes a champion.

## Data Preprocessing

### Importing Libraries and Notebook Setup

In [56]:
# Install libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [57]:
process_df = pd.read_csv('C:/Users/Alex/OneDrive/BrainStation/Data_Science_Bootcamp/Capstone_Project/capstone-Aboard89/data/2024_data.csv')

In [58]:
process_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 25 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   year                                     480 non-null    int64  
 1   age                                      480 non-null    int64  
 2   years_in_f1                              480 non-null    int64  
 3   races_with_each_team_since_1995          480 non-null    int64  
 4   F2_champion                              480 non-null    int64  
 5   Former_F1_World_Champion                 480 non-null    int64  
 6   home_race                                480 non-null    int64  
 7   starting_grid_position                   480 non-null    int64  
 8   points_in_previous_race                  480 non-null    int64  
 9   laps_in_previous_race                    480 non-null    int64  
 10  constructorId_points_at_stage_of_season  480 non-n

In [59]:
pd.set_option('display.max_columns', None)
process_df.head()

Unnamed: 0,year,age,years_in_f1,races_with_each_team_since_1995,F2_champion,Former_F1_World_Champion,home_race,starting_grid_position,points_in_previous_race,laps_in_previous_race,constructorId_points_at_stage_of_season,driver_points_at_stage_of_season,race,engine_manufacturer,constructor_nationality_,nationality,Mechanical,Driver_Issue,Lapped,Number_Of_Stops,Total_time_in_pits,Avg_time_in_pits,weather_conditions,circuit_type,constructor_id
0,2024,29,9,66,0,0,0,4,0,57,0,0,Bahrain Grand Prix,Ferrari,Italian,Spanish,0,0,0,2,48.34,24.17,Dry,Permanent Race Track,6
1,2024,34,13,66,0,0,0,5,12,58,0,0,Bahrain Grand Prix,Red Bull,Austrian,Mexican,0,0,0,2,49.08,24.54,Dry,Permanent Race Track,9
2,2024,26,9,166,0,1,0,1,26,58,0,0,Bahrain Grand Prix,Red Bull,Austrian,Dutch,0,0,0,2,49.27,24.64,Dry,Permanent Race Track,9
3,2024,24,2,44,0,0,0,17,0,58,0,0,Bahrain Grand Prix,Ferrari,Swiss,Chinese,0,0,1,2,50.3,25.15,Dry,Permanent Race Track,51
4,2024,34,11,44,0,0,0,16,0,57,0,0,Bahrain Grand Prix,Ferrari,Swiss,Finnish,0,0,1,2,26.42,13.21,Dry,Permanent Race Track,51


Here you can add sections like:

- Renaming columns
- Drop Redundant Columns
- Changing Data Types
- Dropping Duplicates
- Handling Missing Values
- Handling Unreasonable Data Ranges
- Feature Engineering / Transformation

Use `assert` where possible to show that preprocessing is done.

### One-Hot Encoding

NB - all of these "drop_false" are set to "False", because the original model data had all the information from teams since 1995, so we will need to add those extra columns in for extra teams, nationalities, etc. 

In [60]:
# One-hot encode "race" column
process_df = pd.get_dummies(process_df, columns=['race'], drop_first=False)


In [61]:
# One-hot encode "engine_manufacturer" column
process_df = pd.get_dummies(process_df, columns=['engine_manufacturer'], drop_first=False)

In [62]:
# One-hot encode "Constructor Nationality" column
process_df = pd.get_dummies(process_df, columns=['constructor_nationality_'], drop_first=False)

In [63]:
# One-hot encode "Nationality" column
process_df = pd.get_dummies(process_df, columns=['nationality'], drop_first=False)

In [64]:
# One-hot encode "Constructor Nationality" column
process_df = pd.get_dummies(process_df, columns=['weather_conditions'], drop_first=False)

In [65]:
# One-hot encode "circuit_type" column
process_df = pd.get_dummies(process_df, columns=['circuit_type'], drop_first=False)

In [66]:
# Ensure 'constructorId' is of type object
process_df['constructorId'] = process_df['constructor_id'].astype(str)

# One-hot encode "constructorId" column
process_df = pd.get_dummies(process_df, columns=['constructor_id'], drop_first=False)

In [67]:
pd.set_option('display.max_columns', None)
process_df.head()

Unnamed: 0,year,age,years_in_f1,races_with_each_team_since_1995,F2_champion,Former_F1_World_Champion,home_race,starting_grid_position,points_in_previous_race,laps_in_previous_race,constructorId_points_at_stage_of_season,driver_points_at_stage_of_season,Mechanical,Driver_Issue,Lapped,Number_Of_Stops,Total_time_in_pits,Avg_time_in_pits,race_Abu Dhabi Grand Prix,race_Australian Grand Prix,race_Austrian Grand Prix,race_Azerbaijan Grand Prix,race_Bahrain Grand Prix,race_Belgian Grand Prix,race_Brazilian Grand Prix,race_British Grand Prix,race_Canadian Grand Prix,race_Chinese Grand Prix,race_Dutch Grand Prix,race_Emilia Romagna Grand Prix,race_Hungarian Grand Prix,race_Italian Grand Prix,race_Japanese Grand Prix,race_Las Vegas Grand Prix,race_Mexico City Grand Prix,race_Miami Grand Prix,race_Monaco Grand Prix,race_Qatar Grand Prix,race_Saudi Arabian Grand Prix,race_Singapore Grand Prix,race_Spanish Grand Prix,race_United States Grand Prix,engine_manufacturer_Ferrari,engine_manufacturer_Mercedes,engine_manufacturer_Red Bull,engine_manufacturer_Renault,constructor_nationality__American,constructor_nationality__Austrian,constructor_nationality__British,constructor_nationality__French,constructor_nationality__Italian,constructor_nationality__Swiss,nationality_American,nationality_Australian,nationality_British,nationality_Canadian,nationality_Chinese,nationality_Danish,nationality_Dutch,nationality_Finnish,nationality_French,nationality_German,nationality_Japanese,nationality_Mexican,nationality_Monegasque,nationality_Spanish,nationality_Thai,weather_conditions_Dry,circuit_type_Permanent Race Track,circuit_type_Street Circuit,constructorId,constructor_id_1,constructor_id_3,constructor_id_6,constructor_id_9,constructor_id_51,constructor_id_117,constructor_id_131,constructor_id_210,constructor_id_213,constructor_id_214
0,2024,29,9,66,0,0,0,4,0,57,0,0,0,0,0,2,48.34,24.17,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,True,False,6,False,False,True,False,False,False,False,False,False,False
1,2024,34,13,66,0,0,0,5,12,58,0,0,0,0,0,2,49.08,24.54,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,True,False,9,False,False,False,True,False,False,False,False,False,False
2,2024,26,9,166,0,1,0,1,26,58,0,0,0,0,0,2,49.27,24.64,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,True,False,9,False,False,False,True,False,False,False,False,False,False
3,2024,24,2,44,0,0,0,17,0,58,0,0,0,0,1,2,50.3,25.15,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,True,False,51,False,False,False,False,True,False,False,False,False,False
4,2024,34,11,44,0,0,0,16,0,57,0,0,0,0,1,2,26.42,13.21,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,True,False,51,False,False,False,False,True,False,False,False,False,False


In [68]:
process_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 81 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   year                                     480 non-null    int64  
 1   age                                      480 non-null    int64  
 2   years_in_f1                              480 non-null    int64  
 3   races_with_each_team_since_1995          480 non-null    int64  
 4   F2_champion                              480 non-null    int64  
 5   Former_F1_World_Champion                 480 non-null    int64  
 6   home_race                                480 non-null    int64  
 7   starting_grid_position                   480 non-null    int64  
 8   points_in_previous_race                  480 non-null    int64  
 9   laps_in_previous_race                    480 non-null    int64  
 10  constructorId_points_at_stage_of_season  480 non-n

### Null Values

In [69]:
# Count null values per column
null_values_per_column = process_df.isnull().sum()

# Display the counts
print(null_values_per_column)

year                               0
age                                0
years_in_f1                        0
races_with_each_team_since_1995    0
F2_champion                        0
                                  ..
constructor_id_117                 0
constructor_id_131                 0
constructor_id_210                 0
constructor_id_213                 0
constructor_id_214                 0
Length: 81, dtype: int64


In [70]:
# Filter and show only columns that have null values
null_values_per_column = null_values_per_column[null_values_per_column > 0]

# Display the filtered counts
print(null_values_per_column)

Series([], dtype: int64)


In [71]:
pd.set_option('display.max_columns', None)
process_df.head()

Unnamed: 0,year,age,years_in_f1,races_with_each_team_since_1995,F2_champion,Former_F1_World_Champion,home_race,starting_grid_position,points_in_previous_race,laps_in_previous_race,constructorId_points_at_stage_of_season,driver_points_at_stage_of_season,Mechanical,Driver_Issue,Lapped,Number_Of_Stops,Total_time_in_pits,Avg_time_in_pits,race_Abu Dhabi Grand Prix,race_Australian Grand Prix,race_Austrian Grand Prix,race_Azerbaijan Grand Prix,race_Bahrain Grand Prix,race_Belgian Grand Prix,race_Brazilian Grand Prix,race_British Grand Prix,race_Canadian Grand Prix,race_Chinese Grand Prix,race_Dutch Grand Prix,race_Emilia Romagna Grand Prix,race_Hungarian Grand Prix,race_Italian Grand Prix,race_Japanese Grand Prix,race_Las Vegas Grand Prix,race_Mexico City Grand Prix,race_Miami Grand Prix,race_Monaco Grand Prix,race_Qatar Grand Prix,race_Saudi Arabian Grand Prix,race_Singapore Grand Prix,race_Spanish Grand Prix,race_United States Grand Prix,engine_manufacturer_Ferrari,engine_manufacturer_Mercedes,engine_manufacturer_Red Bull,engine_manufacturer_Renault,constructor_nationality__American,constructor_nationality__Austrian,constructor_nationality__British,constructor_nationality__French,constructor_nationality__Italian,constructor_nationality__Swiss,nationality_American,nationality_Australian,nationality_British,nationality_Canadian,nationality_Chinese,nationality_Danish,nationality_Dutch,nationality_Finnish,nationality_French,nationality_German,nationality_Japanese,nationality_Mexican,nationality_Monegasque,nationality_Spanish,nationality_Thai,weather_conditions_Dry,circuit_type_Permanent Race Track,circuit_type_Street Circuit,constructorId,constructor_id_1,constructor_id_3,constructor_id_6,constructor_id_9,constructor_id_51,constructor_id_117,constructor_id_131,constructor_id_210,constructor_id_213,constructor_id_214
0,2024,29,9,66,0,0,0,4,0,57,0,0,0,0,0,2,48.34,24.17,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,True,False,6,False,False,True,False,False,False,False,False,False,False
1,2024,34,13,66,0,0,0,5,12,58,0,0,0,0,0,2,49.08,24.54,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,True,False,9,False,False,False,True,False,False,False,False,False,False
2,2024,26,9,166,0,1,0,1,26,58,0,0,0,0,0,2,49.27,24.64,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,True,False,9,False,False,False,True,False,False,False,False,False,False
3,2024,24,2,44,0,0,0,17,0,58,0,0,0,0,1,2,50.3,25.15,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,True,False,51,False,False,False,False,True,False,False,False,False,False
4,2024,34,11,44,0,0,0,16,0,57,0,0,0,0,1,2,26.42,13.21,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,True,False,51,False,False,False,False,True,False,False,False,False,False


### Send to CSV

In [73]:
process_df.to_csv('model_data_2024.csv', index=False)