# F1 Win Prediction Project
#### Alex Boardman - BrainStation

### Areas to Fix

- **Data Types**: Ensure that each column has the appropriate data type for the kind of data it contains. For instance, categorical data should not be typed as numeric and vice versa. If any columns are meant to be categorical or date/time but are currently recognized as 'object' or 'int64', they should be converted to the proper data type.

- **Missing Data**: Your dataset contains columns with high percentages of missing values, such as 'team_development_rank_last_year' and 'status_finished_last_race'. You need to decide how to handle these, whether by imputation, deletion, or acquisition of more data if possible. For columns with a small amount of missing data, imputation might be feasible, while for those with a large percentage, it might be more appropriate to consider dropping the column.

- **Duplicate Rows**: Check for any duplicate rows that might skew your analysis. If duplicates are not meaningful for your study, they should be removed.

- **Uniqueness of Data**: Some columns like 'raceId' and 'driverId' are expected to have a high degree of uniqueness and serve as identifiers. Other columns that should normally have a diverse set of values but show a high degree of similarity (low uniqueness) may not be very informative and could potentially be candidates for removal.

- **Data Range**: Verify the range of values in numerical columns. For instance, if 'year' has a minimum value that's in the future or a past date that's not plausible, these could be data entry errors. Check for outliers that don't make sense within the context of the data.

- **Consistency**: Ensure that the data is consistent throughout the dataset. For example, if 'country' and 'nationality_of_circuit' are supposed to represent the same information, they should be consistent and possibly merged if they are redundant.

- **Correctness**: For columns with 'inf' values for uniqueness, ensure they are correctly calculated. An 'inf' value might indicate a division by zero error, suggesting that the column might be entirely unique or entirely composed of a single value, each of which has different implications.

- **Data Integrity**: Ensure that related columns correctly reflect relationships in the data. For example, 'number_of_pit_stops' should correlate with 'average_time_lost_in_pits' in a way that makes sense.

- **Normalization/Standardization**: For machine learning purposes, you may need to standardize or normalize numerical data to ensure that the scale of the data does not unduly influence the model.

## Data Preprocessing

### Importing Libraries and Notebook Setup

In [18]:
# Install libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [19]:
process_df = pd.read_csv('C:/Users/Alex/OneDrive/BrainStation/Data_Science_Bootcamp/Capstone_Project/capstone-Aboard89/data/data_analysis.csv')

In [20]:
process_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11235 entries, 0 to 11234
Data columns (total 36 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Index                             11235 non-null  int64  
 1   resultId                          11235 non-null  int64  
 2   raceId                            11235 non-null  int64  
 3   year                              11235 non-null  int64  
 4   race                              11235 non-null  object 
 5   country                           11235 non-null  object 
 6   nationality_of_circuit            11235 non-null  object 
 7   driverId                          11235 non-null  int64  
 8   number                            11235 non-null  int64  
 9   driver_name                       11235 non-null  object 
 10  F2_champion                       11235 non-null  int64  
 11  Former_F1_World_Champion          11235 non-null  int64  
 12  Nati

In [21]:
pd.set_option('display.max_columns', None)
process_df.head()

Unnamed: 0,Index,resultId,raceId,year,race,country,nationality_of_circuit,driverId,number,driver_name,F2_champion,Former_F1_World_Champion,Nationality,home_race,constructorId,constructor,engine_manufacturer,constructor_nationality,number.1,starting_grid_position,positionOrder,points,points_in_previous_race,laps,laps_completed_in_previous_races,time,milliseconds,fastestLap_ms,fastest_lap_from_last_race,number_of_pit_stops,average_time_lost_in_pits,statusId,status,major_regulation_change,laps_in_previous_race,race_win
0,1,4721,240,1995,Brazilian Grand Prix,brazil,Brazilian,30,1,Michael Schumacher,0,1,German,0,22,Benetton,Renault,Italian,1,2,1,10.0,,71,,01:38:34.154000,5914154,81009,,3,31.83,1,Finished,0,,1
1,1,4724,240,1995,Brazilian Grand Prix,brazil,Brazilian,57,8,Mika Häkkinen,0,0,Finnish,0,1,McLaren,Mercedes,British,8,7,4,3.0,,70,,\N,\N,Not Found,,0,0.0,11,+1 Lap,0,,0
2,1,4746,240,1995,Brazilian Grand Prix,brazil,Brazilian,94,23,Pierluigi Martini,0,0,Italian,0,18,Minardi,Ford,Italian,23,17,26,0.0,,0,,\N,\N,Not Found,,0,0.0,6,Gearbox,0,,0
3,1,4745,240,1995,Brazilian Grand Prix,brazil,Brazilian,44,26,Olivier Panis,0,0,French,0,27,Ligier,Mugen-Honda,French,26,10,25,0.0,,0,,\N,\N,Not Found,,0,0.0,4,Collision,0,,0
4,1,4744,240,1995,Brazilian Grand Prix,brazil,Brazilian,49,30,Heinz-Harald Frentzen,0,0,German,0,15,Sauber,Ford,Swiss,30,14,24,0.0,,10,,\N,\N,84001,,0,0.0,10,Electrical,0,,0


In [22]:
process_df['new_index'] = process_df['Index'] + process_df['driverId']


The above is create a new index column, so that I can track who the model think is going to win races when we get to the prediction stage. 

I then want to make this the new index of my dataframe

In [23]:
process_df = process_df.set_index('new_index')


In [24]:
process_df.head()

Unnamed: 0_level_0,Index,resultId,raceId,year,race,country,nationality_of_circuit,driverId,number,driver_name,F2_champion,Former_F1_World_Champion,Nationality,home_race,constructorId,constructor,engine_manufacturer,constructor_nationality,number.1,starting_grid_position,positionOrder,points,points_in_previous_race,laps,laps_completed_in_previous_races,time,milliseconds,fastestLap_ms,fastest_lap_from_last_race,number_of_pit_stops,average_time_lost_in_pits,statusId,status,major_regulation_change,laps_in_previous_race,race_win
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
31,1,4721,240,1995,Brazilian Grand Prix,brazil,Brazilian,30,1,Michael Schumacher,0,1,German,0,22,Benetton,Renault,Italian,1,2,1,10.0,,71,,01:38:34.154000,5914154,81009,,3,31.83,1,Finished,0,,1
58,1,4724,240,1995,Brazilian Grand Prix,brazil,Brazilian,57,8,Mika Häkkinen,0,0,Finnish,0,1,McLaren,Mercedes,British,8,7,4,3.0,,70,,\N,\N,Not Found,,0,0.0,11,+1 Lap,0,,0
95,1,4746,240,1995,Brazilian Grand Prix,brazil,Brazilian,94,23,Pierluigi Martini,0,0,Italian,0,18,Minardi,Ford,Italian,23,17,26,0.0,,0,,\N,\N,Not Found,,0,0.0,6,Gearbox,0,,0
45,1,4745,240,1995,Brazilian Grand Prix,brazil,Brazilian,44,26,Olivier Panis,0,0,French,0,27,Ligier,Mugen-Honda,French,26,10,25,0.0,,0,,\N,\N,Not Found,,0,0.0,4,Collision,0,,0
50,1,4744,240,1995,Brazilian Grand Prix,brazil,Brazilian,49,30,Heinz-Harald Frentzen,0,0,German,0,15,Sauber,Ford,Swiss,30,14,24,0.0,,10,,\N,\N,84001,,0,0.0,10,Electrical,0,,0


That looks like it has worked correctly

### Rename Columns

In [1]:
process_df = process_df.rename(columns={'Index': 'race_index'})

NameError: name 'process_df' is not defined

### Drop Redundant Columns

In [25]:
# Define the columns to be dropped
cols_to_drop = ["Index",
    "raceId", "resultId", "race", "country",
    "nationality_of_circuit", "driver_name", "driverId", "number",
    "fastestLap_ms", "status", "constructor", "positionOrder", "number.1",
    "laps", "major_regulation_change", "time", "milliseconds"
]

# Verify that the columns exist in the DataFrame before dropping them
existing_cols_to_drop = [col for col in cols_to_drop if col in process_df.columns]

# Drop the verified columns from the DataFrame
process_df.drop(columns=existing_cols_to_drop, axis=1, inplace=True)

In [26]:
process_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11235 entries, 31 to 1384
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   year                              11235 non-null  int64  
 1   F2_champion                       11235 non-null  int64  
 2   Former_F1_World_Champion          11235 non-null  int64  
 3   Nationality                       11235 non-null  object 
 4   home_race                         11235 non-null  int64  
 5   constructorId                     11235 non-null  int64  
 6   engine_manufacturer               11235 non-null  object 
 7   constructor_nationality           11235 non-null  object 
 8   starting_grid_position            11235 non-null  int64  
 9   points                            11235 non-null  float64
 10  points_in_previous_race           10921 non-null  float64
 11  laps_completed_in_previous_races  0 non-null      float64
 12  fastest_l

### Changing Data Types

In [27]:
# # Convert columns to the right data types
# df[col] = df[col].astype('string')
# df[col] = df[col].astype('int')
# df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

# # Convert to categorical datatype
# col_cat = ptypes.CategoricalDtype(categories=['A', 'B', 'C'], ordered=True)
# df['col_cat'] = df['col_cat'].astype(col_cat)

In [28]:
# # Verify conversion
# assert ptypes.is_string_dtype(df[col])
# assert ptypes.is_numeric_dtype(df[col])
# cols_to_check = []
# assert all(ptypes.is_datetime64_any_dtype(df[col]) for col in cols_to_check)

### Dropping Duplicates

In [29]:
# # Drop entirely duplicated rows
# df.drop_duplicates(inplace=True, ignore_index=True)

In [30]:
# # Verify rows dropped
# assert df.duplicated().sum()==0

### Handling Unreasonable Data Ranges

In [31]:
# # Drop affected rows
# df = df.loc[~((df['A'] == 0) | (df['B'] > 100))].reset_index()

In [32]:
# # Verify rows dropped
# len(df)

### Feature Engineering / Transformation

#### constructor_points_at_stage_of_season

In [33]:
#constuctor_points_sum_df.info()

In [34]:
constructor_points_sum_df = process_df.copy()

In [35]:
# First, ensure 'race_index', 'year', and 'constructorId' are sorted in the order we want to process them
constructor_points_sum_df = constructor_points_sum_df.sort_values(by=['year', 'race_index', 'constructorId'])

# Initialize a new column for corrected constructorId points
constructor_points_sum_df['corrected_constructorId_points'] = 0

# Use a temporary DataFrame to assist with the cumulative sum calculation
temp_df = constructor_points_sum_df.groupby(['year', 'race_index', 'constructorId'])['points'].sum().groupby(level=[0, 2]).cumsum().reset_index()

# Merge this temporary DataFrame back to the original sorted DataFrame
# This step ensures each driver for the constructorId at that race_index sees the summed points on the constructorId level
df_merged = pd.merge(constructor_points_sum_df, temp_df, on=['year', 'race_index', 'constructorId'], how='left')

# The merged DataFrame now has an additional column with the cumulative points which needs to be renamed and checked
df_merged = df_merged.rename(columns={'points_y': 'constructorId_points_at_stage_of_season', 'points_x': 'points'})

# Drop the previously incorrectly calculated column
df_merged.drop(columns=['corrected_constructorId_points'], inplace=True)

KeyError: 'race_index'

In [None]:
# Check the first few rows to ensure the new column has been correctly calculated
df_merged[['race_index', 'driver_name', 'year', 'constructorId', 'points', 'constructorId_points_at_stage_of_season']].head(50)

Unnamed: 0,race_index,driver_name,year,constructorId,points,constructorId_points_at_stage_of_season
0,1,Mika Häkkinen,1995,1,3.0,4.0
1,1,Mark Blundell,1995,1,1.0,4.0
2,1,Damon Hill,1995,3,0.0,6.0
3,1,David Coulthard,1995,3,6.0,6.0
4,1,Gerhard Berger,1995,6,4.0,6.0
5,1,Jean Alesi,1995,6,2.0,6.0
6,1,Heinz-Harald Frentzen,1995,15,0.0,0.0
7,1,Karl Wendlinger,1995,15,0.0,0.0
8,1,Eddie Irvine,1995,17,0.0,0.0
9,1,Rubens Barrichello,1995,17,0.0,0.0


In [None]:
# Check the first few rows to ensure the new column has been correctly calculated
df_merged[['race_index', 'driver_name', 'year', 'constructorId', 'points', 'constructorId_points_at_stage_of_season']].tail(50)

Unnamed: 0,race_index,driver_name,year,constructorId,points,constructorId_points_at_stage_of_season
11185,525,Lance Stroll,2023,117,0.0,168.0
11186,525,Fernando Alonso,2023,117,6.0,168.0
11187,525,George Russell,2023,131,10.0,195.0
11188,525,Lewis Hamilton,2023,131,15.0,195.0
11189,525,Kevin Magnussen,2023,210,0.0,8.0
11190,525,Nico Hülkenberg,2023,210,0.0,8.0
11191,525,Nyck de Vries,2023,213,0.0,2.0
11192,525,Yuki Tsunoda,2023,213,0.0,2.0
11193,525,Pierre Gasly,2023,214,0.0,45.0
11194,525,Esteban Ocon,2023,214,0.0,45.0


That seems to have worked - now we have a column that has the cumulative points for the constructor at this stage of the season for each driver.

#### driver_points_at_stage_of_season

In [None]:
driver_points_sum_df = df_merged.copy()

In [None]:
# Use a temporary DataFrame to assist with the cumulative sum calculation for drivers
temp_driver_df = driver_points_sum_df.groupby(['year', 'race_index', 'driver_name'])['points'].sum().groupby(level=[0, 2]).cumsum().reset_index()

# Merge this temporary DataFrame back to the original DataFrame
# This step ensures each driver sees the summed points at that stage of the season
df_merged_with_driver_points = pd.merge(driver_points_sum_df, temp_driver_df, on=['year', 'race_index', 'driver_name'], how='left')

# The merged DataFrame now has an additional column with the cumulative points which needs to be renamed and checked
df_merged_with_driver_points = df_merged_with_driver_points.rename(columns={'points_y': 'driver_points_at_stage_of_season', 'points_x': 'points'})

In [None]:
# Check the first few rows to ensure the new column has been correctly calculated
df_merged_with_driver_points[['race_index', 'year', 'driver_name', 'points', 'driver_points_at_stage_of_season']].head(50)

Unnamed: 0,race_index,year,driver_name,points,driver_points_at_stage_of_season
0,1,1995,Mika Häkkinen,3.0,3.0
1,1,1995,Mark Blundell,1.0,1.0
2,1,1995,Damon Hill,0.0,0.0
3,1,1995,David Coulthard,6.0,6.0
4,1,1995,Gerhard Berger,4.0,4.0
5,1,1995,Jean Alesi,2.0,2.0
6,1,1995,Heinz-Harald Frentzen,0.0,0.0
7,1,1995,Karl Wendlinger,0.0,0.0
8,1,1995,Eddie Irvine,0.0,0.0
9,1,1995,Rubens Barrichello,0.0,0.0


In [None]:
# Check the first few rows to ensure the new column has been correctly calculated
df_merged_with_driver_points[['race_index', 'year', 'driver_name', 'points', 'driver_points_at_stage_of_season']].tail(50)

Unnamed: 0,race_index,year,driver_name,points,driver_points_at_stage_of_season
11185,525,2023,Lance Stroll,0.0,38.0
11186,525,2023,Fernando Alonso,6.0,130.0
11187,525,2023,George Russell,10.0,76.0
11188,525,2023,Lewis Hamilton,15.0,119.0
11189,525,2023,Kevin Magnussen,0.0,2.0
11190,525,2023,Nico Hülkenberg,0.0,6.0
11191,525,2023,Nyck de Vries,0.0,0.0
11192,525,2023,Yuki Tsunoda,0.0,2.0
11193,525,2023,Pierre Gasly,0.0,16.0
11194,525,2023,Esteban Ocon,0.0,29.0


That seems to have worked - now we have a column that has the cumulative points for the driver at this stage of the season for each driver.

### Other Feature Engineering Ideas

1) team_development_rank_last_year, 
2) statusId_finished_last_race
3) team_rank_first_race_after_major_regulation_change

In [None]:
df_merged_with_driver_points.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11235 entries, 0 to 11234
Data columns (total 22 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   race_index                               11235 non-null  int64  
 1   year                                     11235 non-null  int64  
 2   driver_name                              11235 non-null  object 
 3   F2_champion                              11235 non-null  int64  
 4   Former_F1_World_Champion                 11235 non-null  int64  
 5   Nationality                              11235 non-null  object 
 6   home_race                                11235 non-null  int64  
 7   constructorId                            11235 non-null  int64  
 8   engine_manufacturer                      11235 non-null  object 
 9   constructor_nationality                  11235 non-null  object 
 10  starting_grid_position                   11235

In [None]:
df_merged_with_driver_points.head().T

Unnamed: 0,0,1,2,3,4
race_index,1,1,1,1,1
year,1995,1995,1995,1995,1995
driver_name,Mika Häkkinen,Mark Blundell,Damon Hill,David Coulthard,Gerhard Berger
F2_champion,0,0,0,0,0
Former_F1_World_Champion,0,0,1,0,0
Nationality,Finnish,British,British,British,Austrian
home_race,0,0,0,0,0
constructorId,1,1,3,3,6
engine_manufacturer,Mercedes,Mercedes,Renault,Renault,Ferrari
constructor_nationality,British,British,British,British,Italian


In [None]:
# Define the columns to be dropped
cols_to_drop = [
    "driver_name", "resultId", "number_of_pit_stops", "average_time_lost_in_pits",
    "statusId", "laps_completed_in_previous_races", "points"
]

NB - laps_completed_in_previous_races, I have dropped for the time being. It was full of null values. I will run the first models and use thenm as benchmarks and then see if I can improve the scores. 
The other columns I didn't think were helpful for the following reasons
- `number_of_pit_stops` & `average_time_lost_in_pits` - as we are trying to predict the race winner before the race, this information isn't very helpful at this stage. It may be interesting to see how last year's pit strategy at a particular race will impact next year's race, but it's not so helpful in it's current form.
- `points` - this isn't helpful, as we want to predict forwards and this is captured from `constructorId_points_at_stage_of_season` & `driver_points_at_stage_of_season` columns

In [None]:

# Verify that the columns exist in the DataFrame before dropping them
existing_cols_to_drop = [col for col in cols_to_drop if col in df_merged_with_driver_points.columns]

# Drop the verified columns from the DataFrame
df_merged_with_driver_points.drop(columns=existing_cols_to_drop, axis=1, inplace=True)

In [None]:
df_merged_with_driver_points.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11235 entries, 0 to 11234
Data columns (total 16 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   race_index                               11235 non-null  int64  
 1   year                                     11235 non-null  int64  
 2   F2_champion                              11235 non-null  int64  
 3   Former_F1_World_Champion                 11235 non-null  int64  
 4   Nationality                              11235 non-null  object 
 5   home_race                                11235 non-null  int64  
 6   constructorId                            11235 non-null  int64  
 7   engine_manufacturer                      11235 non-null  object 
 8   constructor_nationality                  11235 non-null  object 
 9   starting_grid_position                   11235 non-null  int64  
 10  points_in_previous_race                  10921

In [None]:
df_merged_with_driver_points.to_csv('merged_with_driver_points.csv', index=False)

### One Hot Encoding

In [None]:
# Perform one-hot encoding on the specified columns
df_with_dummies = pd.get_dummies(df_merged_with_driver_points, columns=['engine_manufacturer', 'constructor_nationality', 'Nationality'])

# Now df_with_dummies contains the original data along with the one-hot encoded columns


In [None]:
df_with_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11235 entries, 0 to 11234
Data columns (total 82 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   race_index                               11235 non-null  int64  
 1   year                                     11235 non-null  int64  
 2   F2_champion                              11235 non-null  int64  
 3   Former_F1_World_Champion                 11235 non-null  int64  
 4   home_race                                11235 non-null  int64  
 5   constructorId                            11235 non-null  int64  
 6   starting_grid_position                   11235 non-null  int64  
 7   points_in_previous_race                  10921 non-null  float64
 8   fastest_lap_from_last_race               10921 non-null  object 
 9   laps_in_previous_race                    10921 non-null  float64
 10  race_win                                 11235

### Reviewing fastest_lap_from_last_race - not sure why this is an object

In [None]:
print(df_with_dummies['fastest_lap_from_last_race'].unique())


[nan '83003' 'Not Found' '82005' '81000' '83002' '83007' '84004' '84001'
 '83004' '86003' '81009' '83000' '84007' '89008' '88008' '86005' '83008'
 '87008' '87004' '86000' '88005' '91003' '92001' '91005' '92009' '94003'
 '94008' '95001' '96008' '93001' '94006' '94001' '94000' '101007' '95003'
 '94004' '96009' '95004' '90000' '90007' '90006' '92008' '91009' '94005'
 '93005' '90001' '106007' '97002' '100008' '91001' '98005' '98009'
 '120007' '93004' '96001' '107001' '96005' '105006' '85004' '85005'
 '86007' '86008' '87003' '86002' '86009' '89000' '85008' '87005' '87000'
 '86004' '92004' '88002' '90009' '88009' '87007' '85006' '88004' '87009'
 '88006' '90002' '90004' '90005' '92005' '91008' '101009' '93000' '91002'
 '91006' '93009' '89002' '92002' '92007' '98001' '96006' '81006' '81002'
 '82008' '81004' '83009' '81005' '80002' '83001' '90008' '91004' '93006'
 '104000' '90003' '93002' '98007' '95000' '95006' '96004' '111008'
 '114000' '110007' '111007' '110009' '111004' '112000' '111006' '1

Looking at this, they should be an integer as they are recorded in Millisecond. I will turn this column into an integret

In [None]:
# Replace 'Not Found' with NaN
df_with_dummies['fastest_lap_from_last_race'] = df_with_dummies['fastest_lap_from_last_race'].replace('Not Found', np.nan)

# Convert the column to Pandas nullable integer type
df_with_dummies['fastest_lap_from_last_race'] = df_with_dummies['fastest_lap_from_last_race'].astype('Int64')

In [None]:
df_with_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11235 entries, 0 to 11234
Data columns (total 82 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   race_index                               11235 non-null  int64  
 1   year                                     11235 non-null  int64  
 2   F2_champion                              11235 non-null  int64  
 3   Former_F1_World_Champion                 11235 non-null  int64  
 4   home_race                                11235 non-null  int64  
 5   constructorId                            11235 non-null  int64  
 6   starting_grid_position                   11235 non-null  int64  
 7   points_in_previous_race                  10921 non-null  float64
 8   fastest_lap_from_last_race               8902 non-null   Int64  
 9   laps_in_previous_race                    10921 non-null  float64
 10  race_win                                 11235

In [None]:
rows_with_nan = df_with_dummies[df_with_dummies['fastest_lap_from_last_race'].isna()]
print(rows_with_nan)

       race_index  year  F2_champion  Former_F1_World_Champion  home_race  \
0               1  1995            0                         0          0   
1               1  1995            0                         0          0   
2               1  1995            0                         1          0   
3               1  1995            0                         0          0   
4               1  1995            0                         0          0   
...           ...   ...          ...                       ...        ...   
11222         527  2023            0                         0          0   
11224         527  2023            0                         0          0   
11230         527  2023            1                         0          0   
11233         527  2023            1                         0          0   
11234         527  2023            0                         0          0   

       constructorId  starting_grid_position  points_in_previous_race  \
0 

In [None]:
nan_percentage = df_with_dummies['fastest_lap_from_last_race'].isna().mean() * 100
print(f'Percentage of NaN values in fastest_lap_from_last_race: {nan_percentage:.2f}%')

Percentage of NaN values in fastest_lap_from_last_race: 20.77%


There are quite a lot of missing values from column "fastest_lap_from_last_race", so I'm going to drop this for the time being and I might bring this back after running some initial models if I can decide on an imputation strategy.

In [None]:
df_with_dummies = df_with_dummies.drop('fastest_lap_from_last_race', axis=1)

# Drop remaining null values

In [None]:
df_with_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11235 entries, 0 to 11234
Data columns (total 81 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   race_index                               11235 non-null  int64  
 1   year                                     11235 non-null  int64  
 2   F2_champion                              11235 non-null  int64  
 3   Former_F1_World_Champion                 11235 non-null  int64  
 4   home_race                                11235 non-null  int64  
 5   constructorId                            11235 non-null  int64  
 6   starting_grid_position                   11235 non-null  int64  
 7   points_in_previous_race                  10921 non-null  float64
 8   laps_in_previous_race                    10921 non-null  float64
 9   race_win                                 11235 non-null  int64  
 10  constructorId_points_at_stage_of_season  11235

In [None]:
df_with_dummies = df_with_dummies.dropna()

In [None]:
df_with_dummies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10921 entries, 26 to 11234
Data columns (total 81 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   race_index                               10921 non-null  int64  
 1   year                                     10921 non-null  int64  
 2   F2_champion                              10921 non-null  int64  
 3   Former_F1_World_Champion                 10921 non-null  int64  
 4   home_race                                10921 non-null  int64  
 5   constructorId                            10921 non-null  int64  
 6   starting_grid_position                   10921 non-null  int64  
 7   points_in_previous_race                  10921 non-null  float64
 8   laps_in_previous_race                    10921 non-null  float64
 9   race_win                                 10921 non-null  int64  
 10  constructorId_points_at_stage_of_season  10921 non

In [49]:
df_with_dummies.to_csv('model_data.csv', index=False)