# KPI 4 - Driver Lap Times - Data Validation and Sanity Checks

In [2]:
import pandas as pd

# read csv file
df_laptimes = pd.read_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/driver-lap-times.csv')

# dataframe basic info
print(df_laptimes.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8222 entries, 0 to 8221
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   race_id                8222 non-null   int64 
 1   gp_year                8222 non-null   int64 
 2   gp_name                8222 non-null   object
 3   gp_round               8222 non-null   int64 
 4   driver_id              8222 non-null   int64 
 5   driver_name            8222 non-null   object
 6   rookie_or_experienced  8222 non-null   object
 7   lap_number             8222 non-null   int64 
 8   lap_time               8222 non-null   object
 9   lap_time_ms            8222 non-null   int64 
dtypes: int64(6), object(4)
memory usage: 642.5+ KB
None


## Summary of processed dataset 'driver-lap-times.csv'

- Filepath: /Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/driver-lap-times.csv *(potentially fix from absolute to relative path later?)*
- Range: 8222 entries, 0 to 8221.
- Columns: 10
- Data types: int64(6), object(4) *(objects are strings)*
- Memory usage: 642.5+ KB

## Column data types

In [3]:
print(df_laptimes.dtypes)

race_id                   int64
gp_year                   int64
gp_name                  object
gp_round                  int64
driver_id                 int64
driver_name              object
rookie_or_experienced    object
lap_number                int64
lap_time                 object
lap_time_ms               int64
dtype: object


## Missing or null values

In [4]:
df_laptimes.isnull().sum() # No nulls present across the dataset!

race_id                  0
gp_year                  0
gp_name                  0
gp_round                 0
driver_id                0
driver_name              0
rookie_or_experienced    0
lap_number               0
lap_time                 0
lap_time_ms              0
dtype: int64

## Check for duplicates

In [5]:
df_laptimes.duplicated().sum() # No duplicates found

0

## Summary statistics

In [6]:
df_laptimes.describe()

Unnamed: 0,race_id,gp_year,gp_round,driver_id,lap_number,lap_time_ms
count,8222.0,8222.0,8222.0,8222.0,8222.0,8222.0
mean,986.430066,2017.356604,11.509,602.163221,31.275724,93303.52
std,29.875159,1.401271,4.512304,368.615879,18.877443,46443.72
min,930.0,2015.0,5.0,9.0,1.0,67847.0
25%,962.0,2016.0,8.0,13.0,15.0,79150.75
50%,993.0,2018.0,12.0,822.0,30.0,87566.0
75%,1015.0,2019.0,14.0,840.0,46.0,100130.5
max,1029.0,2019.0,20.0,847.0,78.0,2118323.0


From this we can roughly tell that, 
- Dataset contains 8,222 total laps recorded across all Williams drivers between 2015-2019, on the selected ten consistent circuits.
- Mean lap time is ~93,303 ms (~1:33.3), aligning with standard F1 race pace, but dependent on circuit length.
- Median lap time is ~87,566 ms (~1:2757), slightly faster than the mean, indicating a right-skewed distribution (some outlier laps much slower, possibly due to pit stops or SC/VSC).
- Minimum lap time is ~67,847 ms (~1:07.84), likely recorded on a short circuit or during a qualifying-style push lap.
- Maximum lap time is over 2,118,323 ms (~35 minutes), clearly an extreme outlier, likely due to data error or unclean recording (e.g. stuck in pit, technical failure, etc.)
- Standard deviation is ~46,443 ms (~46s), suggesting significant variability — expected given laps under normal, SC, pit stop, or mechanical issue conditions.

- Laps range from 1 to 78. The 78 laps is a sign that Monaco GP is there.
- GP years span from 2015 to 2019, as expected.
- Median year is 2018, suggesting a larger representation of more recent seasons.

In [7]:
# Based on maximum lap time of over 35 minutes, we need to drop any laps over 2 minutes, or 120,000 ms.
df_laptimes = df_laptimes[df_laptimes['lap_time_ms'] <= 120000]

# Re-check the dataset after dropping invalid lap times
print(df_laptimes.info())

<class 'pandas.core.frame.DataFrame'>
Index: 7719 entries, 0 to 8221
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   race_id                7719 non-null   int64 
 1   gp_year                7719 non-null   int64 
 2   gp_name                7719 non-null   object
 3   gp_round               7719 non-null   int64 
 4   driver_id              7719 non-null   int64 
 5   driver_name            7719 non-null   object
 6   rookie_or_experienced  7719 non-null   object
 7   lap_number             7719 non-null   int64 
 8   lap_time               7719 non-null   object
 9   lap_time_ms            7719 non-null   int64 
dtypes: int64(6), object(4)
memory usage: 663.4+ KB
None


In [8]:
# Check how many observations have been dropped due to invalid lap times
print(f"Total laps before dropping invalid lap times: {8222}")  # Original number of laps
print(f"Total laps after dropping invalid lap times: {len(df_laptimes)}")  # Number of laps after filtering
print(f"Number of laps dropped due to invalid lap times: {8222 - len(df_laptimes)}")  # Calculate the number of laps dropped

Total laps before dropping invalid lap times: 8222
Total laps after dropping invalid lap times: 7719
Number of laps dropped due to invalid lap times: 503


In [9]:
df_laptimes.describe()

Unnamed: 0,race_id,gp_year,gp_round,driver_id,lap_number,lap_time_ms
count,7719.0,7719.0,7719.0,7719.0,7719.0,7719.0
mean,986.746599,2017.375826,11.433217,604.033813,32.063609,89039.073326
std,30.046807,1.407359,4.54224,367.797534,18.720381,13161.967057
min,930.0,2015.0,5.0,9.0,1.0,67847.0
25%,962.0,2016.0,8.0,13.0,16.0,78446.5
50%,993.0,2018.0,11.0,822.0,31.0,86703.0
75%,1015.0,2019.0,14.0,840.0,46.0,97515.5
max,1029.0,2019.0,20.0,847.0,78.0,119983.0


From updated summary statistics we can tell that:
- New mean lap time is 89,039 ms or ~1:29.04 - aligning better with typical midfield lap times.
- New max lap time is 119,983 ms or ~1:59.98 - just under 2 minutes, still within plausible race conditions, like a very wet Belgian GP.
- Standard deviation = 13,162 ms or ~13.1s - much more improved, considering previous value was 41s and severly skewed by outliers.
- IQR of 78,446 - 97,515 ms, capturing normal lap time variation across different drivers and circuits.

In [10]:
# Export the validated lap times dataframe to a new CSV file, called "driver-lap-times-validated.csv"
df_laptimes.to_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/driver-lap-times-validated.csv', index=False)

## Validation conclusion
- No null values found
- Column data types are correct
- Invalid lap times over 2 minutes required around 500 observations to be dropped.
- Proceed with feature engineering using the newly validated and exported "driver-lap-times-validated.csv"