# Q4 - Driver Lap Times - Data Validation and Sanity Checks

In [2]:
import pandas as pd

# read csv file
df_laptimes = pd.read_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/driver-lap-times.csv')

# dataframe basic info
print(df_laptimes.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15982 entries, 0 to 15981
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   race_id                15982 non-null  int64 
 1   gp_year                15982 non-null  int64 
 2   gp_name                15982 non-null  object
 3   gp_round               15982 non-null  int64 
 4   driver_id              15982 non-null  int64 
 5   driver_name            15982 non-null  object
 6   rookie_or_experienced  15982 non-null  object
 7   lap_number             15982 non-null  int64 
 8   lap_time               15982 non-null  object
 9   lap_time_ms            15982 non-null  int64 
dtypes: int64(6), object(4)
memory usage: 1.2+ MB
None


## Summary of processed dataset 'driver-lap-times.csv'

- Filepath: /Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/driver-lap-times.csv *(potentially fix from absolute to relative path later?)*
- Range: 15982 entries, 0 to 15981.
- Columns: 10
- Data types: int64(6), object(4) *(objects are strings)*
- Memory usage: 1.2+ MB

## Column data types

In [3]:
print(df_laptimes.dtypes)

race_id                   int64
gp_year                   int64
gp_name                  object
gp_round                  int64
driver_id                 int64
driver_name              object
rookie_or_experienced    object
lap_number                int64
lap_time                 object
lap_time_ms               int64
dtype: object


## Missing or null values

In [4]:
df_laptimes.isnull().sum() # No nulls present across the dataset!

race_id                  0
gp_year                  0
gp_name                  0
gp_round                 0
driver_id                0
driver_name              0
rookie_or_experienced    0
lap_number               0
lap_time                 0
lap_time_ms              0
dtype: int64

## Check for duplicates

In [None]:
df_laptimes.duplicated().sum() # No duplicates found

0

## Summary statistics

In [6]:
df_laptimes.describe()

Unnamed: 0,race_id,gp_year,gp_round,driver_id,lap_number,lap_time_ms
count,15982.0,15982.0,15982.0,15982.0,15982.0,15982.0
mean,986.778751,2017.40702,10.816293,604.311538,30.252159,95997.97
std,29.73545,1.389695,5.897613,367.468889,18.162977,41711.24
min,926.0,2015.0,1.0,9.0,1.0,67847.0
25%,964.0,2016.0,6.0,13.0,15.0,82555.5
50%,991.0,2018.0,11.0,822.0,29.0,94398.5
75%,1013.0,2019.0,16.0,840.0,44.0,103647.2
max,1030.0,2019.0,21.0,847.0,78.0,2118323.0


From this we can roughly tell that, 
- Dataset contains 15,982 total laps recorded across all Williams drivers between 2015-2019
- Mean lap time is ~95,998 ms (~1:35.99), aligning with standard F1 race pace, but dependent on circuit length
- Median lap time is ~94,398 ms (~1:34.40), slightly faster than the mean, indicating a right-skewed distribution (some outlier laps much slower, possibly due to pit stops or SC/VSC).
- Minimum lap time is ~67,847 ms (~1:07.84), likely recorded on a short circuit or during a qualifying-style push lap.
- Maximum lap time is over 2,118,323 ms (~35 minutes!), clearly an extreme outlier, likely due to data error or unclean recording (e.g. stuck in pit, technical failure, etc.)
- Standard deviation is ~41,711 ms (~41s), suggesting significant variability — expected given laps under normal, SC, pit stop, or mechanical issue conditions.

- Laps range from 1 to 78. The 78 laps is a sign that Monaco GP is there.
- Median lap is 29 - makes sense.
- GP years span from 2015 to 2019, as expected.
- Median year is 2018, suggesting a larger representation of more recent seasons.

In [None]:
# Based on maximum lap time of over 35 minutes, we need to drop any laps over 2 minutes, or 120,000 ms.
df_laptimes = df_laptimes[df_laptimes['lap_time_ms'] <= 120000]

# Re-check the dataset after dropping invalid lap times
print(df_laptimes.info())

<class 'pandas.core.frame.DataFrame'>
Index: 15073 entries, 3 to 15981
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   race_id                15073 non-null  int64 
 1   gp_year                15073 non-null  int64 
 2   gp_name                15073 non-null  object
 3   gp_round               15073 non-null  int64 
 4   driver_id              15073 non-null  int64 
 5   driver_name            15073 non-null  object
 6   rookie_or_experienced  15073 non-null  object
 7   lap_number             15073 non-null  int64 
 8   lap_time               15073 non-null  object
 9   lap_time_ms            15073 non-null  int64 
dtypes: int64(6), object(4)
memory usage: 1.3+ MB
None


In [12]:
# Check how many observations have been dropped due to invalid lap times
print(f"Total laps before dropping invalid lap times: {15982}")  # Original number of laps
print(f"Total laps after dropping invalid lap times: {len(df_laptimes)}")  # Number of laps after filtering
print(f"Number of laps dropped due to invalid lap times: {15982 - len(df_laptimes)}")  # Calculate the number of laps dropped

Total laps before dropping invalid lap times: 15982
Total laps after dropping invalid lap times: 15073
Number of laps dropped due to invalid lap times: 909


In [13]:
df_laptimes.describe()

Unnamed: 0,race_id,gp_year,gp_round,driver_id,lap_number,lap_time_ms
count,15073.0,15073.0,15073.0,15073.0,15073.0,15073.0
mean,987.067074,2017.422146,10.793936,605.529556,30.913819,92293.839448
std,29.81756,1.391867,5.92164,366.930401,18.021379,12391.176214
min,926.0,2015.0,1.0,9.0,1.0,67847.0
25%,964.0,2016.0,6.0,13.0,16.0,81949.0
50%,991.0,2018.0,11.0,822.0,30.0,92760.0
75%,1014.0,2019.0,16.0,840.0,45.0,101999.0
max,1030.0,2019.0,21.0,847.0,78.0,119983.0


From updated summary statistics we can tell that:
- New mean lap time is 92,294 ms or ~1:32.29 - aligning better with typical midfield lap times.
- New max lap time is 119,983 ms or ~1:59.98 - just under 2 minutes, still within plausible race conditions, like a very wet Belgian GP
- Standard deviation = 12,391 ms or ~12.4s - much more improved, considering previous value was 41s and severly skewed by outliers.
- IQR of 81,949 - 101,999 ms, capturing normal lap time variation across different drivers and circuits.

In [14]:
# Export the validated lap times dataframe to a new CSV file, called "driver-lap-times-validated.csv"
df_laptimes.to_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/driver-lap-times-validated.csv', index=False)

## Validation conclusion
- No null values found
- Column data types are correct
- Invalid lap times over 2 minutes required around 900 observations to be dropped.
- Proceed with feature engineering using the newly validated and exported "driver-lap-times-validated.csv"