# KPI 2.1 - Constructor Pit Stops - Data Validation and Sanity Checks

In [35]:
import pandas as pd

# read csv file
df_pitstops = pd.read_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/constructor-pit-stops.csv')

# dataframe basic info
print(df_pitstops.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 667 entries, 0 to 666
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   race_id          667 non-null    int64  
 1   gp_year          667 non-null    int64  
 2   gp_name          667 non-null    object 
 3   gp_round         667 non-null    int64  
 4   driver_id        667 non-null    int64  
 5   driver_name      667 non-null    object 
 6   constructor      667 non-null    object 
 7   constructor_ref  667 non-null    object 
 8   is_williams      667 non-null    bool   
 9   stop_number      667 non-null    int64  
 10  lap_number       667 non-null    int64  
 11  time_of_stop     667 non-null    object 
 12  pit_duration     667 non-null    object 
 13  pit_duration_ms  667 non-null    int64  
 14  pit_duration_s   667 non-null    float64
dtypes: bool(1), float64(1), int64(7), object(6)
memory usage: 73.7+ KB
None


## Summary of processed dataset 'grid-to-finish.csv'

- Filepath: /Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/constructor-pit-stops.csv *(potentially fix from absolute to relative path later?)*
- Range: 667 entries, 0 to 666.
- Columns: 15
- Data types: float64(1), int64(7), object(6), bool(1) *(objects are strings)*
- Memory usage: 73.7+ KB

## Column data types

In [23]:
print(df_pitstops.dtypes)

race_id              int64
gp_year              int64
gp_name             object
gp_round             int64
driver_id            int64
driver_name         object
constructor         object
constructor_ref     object
is_williams           bool
stop_number          int64
lap_number           int64
time_of_stop        object
pit_duration        object
pit_duration_ms      int64
pit_duration_s     float64
dtype: object


## Missing or null values

In [24]:
df_pitstops.isnull().sum() # No nulls present across the dataset!

race_id            0
gp_year            0
gp_name            0
gp_round           0
driver_id          0
driver_name        0
constructor        0
constructor_ref    0
is_williams        0
stop_number        0
lap_number         0
time_of_stop       0
pit_duration       0
pit_duration_ms    0
pit_duration_s     0
dtype: int64

## Check for duplicates

In [25]:
df_pitstops.duplicated().sum() # no duplicates found

0

## Summary statistics

In [26]:
df_pitstops.describe()

Unnamed: 0,race_id,gp_year,gp_round,driver_id,stop_number,lap_number,pit_duration_ms,pit_duration_s
count,667.0,667.0,667.0,667.0,667.0,667.0,667.0,667.0
mean,978.916042,2016.967016,11.994003,648.922039,1.776612,24.731634,66517.67,66.517667
std,27.891287,1.302155,4.478507,320.777242,1.013212,15.239379,250868.8,250.868779
min,930.0,2015.0,5.0,9.0,1.0,1.0,16224.0,16.224
25%,957.0,2016.0,9.0,807.0,1.0,13.0,22576.0,22.576
50%,979.0,2017.0,12.0,815.0,1.0,24.0,24198.0,24.198
75%,1001.0,2018.0,15.0,832.0,2.0,34.5,28773.5,28.7735
max,1029.0,2019.0,20.0,847.0,6.0,72.0,2011147.0,2011.147


From this we can roughly tell that, 
- Most pit stops occur early or mid-race. Median lap no. is 24, with majority between 13 and 34.
- Mean pit duration is 66.5 s. This is inflated by extreme outliers.
- Median pit duration is 24.19 s, far more realistic.
- Most pit stops fall between 22.57 s (25th percentile) and 28.77 s (75th percentile)

- Max duration of 2011.147 s, or nearly 34 minutes, is clearly abnormal. Likely a retired car or incorrectly logged time. Should be flagged or removed.
- Max pit stops number of 6 is unusual, as most cars pit only 1-3 times. Could signal a chaotic race, multiple penalties, or heavy tyre degradation race.

## Drop, or flag outliers?

- This dataset will be heavily used with Fast-F1, understanding safety cars and VSC periods.
- Doing so, I'm inclined not to drop, but actually flag, extreme pit durations of over 90s or abnormal strategies of more than 3 stops.
- This will help analyse context like safety cars, weather, chaotic race conditions (e.g. Germany 2019).

In [28]:
df_pitstops['long_stop_flag'] = df_pitstops['pit_duration_s'] > 90 # any pit stops that last longer than 90 seconds are flagged as long stops
df_pitstops['multi_stops_flag'] = df_pitstops['stop_number'] > 3 # any pit stops that are more than 3 are flagged as multi stops
df_pitstops['chaotic_race_flag'] = df_pitstops['long_stop_flag'] | df_pitstops['multi_stops_flag'] # any pit stops that are either multi stops or long stops are flagged as chaotic

In [29]:
df_pitstops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 667 entries, 0 to 666
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   race_id            667 non-null    int64  
 1   gp_year            667 non-null    int64  
 2   gp_name            667 non-null    object 
 3   gp_round           667 non-null    int64  
 4   driver_id          667 non-null    int64  
 5   driver_name        667 non-null    object 
 6   constructor        667 non-null    object 
 7   constructor_ref    667 non-null    object 
 8   is_williams        667 non-null    bool   
 9   stop_number        667 non-null    int64  
 10  lap_number         667 non-null    int64  
 11  time_of_stop       667 non-null    object 
 12  pit_duration       667 non-null    object 
 13  pit_duration_ms    667 non-null    int64  
 14  pit_duration_s     667 non-null    float64
 15  long_stop_flag     667 non-null    bool   
 16  multi_stops_flag   667 non

In [30]:
# Access observations with long pit stops
long_stops = df_pitstops[df_pitstops['long_stop_flag']]
print(long_stops.head())

     race_id  gp_year             gp_name  gp_round  driver_id  \
192      960     2016  Belgian Grand Prix        13        807   
193      960     2016  Belgian Grand Prix        13        815   
198      960     2016  Belgian Grand Prix        13        154   
199      960     2016  Belgian Grand Prix        13        821   
203      960     2016  Belgian Grand Prix        13        835   

           driver_name   constructor constructor_ref  is_williams  \
192    Nico Hülkenberg   Force India     force_india        False   
193       Sergio Pérez   Force India     force_india        False   
198    Romain Grosjean  Haas F1 Team            haas        False   
199  Esteban Gutiérrez  Haas F1 Team            haas        False   
203      Jolyon Palmer       Renault         renault        False   

     stop_number  lap_number time_of_stop pit_duration  pit_duration_ms  \
192            2           9     14:24:31    16:38.468           998468   
193            2           9     14:24

In [31]:
# Access observations with multi stops
multi_stops = df_pitstops[df_pitstops['multi_stops_flag']]
print(multi_stops.head())

    race_id  gp_year               gp_name  gp_round  driver_id  \
36      936     2015  Hungarian Grand Prix        10        815   
37      936     2015  Hungarian Grand Prix        10        815   
44      936     2015  Hungarian Grand Prix        10         13   
45      936     2015  Hungarian Grand Prix        10        822   
46      936     2015  Hungarian Grand Prix        10        822   

        driver_name  constructor constructor_ref  is_williams  stop_number  \
36     Sergio Pérez  Force India     force_india        False            4   
37     Sergio Pérez  Force India     force_india        False            5   
44     Felipe Massa     Williams        williams         True            4   
45  Valtteri Bottas     Williams        williams         True            4   
46  Valtteri Bottas     Williams        williams         True            5   

    lap_number time_of_stop pit_duration  pit_duration_ms  pit_duration_s  \
36          44     15:14:51       16.958           

In [32]:
# Access observations with long pit stops AND multi stops AND choatic race flag
chaotic_stops = df_pitstops[df_pitstops['long_stop_flag'] & df_pitstops['multi_stops_flag'] & df_pitstops['chaotic_race_flag']]
print(chaotic_stops.head()) # nothing to present.

Empty DataFrame
Columns: [race_id, gp_year, gp_name, gp_round, driver_id, driver_name, constructor, constructor_ref, is_williams, stop_number, lap_number, time_of_stop, pit_duration, pit_duration_ms, pit_duration_s, long_stop_flag, multi_stops_flag, chaotic_race_flag]
Index: []


In [34]:
df_pitstops.to_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/constructor-pit-stops-validated.csv', index=False)

## Validation conclusion
- No null values found
- Column data types are correct
- Long or multiple pit stops are flagged, as well as chaotic races, dependent on the two new columns.
- Proceed with feature engineering using the provided CSV data in 'constructor_pit_stops_validated.csv'