# KPI 3 - Driver Lap Times - Feature Engineering

This notebook focuses on feature engineering to respond to 'Strategic Question 4' from the project's research stage. 

**Question 3:** 
>*How does the standard deviation of lap times for each Williams driver during a race compare to their teammate across a season, and what interventions can improve consistency?*

**KPI 3:**
>*Driver Lap Time Consistency Index - Lap time standard deviation per driver, per race.*

**Hypothesis 3:**
>*Rookie or less experienced Williams drivers had significantly higher lap time variance than their teammates during races in the 2015-2019 seasons, suggesting lower in-race consistency due to inexperience or adaptability challenges.*

**Steps required:**
1. Write a general function grouping drivers by `df['rookie_or_experienced']`, and the std dev of `df['lap_time_ms']`
2. Add a column to each converting lap_time_ms to `mm:ss:ms`
3. Compare the two dataframes with each other.

Next steps, in stage 4 - visualisation, with filters, and hypothesis testing using `ttest_ind()` and similar methods.

In [11]:
import pandas as pd

# load data
df = pd.read_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/driver-lap-times-validated.csv') # load the data

## Steps 1, 2, 3: All-in-one function aggregating rookie and experienced lap time data

In [12]:
def get_laptime_consistency(
        df: pd.DataFrame, 
        experience_level: str = None,
        year: int | list[int] = None, 
        gp_name: str | list[str] = None, 
        verbose: bool = True) -> pd.DataFrame:
    """
    Steps:
    1. Apply optional filters for experience level, year, and GP name.
    2. Drop missing or invalid lap times.
    3. Group by experience level and calculate the mean and standard deviation of lap times in milliseconds.
    4. Count the number of laps for each experience level.
    5. Convert mean and standard deviation lap times from milliseconds to mm:ss:ms format.
    6. Merge results into a summary DataFrame and rename columns for clarity.

    Arguments:
    df -- DataFrame containing lap time data
    experience_level -- 'rookie' or 'experienced' to filter by experience level (optional)
    year -- Single year or list of years to filter (optional)
    gp_name -- Single GP name or list of GP names to filter (optional)
    verbose -- If True, print filtering information (default: True)

    Return:
    A DataFrame with the mean and standard deviation of lap times (in both ms and mm:ss:ms format), 
    along with lap counts for each experience level, considering optional filters.
    """

    # 1. ---------- filter the data if parameters are provided ---------- 
    if experience_level is not None: 
        df = df[df['rookie_or_experienced'] == experience_level]
        if verbose: 
            print(f"Filtering data for experience level: {experience_level}")
    if year is not None: 
        df = df[df['gp_year'].isin([year] if isinstance(year, int) else year)]
        if verbose: 
            print(f"Filtering data for year(s): {year}")
    if gp_name is not None: 
        df = df[df['gp_name'].isin([gp_name] if isinstance(gp_name, str) else gp_name)]
        if verbose: 
            print(f"Filtering data for GP name(s): {gp_name}")

    # 2. ---------- drop missing or invalid times, just in case ----------
    df = df[df['lap_time_ms'] > 0]

    # ---------- 3. group and calculate statistical metrics ---------- 
    times_in_ms  = df[['rookie_or_experienced', 'lap_time_ms']] # select the relevant columns from df

    # group by experience level and calculate mean and standard deviation lap time
    grouped_by_experience = times_in_ms.groupby('rookie_or_experienced').agg(
        mean_lap_time_ms=('lap_time_ms', 'mean'),
        std_dev_lap_time_ms=('lap_time_ms', 'std')
    ).reset_index()

    # count number of laps for each experience level for statistical testing
    n_laps = times_in_ms.groupby('rookie_or_experienced').size().reset_index(name='n_laps')

    # merge counts in the main summary dataframe
    grouped_by_experience = pd.merge(grouped_by_experience, n_laps, on='rookie_or_experienced')

    # ---------- 4. convert ms to mm:ss:ms ----------
    grouped_by_experience['mean_lap_time'] = grouped_by_experience.apply(
        lambda time: f"{int(time['mean_lap_time_ms'] // 60000):02}:{int((time['mean_lap_time_ms'] % 60000) // 1000):02}.{int(time['mean_lap_time_ms'] % 1000):03}",
        axis=1 
    )
    grouped_by_experience['std_dev_lap_time'] = grouped_by_experience.apply(
        lambda time: f"{int(time['std_dev_lap_time_ms'] // 60000):02}:{int((time['std_dev_lap_time_ms'] % 60000) // 1000):02}.{int(time['std_dev_lap_time_ms'] % 1000):03}",
        axis=1 
    )

    # ---------- 5. rename columns and return result ----------
    grouped_by_experience = grouped_by_experience.rename(columns={ # rename columns for clarity
        'rookie_or_experienced': 'experience_level',
        'mean_lap_time_ms': 'mean_ms',
        'mean_lap_time': 'mean_formatted',
        'std_dev_lap_time_ms': 'std_dev_ms',
        'std_dev_lap_time': 'std_dev_formatted'
    })
    return grouped_by_experience[['experience_level', 'mean_ms', 'mean_formatted', 'std_dev_ms', 'std_dev_formatted', 'n_laps']]

## Testing:

In [13]:
technical_circuits = ['Monaco Grand Prix', 'Hungarian Grand Prix', 'Singapore Grand Prix']
power_circuits = ['Italian Grand Prix', 'Austrian Grand Prix']
balanced_circuits = ['British Grand Prix', 'Belgian Grand Prix', 'Brazilian Grand Prix']

In [14]:
# Filtering high-downforce tracks between 2017 - 2019

print(get_laptime_consistency(df, 
                              year = [2017, 2018, 2019], 
                              gp_name = technical_circuits
                              ))

Filtering data for year(s): [2017, 2018, 2019]
Filtering data for GP name(s): ['Monaco Grand Prix', 'Hungarian Grand Prix', 'Singapore Grand Prix']
  experience_level       mean_ms mean_formatted    std_dev_ms  \
0      experienced  88112.064740      01:28.112  12716.414177   
1           rookie  89246.914962      01:29.246  12283.945506   

  std_dev_formatted  n_laps  
0         00:12.716     865  
1         00:12.283     929  


In [15]:
# Filtering low-downforce power tracks between 2017 - 2019

print(get_laptime_consistency(df, 
                              year = [2017, 2018, 2019], 
                              gp_name = power_circuits
                              )) 

Filtering data for year(s): [2017, 2018, 2019]
Filtering data for GP name(s): ['Italian Grand Prix', 'Austrian Grand Prix']
  experience_level       mean_ms mean_formatted   std_dev_ms  \
0      experienced  77900.437613      01:17.900  8678.220250   
1           rookie  78093.240066      01:18.093  8943.847054   

  std_dev_formatted  n_laps  
0         00:08.678     553  
1         00:08.943     604  


In [16]:
# Filtering balanced tracks between 2017-2019:

print(get_laptime_consistency(df, 
                              year = [2017, 2018, 2019],
                              gp_name = balanced_circuits
                              ))

Filtering data for year(s): [2017, 2018, 2019]
Filtering data for GP name(s): ['British Grand Prix', 'Belgian Grand Prix', 'Brazilian Grand Prix']
  experience_level       mean_ms mean_formatted    std_dev_ms  \
0      experienced  90293.974967      01:30.293  14919.515835   
1           rookie  91366.022251      01:31.366  14857.392865   

  std_dev_formatted  n_laps  
0         00:14.919     759  
1         00:14.857     764  


In [17]:
# All years, all tracks
print(get_laptime_consistency(df))

  experience_level       mean_ms mean_formatted    std_dev_ms  \
0      experienced  89687.006121      01:29.687  13318.445767   
1           rookie  87912.203691      01:27.912  12809.785979   

  std_dev_formatted  n_laps  
0         00:13.318    4901  
1         00:12.809    2818  


An outlier may exist in the experienced driver section, as all other filters return that experienced drivers fare better than rookies. 

This is a point to investigate in stage 4 - visualisation.

In [18]:
# export the CSV of the underlying df 
# for which the get_laptime_consistency function is applied to

print(df)
# df was pulled from the csv 'driver-lap-times-validated.csv'
# use this to access data for further processing

      Unnamed: 0  race_id  gp_year               gp_name  gp_round  driver_id  \
0              0      930     2015    Spanish Grand Prix         5         13   
1              1      930     2015    Spanish Grand Prix         5         13   
2              2      930     2015    Spanish Grand Prix         5         13   
3              3      930     2015    Spanish Grand Prix         5         13   
4              4      930     2015    Spanish Grand Prix         5         13   
...          ...      ...      ...                   ...       ...        ...   
7714        7714     1029     2019  Brazilian Grand Prix        20        822   
7715        7715     1029     2019  Brazilian Grand Prix        20        822   
7716        7716     1029     2019  Brazilian Grand Prix        20        822   
7717        7717     1029     2019  Brazilian Grand Prix        20        822   
7718        7718     1029     2019  Brazilian Grand Prix        20        822   

          driver_name rooki