# KPI 1 - Grid-to-Finish Delta - Feature Engineering

This notebook focuses on feature engineering to respond to 'Strategic Question 1' from the project's research stage. 

**Question 1:** 
>*Which circuits saw Williams lose the most positions from race start to finish during the 2015-2019 F1 seasons, and what track characteristics explain these losses to inform targeted setup and strategy adjustments?*

**KPI 1:**
>*Grid-to-Finish Position Delta - Average positions gained/lost from the start to end of race.*

**Hypothesis 1:**
>*Williams' grid-to-finish position delta was significantly worse at high-downforce, technical circuits between 2015-2019 compared to midfield rivals, likely due to cornering limitations in car performance that reduced overtaking and defending capabilities.*

**Steps required:**

Calculate the delta between the grid position and the finish position for each driver.
Can be scaled up to apply to constructors by aggregating the deltas of all drivers involved with a constructor.

1. Retrieve driver-level delta first, found in `df_results['grid_delta']`.
2. Aggregate the deltas to get the constructor-level delta - `df['is_williams']` = True for just Williams drivers, 
    or group by `df['constructor_name']` for all drivers in a constructor.
3. Calculate the average delta for Williams drivers and rival constructors on all tracks.

Rest of hypothesis 1: dealing with high-downforce tracks, *Monaco, Singapore and Hungoraring*, three of the selected ten circuits.

4. Apply constructor-level deltas, but filter for high-downforce tracks by checking if gp_name falls into a predefined list, 
    `['Monaco Grand Prix', 'Singapore Grand Prix', 'Hungarian Grand Prix']`.
5. Calculate the average delta for Williams and rival constructors on these tracks.

Next steps, in stage 4 - hypothesis testing using `ttest_ind()` and similar methods.

In [2]:
import pandas as pd
df = pd.read_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/grid-to-finish-validated.csv') # load the data

## Step 1 - retrieve driver-level delta

In [3]:
def get_driver_level_delta(df: pd.DataFrame) -> pd.DataFrame:
    """
    Retrieve the grid-to-finish delta for each driver.
    Add a column 'gained_or_lost' to indicate if the driver gained or lost positions.
    Add a column 'num_places' to indicate placed gained/lost - this is the absolute value of the delta 

    Arguments:
    df (pd.DataFrame): The dataframe containing the grid-to-finish data.

    Returns:
    pd.DataFrame: A dataframe containing the driver name, GP year, GP name, 
    grid delta, gained or lost status, and number of places.
    """
    df['gained_or_lost'] = df['grid_delta'].apply(lambda x: 'lost' if x < 0 else 'gained')
    df['num_places'] = df['grid_delta'].abs()

    return df[['driver_name', 'gp_year', 'gp_name', 'grid_delta', 'gained_or_lost', 'num_places']]


print(get_driver_level_delta(df)) # test code

         driver_name  gp_year               gp_name  grid_delta  \
0     George Russell     2019  Brazilian Grand Prix           6   
1     George Russell     2019    British Grand Prix           5   
2     George Russell     2019    Italian Grand Prix           0   
3     George Russell     2019     Monaco Grand Prix           4   
4      Robert Kubica     2019    British Grand Prix           5   
..               ...      ...                   ...         ...   
296  Valtteri Bottas     2016    Italian Grand Prix          -1   
297  Valtteri Bottas     2016    Belgian Grand Prix           0   
298  Valtteri Bottas     2015    Belgian Grand Prix          -6   
299  Valtteri Bottas     2016   Austrian Grand Prix          -2   
300  Valtteri Bottas     2016  Hungarian Grand Prix           1   

    gained_or_lost  num_places  
0           gained           6  
1           gained           5  
2           gained           0  
3           gained           4  
4           gained           5

In `grid_delta`, 
- `+` means a driver or constructor gained grid places.
- `-` means a driver or constructor lost grid places.

## Step 2 - aggregate driver-level deltas for constructor-level delta

- Create a function witha  `constructor_ref` parameter, with default argument `williams`
- Otherwise, argument is the constructor reference passed in.

In [4]:
# Get all unique constructor reference names from the dataframe.
unique_constructors = df['constructor_ref'].unique().tolist()
print(unique_constructors)

['williams', 'renault', 'haas', 'force_india', 'racing_point']


In [5]:
def get_constructor_level_delta(df: pd.DataFrame, constructor_ref: str = 'williams') -> pd.DataFrame:
    """
    Group by constructor_ref, and calculate the mean of grid_delta for each constructor.
    
    Arguments:
    df (pd.DataFrame): The dataframe containing the grid-to-finish data.
    constructor_ref (str): The constructor reference name to filter by. Default is 'williams'.

    Returns:
    pd.DataFrame: A dataframe containing the average grid-to-finish delta for the specified constructor.
    columns are 'constructor_ref', 'gp_year', 'gp_name', 'avg_grid_delta'.
    """

    # filter the dataframe for the specified constructor
    df_constructor = df[df['constructor_ref'] == constructor_ref]

    # group by gp_year and gp_name, and calculate the mean of grid_delta
    df_constructor_grouped = df_constructor.groupby(['gp_year', 'gp_name']).agg(
        avg_grid_delta=('grid_delta', 'mean')
    ).reset_index()

    # add a column for the constructor reference name
    df_constructor_grouped['constructor_ref'] = constructor_ref

    # specify in session if on average, the constructor gained or lost places
    df_constructor_grouped['gained_or_lost'] = df_constructor_grouped['avg_grid_delta'].apply(lambda x: 'lost' if x < 0 else 'gained')

    # specify the number of places gained or lost
    df_constructor_grouped['num_places'] = df_constructor_grouped['avg_grid_delta'].abs()

    return df_constructor_grouped[['constructor_ref', 'gp_year', 'gp_name', 'avg_grid_delta', 'gained_or_lost', 'num_places']]

In [6]:
print(get_constructor_level_delta(df).head()) # test on williams, it being the default constructor for the function

  constructor_ref  gp_year               gp_name  avg_grid_delta  \
0        williams     2015   Austrian Grand Prix             1.0   
1        williams     2015    Belgian Grand Prix            -3.0   
2        williams     2015  Brazilian Grand Prix             2.0   
3        williams     2015    British Grand Prix            -1.0   
4        williams     2015  Hungarian Grand Prix            -5.5   

  gained_or_lost  num_places  
0         gained         1.0  
1           lost         3.0  
2         gained         2.0  
3           lost         1.0  
4           lost         5.5  


In [7]:
# test on all constructors, previewing the head of each separate constructor DataFrame
for constructor in unique_constructors:
    print(f'Constructor: {constructor}')
    print(get_constructor_level_delta(df, constructor_ref=constructor).head(5))
    print('\n') # add a newline for readability

Constructor: williams
  constructor_ref  gp_year               gp_name  avg_grid_delta  \
0        williams     2015   Austrian Grand Prix             1.0   
1        williams     2015    Belgian Grand Prix            -3.0   
2        williams     2015  Brazilian Grand Prix             2.0   
3        williams     2015    British Grand Prix            -1.0   
4        williams     2015  Hungarian Grand Prix            -5.5   

  gained_or_lost  num_places  
0         gained         1.0  
1           lost         3.0  
2         gained         2.0  
3           lost         1.0  
4           lost         5.5  


Constructor: renault
  constructor_ref  gp_year               gp_name  avg_grid_delta  \
0         renault     2016   Austrian Grand Prix             5.0   
1         renault     2016    Belgian Grand Prix            -2.0   
2         renault     2016  Brazilian Grand Prix             4.0   
3         renault     2016    British Grand Prix            -1.0   
4         renault   

## Step 3 - Calculate average delta for Williams and rivals

In [8]:
def get_average_delta_all_tracks(df: pd.DataFrame) -> pd.DataFrame: 
    """
    Using the function get_constructor_level_delta, which calculates the average grid-to-finish delta for a constructor and returns a dataframe,
    this function will calculate the average grid-to-finish delta for all constructors on all tracks.

    We need to loop through a list of midfield constructors, and apply the get_constructor_level_delta function to each constructor.
    Then we will extract and concatenate the results into a single dataframe.

    Arguments:
    df (pd.DataFrame): The dataframe containing the grid-to-finish data.

    Returns:
    pd.DataFrame: A dataframe containing the average grid-to-finish delta for all constructors on all tracks.
    Columns are 'constructor_ref', 'avg_grid_delta_year'
    (avg_grid_delta_year is the average of avg_grid_delta for each constructor across all tracks. 
    This is negative, indicating lost positions, or positive, indicating gained ones)
    """

    # initalise a new dataframe to store the results and to return
    df_all_constructors = pd.DataFrame(columns=['constructor_ref', 'avg_grid_delta_year'])

    # get all constructor reference names from the dataframe
    constructor_refs = df['constructor_ref'].unique().tolist()

    for constructor in constructor_refs:
        df_constructor = get_constructor_level_delta(df, constructor) # gets constructor level deltas by race
        avg_grid_delta_year = df_constructor['avg_grid_delta'].mean() # get year-long average of avg_grid_delta for each constructor across all tracks
        df_all_constructors = pd.concat(
            [df_all_constructors, pd.DataFrame({'constructor_ref': [constructor], 'avg_grid_delta_year': [avg_grid_delta_year]})], ignore_index=True
            ) # add the results to the dataframe
        
    return df_all_constructors


print(get_average_delta_all_tracks(df)) # test the function

  constructor_ref  avg_grid_delta_year
0        williams             1.050000
1         renault             1.710526
2            haas            -0.223684
3     force_india             1.089744
4    racing_point             2.650000


  df_all_constructors = pd.concat(


- The issue with the above function is that it averages across all five seasons between 2015-2019.
- We need to build in a filter to the function to work by year - in the form of a parameter.
- Stakeholders wnat to see the average delta for each constructor by year, not across all years.
- This will help with data viz and hypothesis testing.

In [9]:
def get_average_constructor_delta_by_year(df: pd.DataFrame, year: int) -> pd.DataFrame:
    """
    An improved version of the above function, now providing a breakdown of average grid delta by year.

    Arguments:
    df (pd.DataFrame): The dataframe containing the grid-to-finish data.
    year (int): The year to view average constructor deltas.

    Returns:
    pd.DataFrame: A dataframe containing the average grid-to-finish delta for all constructors on all tracks.
    Columns are 'constructor_ref', 'year', 'avg_grid_delta_year'
    (avg_grid_delta_year is the average of avg_grid_delta for each constructor across all tracks. 
    This is negative, indicating lost positions, or positive, indicating gained ones)
    """

    results = [] # list to store each row's dict - passed into pd.DataFrame on function return

    # get a list of constructor reference names from the main dataframe
    constructor_refs = df['constructor_ref'].unique().tolist()

    for constructor in constructor_refs:
        df_constructor = get_constructor_level_delta(df, constructor)  # gets constructor level deltas by race
        df_constructor_year = df_constructor[df_constructor["gp_year"] == year]  # filter for race deltas of the year specified
        avg_grid_delta_year = df_constructor_year["avg_grid_delta"].mean() # get the mean delta throughout the entire year

        if not pd.isna(avg_grid_delta_year): # prevent NaN rows - as at least one constructor did not exist in each given year 2015-2019
            results.append({
                "constructor_ref": constructor,
                "year": year,
                "avg_grid_delta_year": avg_grid_delta_year
            })

    return pd.DataFrame(results).sort_values(by = 'avg_grid_delta_year', ascending=False).reset_index(drop=True) # sort by avg delta, and reset index


# view average constructor deltas between 2015-2019
for year in range(2015, 2020):
    print(f"Year: {year}")
    print(get_average_constructor_delta_by_year(df, year))
    print("\n") # new line for better legibility

Year: 2015
  constructor_ref  year  avg_grid_delta_year
0     force_india  2015             1.611111
1        williams  2015            -0.850000


Year: 2016
  constructor_ref  year  avg_grid_delta_year
0         renault  2016             2.722222
1     force_india  2016             1.050000
2            haas  2016             0.611111
3        williams  2016             0.350000


Year: 2017
  constructor_ref  year  avg_grid_delta_year
0        williams  2017                 2.50
1         renault  2017                 1.65
2            haas  2017                 1.15
3     force_india  2017                 0.05


Year: 2018
  constructor_ref  year  avg_grid_delta_year
0        williams  2018                  2.0
1         renault  2018                  1.7
2     force_india  2018                  1.7
3            haas  2018                 -0.6


Year: 2019
  constructor_ref  year  avg_grid_delta_year
0    racing_point  2019             2.650000
1        williams  2019             1

## Steps 4 & 5 - filter for high-downforce tracks 

In [10]:
df_high_downforce = df[df['gp_name'].isin(['Monaco Grand Prix', 'Singapore Grand Prix', 'Hungarian Grand Prix'])].reset_index(drop=True)

df_high_downforce.to_csv('/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/delta-high-downforce.csv') # for observation

print(df_high_downforce.head())

   race_id  gp_year               gp_name  gp_round     driver_name  \
0     1015     2019     Monaco Grand Prix         6  George Russell   
1     1021     2019  Hungarian Grand Prix        12  George Russell   
2     1024     2019  Singapore Grand Prix        15   Robert Kubica   
3     1015     2019     Monaco Grand Prix         6   Robert Kubica   
4     1021     2019  Hungarian Grand Prix        12   Robert Kubica   

  constructor constructor_ref  is_williams  start_position  final_position  \
0    Williams        williams         True              19              15   
1    Williams        williams         True              15              16   
2    Williams        williams         True              19              16   
3    Williams        williams         True              20              18   
4    Williams        williams         True              19              19   

   grid_delta gained_or_lost  num_places  
0           4         gained           4  
1          -1     

In [11]:
# quick sanity checks before analysis
# 1. checking team-level row counts:
print(df_high_downforce.groupby("constructor")["grid_delta"].count().sort_values(ascending=False))
print("\n")

# 2. per-circuit distribution - checking for over-representation
print(df_high_downforce['gp_name'].value_counts())

constructor
Williams        26
Haas F1 Team    20
Renault         20
Force India     19
Racing Point     5
Name: grid_delta, dtype: int64


gp_name
Monaco Grand Prix       33
Hungarian Grand Prix    31
Singapore Grand Prix    26
Name: count, dtype: int64


In [12]:
# Focusing on the three high-downforce, technical tracks, 
# run get_average_constructor_delta_by_year function
# on df_high_downforce

print("Constructor-level grid-to-finish position delta, by year, on high-downforce & technical tracks.")
print("2015-2019. Monaco, Singapore, Hungarian GPs. Williams, Renault, Haas, Racing Point/Force India")
print("Note: 'avg_grid_delta_year'")
print("'+' means a constructor, on average, gained positions in-race compared to their starting position.")
print("'-' means a constructor typically lost positions compared to their starting position.\n")
for year in range(2015, 2020):
    print(f"Year: {year}")
    print(get_average_constructor_delta_by_year(df_high_downforce, year))
    print("\n") # new line for better legibility

Constructor-level grid-to-finish position delta, by year, on high-downforce & technical tracks.
2015-2019. Monaco, Singapore, Hungarian GPs. Williams, Renault, Haas, Racing Point/Force India
Note: 'avg_grid_delta_year'
'+' means a constructor, on average, gained positions in-race compared to their starting position.
'-' means a constructor typically lost positions compared to their starting position.

Year: 2015
  constructor_ref  year  avg_grid_delta_year
0     force_india  2015             3.000000
1        williams  2015            -1.333333


Year: 2016
  constructor_ref  year  avg_grid_delta_year
0         renault  2016             4.250000
1     force_india  2016             3.666667
2            haas  2016             1.000000
3        williams  2016             0.166667


Year: 2017
  constructor_ref  year  avg_grid_delta_year
0        williams  2017             4.833333
1            haas  2017             2.833333
2     force_india  2017             2.500000
3         renault 

In [13]:
# Return the full dataframe - all circuits

print(df)

     race_id  gp_year               gp_name  gp_round      driver_name  \
0       1029     2019  Brazilian Grand Prix        20   George Russell   
1       1019     2019    British Grand Prix        10   George Russell   
2       1023     2019    Italian Grand Prix        14   George Russell   
3       1015     2019     Monaco Grand Prix         6   George Russell   
4       1019     2019    British Grand Prix        10    Robert Kubica   
..       ...      ...                   ...       ...              ...   
296      961     2016    Italian Grand Prix        14  Valtteri Bottas   
297      960     2016    Belgian Grand Prix        13  Valtteri Bottas   
298      937     2015    Belgian Grand Prix        11  Valtteri Bottas   
299      956     2016   Austrian Grand Prix         9  Valtteri Bottas   
300      958     2016  Hungarian Grand Prix        11  Valtteri Bottas   

    constructor constructor_ref  is_williams  start_position  final_position  \
0      Williams        williams

In [19]:
df.to_csv("/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/delta-all-circuits.csv")