# Group Assignment: Formula 1 Data

Solve the questions regarding the `f1_data.csv` dataset.

The full grade will be split as follows:
* 60 % notebook, code, and explanations of the questions
* 30 % presentation in class: quality of material, presentation, and Q&A
    * The presentation has to be about one specific driver, constructor, or circuit. Build a data-based history about the driver/circuit/constructor and present it.
    * You can present in any format you want: PPT, PDF, notebook, whatever
* 10 % visualization to support the answers and the presentation

The data is composed of the following variables:
* `car_number`: the number of the car
* `grid_starting_position`: the starting position of the car in the grid
* `final_position`: the position in which that driver ended
* `points`: the points earned by the driver in the race
* `laps`: the number of laps completed by the driver
* `total_race_time_ms`: the total time the driver took to complete the race in milliseconds
* `fastest_lap`: the fastest lap completed by the driver
* `rank`: the rank of the driver in the race
* `fastest_lap_time`: the time taken to complete the fastest lap
* `fastest_lap_speed`: the speed of the fastest lap
* `year`: the year of the race
* `race_number_season`: the number of the race in the season
* `race_name`: the name of the race
* `race_date`: the date of the race
* `race_start_time`: the start time of the race
* `circuit_name`: the name of the circuit where the race took place
* `circuit_location`: the location of the circuit
* `circuit_country`: the country where the circuit is located
* `circuit_lat`: the latitude of the circuit
* `circuit_lng`: the longitude of the circuit
* `circuit_altitude`: the altitude of the circuit
* `driver`: the name of the driver
* `driver_dob`: the date of birth of the driver
* `driver_nationality`: the nationality of the driver
* `constructor_name`: the name of the constructor team
* `constructor_nationality`: the nationality of the constructor team
* `status`: the status of the driver in the race (e.g., Finished, Did Not Finish, etc.)

**SUBMISSION: failing to comply with the submission format will result in a 0 grade**
* ONE (1) SINGLE ZIP FILE containing:
    * The notebook with the code and the answers
    * The presentation itself (PPT, PDF, notebook, whatever)
    * `f1_data.csv`
* The ZIP file should be named as follows: `group_assignment_<group_id>.zip`
  * For example, if you are group 1, the file should be named `group_assignment_1.zip`

### 0. Group Information

* Group ID
* Members:
  * ...

### 1. Basic operations. (1 point)

* Open the dataset as a pandas dataframe and show the first 10 rows
* Show the number of rows and columns
* Show the data types of each column
* Calculate a column called `age` which represents the age of each driver on the date of the race:
    * Hint1: use the `pd.to_datetime` function to convert the date columns to datetime
    * Hint2: use the `driver_dob` and substract from it the `race_date` column
    * Hint3: use the `dt.days` property to convert the result to days, then divide by 365.25 to get the age in years

In [2]:
import pandas as pd

file_path = 'f1_data.csv'
df = pd.read_csv(file_path)

pd.set_option('display.max_columns', None)

print(df.head(10))

   car_number  grid_starting_position  final_position  points  laps  \
0        22.0                       1             1.0    10.0    58   
1         3.0                       5             2.0     8.0    58   
2         7.0                       7             3.0     6.0    58   
3         5.0                      11             4.0     5.0    58   
4        23.0                       3             5.0     4.0    58   
5         8.0                      13             6.0     3.0    57   
6        14.0                      17             7.0     2.0    55   
7         1.0                      15             8.0     1.0    53   
8         4.0                       2             NaN     0.0    47   
9        12.0                      18             NaN     0.0    43   

   total_race_time_ms  fastest_lap  rank fastest_lap_time  fastest_lap_speed  \
0           5690616.0         39.0   2.0         1:27.452            218.300   
1           5696094.0         41.0   3.0         1:27.739 

In [None]:
print(len(df))
print(len(df.columns))

26080
27


In [None]:
df.dtypes

Unnamed: 0,0
car_number,float64
grid_starting_position,int64
final_position,float64
points,float64
laps,int64
total_race_time_ms,float64
fastest_lap,float64
rank,float64
fastest_lap_time,object
fastest_lap_speed,float64


In [None]:
df['driver_dob'] = pd.to_datetime(df['driver_dob'])
df['race_date'] = pd.to_datetime(df['race_date'])

df['age'] = ((df['race_date'] - df['driver_dob']).dt.days) / 365.25

### 2. Why do we have missing values in the `final_position` column? (1 point)

The final values are a result of drivers not actually completing the laps required for the race, and therefore not finishing it.

### 3. Constructor analytics (3 points)

* Which constructor has the most race wins? (0.5 points)
* Which constructor has the most podiums (position 1, 2, or 3)? (0.5 points)
* Which constructor has the biggest probability of not finishing a race, according to the dataset? (0.5 points)
* Which country has the most successful constructors in terms of race victories? (0.5 points)
* Which are the current constructors (from 2023) with the longest history in Formula 1? (0.5 points)
* Which is the constructor with the most drivers in Formula 1 across its history? (0.5 points)

In [None]:
most_wins_constructor = df[df['final_position'] == 1]['constructor_name'].mode()[0]

print(most_wins_constructor)

most_podiums_constructor = df[(df['final_position'] == 1) | (df['final_position'] == 2) | (df['final_position'] == 3)]['constructor_name'].mode()[0]

print(most_podiums_constructor)

most_unfinished_constructor = df[pd.isna(df['final_position'])]['constructor_name'].mode()[0]

print(most_unfinished_constructor)

most_wins_nationality = df[df['final_position'] == 1]['constructor_nationality'].mode()[0]

print(most_wins_nationality)

longest_history_constructors = df['constructor_name'].mode()[0]

print(longest_history_constructors)

constructor_driver_counts = df.groupby('constructor_name')['driver'].nunique()
most_drivers_constructor = constructor_driver_counts.idxmax()

print(most_drivers_constructor)




Ferrari
Ferrari
Ferrari
British
Ferrari
Ferrari


### 4. Driver analytics (3 points)

* With the data available, who is the fastest driver in Formula 1? (0.5 point)
* Which is the driver with the most podiums without a win (position 2 or 3)? (0.5 point)
* Calculate the historical probability of each country of having a driver in the podium (0.5 point)
* Calculate the historical probability of each country of having a driver win a race (0.5 point)
* Which driver was the youngest to win a race? (0.5 point)
* Which drivers are the current ones with the longest history in Formula 1? (0.5 point)

Hint: remember that a probability is calculated as the number of times an event happened divided by the total number of events

In [None]:
average_lap_times_driver = df.groupby('driver')['fastest_lap_speed'].mean()
fastest_driver = average_lap_times_driver.idxmax()

print(fastest_driver)

driver_most_podiums_no_win = df[df['final_position'] == 1]['constructor_nationality'].mode()[0]

podium_drivers = df[df['final_position'].isin([2, 3])]

drivers_with_wins = df[df['final_position'] == 1]['driver'].unique()
podium_without_win = podium_drivers[~podium_drivers['driver'].isin(drivers_with_wins)]

podium_counts = podium_without_win['driver'].value_counts()

most_podiums_driver = podium_counts.idxmax()

print(most_podiums_driver)

podium_finishes = df[df['final_position'].isin([1, 2, 3])]
country_podium_counts = podium_finishes['driver_nationality'].value_counts()
total_podiums = country_podium_counts.sum()

country_probabilities_podium = (country_podium_counts / total_podiums)

print(country_probabilities_podium)

win_finishes= df[df['final_position'] == 1]
country_win_counts = win_finishes['driver_nationality'].value_counts()
total_wins = country_win_counts.sum()

country_probabilities_win = (country_win_counts)/total_wins

print(country_probabilities_win)

win_finishes= df[df['final_position'] == 1]
youngest_winner = win_finishes.loc[win_finishes['age'].idxmin()]

print(youngest_winner['driver'])








Antônio Pizzonia
Nick Heidfeld
driver_nationality
British          0.225865
German           0.125987
French           0.093807
Brazilian        0.088950
Finnish          0.074378
Italian          0.062842
Australian       0.039466
American         0.039162
Spanish          0.036733
Austrian         0.035823
Argentine        0.029751
Dutch            0.027626
New Zealander    0.021554
Belgian          0.013661
Swedish          0.013358
Mexican          0.012143
Canadian         0.011840
South African    0.010929
Swiss            0.010929
Colombian        0.009107
Monegasque       0.008500
Polish           0.003643
Russian          0.001214
Japanese         0.000911
Thai             0.000607
Portuguese       0.000304
Rhodesian        0.000304
Venezuelan       0.000304
Danish           0.000304
Name: count, dtype: float64
driver_nationality
British          0.282450
German           0.163620
Brazilian        0.092322
French           0.074040
Finnish          0.052102
Dutch            0.

### 5. Circuit analytics (2 points)

* Which would you say is the toughest circuit in Formula 1? (0.5 point)
* Which circuit requires the most f1 experience to win? (0.5 point)
* Which circuit and year saw the most number of non-finishers? (0.5 point)
* For each constructor, which is their best circuit in terms of amount of podiums? (0.5 point)

In [31]:
races_per_circuit = (
    df.groupby('circuit_name')['final_position']
    .agg(
        nan_count=lambda x: x.isna().sum(),
        non_nan_count=lambda x: (~x.isna()).sum()
    )
    .assign(ratio=lambda x: x['nan_count'] / x['non_nan_count'])
    .sort_values(by='ratio', ascending=False)
)

races_per_circuit

Unnamed: 0_level_0,nan_count,non_nan_count,ratio
circuit_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fair Park,18,8,2.250000
Phoenix street circuit,72,36,2.000000
Sebring International Raceway,12,7,1.714286
Detroit Street Circuit,119,72,1.652778
Long Beach,137,83,1.650602
...,...,...,...
Yas Marina Circuit,41,255,0.160784
Valencia Street Circuit,13,99,0.131313
Losail International Circuit,2,18,0.111111
Miami International Autodrome,3,37,0.081081


In [None]:
winners = df[df['final_position'] == 1]

average_age_per_race = winners.groupby('race_name')['age'].mean()
average_age_per_race.idxmax()

'Swiss Grand Prix'

In [None]:
non_finishes = df[df['final_position'].isna()]
non_finishes_count = non_finishes.groupby(['year','race_name']).size()
year_race_with_max_non_finishes = non_finishes_count.idxmax()

year_race_with_max_non_finishes

(1989, 'Australian Grand Prix')

In [33]:
podiums = df[(df['final_position'] == 1) | (df['final_position'] == 2 | (df['final_position'] == 3))]

podiums_count = podiums.groupby(['constructor_name', 'circuit_name']).size()

max_podium_row = podiums_count.loc[podiums_count['podium_count'].idxmax()]

constructor = max_podium_row['constructor_name']
race = max_podium_row['circuit_name']
max_count = max_podium_row['podium_count']

print(constructor, race, max_count)



Ferrari Autodromo Nazionale di Monza 19


## Code for the presentation goes here

Build a data-based history about a driver/circuit/constructor and present it using your preferred format. You can use any of the data available in the dataset.

Examples:
* The most successful driver in Formula 1 history
* Why Monaco is the most difficult circuit in Formula 1
* The history of Ferrari in Formula 1
* ...

In [3]:
# Check for inconsistencies in the driver column
df_cleaned = df[['driver']].copy()

# Handle missing values explicitly
df_cleaned = df_cleaned.dropna()

# Ensure the driver column is treated as a string
df_cleaned['driver'] = df_cleaned['driver'].astype(str)

# Calculate the correct row counts
driver_row_counts_corrected = df_cleaned['driver'].value_counts().reset_index()

# Rename columns for clarity
driver_row_counts_corrected.columns = ['driver', 'row_count']

# Sort the results by row_count in descending order
driver_row_counts_corrected_sorted = driver_row_counts_corrected.sort_values(by='row_count', ascending=False).reset_index(drop=True)

# Display the top rows of the corrected counts
print(driver_row_counts_corrected_sorted.head())





               driver  row_count
0     Fernando Alonso        370
1      Kimi Räikkönen        352
2  Rubens Barrichello        326
3      Lewis Hamilton        322
4       Jenson Button        309


In [4]:
df_filtered = df[['driver', 'constructor_name', 'race_name', 'final_position', 'year']].copy()

# Replace NaN in 'final_position' with 24 (for non-finishers)
df_filtered['final_position'] = pd.to_numeric(df_filtered['final_position'], errors='coerce').fillna(24)

# Combine race_name and year to create a unique race identifier
df_filtered['unique_race'] = df_filtered['race_name'].astype(str) + "_" + df_filtered['year'].astype(str)

# Function to compare drivers within a group
def calculate_better_finishes(group):
    # Sort by final position (lower is better)
    group = group.sort_values('final_position').reset_index(drop=True)
    group['better_than_teammate'] = group['final_position'] < group['final_position'].shift(-1)
    return group

# Group by unique_race and constructor_name to compare teammates
df_comparison = df_filtered.groupby(['unique_race', 'constructor_name']).apply(calculate_better_finishes)

# Count the number of races where a driver finished better than their teammate
driver_better_counts = df_comparison.groupby('driver')['better_than_teammate'].sum().reset_index()

# Rename the column for clarity
driver_better_counts.columns = ['driver', 'better_than_teammate_count']

# Order the results by better_than_teammate_count in descending order
driver_better_counts = driver_better_counts.sort_values(by='better_than_teammate_count', ascending=False).reset_index(drop=True)

# Display the ordered results
print(driver_better_counts.head(10))  # Display the top drivers


               driver  better_than_teammate_count
0     Fernando Alonso                         241
1      Lewis Hamilton                         199
2  Michael Schumacher                         193
3    Sebastian Vettel                         178
4      Kimi Räikkönen                         175
5       Jenson Button                         161
6  Rubens Barrichello                         121
7      Max Verstappen                         118
8    Daniel Ricciardo                         115
9        Sergio Pérez                         111


In [5]:
# Merge DataFrames
driver_combined = driver_row_counts_corrected_sorted.merge(driver_better_counts, on='driver', how='inner')

# Calculate Percentage
driver_combined['percentage_above_teammate'] = (
    driver_combined['better_than_teammate_count'] / driver_combined['row_count'] * 100
)

# Sort Results
driver_combined = driver_combined.sort_values(by='percentage_above_teammate', ascending=False).reset_index(drop=True)

# Display or save the final DataFrame
print(driver_combined.head(20))  # Show top drivers


                       driver  row_count  better_than_teammate_count  \
0                Dennis Poore          2                           2   
1   Alberto Rodriguez Larreta          1                           1   
2              Larry Crockett          1                           1   
3                 Don Beauman          1                           1   
4                Chuck Arnold          1                           1   
5               Peter Broeker          1                           1   
6                Oscar Gálvez          1                           1   
7           Jonathan Williams          1                           1   
8                 John Barber          1                           1   
9              Robert La Caze          1                           1   
10               George Amick          1                           1   
11                 Guy Tunmer          1                           1   
12               André Guelfi          1                        

In [6]:
# Correct Driver Counts
df_cleaned = df[['driver']].copy()
df_cleaned = df_cleaned.dropna()
df_cleaned['driver'] = df_cleaned['driver'].astype(str)
driver_row_counts_corrected = df_cleaned['driver'].value_counts().reset_index()
driver_row_counts_corrected.columns = ['driver', 'row_count']
driver_row_counts_corrected_sorted = driver_row_counts_corrected.sort_values(by='row_count', ascending=False).reset_index(drop=True)

# Filter drivers with more than 45 races
drivers_with_min_races_60 = driver_row_counts_corrected_sorted[driver_row_counts_corrected_sorted['row_count'] > 45]

# Teammate Comparison DataFrame
df_filtered = df[['driver', 'constructor_name', 'race_name', 'final_position', 'year']].copy()
df_filtered['final_position'] = pd.to_numeric(df_filtered['final_position'], errors='coerce').fillna(24)
df_filtered['unique_race'] = df_filtered['race_name'].astype(str) + "_" + df_filtered['year'].astype(str)


def calculate_better_finishes(group):
    group = group.sort_values('final_position').reset_index(drop=True)
    group['better_than_teammate'] = group['final_position'] < group['final_position'].shift(-1)
    return group


df_comparison = df_filtered.groupby(['unique_race', 'constructor_name']).apply(calculate_better_finishes)
driver_better_counts = df_comparison.groupby('driver')['better_than_teammate'].sum().reset_index()
driver_better_counts.columns = ['driver', 'better_than_teammate_count']

# Merge DataFrames
driver_combined = drivers_with_min_races_60.merge(driver_better_counts, on='driver', how='inner')

# Calculate Percentage
driver_combined['percentage_above_teammate'] = (
    driver_combined['better_than_teammate_count'] / driver_combined['row_count'] * 100
)

# Sort Results
driver_combined = driver_combined.sort_values(by='percentage_above_teammate', ascending=False).reset_index(drop=True)

# Display or save the final DataFrame
print(driver_combined.head(10))  # Show top drivers


               driver  row_count  better_than_teammate_count  \
0         Juan Fangio         58                          40   
1      Max Verstappen        175                         118   
2     Fernando Alonso        370                         241   
3  Michael Schumacher        308                         193   
4       Mike Hawthorn         48                          30   
5      Lewis Hamilton        322                         199   
6           Jim Clark         73                          45   
7      Richie Ginther         54                          33   
8        Lando Norris         94                          57   
9      George Russell         94                          57   

   percentage_above_teammate  
0                  68.965517  
1                  67.428571  
2                  65.135135  
3                  62.662338  
4                  62.500000  
5                  61.801242  
6                  61.643836  
7                  61.111111  
8               

In [7]:
# Define the points mapping according to the current F1 system
points_mapping = {
    1: 25,
    2: 18,
    3: 15,
    4: 12,
    5: 10,
    6: 8,
    7: 6,
    8: 4,
    9: 2,
    10: 1
}

# Apply the points system to create the corrected_points column
df['corrected_points'] = df['final_position'].map(points_mapping).fillna(0).astype(int)

# Display the updated DataFrame
print(df[['driver', 'final_position', 'corrected_points']].head())

df.head(10)

              driver  final_position  corrected_points
0     Lewis Hamilton             1.0                25
1      Nick Heidfeld             2.0                18
2       Nico Rosberg             3.0                15
3    Fernando Alonso             4.0                12
4  Heikki Kovalainen             5.0                10


Unnamed: 0,car_number,grid_starting_position,final_position,points,laps,total_race_time_ms,fastest_lap,rank,fastest_lap_time,fastest_lap_speed,year,race_number_season,race_name,race_date,race_start_time,circuit_name,circuit_location,circuit_country,circuit_lat,circuit_lng,circuit_altitude,driver,driver_dob,driver_nationality,constructor_name,constructor_nationality,status,corrected_points
0,22.0,1,1.0,10.0,58,5690616.0,39.0,2.0,1:27.452,218.3,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Lewis Hamilton,1985-01-07,British,McLaren,British,Finished,25
1,3.0,5,2.0,8.0,58,5696094.0,41.0,3.0,1:27.739,217.586,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nick Heidfeld,1977-05-10,German,BMW Sauber,German,Finished,18
2,7.0,7,3.0,6.0,58,5698779.0,41.0,5.0,1:28.090,216.719,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nico Rosberg,1985-06-27,German,Williams,British,Finished,15
3,5.0,11,4.0,5.0,58,5707797.0,58.0,7.0,1:28.603,215.464,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Fernando Alonso,1981-07-29,Spanish,Renault,French,Finished,12
4,23.0,3,5.0,4.0,58,5708630.0,43.0,1.0,1:27.418,218.385,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Heikki Kovalainen,1981-10-19,Finnish,McLaren,British,Finished,10
5,8.0,13,6.0,3.0,57,,50.0,14.0,1:29.639,212.974,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Kazuki Nakajima,1985-01-11,Japanese,Williams,British,+1 Lap,8
6,14.0,17,7.0,2.0,55,,22.0,12.0,1:29.534,213.224,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Sébastien Bourdais,1979-02-28,French,Toro Rosso,Italian,Engine,6
7,1.0,15,8.0,1.0,53,,20.0,4.0,1:27.903,217.18,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Kimi Räikkönen,1979-10-17,Finnish,Ferrari,Italian,Engine,4
8,4.0,2,,0.0,47,,15.0,9.0,1:28.753,215.1,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Robert Kubica,1984-12-07,Polish,BMW Sauber,German,Collision,0
9,12.0,18,,0.0,43,,23.0,13.0,1:29.558,213.166,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Timo Glock,1982-03-18,German,Toyota,Japanese,Accident,0


In [8]:
# Group by driver to calculate total corrected points and total races
driver_performance_ratio = df.groupby('driver').agg(
    total_corrected_points=('corrected_points', 'sum'),
    total_races=('driver', 'count')  # Count the rows per driver as total races
).reset_index()

# Filter drivers with more than 45 races
driver_performance_ratio = driver_performance_ratio[driver_performance_ratio['total_races'] > 45]

# Calculate the ratio of corrected points to total races
driver_performance_ratio['ratio_points_per_race'] = (
    driver_performance_ratio['total_corrected_points'] / (driver_performance_ratio['total_races'] * 25)
)

# Sort by points_per_race in descending order
driver_performance_ratio = driver_performance_ratio.sort_values(by='ratio_points_per_race', ascending=False).reset_index(drop=True)

# Display the top rows of the sorted DataFrame
print(driver_performance_ratio.head(30))



                driver  total_corrected_points  total_races  \
0       Lewis Hamilton                    4879          322   
1          Juan Fangio                     873           58   
2       Max Verstappen                    2266          175   
3   Michael Schumacher                    3890          308   
4          Alain Prost                    2483          202   
5         Ayrton Senna                    1881          162   
6            Jim Clark                     839           73   
7       Jackie Stewart                    1109          100   
8     Sebastian Vettel                    3287          300   
9        Mike Hawthorn                     463           48   
10          Damon Hill                    1091          122   
11  Juan Pablo Montoya                     825           95   
12        Nico Rosberg                    1739          206   
13       Stirling Moss                     616           73   
14         Denny Hulme                     940         

In [9]:
# Filter relevant columns and create unique race identifier
df_filtered = df[['driver', 'constructor_name', 'race_name', 'final_position', 'year']].copy()
df_filtered['final_position'] = pd.to_numeric(df_filtered['final_position'], errors='coerce').fillna(24)
df_filtered['unique_race'] = df_filtered['race_name'].astype(str) + "_" + df_filtered['year'].astype(str)

# Apply updated logic to count races where drivers beat all teammates
df_comparison = df_filtered.groupby(['unique_race', 'constructor_name']).apply(calculate_better_finishes)

# Count the number of races where a driver beat all teammates
driver_wins_against_teammates = df_comparison[df_comparison['better_than_all_teammates']].groupby('driver').size().reset_index(name='wins_against_teammates')

#Order the results by wins_against_teammates in descending order
driver_wins_against_teammates = driver_wins_against_teammates.sort_values(by='wins_against_teammates', ascending=False).reset_index(drop=True)

# Calculate the count of times each driver appears in the dataset
driver_appearance_count = df['driver'].value_counts().reset_index()
driver_appearance_count.columns = ['driver', 'total_appearances']

# Filter for drivers with more than 45 appearances
drivers_with_min_races = driver_appearance_count[driver_appearance_count['total_appearances'] > 100]

# Merge the appearance count with the driver_wins_against_teammates DataFrame
driver_wins_against_teammates = driver_wins_against_teammates.merge(
    drivers_with_min_races, on='driver', how='inner'
)

# Calculate the ratio of wins against teammates to total appearances
driver_wins_against_teammates['win_ratio'] = (
    driver_wins_against_teammates['wins_against_teammates'] /
    driver_wins_against_teammates['total_appearances']
)

# Sort by the win_ratio in descending order
driver_wins_against_teammates = driver_wins_against_teammates.sort_values(by='win_ratio', ascending=False).reset_index(drop=True)

# Display the results
print(driver_wins_against_teammates.head(30))

KeyError: 'better_than_all_teammates'

In [37]:
# Merge the two DataFrames on the 'driver' column
combined_df = driver_performance_ratio.merge(
    driver_wins_against_teammates[['driver', 'win_ratio']],
    on='driver',
    how='inner'
)

# Create a new column for the multiplication of ratio_points_per_race and win_ratio
combined_df['combined_metric'] = combined_df['ratio_points_per_race'] * combined_df['win_ratio']

# Sort by the combined_metric in descending order
combined_df = combined_df.sort_values(by='combined_metric', ascending=False).reset_index(drop=True)

# Display the resulting DataFrame
print(combined_df[['driver', 'ratio_points_per_race', 'win_ratio', 'combined_metric']].head(30))


                driver  ratio_points_per_race  win_ratio  combined_metric
0       Lewis Hamilton               0.606087   0.627329         0.380216
1       Max Verstappen               0.517943   0.697143         0.361080
2         Ayrton Senna               0.464444   0.746914         0.346900
3          Alain Prost               0.491683   0.702970         0.345639
4   Michael Schumacher               0.505195   0.675325         0.341171
5     Sebastian Vettel               0.438267   0.626667         0.274647
6        Nelson Piquet               0.324251   0.729469         0.236531
7      Fernando Alonso               0.329081   0.705405         0.232136
8        Mika Häkkinen               0.335030   0.678788         0.227415
9           Damon Hill               0.357705   0.614754         0.219901
10  Emerson Fittipaldi               0.266846   0.805369         0.214909
11      Jody Scheckter               0.317168   0.663717         0.210510
12        Nico Rosberg               0