# Group Assignment: Formula 1 Data

Solve the questions regarding the `f1_data.csv` dataset. 

The full grade will be split as follows:
* 60 % notebook, code, and explanations of the questions
* 30 % presentation in class: quality of material, presentation, and Q&A
    * The presentation has to be about one specific driver, constructor, or circuit. Build a data-based history about the driver/circuit/constructor and present it.
    * You can present in any format you want: PPT, PDF, notebook, whatever
* 10 % visualization to support the answers and the presentation

The data is composed of the following variables:
* `car_number`: the number of the car
* `grid_starting_position`: the starting position of the car in the grid
* `final_position`: the position in which that driver ended
* `points`: the points earned by the driver in the race
* `laps`: the number of laps completed by the driver
* `total_race_time_ms`: the total time the driver took to complete the race in milliseconds
* `fastest_lap`: the fastest lap completed by the driver
* `rank`: the rank of the driver in the race
* `fastest_lap_time`: the time taken to complete the fastest lap
* `fastest_lap_speed`: the speed of the fastest lap
* `year`: the year of the race
* `race_number_season`: the number of the race in the season
* `race_name`: the name of the race
* `race_date`: the date of the race
* `race_start_time`: the start time of the race
* `circuit_name`: the name of the circuit where the race took place
* `circuit_location`: the location of the circuit
* `circuit_country`: the country where the circuit is located
* `circuit_lat`: the latitude of the circuit
* `circuit_lng`: the longitude of the circuit
* `circuit_altitude`: the altitude of the circuit
* `driver`: the name of the driver
* `driver_dob`: the date of birth of the driver
* `driver_nationality`: the nationality of the driver
* `constructor_name`: the name of the constructor team
* `constructor_nationality`: the nationality of the constructor team
* `status`: the status of the driver in the race (e.g., Finished, Did Not Finish, etc.)

**SUBMISSION: failing to comply with the submission format will result in a 0 grade**
* ONE (1) SINGLE ZIP FILE containing:
    * The notebook with the code and the answers
    * The presentation itself (PPT, PDF, notebook, whatever)
    * `f1_data.csv`
* The ZIP file should be named as follows: `group_assignment_<group_id>.zip`
  * For example, if you are group 1, the file should be named `group_assignment_1.zip`

### 0. Group Information

* Group ID
* Members:
  * ...

### 1. Basic operations. (1 point)

* Open the dataset as a pandas dataframe and show the first 10 rows
* Show the number of rows and columns
* Show the data types of each column
* Calculate a column called `age` which represents the age of each driver on the date of the race:
    * Hint1: use the `pd.to_datetime` function to convert the date columns to datetime
    * Hint2: use the `driver_dob` and substract from it the `race_date` column
    * Hint3: use the `dt.days` property to convert the result to days, then divide by 365.25 to get the age in years

In [2]:
# Open the dataset as a pandas dataframe and show the first 10 rows
import pandas as pd

df = pd.read_csv('f1_data.csv')
pd.set_option('display.max_columns', None)

df.head(10)

Unnamed: 0,car_number,grid_starting_position,final_position,points,laps,total_race_time_ms,fastest_lap,rank,fastest_lap_time,fastest_lap_speed,year,race_number_season,race_name,race_date,race_start_time,circuit_name,circuit_location,circuit_country,circuit_lat,circuit_lng,circuit_altitude,driver,driver_dob,driver_nationality,constructor_name,constructor_nationality,status
0,22.0,1,1.0,10.0,58,5690616.0,39.0,2.0,1:27.452,218.3,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Lewis Hamilton,1985-01-07,British,McLaren,British,Finished
1,3.0,5,2.0,8.0,58,5696094.0,41.0,3.0,1:27.739,217.586,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nick Heidfeld,1977-05-10,German,BMW Sauber,German,Finished
2,7.0,7,3.0,6.0,58,5698779.0,41.0,5.0,1:28.090,216.719,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nico Rosberg,1985-06-27,German,Williams,British,Finished
3,5.0,11,4.0,5.0,58,5707797.0,58.0,7.0,1:28.603,215.464,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Fernando Alonso,1981-07-29,Spanish,Renault,French,Finished
4,23.0,3,5.0,4.0,58,5708630.0,43.0,1.0,1:27.418,218.385,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Heikki Kovalainen,1981-10-19,Finnish,McLaren,British,Finished
5,8.0,13,6.0,3.0,57,,50.0,14.0,1:29.639,212.974,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Kazuki Nakajima,1985-01-11,Japanese,Williams,British,+1 Lap
6,14.0,17,7.0,2.0,55,,22.0,12.0,1:29.534,213.224,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Sébastien Bourdais,1979-02-28,French,Toro Rosso,Italian,Engine
7,1.0,15,8.0,1.0,53,,20.0,4.0,1:27.903,217.18,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Kimi Räikkönen,1979-10-17,Finnish,Ferrari,Italian,Engine
8,4.0,2,,0.0,47,,15.0,9.0,1:28.753,215.1,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Robert Kubica,1984-12-07,Polish,BMW Sauber,German,Collision
9,12.0,18,,0.0,43,,23.0,13.0,1:29.558,213.166,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Timo Glock,1982-03-18,German,Toyota,Japanese,Accident


In [3]:
# Show the number of rows and columns
print(df.shape)

(26080, 27)


In [4]:
# Show the data types of each column
print(df.dtypes)


car_number                 float64
grid_starting_position       int64
final_position             float64
points                     float64
laps                         int64
total_race_time_ms         float64
fastest_lap                float64
rank                       float64
fastest_lap_time            object
fastest_lap_speed          float64
year                         int64
race_number_season           int64
race_name                   object
race_date                   object
race_start_time             object
circuit_name                object
circuit_location            object
circuit_country             object
circuit_lat                float64
circuit_lng                float64
circuit_altitude           float64
driver                      object
driver_dob                  object
driver_nationality          object
constructor_name            object
constructor_nationality     object
status                      object
dtype: object


In [7]:
# Calculate a column called `age` which represents the age of each driver on the date of the race

df['driver_dob'] = pd.to_datetime(df['driver_dob'])
df['race_date'] = pd.to_datetime(df['race_date'])

df['age'] = (df['race_date'] - df['driver_dob']).dt.days / 365.25

df[['age']].head(5)

Unnamed: 0,age
0,23.186858
1,30.850103
2,22.718686
3,26.631075
4,26.406571


### 2. Why do we have missing values in the `final_position` column? (1 point)

In [22]:
df[df['final_position'].isna()].sample(3)

Unnamed: 0,car_number,grid_starting_position,final_position,points,laps,total_race_time_ms,fastest_lap,rank,fastest_lap_time,fastest_lap_speed,year,race_number_season,race_name,race_date,race_start_time,circuit_name,circuit_location,circuit_country,circuit_lat,circuit_lng,circuit_altitude,driver,driver_dob,driver_nationality,constructor_name,constructor_nationality,status,age
24515,88.0,18,,0.0,28,,24.0,19.0,1:42.327,205.74,2019,16,Russian Grand Prix,2019-09-29,11:10:00,Sochi Autodrom,Sochi,Russia,43.4057,39.9578,2.0,Robert Kubica,1984-12-07,Polish,Williams,British,Brakes,34.809035
12283,20.0,21,,0.0,16,,,,,,1980,11,Dutch Grand Prix,1980-08-31,,Circuit Park Zandvoort,Zandvoort,Netherlands,52.3888,4.54092,6.0,Emerson Fittipaldi,1946-12-12,Brazilian,Fittipaldi,Brazilian,Brakes,33.71937
24115,27.0,16,,0.0,37,,32.0,17.0,1:34.934,220.207,2018,17,Japanese Grand Prix,2018-10-07,05:10:00,Suzuka Circuit,Suzuka,Japan,34.8431,136.541,45.0,Nico Hülkenberg,1987-08-19,German,Renault,French,Engine,31.134839


In [11]:
# When looking at the status for drivers with no position, they did not finish the race or disqualified 
print(list(df[df['final_position'].isna()]['status'].value_counts().index)) 
print(df[df['final_position'].isna()]['status'].value_counts())

# Considering some of the drivers had NaN as a final position but in other columns there were values, we can assume that the missing
# values indicate that the driver either did not finish the race or was disqualified. We can also see according to the list that all
# the status values with final position as NaN, were related to not finishing the race or disqualified.

['Engine', 'Did not qualify', 'Accident', 'Collision', 'Gearbox', 'Spun off', 'Suspension', 'Did not prequalify', 'Transmission', 'Electrical', 'Withdrew', 'Brakes', 'Clutch', 'Not classified', 'Disqualified', 'Turbo', 'Fuel system', 'Hydraulics', 'Overheating', 'Oil leak', 'Ignition', 'Throttle', 'Halfshaft', 'Retired', 'Wheel', 'Oil pressure', 'Fuel pump', 'Differential', 'Handling', 'Tyre', 'Fuel leak', 'Steering', 'Collision damage', 'Radiator', 'Power Unit', 'Puncture', 'Wheel bearing', 'Injection', 'Water leak', 'Physical', 'Fuel pressure', 'Chassis', 'Exhaust', 'Mechanical', 'Alternator', 'Driveshaft', 'Out of fuel', 'Magneto', 'Axle', 'Heat shield fire', 'Battery', 'Power loss', 'Oil pump', 'Distributor', 'Injury', 'Oil pipe', 'Driver unwell', 'Electronics', 'Broken wing', '107% Rule', 'Excluded', 'Wheel nut', 'Rear wing', 'Water pump', 'Injured', 'Vibrations', 'Water pressure', 'Supercharger', 'Front wing', '+11 Laps', 'Pneumatics', 'ERS', 'Fuel', '+9 Laps', 'Spark plugs', 'Fa

### 3. Constructor analytics (3 points)

* Which constructor has the most race wins? (0.5 points)
* Which constructor has the most podiums (position 1, 2, or 3)? (0.5 points)
* Which constructor has the biggest probability of not finishing a race, according to the dataset? (0.5 points)
* Which country has the most successful constructors in terms of race victories? (0.5 points)
* Which are the current constructors (from 2023) with the longest history in Formula 1? (0.5 points)
* Which is the constructor with the most drivers in Formula 1 across its history? (0.5 points)

In [21]:
df_rank = df[df['final_position'] == 1]
#print(df_rank.groupby('constructor_name')['rank'].count().idxmax())
print(df_rank.groupby('constructor_name')['final_position'].count().idxmax())

df_positions = df[(df['final_position'] == 1) | (df['final_position'] == 2) | (df['final_position'] == 3)]
print(df_positions.groupby('constructor_name')['final_position'].count().idxmax())

print(df.groupby('constructor_name')['final_position'].apply(lambda x: x.notna().sum() / x.size).idxmax()) # Take last 5 years instead

print(df_rank.groupby('constructor_nationality')['final_position'].count().idxmax())

#df_year = df[df['year'] <= 2023]
#df[df['race_date'] == df_year.groupby('constructor_name')['race_date'].min().min()]['constructor_name'] 

test = df.groupby('constructor_name')['driver'].nunique().idxmax()

print(f'The constructor {test} is the blah blah')

Ferrari
Ferrari
Behra-Porsche
British
The constructor Ferrari is the blah blah


### 4. Driver analytics (3 points)

* With the data available, who is the fastest driver in Formula 1? (0.5 point)
* Which is the driver with the most podiums without a win (position 2 or 3)? (0.5 point)
* Calculate the historical probability of each country of having a driver in the podium (0.5 point)
* Calculate the historical probability of each country of having a driver win a race (0.5 point)
* Which driver was the youngest to win a race? (0.5 point)
* Which drivers are the current ones with the longest history in Formula 1? (0.5 point)

Hint: remember that a probability is calculated as the number of times an event happened divided by the total number of events

In [34]:
driver_circuit_fastest_lap = df.groupby(['circuit_name', 'driver'])['fastest_lap'].min().reset_index().dropna().sort_values(by = 'fastest_lap', ascending = True)
print(driver_circuit_fastest_lap.head()) # Use maureens formula if nothing better

fastest_lap_per_circuit = driver_circuit_fastest_lap.loc[driver_circuit_fastest_lap.groupby('circuit_name')['fastest_lap'].idxmin()]
print(fastest_lap_per_circuit.head(3))

print(fastest_lap_per_circuit['driver'].value_counts().idxmax())

df_podiums = df[((df['final_position'] == 2) | (df['final_position'] == 3)) & (df['final_position'] != 1)] # Use maureen
df_podiums.head()

print(df_podiums.groupby('driver')['final_position'].size().idxmax())

print(df.groupby('driver_nationality')['final_position'].apply(lambda x: ((x == 1) | (x == 2) | (x == 3)).sum() / x.size).sort_values(ascending = False).sort_values(ascending = False))

                    circuit_name                driver  fastest_lap
1923           Baku City Circuit          Esteban Ocon          2.0
3489           Circuit de Monaco     Zsolt Baumgartner          2.0
3457           Circuit de Monaco           Takuma Sato          2.0
1371  Autódromo José Carlos Pace  Giancarlo Fisichella          2.0
1407  Autódromo José Carlos Pace         Jenson Button          2.0
                             circuit_name           driver  fastest_lap
267        Albert Park Grand Prix Circuit   Lewis Hamilton          2.0
406         Autodromo Enzo e Dino Ferrari     Jarno Trulli          4.0
531  Autodromo Internazionale del Mugello  Kevin Magnussen          4.0
Jarno Trulli
Lewis Hamilton
driver_nationality
Argentine            0.262735
Colombian            0.240000
Finnish              0.211389
Monegasque           0.194444
Dutch                0.181275
New Zealander        0.179293
German               0.173205
South African        0.171429
Austrian         

In [35]:
print(df.groupby('driver_nationality')['final_position'].apply(lambda x: (x == 1).sum() / x.size).sort_values(ascending = False))

driver_nationality
Argentine            0.101877
Dutch                0.089641
German               0.074708
British              0.069376
Austrian             0.059420
Colombian            0.056000
Brazilian            0.051715
Australian           0.051497
Finnish              0.049180
South African        0.047619
Spanish              0.039007
Canadian             0.036638
Monegasque           0.034722
New Zealander        0.030303
French               0.026750
American             0.025562
Swedish              0.023529
Belgian              0.018613
Mexican              0.018433
Swiss                0.014113
Italian              0.012580
Polish               0.010101
Venezuelan           0.008333
Irish                0.000000
Rhodesian            0.000000
Uruguayan            0.000000
Thai                 0.000000
Argentine-Italian    0.000000
Chilean              0.000000
Chinese              0.000000
Czech                0.000000
Russian              0.000000
Portuguese           

In [38]:
df_wins = df[df['final_position'] == 1]

df_wins[['driver', 'final_position', 'age', 'year']].loc[df_wins['age'].idxmin()]

driver            Max Verstappen
final_position               1.0
age                    18.622861
year                        2016
Name: 23000, dtype: object

In [130]:
df_2023 = df[df['year'] <= 2023]

df_2023.groupby('driver')['race_date'].min().sort_values(ascending = True)

driver
Joe Kelly            1950-05-13
Reg Parnell          1950-05-13
Philippe Étancelin   1950-05-13
Peter Walker         1950-05-13
Nino Farina          1950-05-13
                        ...    
Nikita Mazepin       2021-03-28
Guanyu Zhou          2022-03-20
Nyck de Vries        2022-09-11
Oscar Piastri        2023-03-05
Logan Sargeant       2023-03-05
Name: race_date, Length: 857, dtype: datetime64[ns]

### 5. Circuit analytics (2 points)

* Which would you say is the toughest circuit in Formula 1? (0.5 point)
* Which circuit requires the most f1 experience to win? (0.5 point)
* Which circuit and year saw the most number of non-finishers? (0.5 point)
* For each constructor, which is their best circuit in terms of amount of podiums? (0.5 point)

In [50]:
df.head(5)
df['status'].value_counts()

df_not_finished = df[df['final_position'].isna()]

df_not_finished.groupby('circuit_name')['status'].size().reset_index().sort_values(by = 'status', ascending = False) #Maureen has percentages

Unnamed: 0,circuit_name,status
25,Circuit de Monaco,854
7,Autodromo Nazionale di Monza,816
68,Silverstone Circuit,550
28,Circuit de Spa-Francorchamps,478
20,Circuit Gilles Villeneuve,454
...,...,...
62,Riverside International Raceway,7
51,Monsanto Park Circuit,6
50,Miami International Autodrome,3
10,Autódromo Internacional do Algarve,2


In [53]:
df_wins.groupby('circuit_name')['age'].mean().idxmax() # Use Maureen

'Circuit Bremgarten'

In [56]:
no_finish = df[df['final_position'].isna()]

no_finish.groupby(['circuit_name', 'year'])['final_position'].size().reset_index().sort_values(by = 'final_position', ascending = False)


Unnamed: 0,circuit_name,year,final_position
5,Adelaide Street Circuit,1989,31
314,Circuit Gilles Villeneuve,1989,31
830,Phoenix street circuit,1989,30
994,Suzuka Circuit,1989,29
463,Circuit de Monaco,1990,28
...,...,...,...
679,Hungaroring,2019,1
676,Hungaroring,2016,1
572,Circuit of the Americas,2013,1
869,Red Bull Ring,2023,1


In [63]:
df_podiums = df[df['final_position'].isin([1, 2, 3])]

constructor_circuit = df_podiums.groupby(['constructor_name', 'circuit_name'])['final_position'].size().reset_index(name = 'podium_count')
constructor_circuit
constructor_circuit.loc[constructor_circuit.groupby('constructor_name')['podium_count'].idxmax()].sort_values(by = 'podium_count', ascending = False)

Unnamed: 0,constructor_name,circuit_name,podium_count
201,Ferrari,Autodromo Nazionale di Monza,73
421,McLaren,Autodromo Nazionale di Monza,28
734,Williams,Hockenheimring,21
301,Kurtis Kraft,Indianapolis Motor Speedway,19
511,Mercedes,Silverstone Circuit,19
...,...,...,...
192,Eagle-Weslake,Circuit de Spa-Francorchamps,1
189,Dallara,Autodromo Enzo e Dino Ferrari,1
516,Onyx,Autódromo do Estoril,1
517,Penske,Brands Hatch,1


## Code for the presentation goes here

Build a data-based history about a driver/circuit/constructor and present it using your preferred format. You can use any of the data available in the dataset.

Examples:
* The most successful driver in Formula 1 history
* Why Monaco is the most difficult circuit in Formula 1
* The history of Ferrari in Formula 1
* ...

In [6]:
...
# Rivalry between drivers
# Hardest Circuit (based on year and amount of no finishes)

Ellipsis

Get Story Down / Main Points  
- The Evolution of Circuit Difficulty Over Time
- Identifying the Most Challenging Circuits
- Technological Advancements vs. Circuit Demands
- Case Study: The Monaco Conundrum
- The Human Element
- Environmental and External Factors (did they improve the tracks?)

Exploratory (Proportion of drivers who finished the race on each circuit)

Human Element (Circuits Requiring Most Experience) (Experience = Highest Seasons & Highest Time in F1)

In [None]:
#Need to double check if you should use nunique

driver_career_dates = df.groupby('driver')['race_date'].agg(['min', 'max']).reset_index()
driver_career_dates.rename(columns = {'min': 'first_race', 'max': 'last_race'}, inplace = True)

driver_career_dates['carrer_duration_years'] = (driver_career_dates['last_race'] - driver_career_dates['first_race']).dt.days / 365.25

driver_career_dates['race_count'] = df.groupby('driver').size().reset_index(name = 'race_count')['race_count']

driver_career_dates['avg_race_per_year'] = driver_career_dates['race_count'] / driver_career_dates['carrer_duration_years'].replace(0, 1)

driver_career_dates.sort_values(by = 'avg_race_per_year', ascending = False)

# Filter Drivers with career lasing under a year
driver_experience_filtered = driver_career_dates[driver_career_dates['carrer_duration_years'] >= 1]

driver_experience_filtered.sort_values(by = 'avg_race_per_year', ascending = False)

Unnamed: 0,driver,first_race,last_race,carrer_duration_years,race_count,avg_race_per_year
578,Mick Schumacher,2021-03-28,2022-11-20,1.648186,44,26.696013
608,Nicholas Latifi,2020-07-05,2022-11-20,2.376454,61,25.668491
322,Guanyu Zhou,2022-03-20,2023-07-30,1.360712,34,24.986922
850,Yuki Tsunoda,2021-03-28,2023-07-30,2.338125,56,23.950820
464,Jolyon Palmer,2016-03-20,2017-10-08,1.552361,37,23.834656
...,...,...,...,...,...,...
403,Jean-Louis Schlesser,1983-04-17,1988-09-11,5.404517,2,0.370061
830,Vic Wilson,1960-09-04,1966-06-12,5.768652,2,0.346701
233,Eppie Wietzes,1967-08-27,1974-09-22,7.071869,2,0.282811
280,Gene Force,1951-05-30,1960-05-30,9.002053,2,0.222172


In [89]:
driver_experience_filtered.set_index('driver', inplace=True)

career_duration_mapping = driver_experience_filtered['carrer_duration_years']
total_races_mapping = driver_experience_filtered['race_count']
average_races_per_year_mapping = driver_experience_filtered['avg_race_per_year']

df['career_duration_years'] = df['driver'].map(career_duration_mapping)
df['total_races'] = df['driver'].map(total_races_mapping)
df['average_races_per_year'] = df['driver'].map(average_races_per_year_mapping)

df.head(3)

Unnamed: 0,car_number,grid_starting_position,final_position,points,laps,total_race_time_ms,fastest_lap,rank,fastest_lap_time,fastest_lap_speed,year,race_number_season,race_name,race_date,race_start_time,circuit_name,circuit_location,circuit_country,circuit_lat,circuit_lng,circuit_altitude,driver,driver_dob,driver_nationality,constructor_name,constructor_nationality,status,age,career_duration_years,total_races,average_races_per_year
0,22.0,1,1.0,10.0,58,5690616.0,39.0,2.0,1:27.452,218.3,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Lewis Hamilton,1985-01-07,British,McLaren,British,Finished,23.186858,16.366872,322.0,19.673888
1,3.0,5,2.0,8.0,58,5696094.0,41.0,3.0,1:27.739,217.586,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nick Heidfeld,1977-05-10,German,BMW Sauber,German,Finished,30.850103,11.383984,184.0,16.163059
2,7.0,7,3.0,6.0,58,5698779.0,41.0,5.0,1:28.090,216.719,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nico Rosberg,1985-06-27,German,Williams,British,Finished,22.718686,10.71321,206.0,19.228597


In [94]:
#Need to double check if you should use nunique

completion_ratio = df.groupby('circuit_name')['final_position'].apply(lambda x: x.notna().sum() / x.size).reset_index(name = 'prob_of_success').set_index('circuit_name')

df_completed = df[df['final_position'].notna()]

average_experience_by_circuit = df_completed.groupby('circuit_name')[['career_duration_years', 'total_races', 'average_races_per_year']].agg({
    'career_duration_years': 'mean',
    'total_races': 'mean',
    'average_races_per_year': 'mean'
}).reset_index().rename(columns = {'career_duration_years': 'avg_career_duration', 'total_races': 'avg_total_races', 'average_races_per_year': 'avg_races_per_year'})

prob = completion_ratio['prob_of_success']
average_experience_by_circuit['prob_of_success'] = average_experience_by_circuit['circuit_name'].map(prob)

# Average career duration of drivers who finished the circuit
# Average total races drivers have completed who finished the circuit
# Average races per year ratio of drivers who finished the circuit
# Probability of success (drivers who finished / all drivers)

print(average_experience_by_circuit[average_experience_by_circuit['circuit_name'] == 'Circuit de Monaco'])

average_experience_by_circuit.sort_values(by = 'prob_of_success', ascending = True)

         circuit_name  avg_career_duration  avg_total_races  \
25  Circuit de Monaco            10.389778       151.520104   

    avg_races_per_year  prob_of_success  
25           14.353569         0.480535  


Unnamed: 0,circuit_name,avg_career_duration,avg_total_races,avg_races_per_year,prob_of_success
35,Fair Park,8.622519,119.750000,13.748761,0.307692
58,Phoenix street circuit,9.662408,135.305556,14.009534,0.333333
65,Sebring International Raceway,9.860826,73.833333,7.313259,0.368421
32,Detroit Street Circuit,10.199941,142.457143,13.828578,0.376963
47,Long Beach,10.179025,133.530864,13.273635,0.377273
...,...,...,...,...,...
73,Yas Marina Circuit,10.288591,183.680000,18.445261,0.861486
71,Valencia Street Circuit,11.370099,184.234694,16.382259,0.883929
48,Losail International Circuit,9.293876,177.588235,20.104861,0.900000
50,Miami International Autodrome,8.559407,164.264706,20.219744,0.925000


In [None]:
# Map first_race onto table to get the experience of the driver up until the race so you can do race_date - first_race 

Impact of Time Periods

In [97]:
df['car_number'].value_counts()

car_number
6.0      994
8.0      993
4.0      985
16.0     971
3.0      971
        ... 
123.0      1
120.0      1
126.0      1
110.0      1
95.0       1
Name: count, Length: 129, dtype: int64