# Group Assignment: Formula 1 Data

Solve the questions regarding the `f1_data.csv` dataset. 

The full grade will be split as follows:
* 60 % notebook, code, and explanations of the questions
* 30 % presentation in class: quality of material, presentation, and Q&A
    * The presentation has to be about one specific driver, constructor, or circuit. Build a data-based history about the driver/circuit/constructor and present it.
    * You can present in any format you want: PPT, PDF, notebook, whatever
* 10 % visualization to support the answers and the presentation

The data is composed of the following variables:
* `car_number`: the number of the car
* `grid_starting_position`: the starting position of the car in the grid
* `final_position`: the position in which that driver ended
* `points`: the points earned by the driver in the race
* `laps`: the number of laps completed by the driver
* `total_race_time_ms`: the total time the driver took to complete the race in milliseconds
* `fastest_lap`: the fastest lap completed by the driver
* `rank`: the rank of the driver in the race
* `fastest_lap_time`: the time taken to complete the fastest lap
* `fastest_lap_speed`: the speed of the fastest lap
* `year`: the year of the race
* `race_number_season`: the number of the race in the season
* `race_name`: the name of the race
* `race_date`: the date of the race
* `race_start_time`: the start time of the race
* `circuit_name`: the name of the circuit where the race took place
* `circuit_location`: the location of the circuit
* `circuit_country`: the country where the circuit is located
* `circuit_lat`: the latitude of the circuit
* `circuit_lng`: the longitude of the circuit
* `circuit_altitude`: the altitude of the circuit
* `driver`: the name of the driver
* `driver_dob`: the date of birth of the driver
* `driver_nationality`: the nationality of the driver
* `constructor_name`: the name of the constructor team
* `constructor_nationality`: the nationality of the constructor team
* `status`: the status of the driver in the race (e.g., Finished, Did Not Finish, etc.)

**SUBMISSION: failing to comply with the submission format will result in a 0 grade**
* ONE (1) SINGLE ZIP FILE containing:
    * The notebook with the code and the answers
    * The presentation itself (PPT, PDF, notebook, whatever)
    * `f1_data.csv`
* The ZIP file should be named as follows: `group_assignment_<group_id>.zip`
  * For example, if you are group 1, the file should be named `group_assignment_1.zip`

### 0. Group Information

* Group ID
* Members:
  * ...

### 1. Basic operations. (1 point)

* Open the dataset as a pandas dataframe and show the first 10 rows
* Show the number of rows and columns
* Show the data types of each column
* Calculate a column called `age` which represents the age of each driver on the date of the race:
    * Hint1: use the `pd.to_datetime` function to convert the date columns to datetime
    * Hint2: use the `driver_dob` and substract from it the `race_date` column
    * Hint3: use the `dt.days` property to convert the result to days, then divide by 365.25 to get the age in years

In [23]:
import pandas as pd

df = pd.read_csv('f1_data.csv')
pd.set_option('display.max_columns', None)

df.head(5)

Unnamed: 0,car_number,grid_starting_position,final_position,points,laps,total_race_time_ms,fastest_lap,rank,fastest_lap_time,fastest_lap_speed,year,race_number_season,race_name,race_date,race_start_time,circuit_name,circuit_location,circuit_country,circuit_lat,circuit_lng,circuit_altitude,driver,driver_dob,driver_nationality,constructor_name,constructor_nationality,status
0,22.0,1,1.0,10.0,58,5690616.0,39.0,2.0,1:27.452,218.3,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Lewis Hamilton,1985-01-07,British,McLaren,British,Finished
1,3.0,5,2.0,8.0,58,5696094.0,41.0,3.0,1:27.739,217.586,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nick Heidfeld,1977-05-10,German,BMW Sauber,German,Finished
2,7.0,7,3.0,6.0,58,5698779.0,41.0,5.0,1:28.090,216.719,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nico Rosberg,1985-06-27,German,Williams,British,Finished
3,5.0,11,4.0,5.0,58,5707797.0,58.0,7.0,1:28.603,215.464,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Fernando Alonso,1981-07-29,Spanish,Renault,French,Finished
4,23.0,3,5.0,4.0,58,5708630.0,43.0,1.0,1:27.418,218.385,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Heikki Kovalainen,1981-10-19,Finnish,McLaren,British,Finished


In [24]:
import datetime as dt

print(df.shape)

df['driver_dob'] = pd.to_datetime(df['driver_dob'])
df['race_date'] = pd.to_datetime(df['race_date'])

print(df.dtypes)

df['age'] = (df['race_date'] - df['driver_dob']).dt.days / 365

df.head(5)

(26080, 27)
car_number                        float64
grid_starting_position              int64
final_position                    float64
points                            float64
laps                                int64
total_race_time_ms                float64
fastest_lap                       float64
rank                              float64
fastest_lap_time                   object
fastest_lap_speed                 float64
year                                int64
race_number_season                  int64
race_name                          object
race_date                  datetime64[ns]
race_start_time                    object
circuit_name                       object
circuit_location                   object
circuit_country                    object
circuit_lat                       float64
circuit_lng                       float64
circuit_altitude                  float64
driver                             object
driver_dob                 datetime64[ns]
driver_nationality    

Unnamed: 0,car_number,grid_starting_position,final_position,points,laps,total_race_time_ms,fastest_lap,rank,fastest_lap_time,fastest_lap_speed,year,race_number_season,race_name,race_date,race_start_time,circuit_name,circuit_location,circuit_country,circuit_lat,circuit_lng,circuit_altitude,driver,driver_dob,driver_nationality,constructor_name,constructor_nationality,status,age
0,22.0,1,1.0,10.0,58,5690616.0,39.0,2.0,1:27.452,218.3,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Lewis Hamilton,1985-01-07,British,McLaren,British,Finished,23.20274
1,3.0,5,2.0,8.0,58,5696094.0,41.0,3.0,1:27.739,217.586,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nick Heidfeld,1977-05-10,German,BMW Sauber,German,Finished,30.871233
2,7.0,7,3.0,6.0,58,5698779.0,41.0,5.0,1:28.090,216.719,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Nico Rosberg,1985-06-27,German,Williams,British,Finished,22.734247
3,5.0,11,4.0,5.0,58,5707797.0,58.0,7.0,1:28.603,215.464,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Fernando Alonso,1981-07-29,Spanish,Renault,French,Finished,26.649315
4,23.0,3,5.0,4.0,58,5708630.0,43.0,1.0,1:27.418,218.385,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10.0,Heikki Kovalainen,1981-10-19,Finnish,McLaren,British,Finished,26.424658


### 2. Why do we have missing values in the `final_position` column? (1 point)

In [34]:
import numpy as np

df['laps'].describe()

df[df['final_position'] != np.nan]

Unnamed: 0,car_number,grid_starting_position,final_position,points,laps,total_race_time_ms,fastest_lap,rank,fastest_lap_time,fastest_lap_speed,year,race_number_season,race_name,race_date,race_start_time,circuit_name,circuit_location,circuit_country,circuit_lat,circuit_lng,circuit_altitude,driver,driver_dob,driver_nationality,constructor_name,constructor_nationality,status,age
0,22.0,1,1.0,10.0,58,5690616.0,39.0,2.0,1:27.452,218.300,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.96800,10.0,Lewis Hamilton,1985-01-07,British,McLaren,British,Finished,23.202740
1,3.0,5,2.0,8.0,58,5696094.0,41.0,3.0,1:27.739,217.586,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.96800,10.0,Nick Heidfeld,1977-05-10,German,BMW Sauber,German,Finished,30.871233
2,7.0,7,3.0,6.0,58,5698779.0,41.0,5.0,1:28.090,216.719,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.96800,10.0,Nico Rosberg,1985-06-27,German,Williams,British,Finished,22.734247
3,5.0,11,4.0,5.0,58,5707797.0,58.0,7.0,1:28.603,215.464,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.96800,10.0,Fernando Alonso,1981-07-29,Spanish,Renault,French,Finished,26.649315
4,23.0,3,5.0,4.0,58,5708630.0,43.0,1.0,1:27.418,218.385,2008,1,Australian Grand Prix,2008-03-16,04:30:00,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.96800,10.0,Heikki Kovalainen,1981-10-19,Finnish,McLaren,British,Finished,26.424658
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26075,3.0,19,16.0,0.0,44,5053521.0,25.0,15.0,1:50.994,227.169,2023,12,Belgian Grand Prix,2023-07-30,13:00:00,Circuit de Spa-Francorchamps,Spa,Belgium,50.4372,5.97139,401.0,Daniel Ricciardo,1989-07-01,Australian,AlphaTauri,Italian,Finished,34.101370
26076,2.0,18,17.0,0.0,44,5054926.0,37.0,9.0,1:50.486,228.213,2023,12,Belgian Grand Prix,2023-07-30,13:00:00,Circuit de Spa-Francorchamps,Spa,Belgium,50.4372,5.97139,401.0,Logan Sargeant,2000-12-31,American,Williams,British,Finished,22.591781
26077,27.0,0,18.0,0.0,44,5060900.0,26.0,4.0,1:49.907,229.415,2023,12,Belgian Grand Prix,2023-07-30,13:00:00,Circuit de Spa-Francorchamps,Spa,Belgium,50.4372,5.97139,401.0,Nico Hülkenberg,1987-08-19,German,Haas F1 Team,American,Finished,35.969863
26078,55.0,4,,0.0,23,,9.0,19.0,1:53.138,222.864,2023,12,Belgian Grand Prix,2023-07-30,13:00:00,Circuit de Spa-Francorchamps,Spa,Belgium,50.4372,5.97139,401.0,Carlos Sainz,1994-09-01,Spanish,Ferrari,Italian,Collision damage,28.928767


In [35]:
df[df['final_position'].isna()][['final_position', 'status']]
df[df['final_position'].isna()]['status'].value_counts()
list(df[df['final_position'].isna()]['status'].value_counts().index) # When looking at the status for drivers with no position, they did not finish the race
df[df['final_position'].isna()]['status'].value_counts()


status
Engine             1878
Did not qualify    1025
Accident            976
Collision           790
Gearbox             770
                   ... 
Crankshaft            1
Not restarted         1
+4 Laps               1
Underweight           1
Engine fire           1
Name: count, Length: 119, dtype: int64

### 3. Constructor analytics (3 points)

* Which constructor has the most race wins? (0.5 points)
* Which constructor has the most podiums (position 1, 2, or 3)? (0.5 points)
* Which constructor has the biggest probability of not finishing a race, according to the dataset? (0.5 points)
* Which country has the most successful constructors in terms of race victories? (0.5 points)
* Which are the current constructors (from 2023) with the longest history in Formula 1? (0.5 points)
* Which is the constructor with the most drivers in Formula 1 across its history? (0.5 points)

In [None]:
df_rank = df[df['rank'] == 1]
print(df_rank.groupby('constructor_name')['rank'].count().idxmax())

df_positions = df[(df['final_position'] == 1) | (df['final_position'] == 2) | (df['final_position'] == 3)]
print(df_positions.groupby('constructor_name')['final_position'].count().idxmax())



Mercedes
Ferrari


### 4. Driver analytics (3 points)

* With the data available, who is the fastest driver in Formula 1? (0.5 point)
* Which is the driver with the most podiums without a win (position 2 or 3)? (0.5 point)
* Calculate the historical probability of each country of having a driver in the podium (0.5 point)
* Calculate the historical probability of each country of having a driver win a race (0.5 point)
* Which driver was the youngest to win a race? (0.5 point)
* Which drivers are the current ones with the longest history in Formula 1? (0.5 point)

Hint: remember that a probability is calculated as the number of times an event happened divided by the total number of events

### 5. Circuit analytics (2 points)

* Which would you say is the toughest circuit in Formula 1? (0.5 point)
* Which circuit requires the most f1 experience to win? (0.5 point)
* Which circuit and year saw the most number of non-finishers? (0.5 point)
* For each constructor, which is their best circuit in terms of amount of podiums? (0.5 point)

## Code for the presentation goes here

Build a data-based history about a driver/circuit/constructor and present it using your preferred format. You can use any of the data available in the dataset.

Examples:
* The most successful driver in Formula 1 history
* Why Monaco is the most difficult circuit in Formula 1
* The history of Ferrari in Formula 1
* ...

In [None]:
...