# Final Group Project: 
# Prediction of the Top 3 Winners of the upcoming Azerbaijan Grand Prix
#### Group Member: Catherine Jin, Oliver Zhao, Conny Zhou
#### Section: QTM 151, Section 3

## I. Introduction
- Background: What is F1?

    Formula One, also known as F1, is the highest class of single-seater auto racing that is sanctioned by the Fédération Internationale de l'Automobile (FIA). The F1 World Championship is the most prestigious and popular motorsport event in the world, featuring a series of races called Grands Prix that take place across the globe. F1 cars are the most technologically advanced and fastest racing cars in the world, capable of reaching speeds over 370 km/h (230 mph). The sport is known for its high levels of competition, drama, and excitement, with drivers and teams competing for the World Championship title each season. F1 is watched by millions of fans around the world and has a rich history dating back to the 1950s.

- Purpose

    Our group aims to forecast the top three winners of the upcoming Azerbaijan Grand Prix on April 30, 2023, by analyzing data from the past six years, starting from 2017. First, we examined the correlation between a driver's initial "grid position" and his/her final "rank", hoping to find a strong positive correlation between the two variables, and then we choose to employ "grid position" as an indicator of a driver's potential success in the upcoming Azerbaijan race.

- Results

    The analysis unveils that there is a positive relationship between grid position and possibility of winning. The driver starting from the front grid positions are more likely to get a higher rank in the end.
    Through this obervation, we analyze the past records for the 20 drivers that are about to contest in the upcoming Azerbaijan Grand Prix. We utilize an interactive polar graph to visualize the records for each driver who contested in Azerbaijan Grand Prix in the previous yeasrs. More detailed descriptions are given on top of the polar graph.
    

## II. Data Description

- We choose 4 datasets among the 14 datasets provided.
    
    drivers.csv: 
    1. The dataset contains all the drivers participating in Formula One races 
    2. The dataset has 857 observations, with each row representing information about a driver, including name, nationality, etc.

    results.csv:
    1. The dataset contains all the results of the past races
    2. The dataset has 25840 observations, with each row indicating the Id of one particular race, the driver Id, the constructor Id, and a plethora of detailed time data

    races.csv:
    1. The dataset contains the general location and pre-racing information
    2. The dataset has 1102 observations, with each row specifying the date for free practice, qualifying races, the final race, as well as the location for the race

    circuits.csv:
    1. The dataset contains all the circuits that were and are used for holding the Grand Prix
    2. The dataset has 77 observations, with each row indicating the location, altitude, longitude, and other information about the circuits

In [2]:
#Import several libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px

In [17]:
#Import datasets
drivers = pd.read_csv("data_raw/drivers.csv")
results = pd.read_csv("data_raw/results.csv")
races = pd.read_csv("data_raw/races.csv")
circuits = pd.read_csv("data_raw/circuits.csv")

- Merging + Cleaning

    To see the relationship between grid positions and final rank, we query dataset through selecting the driver whose status Id is 1, meaning the race is finished without other unexpectancy.

    We then compute the aggregate statistics of the mean final rank of grid between 1 to 20


In [32]:
# Clean up the dataset through only selecting the driver whose status Id is 1, meaning the race is finished without other unexpectancy
# Then compute the mean
vec_grid = np.arange(1,21)
result_corr = (results.query("statusId == 1")
               .query("grid.isin(@vec_grid)")
               .groupby(["grid"])
               .agg(mean_rank = ('positionOrder', 'mean')))
display(result_corr)

Unnamed: 0_level_0,mean_rank
grid,Unnamed: 1_level_1
1,1.794118
2,2.472574
3,2.998525
4,3.504902
5,4.066421
6,4.39501
7,5.253933
8,5.544335
9,5.948087
10,6.356688


- Merging + Cleaning + Manipulating

    In the first chunk, we select the 20 players in this year's competition. We format the data into a vector containing simply the driverId to make subsequent operations easier.

    We then did a seires of operations merging the 4 datasets above into the final result where we are going to use for making the polar graph. The final dataset is named as result_Baku, where Baku is the city where 2023 Azerbaijan Grand Prix is going to take place.

    In particular, to transform our data in a way that is compatible with the polar graph, we define column [year_str] which is year represented in string format.

In [30]:
drivers_2023 = drivers.query("(surname == 'Verstappen' and forename == 'Max') or \
                             (surname == 'Pérez' and forename == 'Sergio') or \
                             (surname == 'Alonso' and forename == 'Fernando') or \
                             (surname == 'Stroll' and forename == 'Lance') or \
                             (surname == 'Russell' and forename == 'George') or \
                             (surname == 'Hamilton' and forename == 'Lewis') or \
                             (surname == 'Leclerc' and forename == 'Charles') or \
                             (surname == 'Sainz' and forename == 'Carlos') or \
                             (surname == 'Piastri' and forename == 'Oscar') or \
                             (surname == 'Norris' and forename == 'Lando') or \
                             (surname == 'Ocon' and forename == 'Esteban') or \
                             (surname == 'Gasly' and forename == 'Pierre') or \
                             (surname == 'Hülkenberg' and forename == 'Nico') or \
                             (surname == 'Magnussen' and forename == 'Kevin') or \
                             (surname == 'Zhou' and forename == 'Guanyu') or \
                             (surname == 'Bottas' and forename == 'Valtteri') or \
                             (surname == 'de Vries' and forename == 'Nyck') or \
                             (surname == 'Tsunoda' and forename == 'Yuki') or \
                             (surname == 'Sargeant' and forename == 'Logan') or \
                             (surname == 'Albon' and forename == 'Alexander')")

In [31]:
##Predict the outcome in the forthcoming 4/30 Azerbaijan Grand Prix based on pervious performance
##Extract past competitions with the following commonality:
##Circuit ID: 73, Baku, Azerbaijan
##Driver ID: 20 drivers in 2023
series_drId = drivers_2023["driverId"]
vec_drId = np.array(series_drId.values)
results_Baku = (results[results['driverId'].isin(vec_drId)]
                .query("raceId == [976, 992, 1013, 1057, 1081]"))

               
##merge the data with races on raceId
results_Baku = pd.merge(results_Baku, races, on='raceId')
##define a new column year in string format
results_Baku['year_str'] = results_Baku['year'].astype(str)
##add the corresponding forename and surname to the driverId
results_Baku = pd.merge(results_Baku, drivers, on='driverId')





- Column Description

    I. Polar Plot Rationale 

    Initiating from the grid position (var grid_position)is vital on tracks where surpassing competitors is challenging, as it provides the benefit of being a few meters in front and on the standard racecourse, which is typically cleaner and offers better traction. The subsequent chart illustrates the relationship between commencing in the grid position and securing victory in several of the most renowned racing circuits(var circuitId).



    II. Polar Plot Analysis 

    In this polar plot, the radius(r) represents the grid position the player is in in the years they participated, and the circle consists of 5 points representing the year the race was held. The closer to the center of the circle, the better the grid position is which potentially helps with the driver’s performance as mentioned above. Taking a look at Hamilton, he was in 4 races and has grid position in front except in 2021.

In [33]:
fig = px.line_polar(results_Baku ,r='positionOrder', theta= 'year_str', 
                    color = 'surname', line_close=True, 
                    title='Polar seasonal plot',
                    width=600, height=500)
fig.show()


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated a

## III. Results

From the pervious aggregate statistics table we understand that grid positions are positivly related to the final rank. Intuitively this should make sense, it is rarely possible for drivers starting from 10-20 grids to surpass all the drivers in the front, while the front drivers are prone to win the race.

Taking a look at the polar graph. The way we can interpret this graph is through examining the area that the dots connect. If the overall area  for a particular player is small, meaning that the dots are clustered around the center and that he/she earned relatively lower grid positions, it is very likely that he/she is going to win the race in 2023 Azerbaijan Grand Prix. However, technically, it is not necessary to precisely calculate the area for each player since the visualization is clear enough through naked eyes. On a side note, if a player participated in just 1 or 2 races, his/her area is simply a dot or a line, which might result in insensible interpretation if we are simply looking at the area. Thus, the other parameter we should take into consideration is the sample size. Due to the fact that only 5 races were ever held in Azerbaijan, we are really limited in the number of observations that we have. To get around this problem, it would be reasonable to expect a higher percentage of winning for a driver who contested higher times with a bigger area than a driver who contested once or twice but a lower grid position.

Finally, an interesting phenomenon occurs when looking at Max Verstappen's grid positions dataset. His grid positon is always at the very back. Mr.Verstappen is considered one of the most skillful drivers, especially for the year 2023 as Team Red Bull is very competitive. The reason for this to occur is due to engine penalty. 

"The first time an additional element is used, the driver gets a 10-place grid penalty. The next time an additional element is used, the driver gets a five-place grid penalty. If a driver incurs a penalty exceeding 15 grid places, they will be required to start the race at the back."(https://www.formula1.com/en/latest/article.how-do-f1-engine-penalties-work.7aLmj23MgHiv9Rin48ROrY.html)

A quotation from the official F1 website explains that the possible reason for some drivers to start at the back is due to carraige of extra engines. The team feels confident that the disadvantage of starting at the back could be remedied by the extra engine and the performance of the drivers. However, this situation occurs rarely. The discussion of extra engines on the performance of drivers is another topic beyond the scope of this project, but it is something to keep in mind.

## IV. Discussion

In conclusion, based on the graph above, we come to the conclusion that grid positions are positivly related to the final rank.
For the upcoming 2023 Azerbaijan Grand Prix, we infer that Lewis Hamilton, Lando Norris, Sergio Pérez, are the most notable driver who is going to get the first position.

