# **Racing Data Analysis Project**

Our primary goal is to analyze and determine which cars dominate on specific tracks and identify the most versatile vehicles.


## **1\. Methodology**

###  **1.1. Accrual of points**

Let's award first 10 racers in every race using that system:

1. 1st place: 15 points + 3 bonuses = 18 points
2. 2nd place: 14 points + 2 bonuses = 16 points
3. 3rd place: 13 points + 1 bonus = 14 points
4. 4th place: 12 points
5. 5th place: 11 points
6. 6th place: 10 points
7. 7th place: 9 points
8. 8th place: 8 points
9. 9th place: 7 points
10. 10th place: 6 point
11. 11th place: 5 points
12. 12th place: 4 points
13. 13th place: 3 points
14. 14th place: 2 points
15. 15th place: 1 point


### **1.2. Weighing and normalizing points**

In the gt-world-challenge-europe championships, different tracks may have a different number of races with different durations. There may be 5 races on one track, and 25 on another.

To cope with the varying number of races and their duration, it is important to normalize the scores so that the results are comparable. This can be done in several ways:


### **1.2.1. Weighing by the race duration**

Let's develop a system where points will be multiplied by a coefficient directly proportional to the duration of the race.

Determine the base length of the race = 50 laps.<br>
If the race is longer, for example 64 laps, the points are multiplied by a factor (64/50).<br>
If the race is shorter, for example 42 laps, the points are multiplied by a factor (42/50).<br>
So, if a win is worth 18 points in a 50-lap race, in a 64-lap race it will be worth 18 * (64/50) = 23.04 points.


####  **1.2.2. Normalizing by the number of races**

Divide the total points at each track by the number of races at that track.
For example: If the Mercedes-AMG GT3 scored 60 points in Barcelona in total 3 races, we will divide 60 by 3.


### **1.3. Statistical analysis**

We will calculate the median finishing positions of each car at each track.


### **1.4. Data visualization**

Present the results graphically for clarity:<br>
Bar Graphs: Displays total points scored or weighted average for each car.

## **2\. Load Libraries**

In [32]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## **3\. Load Data**

In [33]:
df = pd.read_parquet(".\\cleaned_data\\race_data.parquet")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7319 entries, 0 to 7318
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype          
---  ------                --------------  -----          
 0   Season                7319 non-null   int64          
 1   Meeting               7319 non-null   object         
 2   Event name            7319 non-null   object         
 3   Pos                   7319 non-null   int64          
 4   Car #                 7319 non-null   int64          
 5   Class                 7319 non-null   category       
 6   Special Class         7319 non-null   bool           
 7   Drivers               7319 non-null   object         
 8   Team                  7319 non-null   object         
 9   Car                   7319 non-null   object         
 10  Best lap set          7319 non-null   bool           
 11  Time                  7319 non-null   object         
 12  Time timedelta        7319 non-null   timedelta64[ns]
 13  Lap

## **4\. Add weighted 'Points' column**

In [58]:
# Definition of the point system
points_system = {1: 18, 2: 16, 3: 14, 4: 12, 5: 11, 6: 10, 7: 9, 8: 8, 9: 7, 10: 6, 11: 5, 12: 4, 13: 3, 14: 2, 15: 1}

# Function to assign points based on position
def assign_points(pos, laps, base_laps=50):
    base_points = points_system.get(pos, 0)
    lap_factor = laps / base_laps
    return base_points * lap_factor

# Adding a 'Points' column to the DataFrame
df['Points'] = df.apply(lambda row: assign_points(row['Pos'], row['Laps']), axis=1)

### **4.1. Normalizing the 'Points' column**

In [59]:
# Preparing the dataframe to store aggregate results
normalized_points = pd.DataFrame()

# Retrieving unique tracks and seasons
unique_tracks = df['Meeting'].unique()
unique_seasons = df['Season'].unique()

# Looping through all seasons and tracks to calculate points
for season in unique_seasons:
    for track in unique_tracks:
        # Filtering data for a specific track and season
        filtered_data = df[(df['Meeting'] == track) & (df['Season'] == season)]
        
        # Checking if there is data for this track and season
        if not filtered_data.empty:
            # Calculating and summing points
            points_by_car = filtered_data.groupby('Car')['Points'].sum() / (len(filtered_data['Event name'].unique()))
            points_by_car = points_by_car.round(2).sort_values(ascending=False).reset_index()
            # Adding information about the track and season
            points_by_car['Meeting'] = track
            points_by_car['Season'] = season
            
            # Saving results in the aggregate dataframe
            normalized_points = pd.concat([normalized_points, points_by_car])

# Resetting the index in the final dataframe
normalized_points.reset_index(drop=True, inplace=True)

# Displaying the final dataframe
normalized_points.head()


Unnamed: 0,Car,Points,Meeting,Season
0,Mercedes-AMG GT3,54.17,Barcelona,2021
1,Lamborghini Huracan GT3 Evo,38.37,Barcelona,2021
2,Porsche 911 GT3-R (991.II),35.19,Barcelona,2021
3,Audi R8 LMS GT3,24.38,Barcelona,2021
4,Ferrari 488 GT3,21.71,Barcelona,2021


## **5\. Add 'Median finishing position' column**

In [60]:

median_positions = df.groupby(['Season', 'Meeting', 'Car'])['Pos'].median().reset_index()

# Renaming columns for better readability
median_positions.rename(columns={'Pos': 'Median Finish Position'}, inplace=True)
# Changing data type to int
median_positions['Median Finish Position'] = median_positions['Median Finish Position'].astype(int)

median_positions.head()

Unnamed: 0,Season,Meeting,Car,Median Finish Position
0,2021,Barcelona,Aston Martin Vantage AMR GT3,31
1,2021,Barcelona,Audi R8 LMS GT3,17
2,2021,Barcelona,Audi R8 LMS GT3 EVO 2,10
3,2021,Barcelona,BMW M4 GT3,16
4,2021,Barcelona,BMW M6 GT3,39


---

In [50]:
df['Meeting'].unique()

array(['Barcelona', 'Brands Hatch', 'Circuit Paul Ricard', 'Magny-Cours',
       'Misano', 'Monza', 'Nürburgring', 'Circuit de Spa-Francorchamps',
       'Valencia', 'Zandvoort', 'Hockenheim', 'Imola'], dtype=object)

In [51]:
filtered_data = df[(df['Meeting'] == 'Circuit de Spa-Francorchamps') & (df['Season'] == 2023)]
total_points_by_car = filtered_data.groupby('Car')['Points'].sum().sort_values(ascending=False).reset_index()
total_points_by_car

Unnamed: 0,Car,Points
0,Audi R8 LMS GT3 EVO 2,2582.94
1,BMW M4 GT3,2431.52
2,Porsche 911 GT3 R (992),1580.96
3,Mercedes-AMG GT3,1561.62
4,McLaren 720S GT3 EVO,243.58
5,Ferrari 296 GT3,117.26
6,Lamborghini Huracan GT3 EVO 2,60.42
7,Aston Martin Vantage GT3,0.0
8,Ferrari 488 GT3,0.0
9,Porsche 911 GT3-R (991.II),0.0


In [39]:
print(df['Pos'].mean())

#df.to_csv('C:\\Users\\ireev\\Desktop\\cleaned_race_data_v2.csv', index=False)

df.groupby('Meeting')['Pos'].max().sort_values(ascending=False).reset_index()

28.25058068042082


Unnamed: 0,Meeting,Pos
0,Circuit de Spa-Francorchamps,70
1,Circuit Paul Ricard,56
2,Barcelona,54
3,Nürburgring,54
4,Monza,53
5,Imola,49
6,Hockenheim,48
7,Misano,37
8,Valencia,36
9,Brands Hatch,28


---------------------


In [69]:
filtered_data = normalized_points[normalized_points['Season'] == 2023]
best_cars = filtered_data.groupby('Car')['Points'].sum().sort_values(ascending=False).reset_index()
best_cars

Unnamed: 0,Car,Points
0,Audi R8 LMS GT3 EVO 2,507.66
1,BMW M4 GT3,452.32
2,Mercedes-AMG GT3,231.85
3,Porsche 911 GT3 R (992),228.57
4,Ferrari 296 GT3,198.11
5,Mercedes-AMG GT3 EVO,158.33
6,McLaren 720S GT3 EVO,83.71
7,Lamborghini Huracan GT3 EVO 2,80.3
8,Lamborghini Huracan GT3 Evo,0.34
9,Aston Martin Vantage GT3,0.0


Define overall lider/s by points it's not really suitable. Because some race tracks has more racing events than others, such as on Spa. And this may skew our data.

## Best on Barcelona

In [72]:
filtered_data = normalized_points[(normalized_points['Season'] == 2023) & (normalized_points['Meeting'] == 'Barcelona')]
best_cars = filtered_data.groupby('Car')['Points'].sum().sort_values(ascending=False).reset_index()
best_cars

Unnamed: 0,Car,Points
0,Mercedes-AMG GT3 EVO,50.07
1,Ferrari 296 GT3,43.08
2,Porsche 911 GT3 R (992),30.99
3,BMW M4 GT3,19.78
4,Audi R8 LMS GT3 EVO 2,14.91
5,McLaren 720S GT3 EVO,10.8
6,Lamborghini Huracan GT3 EVO 2,0.47
7,Aston Martin Vantage GT3,0.0
8,Mercedes-AMG GT3,0.0


In [57]:
median_positions[(normalized_points['Meeting'] == 'Barcelona') & (normalized_points['Season'] == 2023)][['Car', 'Median Finish Position']].sort_values(by='Median Finish Position',ascending=True).reset_index(drop=True)

Unnamed: 0,Car,Median Finish Position
0,McLaren 720S GT3 EVO,20
1,BMW M4 GT3,21
2,Lamborghini Huracan GT3 EVO 2,21
3,Ferrari 296 GT3,24
4,Mercedes-AMG GT3 EVO,24
5,Porsche 911 GT3 R (992),24
6,Audi R8 LMS GT3 EVO 2,32
7,Mercedes-AMG GT3,40
8,Aston Martin Vantage GT3,41


Let's provide some statistical test/s to see if there is some statistically significant leader/s. Good options for us are ANOVA and Tukey Test.