# **Racing Data Analysis Project**

Our primary goal is to analyze and determine which cars dominate on specific tracks and identify the most versatile vehicles.


## **1\. Methodology**

###  **1.1. Accrual of points**

Let's award first 10 racers in every race using that system:

1. 1st place: 10 points + 3 bonuses = 13 points
2. 2nd place: 9 points + 2 bonuses = 11 points
3. 3rd place: 8 points + 1 bonus = 9 points
4. 4th place: 7 points
5. 5th place: 6 points
6. 6th place: 5 points
7. 7th place: 4 points
8. 8th place: 3 points
9. 9th place: 2 points
10. 10th place: 1 point



### **1.2. Weighing and normalizing points**

In the gt-world-challenge-europe championships, different tracks may have a different number of races with different durations. There may be 5 races on one track, and 25 on another.

To cope with the varying number of races and their duration, it is important to normalize the scores so that the results are comparable. This can be done in several ways:


### **1.2.1. Weighing the results by race duration**

Let's develop a system where points will be multiplied by a coefficient directly proportional to the duration of the race.

Determine the base length of the race = 50 laps.<br>
If the race is longer, for example 64 laps, the points are multiplied by a factor (64/50).<br>
If the race is shorter, for example 42 laps, the points are multiplied by a factor (42/50).<br>
So, if a win is worth 13 points in a 50-lap race, in a 64-lap race it will be worth 13 * (64/50) = 16.64 points.


####  **1.2.2. Normalizing results by the number of races**

Divide the total points at each track by the number of races at that track.
For example: If the Mercedes-AMG GT3 scored 60 points in Barcelona in total 3 races, we will divide 60 by 3.



### **1.3. Statistical analysis**

We will use Weighted median to reduce the impact of outliers on results.


### **1.4. Data visualization**

Present the results graphically for clarity:<br>
Bar Graphs: Displays total points scored or weighted average for each car.

## **2\. Load Libraries**

In [108]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## **3\. Load Data**

In [109]:
df = pd.read_parquet(".\\cleaned_data\\race_data.parquet")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7319 entries, 0 to 7318
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype          
---  ------                --------------  -----          
 0   Season                7319 non-null   int64          
 1   Meeting               7319 non-null   object         
 2   Event name            7319 non-null   object         
 3   Pos                   7319 non-null   int64          
 4   Car #                 7319 non-null   int64          
 5   Class                 7319 non-null   category       
 6   Special Class         7319 non-null   bool           
 7   Drivers               7319 non-null   object         
 8   Team                  7319 non-null   object         
 9   Car                   7319 non-null   object         
 10  Best lap set          7319 non-null   bool           
 11  Time                  7319 non-null   object         
 12  Time timedelta        7319 non-null   timedelta64[ns]
 13  Lap

## Add 'Pints' column

In [110]:
# Definition of the point system
points_system = {1: 15, 2: 12, 3: 10, 4: 7, 5: 6, 6: 5, 7: 4, 8: 3, 9: 2, 10: 1}

# Function to assign points based on position
def assign_points(pos, laps, base_laps=50):
    base_points = points_system.get(pos, 0)
    lap_factor = laps / base_laps
    return base_points * lap_factor

# Adding a 'Points' column to the DataFrame
df['Points'] = df.apply(lambda row: assign_points(row['Pos'], row['Laps']), axis=1)

In [111]:
# Подготовка датафрейма для хранения общих результатов
new_df = pd.DataFrame()

# Получение уникальных трасс и сезонов
unique_meetings = df['Meeting'].unique()
unique_seasons = df['Season'].unique()

# Цикл по всем сезонам и трассам для расчета баллов
for season in unique_seasons:
    for meeting in unique_meetings:
        # Фильтрация данных для конкретной трассы и сезона
        filtered_data = df[(df['Meeting'] == meeting) & (df['Season'] == season)]
        
        # Проверка, есть ли данные для данной трассы и сезона
        if not filtered_data.empty:
            # Расчет и суммирование баллов
            aggregated_data = filtered_data.groupby('Car').agg(
                Total_Points=pd.NamedAgg(column='Points', aggfunc='sum'),
                Races=pd.NamedAgg(column='Event name', aggfunc='nunique')
            )
            
            # Вычисляем среднее количество очков на гонку
            aggregated_data['Average_Points'] = (aggregated_data['Total_Points'] / aggregated_data['Races']).round(2)
            
            # Сортировка данных по средним очкам
            aggregated_data = aggregated_data.sort_values(by='Average_Points', ascending=False).reset_index()
            
            # Добавление информации о трассе и сезоне
            aggregated_data['Meeting'] = meeting
            aggregated_data['Season'] = season
            
            # Сохранение результатов в общем датафрейме
            new_df = pd.concat([new_df, aggregated_data], ignore_index=True)

In [112]:
new_df.head()

Unnamed: 0,Car,Total_Points,Races,Average_Points,Meeting,Season
0,Mercedes-AMG GT3,103.9,3,34.63,Barcelona,2021
1,Porsche 911 GT3-R (991.II),65.44,3,21.81,Barcelona,2021
2,Lamborghini Huracan GT3 Evo,49.2,3,16.4,Barcelona,2021
3,Audi R8 LMS GT3,34.96,3,11.65,Barcelona,2021
4,Ferrari 488 GT3,30.04,3,10.01,Barcelona,2021


In [122]:
new_df[new_df['Season'] == 2023].groupby('Car')['Average_Points'].sum().sort_values(ascending=False).reset_index()

Unnamed: 0,Car,Average_Points
0,Audi R8 LMS GT3 EVO 2,266.63
1,BMW M4 GT3,251.62
2,Mercedes-AMG GT3,131.9
3,Porsche 911 GT3 R (992),103.23
4,Ferrari 296 GT3,100.71
5,Mercedes-AMG GT3 EVO,99.64
6,Lamborghini Huracan GT3 EVO 2,28.62
7,McLaren 720S GT3 EVO,19.21
8,Aston Martin Vantage GT3,0.0
9,Ferrari 488 GT3,0.0


In [114]:
df[df['Season'] == 2023]['Meeting'].unique()

array(['Barcelona', 'Brands Hatch', 'Circuit Paul Ricard 1000Km',
       'CrowdStrike 24 Hours of Spa', 'Hockenheim', 'Misano', 'Monza',
       'Nürburgring', 'Valencia', 'Zandvoort'], dtype=object)

In [115]:
filtered_data = df[(df['Meeting'] == 'CrowdStrike 24 Hours of Spa') & (df['Season'] == 2023)]
total_points_by_car = filtered_data.groupby('Car')['Points'].sum().sort_values(ascending=False).reset_index()
total_points_by_car

Unnamed: 0,Car,Points
0,Audi R8 LMS GT3 EVO 2,2582.94
1,BMW M4 GT3,2431.52
2,Porsche 911 GT3 R (992),1580.96
3,Mercedes-AMG GT3,1561.62
4,McLaren 720S GT3 EVO,243.58
5,Ferrari 296 GT3,117.26
6,Lamborghini Huracan GT3 EVO 2,60.42
7,Aston Martin Vantage GT3,0.0
8,Ferrari 488 GT3,0.0
9,Porsche 911 GT3-R (991.II),0.0


In [116]:
print(df['Pos'].mean())

#df.to_csv('C:\\Users\\ireev\\Desktop\\cleaned_race_data_v2.csv', index=False)

df.groupby('Meeting')['Pos'].max().sort_values(ascending=False).reset_index()

28.25058068042082


Unnamed: 0,Meeting,Pos
0,CrowdStrike 24 Hours of Spa,70
1,TotalEnergies 24 Hours of Spa,66
2,Circuit Paul Ricard 1000Km,56
3,Barcelona,54
4,Nürburgring,54
5,Monza,53
6,Imola,49
7,Hockenheim,48
8,Misano,37
9,Valencia,36


---------------------


In [117]:
len(df[(df['Season'] == 2021) & (df['Meeting'] == 'Barcelona')]['Event name'].unique())

3

In [118]:
# Подготовка датафрейма для хранения общих результатов
df_total_points = pd.DataFrame()

# Получение уникальных трасс и сезонов
unique_tracks = df['Meeting'].unique()
unique_seasons = df['Season'].unique()

# Цикл по всем сезонам и трассам для расчета баллов
for season in unique_seasons:
    for track in unique_tracks:
        # Фильтрация данных для конкретной трассы и сезона
        filtered_data = df[(df['Meeting'] == track) & (df['Season'] == season)]
        
        # Проверка, есть ли данные для данной трассы и сезона
        if not filtered_data.empty:
            # Расчет и суммирование баллов
            points_by_car = filtered_data.groupby('Car')['Points'].sum() / (len(filtered_data['Event name'].unique()))
            points_by_car = points_by_car.round(2).sort_values(ascending=False).reset_index()
            # Добавление информации о трассе и сезоне
            points_by_car['Meeting'] = track
            points_by_car['Season'] = season
            
            # Сохранение результатов в общем датафрейме
            df_total_points = pd.concat([df_total_points, points_by_car])

# Сброс индекса в итоговом датафрейме
df_total_points.reset_index(drop=True, inplace=True)




# Отображение итогового датафрейма
df_total_points.head()


Unnamed: 0,Car,Points,Meeting,Season
0,Mercedes-AMG GT3,34.63,Barcelona,2021
1,Porsche 911 GT3-R (991.II),21.81,Barcelona,2021
2,Lamborghini Huracan GT3 Evo,16.4,Barcelona,2021
3,Audi R8 LMS GT3,11.65,Barcelona,2021
4,Ferrari 488 GT3,10.01,Barcelona,2021


In [119]:
filtered_data = df_total_points[df_total_points['Season'] == 2023]
best_cars = filtered_data.groupby('Car')['Points'].sum().sort_values(ascending=False).reset_index()
best_cars

Unnamed: 0,Car,Points
0,Audi R8 LMS GT3 EVO 2,266.63
1,BMW M4 GT3,251.62
2,Mercedes-AMG GT3,131.9
3,Porsche 911 GT3 R (992),103.23
4,Ferrari 296 GT3,100.71
5,Mercedes-AMG GT3 EVO,99.64
6,Lamborghini Huracan GT3 EVO 2,28.62
7,McLaren 720S GT3 EVO,19.21
8,Aston Martin Vantage GT3,0.0
9,Ferrari 488 GT3,0.0


Define overall lider/s by points it's not really suitable. Because some race tracks has more racing events than others, such as on Spa. And this may skew our data.

## Best on Barcelona

In [120]:
barcelona_data = df_total_points[(df_total_points['Season'] == 2023) & (df_total_points['Meeting'] == 'Barcelona')].groupby('Car')['Points'].sum().sort_values(ascending=False).reset_index()
barcelona_data

Unnamed: 0,Car,Points
0,Ferrari 296 GT3,32.69
1,Mercedes-AMG GT3 EVO,28.59
2,Porsche 911 GT3 R (992),18.37
3,McLaren 720S GT3 EVO,4.05
4,Audi R8 LMS GT3 EVO 2,2.23
5,BMW M4 GT3,1.82
6,Aston Martin Vantage GT3,0.0
7,Lamborghini Huracan GT3 EVO 2,0.0
8,Mercedes-AMG GT3,0.0


Let's provide some statistical test/s to see if there is some statistically significant leader/s. Good options for us are ANOVA and Tukey Test.