In [105]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import datetime

In [106]:
results = pd.read_csv('../data/results.csv')
races = pd.read_csv('../data/races.csv')
circuits = pd.read_csv('../data/circuits.csv')

In [107]:
############################################
# MATTYLAD COPY EVERYTHING BELOW THX A LOT #
############################################

# if you want you can also correct my perfect english

## Question 2
### At which tracks does starting position have a higher impact on the final position?

In Formula 1 there are a lot of different types of circuits. On the one hand there are highspeed circuits with long straights and fast corners, such as Monza, Italy or Spa Francorchamps, Belgium. On the other hand there a twisty street circuits like Monte-Carlo, Monaco. Furthermore, there are circuits that feature a layout that makes it generally hard to overtake the car in front (for various reasons).

We hypothesise that the starting positon (also called 'grid positon' or 'grid') effects the finishing position the most on street circuits and circuits with a layout that does not support easy overtakes.

### Merging

In [108]:
# merge results and race to obtain circuitId and date
results_copy = results[['resultId', 'raceId', 'grid', 'positionOrder', 'positionText', 'statusId']].merge(
    races[['raceId', 'circuitId', 'date']],
    on='raceId'
)

# converting datatypes
results_copy['grid'] = pd.to_numeric(results_copy['grid'], errors='coerce')
results_copy['positionOrder'] = pd.to_numeric(results_copy['positionOrder'], errors='coerce')
results_copy['date'] = pd.to_datetime(results_copy['date'])

### Selecting

The selected data
- only cotains results from 2000 onwards, due to changing car concepts and general evolution of the car that made it easier / harder to overtake on certain tracks.
- does not contain drivers that start from the pit lane (e.g. due to a penalty), since the pit lane is not an offical grid position
- only contains results of drivers that started and finished within the top 10. Thats our way of keeping the influence of retiring cars as small as possible.

In [109]:
# only care about races after 2000
results_copy = results_copy[results_copy['date'] > datetime.datetime(2000, 1, 1)]

# remove pit lane starters
results_copy = results_copy[results_copy['grid'] != 0]

# only take cars into account that started and finished within the top 10
results_copy = results_copy[results_copy['grid'] <= 10]
results_copy = results_copy[results_copy['positionOrder'] <= 10]

# display dataframe
print(results_copy.shape)
results_copy.head()

(3170, 8)


Unnamed: 0,resultId,raceId,grid,positionOrder,positionText,statusId,circuitId,date
0,1,18,1,1,1,1,1,2008-03-16
1,2,18,5,2,2,1,1,2008-03-16
2,3,18,7,3,3,1,1,2008-03-16
4,5,18,3,5,5,1,1,2008-03-16
8,9,18,2,9,R,4,1,2008-03-16


### Find out for which circuit the starting position matters most

In [110]:
# find out number of races for all circuits
race_per_circuit = results_copy.groupby('circuitId').agg(
    num_races = ('raceId', 'nunique')
).reset_index()

# display dataframe
print(race_per_circuit.shape)
race_per_circuit.head()

(37, 2)


Unnamed: 0,circuitId,num_races
0,1,21
1,2,18
2,3,20
3,4,23
4,5,9


In [111]:
# for all circuit-grid combinations, find the average finishing position
circuit_grid_grouped = results_copy.groupby(['circuitId', 'grid']).agg(
    positionOrder = ('positionOrder', 'mean')
).reset_index()

# and calculate the difference
circuit_grid_grouped['delta'] = circuit_grid_grouped['grid'] - circuit_grid_grouped['positionOrder']

# display dataframe
print(circuit_grid_grouped.shape)
circuit_grid_grouped.head()

(357, 4)


Unnamed: 0,circuitId,grid,positionOrder,delta
0,1,1,2.176471,-1.176471
1,1,2,3.470588,-1.470588
2,1,3,3.235294,-0.235294
3,1,4,4.1875,-0.1875
4,1,5,4.214286,0.785714


In [112]:
# group by circuit
circuit_grouped = circuit_grid_grouped.groupby('circuitId').agg(
    mean_delta = ('delta', 'mean')
).reset_index()

# only take into account circuits with 10 or more races
circuit_grouped = circuit_grouped[race_per_circuit['num_races'] >= 10]

# merge with circuits to obtain location and country of the circuit
circuit_grouped = circuit_grouped.merge(circuits[['circuitId', 'location', 'country']], on='circuitId')

### Results

In [113]:
# display results
circuit_grouped.sort_values(by='mean_delta')

Unnamed: 0,circuitId,mean_delta,location,country
17,24,0.266008,Abu Dhabi,UAE
12,17,0.313844,Shanghai,China
7,10,0.36539,Hockenheim,Germany
8,11,0.396813,Budapest,Hungary
2,3,0.492332,Sakhir,Bahrain
13,18,0.507398,São Paulo,Brazil
18,69,0.520833,Austin,USA
6,9,0.540136,Silverstone,UK
5,7,0.5613,Montreal,Canada
0,1,0.576015,Melbourne,Australia


When looking at the results, one can see that Abu Dhabi, Shanghai and Hockenheim are the circuits where, on average, the finishing position differs the least from the finishing position. One could say, on those circuits, starting position has the highest impact.

As stated in the beginning, we expected street circuits like Monaco, or circuits where its generally hard to overtake to lead this list. However, both of them find themselves in the lower part of the list, therefore indicating that starting position does not matter as much as expected. We assume that this goes back to the fact that circuits like this, who are usually very demanding for the driver, both physically and mentally, feature a lot of retiring cars (due to crashes, collisions or other kind of damage) which effects the finishing position of the other cars.