# **DNF F1 analasys project** 

## Part 1 - Loading and maipulating data

In [278]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats

In [279]:
#Importing data to dataframe
data = pd.read_csv('data.csv')
print(data.head(1).to_string() ,"\n")
print("Shape of daraframe: ", data.shape)

   resultId  raceId  year  round  grid  positionOrder  points  laps milliseconds fastestLap rank fastestLapTime fastestLapSpeed  driverRef    surname forename         dob nationality_x constructorRef         name nationality_y   circuitRef  circuitId   name_y  location  country      lat      lng  alt        date  target_finish
0      2460     136  2002     13    11              4     3.0  77.0          NaN         \N   \N             \N              \N  raikkonen  Räikkönen     Kimi  1979-10-17       Finnish        mclaren  Hungaroring       British  hungaroring         11  McLaren  Budapest  Hungary  47.5789  19.2486  264  2002-08-18              1 

Shape of daraframe:  (10000, 31)


As we can see, we have 10 000 rows/cars and 31 columns/features. A lot of these columns have NULL, NAN
 or \N values, we therefore have to remove these to get the desired features that can be used in our models.

In [280]:

# Count columns that contains either NaN, '\N' eller 'NULL'
null_counts = (data.isna() | data.eq('\\N') | data.eq('NULL')).sum()

print(null_counts.sort_values(ascending=False))
print("Columncount:  ",len(data.columns))


milliseconds       7393
fastestLapSpeed    7191
fastestLap         6895
fastestLapTime     6895
rank               6798
laps                978
points              971
circuitRef            0
circuitId             0
name_y                0
location              0
country               0
resultId              0
name                  0
lat                   0
lng                   0
alt                   0
date                  0
nationality_y         0
forename              0
constructorRef        0
nationality_x         0
dob                   0
raceId                0
surname               0
driverRef             0
positionOrder         0
grid                  0
round                 0
year                  0
target_finish         0
dtype: int64
Columncount:   31


We decide to remove the first 5 since they are leakage columns, meaning they will give our prediction away too easily.

In [281]:
leakage_cols = [
    "milliseconds",
    "fastestLapSpeed",
    "fastestLap",
    "fastestLapTime",
    "rank",
  
]

data = data.drop(columns=leakage_cols , axis =1)

print("Columncount:  ", len(data.columns))


Columncount:   26


Now our dataset consists of 26 columns. We need to remove "target_finish", as this is what we are trying to predict

In [282]:
#Remove target_finish, leading to 25 remaining columns
data = data.drop("target_finish", axis = 1)
print("Columncount:  ", len(data.columns))

Columncount:   25


In [283]:
data.columns

Index(['resultId', 'raceId', 'year', 'round', 'grid', 'positionOrder',
       'points', 'laps', 'driverRef', 'surname', 'forename', 'dob',
       'nationality_x', 'constructorRef', 'name', 'nationality_y',
       'circuitRef', 'circuitId', 'name_y', 'location', 'country', 'lat',
       'lng', 'alt', 'date'],
      dtype='object')

We are now left with 25 columns. Let us analyse further what columns are actually beneficial for our dnf prediction:

✅ Column decisions 

- resultId: **DROP**  – Unique ID, no predictive value

- raceId: **DROP** – Unique ID, not useful

- circuitId: **DROP** – Redundant with circuitRef

**Race metadata**

- year: **KEEP** – DNF rates vary between seasons

- round: **KEEP** – Early/late season influences DNF

- date: **KEEP** – Weather/season patterns

- country: **KEEP** – Conditions differ across countries

**Driver & team**

- driverRef: **KEEP** – Stable driver ID, good for OHE

- surname: **DROP** – High cardinality, redundant

- forename: **DROP** – High cardinality, redundant

- dob: DROP – **Raw date**, weak predictor

- nationality_x: **DROP** – Weak signal

- constructorRef: **KEEP** – Team strongly impacts DNF

- nationality_y: **DROP** – Team nationality irrelevant

**Performance stats**

- grid: **KEEP** – Starting position affects crash risk

- points: **KEEP** – Reflects skill/performance level

- positionOrder: **DROP** – Leakage (reveals final result)

- laps: **DROP** – Leakage, low laps = DNF

**Circuit info**

- circuitRef: **KEEP** – Tracks differ in DNF probability

- name: **DROP** – Duplicate of circuitRef

- name_y: **DROP** – Another duplicate

- location: **DROP** – Text field, not useful

**Geographical**

- lat: **DROP** – Raw coordinate not meaningful

- lng: **DROP** – Same as above

In [284]:
keep_cols = [
    'year', 'round', 'grid', 'points',
    'driverRef', 'constructorRef',
    'circuitRef', 'country', 'alt', 'date',
    'target_finish'
]


existing_cols = [c for c in keep_cols if c in data.columns]

data = data[existing_cols]

print("Columns kept:", existing_cols)
print("New shape:", data.shape)

Columns kept: ['year', 'round', 'grid', 'points', 'driverRef', 'constructorRef', 'circuitRef', 'country', 'alt', 'date']
New shape: (10000, 10)


In [285]:
# Count columns that contains either NaN, '\N' eller 'NULL'
null_counts = (data.isna() | data.eq('\\N') | data.eq('NULL')).sum()

print(null_counts.sort_values(ascending=False))
print("Columncount:  ",len(data.columns))

points            971
year                0
round               0
grid                0
driverRef           0
constructorRef      0
circuitRef          0
country             0
alt                 0
date                0
dtype: int64
Columncount:   10


Let us remove the rows that does not have points.

In [287]:
data = data.dropna(subset=['points'])
print("New shape: ", data.shape)

New shape:  (9029, 10)
