# **DNF F1 analasys project** 

## Part 1 - Loading and maipulating data

In [357]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats

In [358]:
#Importing data to dataframe
data = pd.read_csv('data.csv')
print(data.head(1).to_string() ,"\n")
print("Shape of daraframe: ", data.shape)

   resultId  raceId  year  round  grid  positionOrder  points  laps milliseconds fastestLap rank fastestLapTime fastestLapSpeed  driverRef    surname forename         dob nationality_x constructorRef         name nationality_y   circuitRef  circuitId   name_y  location  country      lat      lng  alt        date  target_finish
0      2460     136  2002     13    11              4     3.0  77.0          NaN         \N   \N             \N              \N  raikkonen  Räikkönen     Kimi  1979-10-17       Finnish        mclaren  Hungaroring       British  hungaroring         11  McLaren  Budapest  Hungary  47.5789  19.2486  264  2002-08-18              1 

Shape of daraframe:  (10000, 31)


As we can see, we have 10 000 rows/cars and 31 columns/features. A lot of these columns have NULL, NAN
 or \N values, we therefore have to remove these to get the desired features that can be used in our models.

In [359]:

# Count columns that contains either NaN, '\N' or 'NULL'
null_counts = (data.isna() | data.eq('\\N') | data.eq('NULL')).sum()

print(null_counts.sort_values(ascending=False))
print("Columncount:  ",len(data.columns))


milliseconds       7393
fastestLapSpeed    7191
fastestLap         6895
fastestLapTime     6895
rank               6798
laps                978
points              971
circuitRef            0
circuitId             0
name_y                0
location              0
country               0
resultId              0
name                  0
lat                   0
lng                   0
alt                   0
date                  0
nationality_y         0
forename              0
constructorRef        0
nationality_x         0
dob                   0
raceId                0
surname               0
driverRef             0
positionOrder         0
grid                  0
round                 0
year                  0
target_finish         0
dtype: int64
Columncount:   31


We decide to remove the first 5 since they are leakage columns, meaning they will give our prediction away too easily.

In [360]:
leakage_cols = [
    "milliseconds",
    "fastestLapSpeed",
    "fastestLap",
    "fastestLapTime",
    "rank",
  
]

data = data.drop(columns=leakage_cols , axis =1)

print("Columncount:  ", len(data.columns))


Columncount:   26


Now our dataset consists of 26 columns. We need to remove "target_finish", as this is what we are trying to predict

In [361]:
#Remove target_finish, leading to 25 remaining columns
data = data.drop("target_finish", axis = 1)
print("Columncount:  ", len(data.columns))

Columncount:   25


In [362]:
data.columns

Index(['resultId', 'raceId', 'year', 'round', 'grid', 'positionOrder',
       'points', 'laps', 'driverRef', 'surname', 'forename', 'dob',
       'nationality_x', 'constructorRef', 'name', 'nationality_y',
       'circuitRef', 'circuitId', 'name_y', 'location', 'country', 'lat',
       'lng', 'alt', 'date'],
      dtype='object')

We are now left with 25 columns. Let us analyse further what columns are actually beneficial for our dnf prediction:

✅ Column decisions 

- resultId: **DROP**  – Unique ID, no predictive value

- raceId: **DROP** – Unique ID, not useful

- circuitId: **DROP** – Redundant with circuitRef

**Race metadata**

- year: **KEEP** – DNF rates vary between seasons

- round: **KEEP** – Early/late season influences DNF

- date: **KEEP** – Weather/season patterns

- country: **KEEP** – Conditions differ across countries

**Driver & team**

- driverRef: **KEEP** – Stable driver ID, good for OHE

- surname: **DROP** – High cardinality, redundant

- forename: **DROP** – High cardinality, redundant

- dob: DROP – **Raw date**, weak predictor

- nationality_x: **DROP** – Weak signal

- constructorRef: **KEEP** – Team strongly impacts DNF

- nationality_y: **DROP** – Team nationality irrelevant

**Performance stats**

- grid: **KEEP** – Starting position affects crash risk

- points: **KEEP** – Reflects skill/performance level

- positionOrder: **DROP** – Leakage (reveals final result)

- laps: **DROP** – Leakage, low laps = DNF

**Circuit info**

- circuitRef: **KEEP** – Tracks differ in DNF probability

- name: **DROP** – Duplicate of circuitRef

- name_y: **DROP** – Another duplicate

- location: **DROP** – Text field, not useful

**Geographical**

- lat: **DROP** – Raw coordinate not meaningful

- lng: **DROP** – Same as above

In [363]:
keep_cols = [
    'year', 'round', 'grid', 'points',
    'driverRef', 'constructorRef',
    'circuitRef', 'country', 'alt', 'date',
    'target_finish'
]


existing_cols = [c for c in keep_cols if c in data.columns]

data = data[existing_cols]

print("Columns kept:", existing_cols)
print("New shape:", data.shape)

Columns kept: ['year', 'round', 'grid', 'points', 'driverRef', 'constructorRef', 'circuitRef', 'country', 'alt', 'date']
New shape: (10000, 10)


In [364]:
# Count columns that contains either NaN, '\N' eller 'NULL'
null_counts = (data.isna() | data.eq('\\N') | data.eq('NULL')).sum()

print(null_counts.sort_values(ascending=False))
print("Columncount:  ",len(data.columns))

points            971
year                0
round               0
grid                0
driverRef           0
constructorRef      0
circuitRef          0
country             0
alt                 0
date                0
dtype: int64
Columncount:   10


Let us remove the rows that does not have points.

In [365]:
data = data.dropna(subset=['points'])
print("New shape: ", data.shape)

New shape:  (9029, 10)


We now have to find out what columns are objects or strings, since our prediction models only can use numbers to interpret the data.

In [366]:
string_cols = data.select_dtypes(include=['object', 'string']).columns.tolist()
print(string_cols)

['driverRef', 'constructorRef', 'circuitRef', 'country', 'date']


In [367]:
#Ammount of unique values for our string features
unique_counts = data[string_cols].nunique()
print(unique_counts)

driverRef          663
constructorRef     173
circuitRef          77
country             35
date              1125
dtype: int64


As we can see, using one-hot encoding here will not work since we will get way to many new features - 2073 to be precise.

We can start by turning date into: year, month, day

In [368]:
data['date'] = pd.to_datetime(data['date'])
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day

# DROP date after extracting features
data = data.drop(columns=['date'])


print(data.columns)
print("New shape: ", data.shape)

Index(['year', 'round', 'grid', 'points', 'driverRef', 'constructorRef',
       'circuitRef', 'country', 'alt', 'month', 'day'],
      dtype='object')
New shape:  (9029, 11)


Let us use frequency encoding on the 4 other columns. We will get values between 0-1 based on how often they appear. An example is Hamilton = 1000/9029 = 0.11075423635.

In [369]:
# Finn alle kategoriske kolonner (object)
cat_cols = data.select_dtypes(include=['object']).columns
print("Categorical columns:", cat_cols)

# Frequency encode each column
for col in cat_cols:
    freq = data[col].value_counts(normalize=True)   # normalized frequency
    data[col] = data[col].map(freq)

data.head()

Categorical columns: Index(['driverRef', 'constructorRef', 'circuitRef', 'country'], dtype='object')


Unnamed: 0,year,round,grid,points,driverRef,constructorRef,circuitRef,country,alt,month,day
0,2002,13,11,3.0,0.01473,0.072987,0.037324,0.037324,264,8,18
1,1981,1,23,0.0,0.006313,0.072987,0.009414,0.079189,12,3,15
2,1958,8,0,0.0,0.000332,0.016281,0.035884,0.072323,578,8,3
3,2021,8,19,0.0,0.001551,0.014952,0.033448,0.034002,678,6,27
4,1988,12,0,0.0,0.003987,0.004098,0.068668,0.099568,162,9,11


LETS GO! Our dataset is finished and we can now move onto part 2, training our models!

## Part 2 - Loading and maipulating data