In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats

### Loading data and prining out head & shape ###

In [19]:
#Importing data to dataframe
raw_data = pd.read_csv('data.csv')
print(raw_data.head(10).to_string())
print("Shape: ", raw_data.shape)

   resultId  raceId  year  round  grid  positionOrder  points  laps milliseconds fastestLap rank fastestLapTime fastestLapSpeed        driverRef     surname forename         dob nationality_x constructorRef                           name nationality_y     circuitRef  circuitId        name_y        location    country      lat        lng  alt        date  target_finish
0      2460     136  2002     13    11              4     3.0  77.0          NaN         \N   \N             \N              \N        raikkonen   Räikkönen     Kimi  1979-10-17       Finnish        mclaren                    Hungaroring       British    hungaroring         11       McLaren        Budapest    Hungary  47.5789   19.24860  264  2002-08-18              1
1     11565     483  1981      1    23             21     0.0  16.0           \N         \N   \N             \N              \N           watson      Watson     John  1946-05-04       British        mclaren                     Long Beach       British     lo

### Print overview of columns containing NA values, and dropping using 90% threshold ###

In [42]:
# Make sure other types of missing data is also registered as missing data
missing_markers = ['\\N', 'NULL', 'null', ''] 
raw_data = raw_data.replace(missing_markers, np.nan)

nullValues = raw_data.isnull().sum()
print("Rows with NA and NA count: ")
print(nullValues[nullValues > 0])
print("\nShape with 90% threshold for dropping column:")
thresh = round(0.9*raw_data.shape[0])
trimmed_raw_data = raw_data.dropna(axis=1, thresh=thresh)
print(trimmed_raw_data.shape)

Rows with NA and NA count: 
points              971
laps                978
milliseconds       7393
fastestLap         6895
rank               6798
fastestLapTime     6895
fastestLapSpeed    7191
dtype: int64

Shape with 90% threshold for dropping column:
(10000, 26)


### Dropping rows with NA values ###

In [43]:
trimmed_raw_data = trimmed_raw_data.dropna()
print("Shape after dropping rows with NA values and columns with >90% NA values:\n",trimmed_raw_data.shape)

Shape after dropping rows with NA values and columns with >90% NA values:
 (8155, 26)


### Printing out describtion of dataframe, and ranked correlation between numerical features for analysis ###

In [44]:
print(trimmed_raw_data.describe().to_string())
print("\nCorrelation between numerical features and target_finish ranked on abs value")
print(trimmed_raw_data.corr(numeric_only=True)['target_finish'].sort_values(key=abs, ascending=False)[1:])

           resultId       raceId         year        round         grid  positionOrder       points         laps    circuitId          lat          lng          alt  target_finish
count   8155.000000  8155.000000  8155.000000  8155.000000  8155.000000    8155.000000  8155.000000  8155.000000  8155.000000  8155.000000  8155.000000  8155.000000    8155.000000
mean   13441.308032   554.213857  1991.561496     8.557940    11.199142      12.701288     2.010072    46.764562    23.894298    34.221987     4.866133   283.150092       0.287676
std     7758.834745   314.591211    20.045827     5.063339     7.248409       7.595927     4.429293    29.781355    19.163225    25.026052    57.453667   416.136823       0.452707
min        9.000000     1.000000  1950.000000     1.000000     0.000000       1.000000     0.000000     0.000000     1.000000   -37.849700  -118.189000    -7.000000       0.000000
25%     6633.500000   300.000000  1977.000000     4.000000     5.000000       6.000000     0.000000 

### Futher feature selection ###

Some columns have leakage features, meaning they have a 1-1 correlation with what we try to predict.These need to be removed as they will make the predictions too "easy"

Some of these are also measurements made after after the target value is measured, and therefore don't have any predictive power. E.g. "positionOrder" and "points", which are based upon whether the driver finishes the race or not.

We also need to decide which of the non-numerical columns to discard even if they aren't leakage features. Simply because they have too diverse values relative to the amount of data. As an example, the forename of the driver will be too sparse for the model to learn what drivers are reckless, and which aren't. 

Will also remove geographical features

In [46]:
f = ["points", "resultId", "positionOrder", "lng", "lat"]