# **Data Preprocessing**

In this notebook we aim to prepare the data for modelling.
Data preprocessing will take place in 6 stages:

* **1. Loading in the Data**
* **2. Renaming the Dataframe Fields**
* **3. Merging the Dataframes**
* **4. Creating New Variables**
* **5. Filling in Missing Values**
* **6. Adding Dummy Variables**

After preprocessing is complete, the data will be ready for training a machine learning model 

## **Dependencies**

In [1]:
import pandas as pd
import numpy as np

## **1. Loading in the Data**

We are going to read in our six dataframes directly from Github, which will prevent other users from having to locally download the csv files

In [3]:
race_url = 'https://raw.githubusercontent.com/DeanLundie/Formula-1/master/Data/race.csv'
races = pd.read_csv(race_url)

results_url = 'https://raw.githubusercontent.com/DeanLundie/Formula-1/master/Data/results.csv'
results = pd.read_csv(results_url)

qualifying_url = 'https://raw.githubusercontent.com/DeanLundie/Formula-1/master/Data/qualifying_results.csv'
qualifying = pd.read_csv(qualifying_url)

driver_url = 'https://raw.githubusercontent.com/DeanLundie/Formula-1/master/Data/driver_standings.csv'
driver_standings = pd.read_csv(driver_url)

constructor_url = 'https://raw.githubusercontent.com/DeanLundie/Formula-1/master/Data/constructor_standings.csv'
constructor_standings = pd.read_csv(constructor_url)

weather_url = 'https://raw.githubusercontent.com/DeanLundie/Formula-1/master/Data/weather_info.csv'
weather = pd.read_csv(weather_url)

## **2. Renaming the Dataframe Fields**

Before we can merge the dataframes, we need to rename any fields that have differing names:

In [14]:
qualifying.rename(columns = {'grid_position': 'grid'}, inplace = True)

print(qualifying.shape)
qualifying.head()

(14559, 6)


Unnamed: 0,grid,driver_name,car,qualifying_time,season,round
0,1,Keke Rosberg ROS,Williams Honda,1:34.526,1983,1
1,2,Alain Prost PRO,Renault,1:34.672,1983,1
2,3,Patrick Tambay TAM,Ferrari,1:34.758,1983,1
3,4,Nelson Piquet PIQ,Brabham BMW,1:35.114,1983,1
4,5,Derek Warwick WAR,Toleman Hart,1:35.206,1983,1


## **3. Merging the Dataframes**

We can now begin merging the dataframes. Since there is no common identifier (which is one-to-one) between the dataframes, we will iteratively add two dataframes together which have a common key. This will be done until all six dataframes are contained within a single dataframe:

In [23]:
df1 = pd.merge(races, weather, how='inner', on=['season', 'round', 'circuit_id']).drop(['lat', 'long','country','weather'], axis = 1)
df2 = pd.merge(df1, results, how='inner', on=['season', 'round', 'circuit_id']).drop(['url','points', 'status', 'time'], axis = 1)

df3 = pd.merge(df2, driver_standings, how='left', on=['season', 'round', 'driver']) 
df4 = pd.merge(df3, constructor_standings, how='left', on=['season', 'round', 'constructor']) #from 1958

final_df = pd.merge(df4, qualifying, how='inner', on=['season', 'round', 'grid']).drop(['driver_name', 'car'], axis = 1)

In [24]:
final_df

Unnamed: 0,season,round,circuit_id,date,weather_warm,weather_cold,weather_dry,weather_wet,weather_cloudy,driver,...,constructor,grid,podium,driver_points,driver_wins,driver_standings_pos,constructor_points,constructor_wins,constructor_standings_pos,qualifying_time
0,1983,1,jacarepagua,1983-03-13,0,0,1,0,0,piquet,...,brabham,4,1,0.0,0.0,0.0,0.0,0.0,0.0,1:35.114
1,1983,1,jacarepagua,1983-03-13,0,0,1,0,0,lauda,...,mclaren,9,2,0.0,0.0,0.0,0.0,0.0,0.0,1:36.054
2,1983,1,jacarepagua,1983-03-13,0,0,1,0,0,laffite,...,williams,18,3,0.0,0.0,0.0,0.0,0.0,0.0,1:38.234
3,1983,1,jacarepagua,1983-03-13,0,0,1,0,0,tambay,...,ferrari,3,4,0.0,0.0,0.0,0.0,0.0,0.0,1:34.758
4,1983,1,jacarepagua,1983-03-13,0,0,1,0,0,surer,...,arrows,20,5,0.0,0.0,0.0,0.0,0.0,0.0,1:38.468
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14531,2019,21,yas_marina,2019-12-01,1,0,0,0,0,giovinazzi,...,alfa,16,16,14.0,0.0,17.0,57.0,0.0,8.0,1:38.114
14532,2019,21,yas_marina,2019-12-01,1,0,0,0,0,russell,...,williams,18,17,0.0,0.0,20.0,1.0,0.0,10.0,1:38.717
14533,2019,21,yas_marina,2019-12-01,1,0,0,0,0,gasly,...,toro_rosso,11,18,95.0,0.0,6.0,83.0,0.0,6.0,1:37.089
14534,2019,21,yas_marina,2019-12-01,1,0,0,0,0,kubica,...,williams,19,19,1.0,0.0,19.0,1.0,0.0,10.0,1:39.236
