---
title: "Data Cleaning"
format:
    html: 
        code-fold: false
---

<!-- After digesting the instructions, you can delete this cell, these are assignment instructions and do not need to be included in your final submission.  -->

{{< include instructions.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Clean the News Data

In [1]:
# import required libraries
import os
import json
import re
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nandinikodali/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from sklearn.preprocessing import StandardScaler

In [37]:
import re
from nltk.corpus import stopwords

# Load English stop words
stop_words = set(stopwords.words('english'))

def string_cleaner(input_string):
    try:
        # Remove unwanted punctuation
        out = re.sub(r"[.,;@#?!&$-]+", " ", input_string)  # Replace punctuation with space
        
        # Remove escape characters like \u201e
        out = re.sub(r'\\u[0-9a-fA-F]{4}', '', out)
        
        # Remove extra whitespace
        out = re.sub(r'\s+', ' ', out).strip()
        
        # Convert to lowercase
        out = out.lower()
        
        # Remove stop words and words of length <= 3
        words = out.split()
        words = [word for word in words if len(word) > 3 and word not in stop_words]
        
        # Join words back into a single string
        out = ' '.join(words)
        
    except Exception as e:
        print(f"Error cleaning string: {e}")
        out = ''
    
    return out

# Example Usage
text = r"thema kdssdn\u201eformel 1\u201c lesen sie jetzt j\u201everstappen im training"
cleaned_text = string_cleaner(text)
print(f"Cleaned Text: {cleaned_text}")


Cleaned Text: thema kdssdnformel lesen jetzt jverstappen training


In [44]:
# function for cleaning the text
stop_words = set(stopwords.words('english'))

def string_cleaner(input_string):
    try:
        out = re.sub(r"""
                    [,.;@#?!&$-]+
                    \ *          
                    """,
                    " ",
                    input_string, flags=re.VERBOSE)
        out = re.sub(r'\\u[0-9a-fA-F]{4}', '', out)
        out = re.sub('[’.]+', '', out)

        out = re.sub(r'\s+', ' ', out)
        out = out.lower()
        words = out.split()
        words = [
            word for word in words
            if len(word) > 3 and word not in stop_words and not re.search(r'\d', word)
        ]
        out = ' '.join(words)
    except:
        print("ERROR")
        out = ''
    return out

In [45]:
# Function to clean news data
def clean_news_data(raw_data_dir, clean_data_dir):
    
    # Iterate through raw data files
    for file_name in os.listdir(raw_data_dir):
        if file_name.endswith("_raw_text.json"):  # Process only raw data files
            
            # Load the raw data
            raw_file_path = os.path.join(raw_data_dir, file_name)
            with open(raw_file_path, 'r') as raw_file:
                raw_data = json.load(raw_file)
            
            # Clean the data
            clean_data = {}
            for article in raw_data:
                title = article.get('title', '')
                description = article.get('description', '')
                
                if title and description:
                    clean_title = string_cleaner(title)
                    clean_description = string_cleaner(description)
                    clean_data[clean_title] = clean_description
            
            # Save the cleaned data to a new file
            clean_file_name = file_name.replace("_raw_text.json", "_clean_news.json")
            clean_file_path = os.path.join(clean_data_dir, clean_file_name)
            with open(clean_file_path, 'w') as clean_file:
                json.dump(clean_data, clean_file, indent=4)

In [46]:
# Define directories

# Directory with raw data files
raw_data_dir = "../../data/raw-data/News_Drivers"  

# Directory for cleaned data files
clean_data_dir = "../../data/processed-data/News_drivers"  


In [47]:
# Clean the news data
clean_news_data(raw_data_dir, clean_data_dir)

## Clean the Drivers Standings

In [52]:
input_file = "../../data/raw-data/Driver_standings/driver_standings_2000_2023.json"
output_file = "../../data/processed-data/driver_standings_2000_2023.csv"


In [53]:
with open(input_file, 'r') as f:
    data = json.load(f)

# Prepare a list to store extracted records
cleaned_data = []

# Loop through each season in the JSON
for season, season_data in data.items():
    standings_lists = season_data.get('MRData', {}).get('StandingsTable', {}).get('StandingsLists', [])
    
    for standings in standings_lists:
        driver_standings = standings.get('DriverStandings', [])
        
        for entry in driver_standings:
            # Extract required fields
            position = entry.get('position', '')
            points = entry.get('points', '')
            wins = entry.get('wins', '')
            driver = entry.get('Driver', {})
            constructors = entry.get('Constructors', [])
            
            # Extract driver and constructor details
            given_name = driver.get('givenName', '')
            family_name = driver.get('familyName', '')
            constructor_id = constructors[0].get('constructorId', '') if constructors else ''
            constructor_name = constructors[0].get('name', '') if constructors else ''
            
            # Append the record to the cleaned data list
            cleaned_data.append({
                "Season": season,
                "Position": position,
                "FirstName": given_name,
                "LastName": family_name,
                "Constructor_ID": constructor_id,
                "Constructor_Name": constructor_name,
                "Points": points,
                "Wins": wins
            })


In [55]:
df = pd.DataFrame(cleaned_data)
df['driverName'] = df['FirstName'] + " " + df['LastName']
df = df.drop(['FirstName', 'LastName'], axis=1)
df.to_csv(output_file, index=False)

## Clean the Circuit Information

In [58]:
input_file = "../../data/raw-data/circuit_data/circuit_data.json"
output_file = "../../data/processed-data/circuit_data_clean.csv"


In [59]:
# Ensure the output directory exists
os.makedirs(os.path.dirname(output_file), exist_ok=True)

# Read the JSON file
with open(input_file, 'r') as f:
    data = json.load(f)

# Extract circuit data
circuits = data.get('MRData', {}).get('CircuitTable', {}).get('Circuits', [])

# Prepare a list to store extracted records
cleaned_data = []

for circuit in circuits:
    circuit_id = circuit.get('circuitId', '')
    circuit_name = circuit.get('circuitName', '')
    country = circuit.get('Location', {}).get('country', '')
    latitude = circuit.get('Location', {}).get('lat', '')
    longitude = circuit.get('Location', {}).get('long', '')
    
    # Append to the list
    cleaned_data.append({
        "Circuit_ID": circuit_id,
        "Circuit_Name": circuit_name,
        "Country": country,
        "Latitude": latitude,
        "Longitude": longitude
    })

# Convert the list to a Pandas DataFrame
df = pd.DataFrame(cleaned_data)

# Save the DataFrame to a CSV file
df.to_csv(output_file, index=False)

## Combine the Race data

In [7]:
## Cleaning all the race_data and appending them in to a single csv file 

# input output directory
input_dir = "../../data/raw-data/"
# output directory
output_file = "../../data/processed-data/all_race_results_cleaned.csv"

# creating an output file
os.makedirs(os.path.dirname(output_file), exist_ok=True)

# initialize a list to hold all results
all_combined_results = []

# process each JSON file in the input directory
# they are the only .json files in the directory
for file_name in os.listdir(input_dir):
    # process only JSON files
    if file_name.endswith(".json"): 
        file_path = os.path.join(input_dir, file_name)
        #print(f"Processing file: {file_path}")
        
        # read the JSON file
        with open(file_path, 'r') as f:
            data = json.load(f)
        
        # extract races from the JSON
        races = data.get('MRData', {}).get('RaceTable', {}).get('Races', [])
        
        # prepare a list to hold flattened race results for this file
        file_results = []

        # loop through each race and flatten its data
        for race in races:
            # extract required information 
            race_info = { 
                "season": race.get("season", ""),
                "round": race.get("round", ""),
                "raceName": race.get("raceName", ""),
                "url": race.get("url",""),
                "circuitName": race.get("Circuit", {}).get("circuitName", ""),
                "locality": race.get("Circuit", {}).get("Location", {}).get("locality", ""),
                "country": race.get("Circuit", {}).get("Location", {}).get("country", ""),
                "lat": race.get("Circuit", {}).get("Location", {}).get("lat", ""),
                "long": race.get("Circuit", {}).get("Location", {}).get("long", ""),
                "date": race.get("date", ""),
            }
            
            # extract results and combine with useful details
            results = race.get("Results", [])
            for result in results:
                # combine race-level and result-level data
                combined_data = {**race_info, **result}
                # add flattened driver and constructor details
                combined_data.update({
                    "driverId": result.get("Driver", {}).get("driverId", ""),
                    "driverGivenName": result.get("Driver", {}).get("givenName", ""),
                    "driverFamilyName": result.get("Driver", {}).get("familyName", ""),
                    "constructorId": result.get("Constructor", {}).get("constructorId", ""),
                    "constructorName": result.get("Constructor", {}).get("name", ""),
                    "status": result.get("status", ""),
                    "timeMillis": result.get("Time", {}).get("millis", ""),
                    "time": result.get("Time", {}).get("time", "")
                })
                file_results.append(combined_data)

        # append the results for this file to the combined list
        all_combined_results.extend(file_results)

# aonvert the combined results to a Pandas DataFrame
df = pd.DataFrame(all_combined_results)

# aave the combined DataFrame to a CSV file
df.to_csv(output_file, index=False)



## Cleaning Weather Data

In [66]:
weather_df = pd.read_csv("../../data/raw-data/weather/race_data_with_weather.csv")

In [67]:
weather_df.isnull().sum()

season      0
raceName    0
url         0
weather     0
dtype: int64

In [68]:
weather_df['weather']

0                                                  Sunny
1                      Overcast with light rain at start
2                                     Mainly cloudy, dry
3                                           Cloudy, rain
4                                     Mainly cloudy, dry
                             ...                        
117    Sunny with temperatures reaching up to 27 °C (...
118    Dry start, with heavy rain and thunderstorm/mo...
119                                                 Rain
120                                                Sunny
121                                          Warm, Sunny
Name: weather, Length: 122, dtype: object

#### we will try to categorise the weather description into one of the following categories:
1. Sunny
2. Cloudy
3. Rainy
4. Windy

In [69]:
def classify_weather(weather_description):

    weather_description = weather_description.lower()
    
    if "sunny" in weather_description or "fine" in weather_description or "clear" in weather_description or "dry" in weather_description:
        return "Sunny"
    elif "cloudy" in weather_description or "overcast" in weather_description or "cloud" in weather_description:
        return "Cloudy"
    elif "rain" in weather_description or "thunderstorms" in weather_description or "drizzle" in weather_description:
        return "Rainy"
    elif "windy" in weather_description:
        return "Windy"
    # If no match, classify as "Not Available"
    else:
        return "Not Available" 

In [70]:
# call classify_weather()
weather_df['weather_class'] = weather_df['weather'].apply(classify_weather)

# Save the updated DataFrame to a CSV file
output_csv = "../../data/processed-data/classified_weather_data.csv"
weather_df.to_csv(output_csv, index=False)

In [71]:
weather_df['weather_class'].value_counts()

weather_class
Sunny            95
Cloudy           24
Rainy             2
Not Available     1
Name: count, dtype: int64

The weather data for 2006 European Grand Prix is not available on wikipedia.
- Using longitude, latitude and the date: The weather was `Sunny`

In [72]:
weather_df['weather_class'] = weather_df['weather_class'].replace('Not Available', 'Sunny')

In [73]:
output_csv = "../../data/processed-data/classified_weather_data.csv"
weather_df.to_csv(output_csv, index=False)

In [75]:
# merge race results and weather information
race_df = pd.read_csv("../../data/processed-data/all_race_results_cleaned.csv")
weather_df = pd.read_csv("../../data/processed-data/classified_weather_data.csv")

merged_df = race_df.merge(weather_df[['url','weather_class']], on='url', how='left')

merged_df.to_csv("../../data/processed-data/race_weather_merged.csv", index=False)

## Clean the merged data

In [33]:
main_df = pd.read_csv("../../data/processed-data/race_weather_merged.csv")
main_df.head()

Unnamed: 0,season,round,raceName,url,circuitName,locality,country,lat,long,date,...,Time,FastestLap,driverId,driverGivenName,driverFamilyName,constructorId,constructorName,timeMillis,time,weather_class
0,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,26.0325,50.5106,2010-03-14,...,"{'millis': '5960396', 'time': '1:39:20.396'}","{'rank': '1', 'lap': '45', 'Time': {'time': '1...",alonso,Fernando,Alonso,ferrari,Ferrari,5960396.0,1:39:20.396,Sunny
1,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,26.0325,50.5106,2010-03-14,...,"{'millis': '5976495', 'time': '+16.099'}","{'rank': '5', 'lap': '38', 'Time': {'time': '1...",massa,Felipe,Massa,ferrari,Ferrari,5976495.0,+16.099,Sunny
2,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,26.0325,50.5106,2010-03-14,...,"{'millis': '5983578', 'time': '+23.182'}","{'rank': '4', 'lap': '42', 'Time': {'time': '1...",hamilton,Lewis,Hamilton,mclaren,McLaren,5983578.0,+23.182,Sunny
3,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,26.0325,50.5106,2010-03-14,...,"{'millis': '5999195', 'time': '+38.799'}","{'rank': '12', 'lap': '32', 'Time': {'time': '...",vettel,Sebastian,Vettel,red_bull,Red Bull,5999195.0,+38.799,Sunny
4,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,26.0325,50.5106,2010-03-14,...,"{'millis': '6000609', 'time': '+40.213'}","{'rank': '13', 'lap': '45', 'Time': {'time': '...",rosberg,Nico,Rosberg,mercedes,Mercedes,6000609.0,+40.213,Sunny


In [34]:
main_df.columns

Index(['season', 'round', 'raceName', 'url', 'circuitName', 'locality',
       'country', 'lat', 'long', 'date', 'number', 'position', 'positionText',
       'points', 'Driver', 'Constructor', 'grid', 'laps', 'status', 'Time',
       'FastestLap', 'driverId', 'driverGivenName', 'driverFamilyName',
       'constructorId', 'constructorName', 'timeMillis', 'time',
       'weather_class'],
      dtype='object')

In [35]:
# drop un-needed columns
main_df = main_df.drop(['lat', 'long', 'number', 'positionText', 'Driver', 'Constructor', 'Time', 'FastestLap'], axis=1)

In [36]:
main_df.columns

Index(['season', 'round', 'raceName', 'url', 'circuitName', 'locality',
       'country', 'date', 'position', 'points', 'grid', 'laps', 'status',
       'driverId', 'driverGivenName', 'driverFamilyName', 'constructorId',
       'constructorName', 'timeMillis', 'time', 'weather_class'],
      dtype='object')

In [37]:
main_df.isnull().sum()

season                 0
round                  0
raceName               0
url                    0
circuitName            0
locality               0
country                0
date                   0
position               0
points                 0
grid                   0
laps                   0
status                 0
driverId               0
driverGivenName        0
driverFamilyName       0
constructorId          0
constructorName        0
timeMillis          1291
time                1291
weather_class          0
dtype: int64

The missing values in '`timeMillis'` and '`time'` columns are of those drivers who did not finish the race. Therefore, we will drop these columns and try to analyse the performance based on other metrics.

In [38]:
# drop un-needed columns
main_df = main_df.drop(['timeMillis', 'time'], axis=1)

In [39]:
main_df['constructorName'].value_counts()

constructorName
Ferrari           235
McLaren           234
Williams          229
Red Bull          184
Renault           142
Sauber            139
Mercedes          134
Toro Rosso        127
Force India       100
Haas F1 Team       79
Toyota             76
Jordan             57
BAR                56
Minardi            54
Alfa Romeo         50
Jaguar             45
BMW Sauber         40
AlphaTauri         40
Lotus F1           38
Alpine F1 Team     30
Aston Martin       30
Honda              28
Arrows             26
HRT                24
Marussia           24
Super Aguri        24
Caterham           24
Racing Point       20
Prost              18
Benetton           18
Virgin             16
Manor Marussia     16
Lotus              16
Brawn              10
MF1                 9
Spyker              8
Name: count, dtype: int64

In [40]:
main_df = main_df.drop(['constructorId'], axis=1)

Some of the team names were changed in the process of rebranding or due to a change in ownership. 
For accurate analysis, we will replace the older versions of the constructors' names with the current ones.



In [41]:
constructor_mapping= {
    "Jaguar" : "Red Bull",
    "BMW Sauber" : "Sauber",
    "Alfa Romeo" : "Sauber",
    "BAR" : "Mercedes",
    "Honda" : "Mercedes",
    "Brawn" : "Mercedes",
    "Minardi" : "AlphaTauri",
    "Toro Rosso" : "AlphaTauri",
    "Force India" : "Aston Martin",
    "Jordan" : "Aston Martin",
    "Racing Point" : "Aston Martin",
    "MF1" : "Aston Martin",
    "Spyker" : "Aston Martin",
    "Lotus F1" : "Alpine F1 Team",
    "Renault" : "Alpine F1 Team",
    "Benetton" : "Alpine F1 Team",
    "Manor Marussia" : "Marussia",
    "Virgin" : "Marussia",
    "Lotus" : "Caterham",

}

In [42]:
main_df['constructorName'] = main_df['constructorName'].replace(constructor_mapping)

In [43]:
main_df['constructorName'].value_counts()

constructorName
Ferrari           235
McLaren           234
Red Bull          229
Williams          229
Sauber            229
Mercedes          228
Alpine F1 Team    228
Aston Martin      224
AlphaTauri        221
Haas F1 Team       79
Toyota             76
Marussia           56
Caterham           40
Arrows             26
Super Aguri        24
HRT                24
Prost              18
Name: count, dtype: int64

In [44]:
main_df['driverName'] = main_df['driverGivenName'] + " " + main_df['driverFamilyName']

In [45]:
main_df = main_df.drop(['driverGivenName','driverFamilyName'], axis = 1)

In [46]:
main_df.head(2)

Unnamed: 0,season,round,raceName,url,circuitName,locality,country,date,position,points,grid,laps,status,driverId,constructorName,weather_class,driverName
0,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,2010-03-14,1,25.0,3,49,Finished,alonso,Ferrari,Sunny,Fernando Alonso
1,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,2010-03-14,2,18.0,2,49,Finished,massa,Ferrari,Sunny,Felipe Massa


In [47]:
main_df['status'].unique()

array(['Finished', '+1 Lap', '+2 Laps', 'Electrical', 'Hydraulics',
       'Overheating', 'Gearbox', 'Suspension', 'Accident', '+5 Laps',
       'Wheel', 'Engine', 'Spun off', 'Collision', '+3 Laps', '+4 Laps',
       '+10 Laps', 'Throttle', 'Clutch', 'Technical', 'Mechanical',
       'Driveshaft', 'Transmission', 'Steering', 'Puncture', 'Brakes',
       'Retired', 'Tyre', 'Fuel pressure', '+9 Laps', 'Water leak',
       'Disqualified', 'Did not qualify', '+42 Laps', 'Engine misfire',
       'Power Unit', 'Oil pressure', 'Safety concerns', 'Fuel system',
       '+6 Laps', 'Electronics', 'Collision damage', 'Wheel nut',
       'Battery', 'Oil leak', '+7 Laps', 'Stalled', 'Exhaust',
       'Vibrations', 'Broken wing', 'Fuel', 'Wheel rim', 'Power loss',
       '107% Rule', '+8 Laps', 'ERS', 'Withdrew', 'Cooling system',
       'Water pump', 'Fuel leak', 'Front wing', 'Turbo', 'Damage',
       'Out of fuel', 'Radiator', 'Oil line', 'Fuel rig',
       'Launch control', 'Not classified', 'Pn

In [48]:
# classifying different categories under status
def classify_status(status):
    if status == 'Finished':
        return 'Finished'
    elif 'Lap' in status:  # Handles all with 'Lap'
        return 'Lapped'
    elif status in ['Accident', 'Collision', 'Spun off', 'Withdrew']:
        return 'Accident'
    else:
        return 'Mechanical'

main_df['status'] = main_df['status'].apply(classify_status)


In [49]:
main_df['status'].value_counts()

status
Finished      1105
Lapped         693
Mechanical     412
Accident       190
Name: count, dtype: int64

In [50]:
main_df.to_csv("../../data/processed-data/race_info.csv", index=False)

In [None]:
# Finish category - new categorical variable
data = pd.read_csv("../../data/processed-data/race_info.csv")
data['FinishCategory'] = ''
for i in range(len(data)):
    if data['position'][i] in [1,2,3]:
        data['FinishCategory'][i] = "Podium"
    
    elif data['position'][i] in [4,5,6,7,8,9,10]:
        data['FinishCategory'][i] = "Points Finish"

    else:
        data['FinishCategory'][i] = "No Points"

In [52]:
data.head()

Unnamed: 0,season,round,raceName,url,circuitName,locality,country,date,position,points,grid,laps,status,driverId,constructorName,weather_class,driverName,FinishCategory
0,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,2010-03-14,1,25.0,3,49,Finished,alonso,Ferrari,Sunny,Fernando Alonso,Podium
1,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,2010-03-14,2,18.0,2,49,Finished,massa,Ferrari,Sunny,Felipe Massa,Podium
2,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,2010-03-14,3,15.0,4,49,Finished,hamilton,McLaren,Sunny,Lewis Hamilton,Podium
3,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,2010-03-14,4,12.0,1,49,Finished,vettel,Red Bull,Sunny,Sebastian Vettel,Points Finish
4,2010,1,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...,Bahrain International Circuit,Sakhir,Bahrain,2010-03-14,5,10.0,5,49,Finished,rosberg,Mercedes,Sunny,Nico Rosberg,Points Finish


In [53]:
data.to_csv("../../data/processed-data/race_info.csv", index=False)

In [25]:
# race track features
data = pd.read_csv("../../data/raw-data/circuit_data/merged_circuit_features.csv")
data.head()

Unnamed: 0,Year,Grand Prix,Track Length (m),Max Speed (km/h),Full Throttle (%),Number of Corners,Number of Straights
0,2020,Pre-Season Test 1,4312.438437,323,70.673953,1,4
1,2020,Pre-Season Test 2,4312.438437,323,70.673953,1,4
2,2020,Austrian Grand Prix,4312.438437,323,70.673953,1,4
3,2020,Styrian Grand Prix,4292.610384,300,46.556886,2,6
4,2020,Hungarian Grand Prix,4348.049386,318,58.114374,0,6


In [26]:
data.isnull().sum()

Year                   0
Grand Prix             0
Track Length (m)       0
Max Speed (km/h)       0
Full Throttle (%)      0
Number of Corners      0
Number of Straights    0
dtype: int64

In [27]:
req_cols = ["Track Length (m)", "Max Speed (km/h)", "Full Throttle (%)","Number of Corners", "Number of Straights"]
scaler = StandardScaler()

# Apply the scaler to the dataframe
data[req_cols] = scaler.fit_transform(data[req_cols])

In [31]:
data.head()

Unnamed: 0,Year,Grand Prix,Track Length (m),Max Speed (km/h),Full Throttle (%),Number of Corners,Number of Straights
0,2020,Pre-Season Test 1,-1.000607,-0.11567,1.059667,-0.789651,-0.938394
1,2020,Pre-Season Test 2,-1.000607,-0.11567,1.059667,-0.789651,-0.938394
2,2020,Austrian Grand Prix,-1.000607,-0.11567,1.059667,-0.789651,-0.938394
3,2020,Styrian Grand Prix,-1.024865,-1.84098,-1.757479,-0.275003,-0.037811
4,2020,Hungarian Grand Prix,-0.957039,-0.490737,-0.407433,-1.3043,-0.037811


In [32]:
data.to_csv("../../data/processed-data/race_track_features.csv")    

## Pitstop data

In [54]:
df = pd.read_csv("../../data/raw-data/pitstop_data.csv")
df.head()

Unnamed: 0,Year,Round,RaceName,DriverID,Lap,Stop,Time,Duration
0,2011,1,Australian Grand Prix,alguersuari,1,1,17:05:23,26.898
1,2011,1,Australian Grand Prix,michael_schumacher,1,1,17:05:52,25.021
2,2011,1,Australian Grand Prix,webber,11,1,17:20:48,23.426
3,2011,1,Australian Grand Prix,alonso,12,1,17:22:34,23.251
4,2011,1,Australian Grand Prix,massa,13,1,17:24:10,23.842


In [60]:
transformed = (
    df.assign(PitStopNumber=lambda x: x.groupby(['Year', 'Round', 'RaceName', 'DriverID']).cumcount() + 1)
    .pivot(index=['Year', 'Round', 'RaceName', 'DriverID'], columns='PitStopNumber', values=['Lap', 'Stop', 'Time', 'Duration'])
)

# Flatten the multi-index columns
transformed.columns = [f"{col[0]}{col[1]}" for col in transformed.columns]

# Reset index to turn it into a DataFrame
transformed.reset_index(inplace=True)

transformed = transformed.fillna(0)
# Save or print the transformed data
print(transformed.head(2))

   Year  Round               RaceName     DriverID  Lap1  Lap2  Lap3  Lap4  \
0  2011      1  Australian Grand Prix  alguersuari     1    17    35     0   
1  2011      1  Australian Grand Prix       alonso    12    27    42     0   

   Lap5  Lap6  ...  Time5  Time6  Time7  Duration1  Duration2  Duration3  \
0     0     0  ...      0      0      0     26.898     24.463     26.348   
1     0     0  ...      0      0      0     23.251     24.733     24.181   

   Duration4  Duration5 Duration6 Duration7  
0          0          0         0         0  
1          0          0         0         0  

[2 rows x 32 columns]


In [56]:
transformed.shape

(5118, 32)