---
title: "Data Cleaning"
format:
    html: 
        code-fold: false
---

<!-- After digesting the instructions, you can delete this cell, these are assignment instructions and do not need to be included in your final submission.  -->

{{< include instructions.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Clean the News Data

In [22]:
# import required libraries
import os
import json
import re

In [23]:
# function for cleaning the text

def string_cleaner(input_string):
    try:
        out = re.sub(r"""
                    [,.;@#?!&$-]+
                    \ *          
                    """,
                    " ",
                    input_string, flags=re.VERBOSE)
        out = re.sub('[’.]+', '', out)
        out = re.sub(r'\\u[0-9a-fA-F]{4}', '', out)
        out = re.sub(r'\s+', ' ', out)
        out = out.lower()
    except:
        print("ERROR")
        out = ''
    return out

In [24]:
# Function to clean news data
def clean_news_data(raw_data_dir, clean_data_dir):
    
    # Iterate through raw data files
    for file_name in os.listdir(raw_data_dir):
        if file_name.endswith("_raw_text.json"):  # Process only raw data files
            
            # Load the raw data
            raw_file_path = os.path.join(raw_data_dir, file_name)
            with open(raw_file_path, 'r') as raw_file:
                raw_data = json.load(raw_file)
            
            # Clean the data
            clean_data = {}
            for article in raw_data:
                title = article.get('title', '')
                description = article.get('description', '')
                
                if title and description:
                    clean_title = string_cleaner(title)
                    clean_description = string_cleaner(description)
                    clean_data[clean_title] = clean_description
            
            # Save the cleaned data to a new file
            clean_file_name = file_name.replace("_raw_text.json", "_clean_news.json")
            clean_file_path = os.path.join(clean_data_dir, clean_file_name)
            with open(clean_file_path, 'w') as clean_file:
                json.dump(clean_data, clean_file, indent=4)

In [25]:
# Define directories

# Directory with raw data files
raw_data_dir = "../../data/raw-data/News_Drivers"  

# Directory for cleaned data files
clean_data_dir = "../../data/processed-data/News_drivers"  


In [26]:
# Clean the news data
clean_news_data(raw_data_dir, clean_data_dir)

## Clean the Drivers Standings

In [32]:
input_file = "../../data/raw-data/Driver_standings/driver_standings_2000_2023.json"
output_file = "../../data/processed-data/driver_standings_2000_2023.csv"
os.makedirs(os.path.dirname(output_file), exist_ok=True)

In [33]:
with open(input_file, 'r') as f:
    data = json.load(f)

# Prepare a list to store extracted records
cleaned_data = []

# Loop through each season in the JSON
for season, season_data in data.items():
    standings_lists = season_data.get('MRData', {}).get('StandingsTable', {}).get('StandingsLists', [])
    
    for standings in standings_lists:
        driver_standings = standings.get('DriverStandings', [])
        
        for entry in driver_standings:
            # Extract required fields
            position = entry.get('position', '')
            points = entry.get('points', '')
            wins = entry.get('wins', '')
            driver = entry.get('Driver', {})
            constructors = entry.get('Constructors', [])
            
            # Extract driver and constructor details
            given_name = driver.get('givenName', '')
            family_name = driver.get('familyName', '')
            constructor_id = constructors[0].get('constructorId', '') if constructors else ''
            constructor_name = constructors[0].get('name', '') if constructors else ''
            
            # Append the record to the cleaned data list
            cleaned_data.append({
                "Season": season,
                "Position": position,
                "FirstName": given_name,
                "LastName": family_name,
                "Constructor_ID": constructor_id,
                "Constructor_Name": constructor_name,
                "Points": points,
                "Wins": wins
            })

# Convert the list to a Pandas DataFrame
df = pd.DataFrame(cleaned_data)

# Save the DataFrame to a CSV file
df.to_csv(output_file, index=False)

## Clean the Circuit Information

In [46]:
input_file = "../../data/raw-data/circuit_data.json"
output_file = "../../data/processed-data/circuit_data_clean.csv"
os.makedirs(os.path.dirname(output_file), exist_ok=True)

In [47]:
# Ensure the output directory exists
os.makedirs(os.path.dirname(output_file), exist_ok=True)

# Read the JSON file
with open(input_file, 'r') as f:
    data = json.load(f)

# Extract circuit data
circuits = data.get('MRData', {}).get('CircuitTable', {}).get('Circuits', [])

# Prepare a list to store extracted records
cleaned_data = []

for circuit in circuits:
    circuit_id = circuit.get('circuitId', '')
    circuit_name = circuit.get('circuitName', '')
    country = circuit.get('Location', {}).get('country', '')
    latitude = circuit.get('Location', {}).get('lat', '')
    longitude = circuit.get('Location', {}).get('long', '')
    
    # Append to the list
    cleaned_data.append({
        "Circuit_ID": circuit_id,
        "Circuit_Name": circuit_name,
        "Country": country,
        "Latitude": latitude,
        "Longitude": longitude
    })

# Convert the list to a Pandas DataFrame
df = pd.DataFrame(cleaned_data)

# Save the DataFrame to a CSV file
df.to_csv(output_file, index=False)

## Clean the Race data

In [36]:
input_file = "../../data/raw-data/race_data_2000.json"
output_file = "../../data/processed-data/race_data/race_data_2000_clean.csv"
os.makedirs(os.path.dirname(output_file), exist_ok=True)

In [37]:
# Read the JSON file
with open(input_file, 'r') as f:
    data = json.load(f)

# Extract races from the JSON
races = data.get('MRData', {}).get('RaceTable', {}).get('Races', [])

# Prepare a list to hold flattened race results
all_results = []

# Loop through each race and flatten its data
for race in races:
    race_info = {  # Extract race-level details
        "season": race.get("season", ""),
        "round": race.get("round", ""),
        "raceName": race.get("raceName", ""),
        "circuitName": race.get("Circuit", {}).get("circuitName", ""),
        "locality": race.get("Circuit", {}).get("Location", {}).get("locality", ""),
        "country": race.get("Circuit", {}).get("Location", {}).get("country", ""),
        "date": race.get("date", ""),
    }
    
    # Extract results and combine with race-level details
    results = race.get("Results", [])
    for result in results:
        # Combine race-level and result-level data
        combined_data = {**race_info, **result}
        # Add flattened driver and constructor details
        combined_data.update({
            "driverId": result.get("Driver", {}).get("driverId", ""),
            "driverGivenName": result.get("Driver", {}).get("givenName", ""),
            "driverFamilyName": result.get("Driver", {}).get("familyName", ""),
            "constructorId": result.get("Constructor", {}).get("constructorId", ""),
            "constructorName": result.get("Constructor", {}).get("name", ""),
            "status": result.get("status", ""),
            "timeMillis": result.get("Time", {}).get("millis", ""),
            "time": result.get("Time", {}).get("time", "")
        })
        all_results.append(combined_data)

# Convert to a Pandas DataFrame
df = pd.DataFrame(all_results)

# Save the DataFrame to a CSV file
df.to_csv(output_file, index=False)

In [39]:
race_df = pd.read_csv("../../data/processed-data/race_data/race_data_2000_clean.csv")
race_df.head()

Unnamed: 0,season,round,raceName,circuitName,locality,country,date,number,position,positionText,...,laps,status,Time,driverId,driverGivenName,driverFamilyName,constructorId,constructorName,timeMillis,time
0,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,3,1,1,...,58,Finished,"{'millis': '5641987', 'time': '1:34:01.987'}",michael_schumacher,Michael,Schumacher,ferrari,Ferrari,5641987.0,1:34:01.987
1,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,4,2,2,...,58,Finished,"{'millis': '5653402', 'time': '+11.415'}",barrichello,Rubens,Barrichello,ferrari,Ferrari,5653402.0,+11.415
2,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,9,3,3,...,58,Finished,"{'millis': '5661996', 'time': '+20.009'}",ralf_schumacher,Ralf,Schumacher,williams,Williams,5661996.0,+20.009
3,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,22,4,4,...,58,Finished,"{'millis': '5686434', 'time': '+44.447'}",villeneuve,Jacques,Villeneuve,bar,BAR,5686434.0,+44.447
4,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,11,5,5,...,58,Finished,"{'millis': '5687152', 'time': '+45.165'}",fisichella,Giancarlo,Fisichella,benetton,Benetton,5687152.0,+45.165


In [41]:
race_df.columns

Index(['season', 'round', 'raceName', 'circuitName', 'locality', 'country',
       'date', 'number', 'position', 'positionText', 'points', 'Driver',
       'Constructor', 'grid', 'laps', 'status', 'Time', 'driverId',
       'driverGivenName', 'driverFamilyName', 'constructorId',
       'constructorName', 'timeMillis', 'time'],
      dtype='object')

In [42]:
race_df = race_df.drop(columns=['positionText', 'Driver', 'Constructor', 'Time'], axis=1)

In [44]:
race_df.head()

Unnamed: 0,season,round,raceName,circuitName,locality,country,date,number,position,points,grid,laps,status,driverId,driverGivenName,driverFamilyName,constructorId,constructorName,timeMillis,time
0,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,3,1,10,3,58,Finished,michael_schumacher,Michael,Schumacher,ferrari,Ferrari,5641987.0,1:34:01.987
1,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,4,2,6,4,58,Finished,barrichello,Rubens,Barrichello,ferrari,Ferrari,5653402.0,+11.415
2,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,9,3,4,11,58,Finished,ralf_schumacher,Ralf,Schumacher,williams,Williams,5661996.0,+20.009
3,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,22,4,3,8,58,Finished,villeneuve,Jacques,Villeneuve,bar,BAR,5686434.0,+44.447
4,2000,1,Australian Grand Prix,Albert Park Grand Prix Circuit,Melbourne,Australia,2000-03-12,11,5,2,9,58,Finished,fisichella,Giancarlo,Fisichella,benetton,Benetton,5687152.0,+45.165
