---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Example

In the following code, we first utilized the requests library to retrieve the HTML content from the Wikipedia page. Afterward, we employed BeautifulSoup to parse the HTML and locate the specific table of interest by using the find function. Once the table was identified, we extracted the relevant data by iterating through its rows, gathering country names and their respective populations. Finally, we used Pandas to store the collected data in a DataFrame, allowing for easy analysis and visualization. The data could also be optionally saved as a CSV file for further use. 


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send a request to Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
response = requests.get(url)

# Step 2: Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find the table containing the data (usually the first table for such lists)
table = soup.find('table', {'class': 'wikitable'})

# Step 4: Extract data from the table rows
countries = []
populations = []

# Iterate over the table rows
for row in table.find_all('tr')[1:]:  # Skip the header row
    cells = row.find_all('td')
    if len(cells) > 1:
        country = cells[1].text.strip()  # The country name is in the second column
        population = cells[2].text.strip()  # The population is in the third column
        countries.append(country)
        populations.append(population)

# Step 5: Create a DataFrame to store the results
data = pd.DataFrame({
    'Country': countries,
    'Population': populations
})

# Display the scraped data
print(data)

# Optionally save to CSV
data.to_csv('../../data/raw-data/countries_population.csv', index=False)


                                 Country     Population
0                                  World  8,119,000,000
1                                  China  1,409,670,000
2                          1,404,910,000          17.3%
3                          United States    335,893,238
4                              Indonesia    281,603,800
..                                   ...            ...
235                   Niue (New Zealand)          1,681
236                Tokelau (New Zealand)          1,647
237                         Vatican City            764
238  Cocos (Keeling) Islands (Australia)            593
239                Pitcairn Islands (UK)             35

[240 rows x 2 columns]


{{< include closing.qmd >}} 

### Import Required Libraries

In [11]:
import requests
import pandas as pd
import numpy as np
import json
import os
import re
from datetime import datetime
import time
from bs4 import BeautifulSoup
import csv

### Race Information for 2000-2023 Seasons

In [11]:
def get_race_results(url, year, offset, limit=1000):
    full_url = f"{url}/{year}/results.json?limit={limit}&offset={offset}"
    result = requests.get(full_url)
    return result.json()


In [21]:
# Testing for 2023 season
season_2023_json = get_race_results(url='http://ergast.com/api/f1', year=2023, offset=0)

# Save the data to a JSON file
with open('../../data/raw-data/race_data_2023.json', 'w') as outfile:
    json.dump(season_2023_json, outfile)

In [18]:
# collecting data from 2000 to 2022

# function to loop through years and fetch the results
def race_data(start_year, end_year, output_dr, url):

    for year in range(start_year, end_year + 1):
        race_data = get_race_results(url, year, offset=0)
        # save the output 
        output_file = os.path.join(output_dr, f"race_data_{year}.json")
        with open(output_file, 'w') as f:
            json.dump(race_data, f)

In [20]:
# call race_data()
race_data(
    start_year = 2000,
    end_year = 2009,
    output_dr = "../../data/raw-data",
    url = 'http://ergast.com/api/f1'
)

### Driver Standings for 2000-2023 Seasons

In [26]:
def driverstanding_info(url, season):
    full_url = f"{url}/{season}/driverStandings.json"
    response = requests.get(full_url)
    return response.json()

# Function to fetch and save all driver standings for the given seasons
def driverstandings_info(start_year, end_year, output_file, url="http://ergast.com/api/f1"):
    driver_standings_data = {}
    
    for year in range(start_year, end_year + 1):
        data = driverstanding_info(url, year)
        driver_standings_data[year] = data
    
    # Save to output file
    with open(output_file, 'w') as outfile:
        json.dump(driver_standings_data, outfile)

# Call the function for seasons 2000–2023
driverstandings_info(
    start_year=2000,
    end_year=2023,
    output_file="../../data/raw-data/driver_standings/driver_standings_2000_2023.json"
)

### Circuit Information for 2000-2023 Seasons

In [22]:
def circuit_info(output_file, url):
    results = requests.get(url)
    
    # save to output file
    with open(output_file, 'w') as f:
        json.dump(results.json(), f)
        

In [24]:
circuit_info(output_file='../../data/raw-data/circuit_data.json', url = "http://ergast.com/api/f1/circuits.json")

### News of Top 10 drivers in 2024 season (so far)
- Using News-API
- Resources: https://jfh.georgetown.domains/centralized-lecture-content/content/data-science/data-collection/share/API-newapi/news-api.html

#### Set Credentials

In [50]:
baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

API_KEY='86d4dac5a4864ece92da90bc31277e53'

In [55]:
def news_data(topic, API_KEY, total_requests=1, verbose=True):
    baseURL = "https://newsapi.org/v2/everything?"

    # API parameters
    URLpost = {
        'apiKey': API_KEY,
        'q': '+'+topic,
        'sotBy': 'relevancy',
        'pageSize': 100,
        'page': 1
    }
    # last name of the drives to avoid spaces in the file names
    file_name = topic.split()[-1]
    all_articles = []

    # make an API request 
    for request_num in range(total_requests):
        response = requests.get(baseURL, params=URLpost)
        response_data = response.json()

        articles = response_data.get('articles', [])
        all_articles.extend(articles)

        URLpost['page'] += 1


    # output file path
    output_dr = "../../data/raw-data/News_Drivers"
    output_file = os.path.join(output_dr, f"{file_name}_raw_text.json")

    # save to output file
    with open(output_file, 'w') as f:
        json.dump(all_articles, f, indent=4)
    
    return all_articles


#### Top 10 Drivers as of Round 22 (Las Vegas Grand Prix)

1. Max Verstappen
2. Lando Norris
3. Charles Leclerc
4. Oscar Piastri
5. Carlos Sainz
6. George Russell
7. Lewis Hamilton
8. Sergio Perez
9. Fernando Alonso
10. Nico Hulkenberg

In [56]:
# testing 
text_data = news_data('Max Verstappen', API_KEY, total_requests=1, verbose=True)

In [57]:
text_data = news_data('Lando Norris', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Charles Leclerc', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Oscar Piastri', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Carlos Sainz', API_KEY, total_requests=1, verbose=True)
text_data = news_data('George Russell', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Lewis Hamilton', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Sergio Perez', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Fernando Alonso', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Nico Hulkenberg', API_KEY, total_requests=1, verbose=True)


### Fetch Weather data on the day of the race
![](../../data/images/weather_wiki.png)[^1]

[^1]: [2010 Bahrain Grand Prix](https://en.wikipedia.org/wiki/2010_Bahrain_Grand_Prix)

The weather data will be fetched from the wiki page of each race. 

In [61]:
# the url for each race is in the race data collected using ergast API
race_df = pd.read_csv("../../data/processed-data/all_race_results_cleaned.csv")

In [62]:
race_data = race_df[['season', 'raceName', 'url']]

In [63]:
race_data = race_data.drop_duplicates()

In [64]:
race_data.head()

Unnamed: 0,season,raceName,url
0,2010,Bahrain Grand Prix,http://en.wikipedia.org/wiki/2010_Bahrain_Gran...
24,2010,Australian Grand Prix,http://en.wikipedia.org/wiki/2010_Australian_G...
48,2010,Malaysian Grand Prix,http://en.wikipedia.org/wiki/2010_Malaysian_Gr...
72,2010,Chinese Grand Prix,http://en.wikipedia.org/wiki/2010_Chinese_Gran...
96,2010,Spanish Grand Prix,http://en.wikipedia.org/wiki/2010_Spanish_Gran...


In [None]:

def get_weather_from_wikipedia(url):
    response = requests.get(url)
    bs = BeautifulSoup(response.text, 'html.parser')  
    
    # locate the infobox table
    table = bs.find('table', {'class': 'infobox infobox-table vevent'})
    if not table:
        print(f"No infobox found on the page: {url}")
        return "Not Available"
    
    # search for the "Weather" row in the table
    for row in table.find_all('tr'):
        # find the header cell with class 'infobox-label'
        header = row.find('th', {'class': 'infobox-label'})  
        # check if it contains "Weather"
        if header and 'Weather' in header.text:  # Check if it contains "Weather"
            # find the corresponding data cell with class 'infobox-data'
            data = row.find('td', {'class': 'infobox-data'})  
            
            if data:
                return data.text.strip()  
    
    
race_data['weather'] = None

# fetch weather information for each URL
for index, row in race_data.iterrows():
    url = row['url']
    # for debuggin purpose
    print(f"Fetching weather for: {url}")
    
    # get the weather information
    weather = get_weather_from_wikipedia(url)
    
    # update the weather column
    race_data.at[index, 'weather'] = weather

# save to output file
output_csv = "../../data/raw-data/weather/race_data_with_weather.csv"
race_data.to_csv(output_csv, index=False)

print(f"Updated race data saved to: {output_csv}")

In [None]:
os.makedirs('cache', exist_ok=True)
fastf1.Cache.enable_cache('cache')

In [None]:
track_data = []

def extract_track_features(year, race_name):
    session = fastf1.get_session(year, race_name, 'Q') 
    session.load()

    # Get the fastest lap
    fastest_lap = session.laps.pick_fastest()
    telemetry = fastest_lap.get_telemetry()

    # Track Length
    track_length = telemetry['Distance'].iloc[-1]  # Distance of the fastest lap

    # Max Speed
    max_speed = telemetry['Speed'].max()

    # Average Speed
    avg_speed = track_length / fastest_lap['LapTime'].total_seconds()

    # Percentage of Full Throttle
    full_throttle = telemetry[telemetry['Throttle'] >= 95]
    perc_full_throttle = (len(full_throttle) / len(telemetry)) * 100

    # Number of Corners
    telemetry['is_corner'] = telemetry['Speed'] < 100
    num_corners = (telemetry['is_corner'] & ~telemetry['is_corner'].shift(1, fill_value=False)).sum()

    # Number of Straights
    telemetry['is_straight'] = telemetry['Speed'] > 150
    num_straights = (telemetry['is_straight'] & ~telemetry['is_straight'].shift(1, fill_value=False)).sum()

    return {
        "Year": year,
        "Grand Prix": race_name,
        "Track Length (m)": track_length,
        "Max Speed (km/h)": max_speed,
        "Full Throttle (%)": perc_full_throttle,
        "Number of Corners": num_corners,
        "Number of Straights": num_straights
    }

year = 2023
schedule = fastf1.get_event_schedule(year)

for _, event in schedule.iterrows():  
    if not pd.isna(event['Session1']):  
        try:
            track_features = extract_track_features(year, event['EventName'])
            track_data.append(track_features)
        except Exception as e:
            print(f"Failed for {event['EventName']} in {year}: {e}")

df_tracks = pd.DataFrame(track_data)



In [None]:
# merging all racetrack features into a single csv
folder_path = "../../data/raw-data/circuit_data/"

dataframes = []

for file_name in os.listdir(folder_path):
    if file_name.endswith('.csv'): 
        file_path = os.path.join(folder_path, file_name)
        df = pd.read_csv(file_path)
        dataframes.append(df)

merged_df = pd.concat(dataframes, ignore_index=True)

output_file = "../../data/raw-data/circuit_data/merged_circuit_features.csv"
os.makedirs(os.path.dirname(output_file), exist_ok=True)
merged_df.to_csv(output_file, index=False)

### Pitstop data

In [21]:
# data available from 2011

# Function to fetch pitstop data for a specific race
def get_pitstop_data(year, round_number):
    url = f"http://ergast.com/api/f1/{year}/{round_number}/pitstops.json?limit=1000"
    response = requests.get(url)
    
    if response.status_code == 200 and response.text.strip():
        try:
            return response.json()
        except Exception as e:
            print(f"Error parsing JSON for {year} Round {round_number}: {e}")
            return None
    else:
        print(f"Failed to fetch data for {year} Round {round_number}: {response.status_code}")
        return None

# Function to fetch race schedule
def get_race_schedule(year):
    url = f"http://ergast.com/api/f1/{year}.json"
    response = requests.get(url)
    if response.status_code == 200:
        return response.json().get("MRData", {}).get("RaceTable", {}).get("Races", [])
    else:
        print(f"Failed to fetch schedule for {year}: {response.status_code}")
        return []

# Function to extract and save pitstop data to CSV
def fetch_and_save_pitstop_data(start_year, end_year, output_csv):
    # Create the CSV file and write the header
    with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ["Year", "Round", "RaceName", "DriverID", "Lap", "Stop", "Time", "Duration"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        # Loop through years and races
        for year in range(start_year, end_year + 1):
            print(f"Fetching data for year: {year}")
            races = get_race_schedule(year)
            
            for race in races:
                round_number = race.get("round")
                race_name = race.get("raceName")
                print(f"Processing {race_name} (Round {round_number}) in {year}")

                # Fetch pitstop data for this race
                pitstop_data = get_pitstop_data(year, round_number)
                if pitstop_data:
                    races_list = pitstop_data.get("MRData", {}).get("RaceTable", {}).get("Races", [])
                    
                    # Ensure there is race data
                    if races_list:
                        pitstops = races_list[0].get("PitStops", [])
                        
                        # Write each pitstop to the CSV
                        for pitstop in pitstops:
                            writer.writerow({
                                "Year": year,
                                "Round": round_number,
                                "RaceName": race_name,
                                "DriverID": pitstop.get("driverId"),
                                "Lap": pitstop.get("lap"),
                                "Stop": pitstop.get("stop"),
                                "Time": pitstop.get("time"),
                                "Duration": pitstop.get("duration")
                            })
                    else:
                        print(f"No race data available for {race_name} in {year}")



In [23]:
output_csv = "../../data/raw-data/pitstop_data_test2.csv"
fetch_and_save_pitstop_data(
    start_year=2000,
    end_year=2023,
    output_csv=output_csv
)

Fetching data for year: 2000
Processing Australian Grand Prix (Round 1) in 2000
No race data available for Australian Grand Prix in 2000
Processing Brazilian Grand Prix (Round 2) in 2000
No race data available for Brazilian Grand Prix in 2000
Processing San Marino Grand Prix (Round 3) in 2000
No race data available for San Marino Grand Prix in 2000
Processing British Grand Prix (Round 4) in 2000
No race data available for British Grand Prix in 2000
Processing Spanish Grand Prix (Round 5) in 2000
No race data available for Spanish Grand Prix in 2000
Processing European Grand Prix (Round 6) in 2000
No race data available for European Grand Prix in 2000
Processing Monaco Grand Prix (Round 7) in 2000
No race data available for Monaco Grand Prix in 2000
Processing Canadian Grand Prix (Round 8) in 2000
No race data available for Canadian Grand Prix in 2000
Processing French Grand Prix (Round 9) in 2000
No race data available for French Grand Prix in 2000
Processing Austrian Grand Prix (Round