---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Example

In the following code, we first utilized the requests library to retrieve the HTML content from the Wikipedia page. Afterward, we employed BeautifulSoup to parse the HTML and locate the specific table of interest by using the find function. Once the table was identified, we extracted the relevant data by iterating through its rows, gathering country names and their respective populations. Finally, we used Pandas to store the collected data in a DataFrame, allowing for easy analysis and visualization. The data could also be optionally saved as a CSV file for further use. 


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send a request to Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
response = requests.get(url)

# Step 2: Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find the table containing the data (usually the first table for such lists)
table = soup.find('table', {'class': 'wikitable'})

# Step 4: Extract data from the table rows
countries = []
populations = []

# Iterate over the table rows
for row in table.find_all('tr')[1:]:  # Skip the header row
    cells = row.find_all('td')
    if len(cells) > 1:
        country = cells[1].text.strip()  # The country name is in the second column
        population = cells[2].text.strip()  # The population is in the third column
        countries.append(country)
        populations.append(population)

# Step 5: Create a DataFrame to store the results
data = pd.DataFrame({
    'Country': countries,
    'Population': populations
})

# Display the scraped data
print(data)

# Optionally save to CSV
data.to_csv('../../data/raw-data/countries_population.csv', index=False)


                                 Country     Population
0                                  World  8,119,000,000
1                                  China  1,409,670,000
2                          1,404,910,000          17.3%
3                          United States    335,893,238
4                              Indonesia    281,603,800
..                                   ...            ...
235                   Niue (New Zealand)          1,681
236                Tokelau (New Zealand)          1,647
237                         Vatican City            764
238  Cocos (Keeling) Islands (Australia)            593
239                Pitcairn Islands (UK)             35

[240 rows x 2 columns]


{{< include closing.qmd >}} 

### Import Required Libraries

In [27]:
import requests
import pandas as pd
import numpy as np
import json
import os
import re
from datetime import datetime

### Race Information for 2000-2023 Seasons

In [11]:
def get_race_results(url, year, offset, limit=1000):
    full_url = f"{url}/{year}/results.json?limit={limit}&offset={offset}"
    result = requests.get(full_url)
    return result.json()


In [21]:
# Testing for 2023 season
season_2023_json = get_race_results(url='http://ergast.com/api/f1', year=2023, offset=0)

# Save the data to a JSON file
with open('../../data/raw-data/race_data_2023.json', 'w') as outfile:
    json.dump(season_2023_json, outfile)

In [18]:
# collecting data from 2000 to 2022

# function to loop through years and fetch the results
def race_data(start_year, end_year, output_dr, url):

    for year in range(start_year, end_year + 1):
        race_data = get_race_results(url, year, offset=0)
        # save the output 
        output_file = os.path.join(output_dr, f"race_data_{year}.json")
        with open(output_file, 'w') as f:
            json.dump(race_data, f)

In [20]:
# call race_data()
race_data(
    start_year = 2000,
    end_year = 2009,
    output_dr = "../../data/raw-data",
    url = 'http://ergast.com/api/f1'
)

### Driver Standings for 2000-2023 Seasons

In [26]:
def driverstanding_info(url, season):
    full_url = f"{url}/{season}/driverStandings.json"
    response = requests.get(full_url)
    return response.json()

# Function to fetch and save all driver standings for the given seasons
def driverstandings_info(start_year, end_year, output_file, url="http://ergast.com/api/f1"):
    driver_standings_data = {}
    
    for year in range(start_year, end_year + 1):
        data = driverstanding_info(url, year)
        driver_standings_data[year] = data
    
    # Save to output file
    with open(output_file, 'w') as outfile:
        json.dump(driver_standings_data, outfile)

# Call the function for seasons 2000–2023
driverstandings_info(
    start_year=2000,
    end_year=2023,
    output_file="../../data/raw-data/driver_standings/driver_standings_2000_2023.json"
)

### Circuit Information for 2000-2023 Seasons

In [22]:
def circuit_info(output_file, url):
    results = requests.get(url)
    
    # save to output file
    with open(output_file, 'w') as f:
        json.dump(results.json(), f)
        

In [24]:
circuit_info(output_file='../../data/raw-data/circuit_data.json', url = "http://ergast.com/api/f1/circuits.json")

### News of Top 10 drivers in 2024 season (so far)
- Using News-API
- Resources: https://jfh.georgetown.domains/centralized-lecture-content/content/data-science/data-collection/share/API-newapi/news-api.html

#### Set Credentials

In [50]:
baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

API_KEY='86d4dac5a4864ece92da90bc31277e53'

In [55]:
def news_data(topic, API_KEY, total_requests=1, verbose=True):
    baseURL = "https://newsapi.org/v2/everything?"

    # API parameters
    URLpost = {
        'apiKey': API_KEY,
        'q': '+'+topic,
        'sotBy': 'relevancy',
        'pageSize': 100,
        'page': 1
    }
    # last name of the drives to avoid spaces in the file names
    file_name = topic.split()[-1]
    all_articles = []

    # make an API request 
    for request_num in range(total_requests):
        response = requests.get(baseURL, params=URLpost)
        response_data = response.json()

        articles = response_data.get('articles', [])
        all_articles.extend(articles)

        URLpost['page'] += 1


    # output file path
    output_dr = "../../data/raw-data/News_Drivers"
    output_file = os.path.join(output_dr, f"{file_name}_raw_text.json")

    # save to output file
    with open(output_file, 'w') as f:
        json.dump(all_articles, f, indent=4)
    
    return all_articles


Top 10 Drivers as of Round 22 (Las Vegas Grad Prix)
1. Max Verstappen
2. Lando Norris
3. Charles Leclerc
4. Oscar Piastri
5. Carlos Sainz
6. George Russell
7. Lewis Hamilton
8. Sergio Perez
9. Fernando Alonso
10. Nico Hulkenberg

In [56]:
# testing 
text_data = news_data('Max Verstappen', API_KEY, total_requests=1, verbose=True)

In [57]:
text_data = news_data('Lando Norris', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Charles Leclerc', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Oscar Piastri', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Carlos Sainz', API_KEY, total_requests=1, verbose=True)
text_data = news_data('George Russell', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Lewis Hamilton', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Sergio Perez', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Fernando Alonso', API_KEY, total_requests=1, verbose=True)
text_data = news_data('Nico Hulkenberg', API_KEY, total_requests=1, verbose=True)
