---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Example

In the following code, we first utilized the requests library to retrieve the HTML content from the Wikipedia page. Afterward, we employed BeautifulSoup to parse the HTML and locate the specific table of interest by using the find function. Once the table was identified, we extracted the relevant data by iterating through its rows, gathering country names and their respective populations. Finally, we used Pandas to store the collected data in a DataFrame, allowing for easy analysis and visualization. The data could also be optionally saved as a CSV file for further use. 


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send a request to Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
response = requests.get(url)

# Step 2: Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find the table containing the data (usually the first table for such lists)
table = soup.find('table', {'class': 'wikitable'})

# Step 4: Extract data from the table rows
countries = []
populations = []

# Iterate over the table rows
for row in table.find_all('tr')[1:]:  # Skip the header row
    cells = row.find_all('td')
    if len(cells) > 1:
        country = cells[1].text.strip()  # The country name is in the second column
        population = cells[2].text.strip()  # The population is in the third column
        countries.append(country)
        populations.append(population)

# Step 5: Create a DataFrame to store the results
data = pd.DataFrame({
    'Country': countries,
    'Population': populations
})

# Display the scraped data
print(data)

# Optionally save to CSV
data.to_csv('../../data/raw-data/countries_population.csv', index=False)


                                 Country     Population
0                                  World  8,119,000,000
1                                  China  1,409,670,000
2                          1,404,910,000          17.3%
3                          United States    335,893,238
4                              Indonesia    281,603,800
..                                   ...            ...
235                   Niue (New Zealand)          1,681
236                Tokelau (New Zealand)          1,647
237                         Vatican City            764
238  Cocos (Keeling) Islands (Australia)            593
239                Pitcairn Islands (UK)             35

[240 rows x 2 columns]


In [7]:
from googleapiclient.discovery import build
import pandas as pd

# API Key
api_key = "AIzaSyDtKE-4QZj6EA-rwG7cj5gMJxdt4Fe14Nw"

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)

# List to store data
all_data = []

# Read song data and fetch YouTube statistics
with open('song_data.txt', 'r') as file:
    for line in file:
        # Strip newline characters and spaces
        query = line.strip()

        # Search request for the query
        search_request = youtube.search().list(
            part="snippet",
            q=query,  # Use the query from the file
            maxResults=5,
            type="video",
            order='relevance'
        )
        search_response = search_request.execute()

        # Get video IDs
        video_ids = [item['id']['videoId'] for item in search_response['items']]
        if not video_ids:
            continue  # Skip if no results

        # Fetch video details (statistics)
        video_request = youtube.videos().list(
            part="statistics",
            id=",".join(video_ids)
        )
        video_response = video_request.execute()

        # Collect results for the current query
        query_data = []
        for item, stats in zip(search_response['items'], video_response['items']):
            query_data.append({
                "titles": item['snippet']['title'],
                "view_counts": int(stats['statistics']['viewCount']),
                "query": query
            })

        # Convert query-specific data to a DataFrame and sort by view_counts
        query_df = pd.DataFrame(query_data)
        query_df = query_df.sort_values(by="view_counts", ascending=False)

        # Append the sorted data to the final list
        all_data.append(query_df)

# Concatenate all sorted query-specific DataFrames into one
final_df = pd.concat(all_data, ignore_index=True)

final_df


Unnamed: 0,titles,view_counts,query
0,Taylor Swift - Anti-Hero (Official Music Video),212893513,Anti-Hero by Taylor Swift
1,Taylor Swift - Anti-Hero (Official Lyric Video),34875876,Anti-Hero by Taylor Swift
2,Taylor Swift - Anti-Hero (Lyrics),14266885,Anti-Hero by Taylor Swift
3,Taylor Swift - Anti Hero (Lyrics) &quot;It&#39...,5444496,Anti-Hero by Taylor Swift
4,Taylor Swift - Anti-Hero,1332074,Anti-Hero by Taylor Swift
...,...,...,...
110,Lorde - Tennis Court,131679562,Tennis Court by Lorde
111,Lorde - Tennis Court (Flume Remix),115590517,Tennis Court by Lorde
112,Lorde - Tennis Court (Audio),2182597,Tennis Court by Lorde
113,Lorde - Tennis Court (Glastonbury 2017),365081,Tennis Court by Lorde


In [8]:
final_df.to_csv('view_counts.csv')

{{< include closing.qmd >}} 