---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Example

In the following code, we first utilized the requests library to retrieve the HTML content from the Wikipedia page. Afterward, we employed BeautifulSoup to parse the HTML and locate the specific table of interest by using the find function. Once the table was identified, we extracted the relevant data by iterating through its rows, gathering country names and their respective populations. Finally, we used Pandas to store the collected data in a DataFrame, allowing for easy analysis and visualization. The data could also be optionally saved as a CSV file for further use. 


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send a request to Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
response = requests.get(url)

# Step 2: Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find the table containing the data (usually the first table for such lists)
table = soup.find('table', {'class': 'wikitable'})

# Step 4: Extract data from the table rows
countries = []
populations = []

# Iterate over the table rows
for row in table.find_all('tr')[1:]:  # Skip the header row
    cells = row.find_all('td')
    if len(cells) > 1:
        country = cells[1].text.strip()  # The country name is in the second column
        population = cells[2].text.strip()  # The population is in the third column
        countries.append(country)
        populations.append(population)

# Step 5: Create a DataFrame to store the results
data = pd.DataFrame({
    'Country': countries,
    'Population': populations
})

# Display the scraped data
print(data)

# Optionally save to CSV
data.to_csv('../../data/raw-data/countries_population.csv', index=False)


                                 Country     Population
0                                  World  8,119,000,000
1                                  China  1,409,670,000
2                          1,404,910,000          17.3%
3                          United States    335,893,238
4                              Indonesia    281,603,800
..                                   ...            ...
235                   Niue (New Zealand)          1,681
236                Tokelau (New Zealand)          1,647
237                         Vatican City            764
238  Cocos (Keeling) Islands (Australia)            593
239                Pitcairn Islands (UK)             35

[240 rows x 2 columns]


In [7]:
from googleapiclient.discovery import build
import pandas as pd

# API Key
api_key = "AIzaSyDtKE-4QZj6EA-rwG7cj5gMJxdt4Fe14Nw"

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)

# List to store data
all_data = []

# Read song data and fetch YouTube statistics
with open('song_data.txt', 'r') as file:
    for line in file:
        # Strip newline characters and spaces
        query = line.strip()

        # Search request for the query
        search_request = youtube.search().list(
            part="snippet",
            q=query,  # Use the query from the file
            maxResults=5,
            type="video",
            order='relevance'
        )
        search_response = search_request.execute()

        # Get video IDs
        video_ids = [item['id']['videoId'] for item in search_response['items']]
        if not video_ids:
            continue  # Skip if no results

        # Fetch video details (statistics)
        video_request = youtube.videos().list(
            part="statistics",
            id=",".join(video_ids)
        )
        video_response = video_request.execute()

        # Collect results for the current query
        query_data = []
        for item, stats in zip(search_response['items'], video_response['items']):
            query_data.append({
                "titles": item['snippet']['title'],
                "view_counts": int(stats['statistics']['viewCount']),
                "query": query
            })

        # Convert query-specific data to a DataFrame and sort by view_counts
        query_df = pd.DataFrame(query_data)
        query_df = query_df.sort_values(by="view_counts", ascending=False)

        # Append the sorted data to the final list
        all_data.append(query_df)

# Concatenate all sorted query-specific DataFrames into one
final_df = pd.concat(all_data, ignore_index=True)

final_df


Unnamed: 0,titles,view_counts,query
0,Taylor Swift - Anti-Hero (Official Music Video),212893513,Anti-Hero by Taylor Swift
1,Taylor Swift - Anti-Hero (Official Lyric Video),34875876,Anti-Hero by Taylor Swift
2,Taylor Swift - Anti-Hero (Lyrics),14266885,Anti-Hero by Taylor Swift
3,Taylor Swift - Anti Hero (Lyrics) &quot;It&#39...,5444496,Anti-Hero by Taylor Swift
4,Taylor Swift - Anti-Hero,1332074,Anti-Hero by Taylor Swift
...,...,...,...
110,Lorde - Tennis Court,131679562,Tennis Court by Lorde
111,Lorde - Tennis Court (Flume Remix),115590517,Tennis Court by Lorde
112,Lorde - Tennis Court (Audio),2182597,Tennis Court by Lorde
113,Lorde - Tennis Court (Glastonbury 2017),365081,Tennis Court by Lorde


In [8]:
final_df.to_csv('view_counts.csv')

In [None]:
import os
import csv
from googleapiclient.discovery import build

# 设置 YouTube Data API 密钥和服务
API_KEY = "AIzaSyC5VGKOdaG9IW3lauaZ03yk0nkP3oS4cTc"
youtube = build("youtube", "v3", developerKey=API_KEY)

# 视频ID列表
video_ids = [
    "EqDlrimnMCE", "m6N6jOt7heY"
]

# 要提取的信息
fields = [
    "videoId", "title", "description", "publishedAt", "tags",
    "viewCount", "likeCount", "commentCount", "categoryId", "duration",
    "dimension", "definition"
]

# 存储结果的列表
results = []

# 遍历视频ID，获取相关数据
def get_video_data(video_id):
    request = youtube.videos().list(
        part="snippet,statistics,contentDetails",
        id=video_id
    )
    response = request.execute()
    
    if "items" in response and response["items"]:
        item = response["items"][0]
        snippet = item.get("snippet", {})
        statistics = item.get("statistics", {})
        content_details = item.get("contentDetails", {})
        
        data = {
            "videoId": video_id,
            "title": snippet.get("title", ""),
            "description": snippet.get("description", ""),
            "publishedAt": snippet.get("publishedAt", ""),
            "tags": snippet.get("tags", []),
            "viewCount": statistics.get("viewCount", "0"),
            "likeCount": statistics.get("likeCount", "0"),
            "commentCount": statistics.get("commentCount", "0"),
            "categoryId": snippet.get("categoryId", ""),
            "duration": content_details.get("duration", ""),
            "dimension": content_details.get("dimension", ""),
            "definition": content_details.get("definition", "")
        }
        return data
    return None

for video_id in video_ids:
    video_data = get_video_data(video_id)
    if video_data:
        results.append(video_data)

# 输出到 CSV 文件
output_path = "../data/youtube_video_data.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fields)
    writer.writeheader()
    for row in results:
        writer.writerow(row)

print(f"数据已保存到 {output_path}")

In [12]:
import requests

# 设置API密钥
api_key = 'uGXLKQRdFV2dgKISJUa7YMSt2ex2UDxcdIAezx6I'

# 构建请求URL
base_url = 'https://api.usa.gov/crime/fbi/sapi/'
endpoint = 'api/nibrs/violent-crime/offense/national/count'
# 指定需要的年份
year = '2019'

# 完整URL
url = f'https://api.usa.gov/crime/fbi/cde/hate-crime/state/VA?from=2020&type=race&to=2021&API_KEY=iiHnOKfno2Mgkt5AynpvPpUQTEyxE77jo1RU8PIv'

# 发送GET请求
response = requests.get(url)

# 检查响应状态码
if response.status_code == 200:
    data = response.json()  # 解析返回的JSON数据
    print(data)
else:
    print("Failed to retrieve data:", response.status_code)

Failed to retrieve data: 400


In [13]:
from googleapiclient.discovery import build

# 你的API密钥
api_key = 'AIzaSyC5VGKOdaG9IW3lauaZ03yk0nkP3oS4cTc'

# 创建 YouTube client 对象
youtube = build('youtube', 'v3', developerKey=api_key)

# 视频ID
video_id = 'rDolt3jJRsM'

# 调用API获取视频详情
video_response = youtube.videos().list(
    part='snippet,contentDetails,statistics',
    id=video_id
).execute()

# 输出视频详细信息
for video in video_response.get('items', []):
    title = video['snippet']['title']
    description = video['snippet']['description']
    duration = video['contentDetails']['duration']
    view_count = video['statistics']['viewCount']
    like_count = video['statistics'].get('likeCount', 'Unavailable')
    comment_count = video['statistics'].get('commentCount', 'Unavailable')

    print(f'Title: {title}')
    print(f'Description: {description}')
    print(f'Duration: {duration}')
    print(f'View count: {view_count}')
    print(f'Like count: {like_count}')
    print(f'Comment count: {comment_count}')

# 获取热门评论
comments_response = youtube.commentThreads().list(
    part='snippet',
    videoId=video_id,
    order='relevance',  # 按相关性排序
    maxResults=5  # 获取前5个热门评论
).execute()

print("\nTop Comments:")
for comment in comments_response.get('items', []):
    author = comment['snippet']['topLevelComment']['snippet']['authorDisplayName']
    text = comment['snippet']['topLevelComment']['snippet']['textDisplay']
    print(f'{author}: {text}')

Title: ENHYPEN (엔하이픈) 'No Doubt' Official MV
Description: ENHYPEN (엔하이픈) 'No Doubt' Official MV

Credits:
Directed by Yunah Sheep

ⓒ BELIFT LAB Inc. All Rights Reserved

Connect with ENHYPEN
OFFICIAL WEBSITE https://ENHYPEN.com
ENHYPEN Weverse https://www.weverse.io/enhypen
OFFICIAL YOUTUBE https://www.youtube.com/ENHYPENOFFICIAL
OFFICIAL X (TWITTER) https://twitter.com/ENHYPEN
ENHYPEN X (TWITTER) https://twitter.com/ENHYPEN_members
OFFICIAL FACEBOOK https://www.facebook.com/officialENHYPEN
OFFICIAL INSTAGRAM https://www.instagram.com/enhypen
OFFICIAL TIKTOK  https://www.tiktok.com/@enhypen
OFFICIAL WEIBO https://weibo.com/ENHYPEN
OFFICIAL BILIBILI https://space.bilibili.com/3493119035181246
OFFICIAL JAPAN X (TWITTER) https://twitter.com/ENHYPEN_JP

#ENHYPEN #엔하이픈 #ROMANCE_UNTOLD_daydream #NoDoubt
Duration: PT3M5S
View count: 28195106
Like count: 819592
Comment count: 64086

Top Comments:
@attaetude: I LOVE THE CHOREOGRAPHY THE SHOULDER DANCE AND THE WHISTLE THING THE SONG THE OUTFITS 

In [16]:
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import dateutil.parser

# YouTube API Key
api_key = 'AIzaSyC5VGKOdaG9IW3lauaZ03yk0nkP3oS4cTc'

# Video ID
video_id = 'rDolt3jJRsM'

# Create a YouTube object
youtube = build('youtube', 'v3', developerKey=api_key)

# Fetch video details
video_response = youtube.videos().list(
    part='snippet,contentDetails,statistics',
    id=video_id
).execute()

# Extract video and channel details
video = video_response['items'][0]
snippet = video['snippet']
statistics = video['statistics']
content_details = video['contentDetails']

# Calculate days since published
published_at = dateutil.parser.parse(snippet['publishedAt'])
days_since_published = (datetime.now(published_at.tzinfo) - published_at).days

# Get channel details for subscriber count
channel_id = snippet['channelId']
channel_response = youtube.channels().list(
    part='statistics',
    id=channel_id
).execute()
subscriber_count = channel_response['items'][0]['statistics']['subscriberCount']

# Create a DataFrame
data = {
    'Title': snippet['title'],
    'Description': snippet['description'],
    'Published At': snippet['publishedAt'],
    'Days Since Published': days_since_published,
    'View Count': statistics['viewCount'],
    'Like Count': statistics.get('likeCount', 'Unavailable'),
    'Comment Count': statistics.get('commentCount', 'Unavailable'),
    'Subscriber Count': subscriber_count,
    'Category ID': snippet['categoryId'],
    'Definition': content_details['definition']
}
df = pd.DataFrame([data])

# Get top 10 comments
comments_response = youtube.commentThreads().list(
    part='snippet',
    videoId=video_id,
    order='relevance',
    maxResults=10
).execute()

top_comments = [comment['snippet']['topLevelComment']['snippet']['textDisplay']
                for comment in comments_response.get('items', [])]
df['Top Comments'] = pd.Series([top_comments])

# Save DataFrame to CSV
safe_title = "".join(x for x in snippet['title'] if x.isalnum() or x in " _-").rstrip()
filename = f"{safe_title}.csv"
df.to_csv(filename, index=False)

print(f'Data saved to {filename}')

Data saved to ENHYPEN 엔하이픈 No Doubt Official MV.csv
