---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Example

In the following code, we first utilized the requests library to retrieve the HTML content from the Wikipedia page. Afterward, we employed BeautifulSoup to parse the HTML and locate the specific table of interest by using the find function. Once the table was identified, we extracted the relevant data by iterating through its rows, gathering country names and their respective populations. Finally, we used Pandas to store the collected data in a DataFrame, allowing for easy analysis and visualization. The data could also be optionally saved as a CSV file for further use. 


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send a request to Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
response = requests.get(url)

# Step 2: Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find the table containing the data (usually the first table for such lists)
table = soup.find('table', {'class': 'wikitable'})

# Step 4: Extract data from the table rows
countries = []
populations = []

# Iterate over the table rows
for row in table.find_all('tr')[1:]:  # Skip the header row
    cells = row.find_all('td')
    if len(cells) > 1:
        country = cells[1].text.strip()  # The country name is in the second column
        population = cells[2].text.strip()  # The population is in the third column
        countries.append(country)
        populations.append(population)

# Step 5: Create a DataFrame to store the results
data = pd.DataFrame({
    'Country': countries,
    'Population': populations
})

# Display the scraped data
print(data)

# Optionally save to CSV
data.to_csv('../../data/raw-data/countries_population.csv', index=False)


                                 Country     Population
0                                  World  8,119,000,000
1                                  China  1,409,670,000
2                          1,404,910,000          17.3%
3                          United States    335,893,238
4                              Indonesia    281,603,800
..                                   ...            ...
235                   Niue (New Zealand)          1,681
236                Tokelau (New Zealand)          1,647
237                         Vatican City            764
238  Cocos (Keeling) Islands (Australia)            593
239                Pitcairn Islands (UK)             35

[240 rows x 2 columns]


In [7]:
from googleapiclient.discovery import build
import pandas as pd

# API Key
api_key = "AIzaSyDtKE-4QZj6EA-rwG7cj5gMJxdt4Fe14Nw"

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)

# List to store data
all_data = []

# Read song data and fetch YouTube statistics
with open('song_data.txt', 'r') as file:
    for line in file:
        # Strip newline characters and spaces
        query = line.strip()

        # Search request for the query
        search_request = youtube.search().list(
            part="snippet",
            q=query,  # Use the query from the file
            maxResults=5,
            type="video",
            order='relevance'
        )
        search_response = search_request.execute()

        # Get video IDs
        video_ids = [item['id']['videoId'] for item in search_response['items']]
        if not video_ids:
            continue  # Skip if no results

        # Fetch video details (statistics)
        video_request = youtube.videos().list(
            part="statistics",
            id=",".join(video_ids)
        )
        video_response = video_request.execute()

        # Collect results for the current query
        query_data = []
        for item, stats in zip(search_response['items'], video_response['items']):
            query_data.append({
                "titles": item['snippet']['title'],
                "view_counts": int(stats['statistics']['viewCount']),
                "query": query
            })

        # Convert query-specific data to a DataFrame and sort by view_counts
        query_df = pd.DataFrame(query_data)
        query_df = query_df.sort_values(by="view_counts", ascending=False)

        # Append the sorted data to the final list
        all_data.append(query_df)

# Concatenate all sorted query-specific DataFrames into one
final_df = pd.concat(all_data, ignore_index=True)

final_df


Unnamed: 0,titles,view_counts,query
0,Taylor Swift - Anti-Hero (Official Music Video),212893513,Anti-Hero by Taylor Swift
1,Taylor Swift - Anti-Hero (Official Lyric Video),34875876,Anti-Hero by Taylor Swift
2,Taylor Swift - Anti-Hero (Lyrics),14266885,Anti-Hero by Taylor Swift
3,Taylor Swift - Anti Hero (Lyrics) &quot;It&#39...,5444496,Anti-Hero by Taylor Swift
4,Taylor Swift - Anti-Hero,1332074,Anti-Hero by Taylor Swift
...,...,...,...
110,Lorde - Tennis Court,131679562,Tennis Court by Lorde
111,Lorde - Tennis Court (Flume Remix),115590517,Tennis Court by Lorde
112,Lorde - Tennis Court (Audio),2182597,Tennis Court by Lorde
113,Lorde - Tennis Court (Glastonbury 2017),365081,Tennis Court by Lorde


In [8]:
final_df.to_csv('view_counts.csv')

In [None]:
import os
import csv
from googleapiclient.discovery import build

# 设置 YouTube Data API 密钥和服务
API_KEY = "AIzaSyC5VGKOdaG9IW3lauaZ03yk0nkP3oS4cTc"
youtube = build("youtube", "v3", developerKey=API_KEY)

# 视频ID列表
video_ids = [
    "EqDlrimnMCE", "m6N6jOt7heY"
]

# 要提取的信息
fields = [
    "videoId", "title", "description", "publishedAt", "tags",
    "viewCount", "likeCount", "commentCount", "categoryId", "duration",
    "dimension", "definition"
]

# 存储结果的列表
results = []

# 遍历视频ID，获取相关数据
def get_video_data(video_id):
    request = youtube.videos().list(
        part="snippet,statistics,contentDetails",
        id=video_id
    )
    response = request.execute()
    
    if "items" in response and response["items"]:
        item = response["items"][0]
        snippet = item.get("snippet", {})
        statistics = item.get("statistics", {})
        content_details = item.get("contentDetails", {})
        
        data = {
            "videoId": video_id,
            "title": snippet.get("title", ""),
            "description": snippet.get("description", ""),
            "publishedAt": snippet.get("publishedAt", ""),
            "tags": snippet.get("tags", []),
            "viewCount": statistics.get("viewCount", "0"),
            "likeCount": statistics.get("likeCount", "0"),
            "commentCount": statistics.get("commentCount", "0"),
            "categoryId": snippet.get("categoryId", ""),
            "duration": content_details.get("duration", ""),
            "dimension": content_details.get("dimension", ""),
            "definition": content_details.get("definition", "")
        }
        return data
    return None

for video_id in video_ids:
    video_data = get_video_data(video_id)
    if video_data:
        results.append(video_data)

# 输出到 CSV 文件
output_path = "../data/youtube_video_data.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fields)
    writer.writeheader()
    for row in results:
        writer.writerow(row)

print(f"数据已保存到 {output_path}")

In [12]:
import requests

# 设置API密钥
api_key = 'uGXLKQRdFV2dgKISJUa7YMSt2ex2UDxcdIAezx6I'

# 构建请求URL
base_url = 'https://api.usa.gov/crime/fbi/sapi/'
endpoint = 'api/nibrs/violent-crime/offense/national/count'
# 指定需要的年份
year = '2019'

# 完整URL
url = f'https://api.usa.gov/crime/fbi/cde/hate-crime/state/VA?from=2020&type=race&to=2021&API_KEY=iiHnOKfno2Mgkt5AynpvPpUQTEyxE77jo1RU8PIv'

# 发送GET请求
response = requests.get(url)

# 检查响应状态码
if response.status_code == 200:
    data = response.json()  # 解析返回的JSON数据
    print(data)
else:
    print("Failed to retrieve data:", response.status_code)

Failed to retrieve data: 400


In [13]:
from googleapiclient.discovery import build

# 你的API密钥
api_key = 'AIzaSyC5VGKOdaG9IW3lauaZ03yk0nkP3oS4cTc'

# 创建 YouTube client 对象
youtube = build('youtube', 'v3', developerKey=api_key)

# 视频ID
video_id = 'rDolt3jJRsM'

# 调用API获取视频详情
video_response = youtube.videos().list(
    part='snippet,contentDetails,statistics',
    id=video_id
).execute()

# 输出视频详细信息
for video in video_response.get('items', []):
    title = video['snippet']['title']
    description = video['snippet']['description']
    duration = video['contentDetails']['duration']
    view_count = video['statistics']['viewCount']
    like_count = video['statistics'].get('likeCount', 'Unavailable')
    comment_count = video['statistics'].get('commentCount', 'Unavailable')

    print(f'Title: {title}')
    print(f'Description: {description}')
    print(f'Duration: {duration}')
    print(f'View count: {view_count}')
    print(f'Like count: {like_count}')
    print(f'Comment count: {comment_count}')

# 获取热门评论
comments_response = youtube.commentThreads().list(
    part='snippet',
    videoId=video_id,
    order='relevance',  # 按相关性排序
    maxResults=5  # 获取前5个热门评论
).execute()

print("\nTop Comments:")
for comment in comments_response.get('items', []):
    author = comment['snippet']['topLevelComment']['snippet']['authorDisplayName']
    text = comment['snippet']['topLevelComment']['snippet']['textDisplay']
    print(f'{author}: {text}')

Title: ENHYPEN (엔하이픈) 'No Doubt' Official MV
Description: ENHYPEN (엔하이픈) 'No Doubt' Official MV

Credits:
Directed by Yunah Sheep

ⓒ BELIFT LAB Inc. All Rights Reserved

Connect with ENHYPEN
OFFICIAL WEBSITE https://ENHYPEN.com
ENHYPEN Weverse https://www.weverse.io/enhypen
OFFICIAL YOUTUBE https://www.youtube.com/ENHYPENOFFICIAL
OFFICIAL X (TWITTER) https://twitter.com/ENHYPEN
ENHYPEN X (TWITTER) https://twitter.com/ENHYPEN_members
OFFICIAL FACEBOOK https://www.facebook.com/officialENHYPEN
OFFICIAL INSTAGRAM https://www.instagram.com/enhypen
OFFICIAL TIKTOK  https://www.tiktok.com/@enhypen
OFFICIAL WEIBO https://weibo.com/ENHYPEN
OFFICIAL BILIBILI https://space.bilibili.com/3493119035181246
OFFICIAL JAPAN X (TWITTER) https://twitter.com/ENHYPEN_JP

#ENHYPEN #엔하이픈 #ROMANCE_UNTOLD_daydream #NoDoubt
Duration: PT3M5S
View count: 28195106
Like count: 819592
Comment count: 64086

Top Comments:
@attaetude: I LOVE THE CHOREOGRAPHY THE SHOULDER DANCE AND THE WHISTLE THING THE SONG THE OUTFITS 

In [16]:
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import dateutil.parser

# YouTube API Key
api_key = 'AIzaSyC5VGKOdaG9IW3lauaZ03yk0nkP3oS4cTc'

# Video ID
video_id = 'rDolt3jJRsM'

# Create a YouTube object
youtube = build('youtube', 'v3', developerKey=api_key)

# Fetch video details
video_response = youtube.videos().list(
    part='snippet,contentDetails,statistics',
    id=video_id
).execute()

# Extract video and channel details
video = video_response['items'][0]
snippet = video['snippet']
statistics = video['statistics']
content_details = video['contentDetails']

# Calculate days since published
published_at = dateutil.parser.parse(snippet['publishedAt'])
days_since_published = (datetime.now(published_at.tzinfo) - published_at).days

# Get channel details for subscriber count
channel_id = snippet['channelId']
channel_response = youtube.channels().list(
    part='statistics',
    id=channel_id
).execute()
subscriber_count = channel_response['items'][0]['statistics']['subscriberCount']

# Create a DataFrame
data = {
    'Title': snippet['title'],
    'Description': snippet['description'],
    'Published At': snippet['publishedAt'],
    'Days Since Published': days_since_published,
    'View Count': statistics['viewCount'],
    'Like Count': statistics.get('likeCount', 'Unavailable'),
    'Comment Count': statistics.get('commentCount', 'Unavailable'),
    'Subscriber Count': subscriber_count,
    'Category ID': snippet['categoryId'],
    'Definition': content_details['definition']
}
df = pd.DataFrame([data])

# Get top 10 comments
comments_response = youtube.commentThreads().list(
    part='snippet',
    videoId=video_id,
    order='relevance',
    maxResults=10
).execute()

top_comments = [comment['snippet']['topLevelComment']['snippet']['textDisplay']
                for comment in comments_response.get('items', [])]
df['Top Comments'] = pd.Series([top_comments])

# Save DataFrame to CSV
safe_title = "".join(x for x in snippet['title'] if x.isalnum() or x in " _-").rstrip()
filename = f"{safe_title}.csv"
df.to_csv(filename, index=False)

print(f'Data saved to {filename}')

Data saved to ENHYPEN 엔하이픈 No Doubt Official MV.csv


In [17]:
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import dateutil.parser

# Replace 'YOUR_API_KEY' with your actual YouTube API key
#api_key = 'YOUR_API_KEY'
youtube = build('youtube', 'v3', developerKey=api_key)

# YouTube Video ID
video_id = 'rDolt3jJRsM'

# Fetching video details
video_response = youtube.videos().list(
    part='snippet,contentDetails,statistics',
    id=video_id
).execute()

video = video_response['items'][0]
snippet = video['snippet']
statistics = video['statistics']
content_details = video['contentDetails']

# Calculating days since the video was published
published_at = dateutil.parser.parse(snippet['publishedAt'])
days_since_published = (datetime.now(published_at.tzinfo) - published_at).days

# Fetching channel details for subscriber count
channel_response = youtube.channels().list(
    part='statistics',
    id=snippet['channelId']
).execute()
subscriber_count = channel_response['items'][0]['statistics']['subscriberCount']

# Fetching top 10 comments
comments_response = youtube.commentThreads().list(
    part='snippet',
    videoId=video_id,
    order='relevance',
    maxResults=10
).execute()

top_comments = [comment['snippet']['topLevelComment']['snippet']['textDisplay']
                for comment in comments_response.get('items', [])]

# Preparing data for CSV
data = {
    'Title': snippet['title'],
    'Description': snippet['description'],
    'Published At': snippet['publishedAt'],
    'Days Since Published': days_since_published,
    'View Count': statistics['viewCount'],
    'Like Count': statistics.get('likeCount', 'Unavailable'),
    'Comment Count': statistics.get('commentCount', 'Unavailable'),
    'Subscriber Count': subscriber_count,
    'Category ID': snippet['categoryId'],
    'Definition': content_details['definition'],
    'Tags': snippet.get('tags', []),
    'Top Comments': top_comments
}
df = pd.DataFrame([data])

# Saving to CSV
filename = f"{snippet['title']}.csv".replace('/', '_').replace('\\', '_')  # Cleaning filename
df.to_csv(filename, index=False)

print(f'Data saved to {filename}')

Data saved to ENHYPEN (엔하이픈) 'No Doubt' Official MV.csv


In [18]:
import pandas as pd
import random
import string

def generate_track_name(genre):
    """生成带有流派风格的音乐Track名称"""
    adjectives = {
        'Pop': ['Shiny', 'Happy', 'Bright', 'Lovely', 'Sweet'],
        'Rock': ['Loud', 'Wild', 'Intense', 'Dark', 'Heavy'],
        'Electronic': ['Cyber', 'Digital', 'Synthetic', 'Techno', 'Future'],
        'Hip Hop': ['Street', 'Urban', 'Raw', 'Smooth', 'Groove']
    }
    
    def random_string(length=3):
        return ''.join(random.choices(string.ascii_uppercase, k=length))
    
    return f"{random.choice(adjectives[genre])} {random.choice(['Beat', 'Rhythm', 'Sound', 'Pulse', 'Vibe'])} {random_string()}"

def generate_artists(genre):
    """生成符合流派的艺术家名称"""
    artist_prefixes = {
        'Pop': ['Pop Star', 'Music', 'Melody', 'Rhythm'],
        'Rock': ['Rock Band', 'Guitar', 'Sonic', 'Metal'],
        'Electronic': ['Digital', 'Synth', 'Tech', 'Beat'],
        'Hip Hop': ['Flow', 'Street', 'Urban', 'Rap']
    }
    
    def random_string(length=3):
        return ''.join(random.choices(string.ascii_uppercase, k=length))
    
    return f"{random.choice(artist_prefixes[genre])} {random_string()}"

def generate_dataset():
    random.seed(42)  # 设置随机数种子以保证每次生成的数据一致
    genres = ['Pop', 'Rock', 'Electronic', 'Hip Hop']
    all_tracks = []

    for genre in genres:
        for _ in range(100):
            track = {
                'Genre': genre,
                'Track Name': generate_track_name(genre),
                'Artist': generate_artists(genre),
                'YouTube MV Link': f"https://youtube.com/watch?v={random.randint(10000, 99999)}",
                'Duration (min)': round(random.uniform(2.5, 5.5), 2)
            }
            all_tracks.append(track)

    df = pd.DataFrame(all_tracks)
    return df

# 生成数据集
dataset = generate_dataset()

# 展示所有数据
print(dataset)

# 保存CSV
dataset.to_csv('music_tracks_dataset.csv', index=False)
print("\n数据集已保存为 'music_tracks_dataset.csv'")

       Genre         Track Name        Artist  \
0        Pop     Shiny Beat TGD  Pop Star RXC   
1        Pop   Shiny Rhythm GPO    Rhythm FPV   
2        Pop    Happy Pulse IEY    Melody CJJ   
3        Pop    Shiny Pulse NZJ    Melody VQW   
4        Pop     Shiny Beat RUZ     Music WJL   
..       ...                ...           ...   
395  Hip Hop  Street Rhythm LSE    Street NUU   
396  Hip Hop  Street Rhythm KQU    Street MPC   
397  Hip Hop    Street Vibe AJZ     Urban SPY   
398  Hip Hop    Smooth Beat FRU     Urban FUN   
399  Hip Hop   Street Sound GAE     Urban UJK   

                       YouTube MV Link  Duration (min)  
0    https://youtube.com/watch?v=65302            2.60  
1    https://youtube.com/watch?v=10851            4.78  
2    https://youtube.com/watch?v=55082            4.31  
3    https://youtube.com/watch?v=85674            3.08  
4    https://youtube.com/watch?v=57819            2.99  
..                                 ...             ...  
395  https:/

In [20]:
import pandas as pd

# Let's create a hypothetical dataset representing 4 music genres with 100 music videos (tracks) each on YouTube.

# Define the genres
genres = ['Rock', 'Hip-Hop', 'Pop', 'Electronic/Dance']

# Simulating the data generation by randomly creating 'artist - track' names
# For simplicity, the artist names and tracks are made up and should be replaced with real data.

# Dictionary to hold genre, artist, and track
data = {'Genre': [], 'Artist': [], 'Track': []}

# Generating dummy data for each genre
for genre in genres:
    for i in range(1, 101):  # Assuming 100 unique tracks per genre
        data['Genre'].append(genre)
        data['Artist'].append(f"Artist_{i}_{genre}")
        data['Track'].append(f"Track_{i}_{genre}")

# Creating a DataFrame
df = pd.DataFrame(data)

# Display a portion of the DataFrame to verify
print(df.head())

# Save the DataFrame to CSV
csv_file_path = "/mnt/data/Music_Videos_Dataset.csv"
df.to_csv(csv_file_path, index=False)

csv_file_path

  Genre         Artist         Track
0  Rock  Artist_1_Rock  Track_1_Rock
1  Rock  Artist_2_Rock  Track_2_Rock
2  Rock  Artist_3_Rock  Track_3_Rock
3  Rock  Artist_4_Rock  Track_4_Rock
4  Rock  Artist_5_Rock  Track_5_Rock


OSError: Cannot save file into a non-existent directory: '/mnt/data'

*我从这里开始写的*
# Data Collection
This project collects music data through YouTube and Spotify APIs, covering information on the works of 20 representative artists in five genres: Electronic, Jazz, Hip-Hop, Pop and Rock. The data processing flow is as follows:

## 1. YouTube Data Collection
## 1.1 Acquiring Official Music Video Data

In [7]:
from googleapiclient.discovery import build
import pandas as pd

# API Key
api_key = "AIzaSyDtKE-4QZj6EA-rwG7cj5gMJxdt4Fe14Nw"

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)

# List to store data
all_data = []

# Read song data and fetch YouTube statistics
with open('song_data.txt', 'r') as file:
    for line in file:
        artist = line.strip()
        query = f"{artist} official music video"

        # Search request for the query
        search_request = youtube.search().list(
            part="snippet",
            q=query,
            maxResults=5,
            type="video",
            order='relevance'
        )
        search_response = search_request.execute()

        for item in search_response['items']:
            video_id = item['id']['videoId']
            video_request = youtube.videos().list(
                part="snippet,contentDetails,statistics",
                id=video_id
            )
            video_response = video_request.execute()

            for video in video_response['items']:
                snippet = video['snippet']
                content_details = video['contentDetails']
                statistics = video['statistics']

                # Calculate days since video was published
                published_at = dateutil.parser.parse(snippet['publishedAt'])
                days_since_published = (datetime.now(published_at.tzinfo) - published_at).days

                # Fetch channel details for subscriber count
                channel_response = youtube.channels().list(
                    part='statistics',
                    id=snippet['channelId']
                ).execute()
                subscriber_count = channel_response['items'][0]['statistics'].get('subscriberCount', '0')

                # Fetch comments
                comments_request = youtube.commentThreads().list(
                    part='snippet',
                    videoId=video_id,
                    order='relevance',
                    maxResults=10
                )
                comments_response = comments_request.execute()

                comments = [comment['snippet']['topLevelComment']['snippet']['textDisplay']
                            for comment in comments_response.get('items', [])]

                # Store all data in a dictionary
                video_data = {
                    'Video ID': video_id,
                    'Title': snippet['title'],
                    'Description': snippet['description'],
                    'Published At': snippet['publishedAt'],
                    'Days Since Published': days_since_published,
                    'View Count': statistics.get('viewCount', '0'),
                    'Like Count': statistics.get('likeCount', '0'),
                    'Comment Count': statistics.get('commentCount', '0'),
                    'Comments': comments,
                    'Subscriber Count': subscriber_count,
                    'Category ID': snippet['categoryId'],
                    'Definition': content_details['definition'],
                    'Duration': content_details['duration']
                }

                all_data.append(video_data)

# Convert the list to a DataFrame
final_df = pd.DataFrame(all_data)

# Optionally save the DataFrame to a CSV file
final_df.to_csv('Detailed_YouTube_Video_Data.csv', index=False)

# Display the DataFrame
print(final_df.head())

FileNotFoundError: [Errno 2] No such file or directory: 'song_data.txt'

In [8]:
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import dateutil.parser

# API Key
api_key = "AIzaSyDtKE-4QZj6EA-rwG7cj5gMJxdt4Fe14Nw"

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)

# List to store data
all_data = []

# Read song data and fetch YouTube statistics
# We should have put 20*5 names of singers, but for the sake of presentation, we choose three singers as a demonstration here 
with open('find.txt', 'r') as file:
    for line in file:
        artist = line.strip()
        query = f"{artist} official music video"

        # Search request for the query
        search_request = youtube.search().list(
            part="snippet",
            q=query,
            maxResults=5,
            type="video",
            order='relevance'
        )
        search_response = search_request.execute()

        for item in search_response['items']:
            video_id = item['id']['videoId']
            video_request = youtube.videos().list(
                part="snippet,contentDetails,statistics",
                id=video_id
            )
            video_response = video_request.execute()

            for video in video_response['items']:
                snippet = video['snippet']
                content_details = video['contentDetails']
                statistics = video['statistics']

                # Calculate days since video was published
                published_at = dateutil.parser.parse(snippet['publishedAt'])
                days_since_published = (datetime.now(published_at.tzinfo) - published_at).days

                # Fetch channel details for subscriber count
                channel_response = youtube.channels().list(
                    part='statistics',
                    id=snippet['channelId']
                ).execute()
                subscriber_count = channel_response['items'][0]['statistics'].get('subscriberCount', '0')

                # Fetch comments
                comments_request = youtube.commentThreads().list(
                    part='snippet',
                    videoId=video_id,
                    order='relevance',
                    maxResults=10
                )
                comments_response = comments_request.execute()

                comments = [comment['snippet']['topLevelComment']['snippet']['textDisplay']
                            for comment in comments_response.get('items', [])]

                # Store all data in a dictionary
                video_data = {
                    'Video ID': video_id,
                    'Title': snippet['title'],
                    'Description': snippet['description'],
                    'Published At': snippet['publishedAt'],
                    'Days Since Published': days_since_published,
                    'View Count': statistics.get('viewCount', '0'),
                    'Like Count': statistics.get('likeCount', '0'),
                    'Comment Count': statistics.get('commentCount', '0'),
                    'Comments': comments,
                    'Subscriber Count': subscriber_count,
                    'Category ID': snippet['categoryId'],
                    'Definition': content_details['definition'],
                    'Duration': content_details['duration']
                }

                all_data.append(video_data)

# Convert the list to a DataFrame
final_df = pd.DataFrame(all_data)

# Optionally save the DataFrame to a CSV file
final_df.to_csv('Example_Detailed_YouTube_Video_Data.csv', index=False)

# Display the DataFrame
print(final_df.head())

      Video ID                                              Title  \
0  1VQ_3sBZEm0    Foo Fighters - Learn To Fly (Official HD Video)   
1  eBG7P-K-r1Y        Foo Fighters - Everlong (Official HD Video)   
2  SBjQ9tuuTJQ                       Foo Fighters - The Pretender   
3  EqWRaAF6_WY         Foo Fighters - My Hero (Official HD Video)   
4  h_L4Rixya64  Foo Fighters - Best Of You (Official Music Video)   

                                         Description          Published At  \
0  Foo Fighters' official music video for 'Learn ...  2009-10-03T04:46:13Z   
1  "Everlong" by Foo Fighters \nListen to Foo Fig...  2009-10-03T04:49:58Z   
2  Watch the official music video for "The Preten...  2009-10-03T04:46:14Z   
3  "My Hero" by Foo Fighters \nListen to Foo Figh...  2011-03-18T19:35:42Z   
4  Watch the official music video for "Best Of Yo...  2009-10-03T20:49:33Z   

   Days Since Published View Count Like Count Comment Count  \
0                  5550  183921366     808172        

## 1.2 Merge artist name and music genre into csv
The row of the initial find.csv(include artist name and genre) is repeated five times per row to correspond to the five mv chosen by each artist (python)
Then merge these two columns into the csv (copy manually) 

In [None]:
import pandas as pd

input_file = './find.csv'  
output_file = './Example_singer_info.csv'  

# Load the input CSV file
df = pd.read_csv(input_file)

# Create an empty DataFrame to store repeated rows
repeated_df = pd.DataFrame()

# Repeat each row 5 times and append it to the new DataFrame
for i in range(len(df)):
    repeated_df = pd.concat([repeated_df, pd.DataFrame([df.iloc[i]] * 5)], ignore_index=True)

# Save the processed DataFrame to a new CSV file
repeated_df.to_csv(output_file, index=False)

print(f"Data processed successfully and saved as {output_file}")

Data processed successfully and saved as ./Example_singer_info.csv


## 1.3 Data preprocessing
Extract the song name from the csv's title and generate a new CSV file containing the singer's and song's name.

In [11]:
import pandas as pd
import re

# Load the CSV file with YouTube video data
df = pd.read_csv('./Example_Detailed_YouTube_Video_Data.csv', encoding='MacRoman')

# Function to extract the song name from the title
def extract_song_name(title):
    # Use regular expression to find text between " - " and "("
    match = re.search(r' - (.*?) \(.*\)', title)
    if match:
        return match.group(1) 
    else:
        return title  

# Create a new DataFrame with extracted song names
new_df = pd.DataFrame({
    'Extracted Song Name': df['Title'].apply(extract_song_name)
})

# Save the new DataFrame to a CSV file in the current directory
output_file = './Example_extracted_song_names.csv'
new_df.to_csv(output_file, index=False, encoding='MacRoman')

print(f"Song titles extracted and saved to {output_file}")

Song titles extracted and saved to ./Example_extracted_song_names.csv


In [12]:

input_file = './Example_extracted_song_names.csv'
df = pd.read_csv(input_file, encoding='MacRoman')

# Function to remove content inside brackets (e.g., [example])
def remove_brackets(text):
    return re.sub(r'\[.*?\]', '', text)

# Apply the function to the 'Extracted Song Name' column
df['Extracted Song Name'] = df['Extracted Song Name'].apply(remove_brackets)


output_file = './Example_extracted_song_names_cleaned.csv'
df.to_csv(output_file, index=False)

print(f"Processed data saved to {output_file}")

Processed data saved to ./Example_extracted_song_names_cleaned.csv


Manually merge example_extracted_song_name_cleaned.csv with example_artist_info.csv

## 2. Spotify Collection
## 2.1 Acquiring Track Information

In [33]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# Set up Spotify client credentials
client_credentials_manager = SpotifyClientCredentials(
    client_id='3e1596de002340b898f5d10c9aeae4ea',
    client_secret='526fa44678974475b0f6ba5d8efd16c4'
)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# Input and output file paths
input_file = './Example_extracted_song_names_cleaned.csv'
output_file = './Example_spotify_track_info.csv'

# Load the input CSV file
df = pd.read_csv(input_file, encoding='MacRoman')

# Initialize a list to store results
results = []

# Process each row in the DataFrame
for _, row in df.iterrows():
    song = row['Extracted Song Name']
    artist = row['singer']
    query = f'{song} {artist}'  # Concatenate song and artist for the search query

    try:
        # Search for the track on Spotify
        result = sp.search(q=query, limit=1, type='track')
        tracks = result.get('tracks', {}).get('items', [])

        if tracks:
            # If a track is found, extract its details
            track = tracks[0]
            track_info = {
                'Track Name': track['name'],
                'Artist Name': track['artists'][0]['name'],
                'Album Name': track['album']['name'],
                'Popularity': track['popularity'],
                'Duration (ms)': track['duration_ms'],
                'Track ID': track['id'],
                'Spotify URL': track['external_urls']['spotify']
            }
            print(f"Found: {track_info['Track Name']} by {track_info['Artist Name']}")
        else:
            # If no track is found, append placeholders
            print(f"Track not found: {query}")
            track_info = {
                'Track Name': song,
                'Artist Name': artist,
                'Album Name': None,
                'Popularity': None,
                'Duration (ms)': None,
                'Track ID': None,
                'Spotify URL': None
            }

        # Append the result to the list
        results.append(track_info)

    except Exception as e:
        # Handle exceptions during the search
        print(f"Error processing query '{query}': {e}")
        results.append({
            'Track Name': song,
            'Artist Name': artist,
            'Album Name': None,
            'Popularity': None,
            'Duration (ms)': None,
            'Track ID': None,
            'Spotify URL': None
        })

# Create a DataFrame from the results
output_df = pd.DataFrame(results)

# Save the output DataFrame to a CSV file
output_df.to_csv(output_file, index=False, encoding='utf-8')

print(f"Processed data saved to: {output_file}")


Found: Learn to Fly by Foo Fighters
Found: Everlong by Foo Fighters
Found: The Pretender by Foo Fighters
Found: My Hero by Foo Fighters
Found: Best of You by Foo Fighters
Found: Mr. Brightside by The Killers
Found: When You Were Young by The Killers
Found: Mr. Brightside by The Killers
Found: Somebody Told Me by The Killers
Found: One Empty Grave by A Sound of Thunder
Found: Basket Case by Green Day
Found: When I Come Around by Green Day
Found: American Idiot by Green Day
Found: Boulevard of Broken Dreams by Green Day
Found: Wake Me up When September Ends by Green Day
Processed data saved to: ./Example_spotify_track_info.csv


## 2.2 Obtaining Artist Information

In [19]:
import pandas as pd
import requests

# Function to obtain an access token for Spotify API
def get_access_token(client_id, client_secret):
    auth_url = 'https://accounts.spotify.com/api/token'
    auth_data = {
        'grant_type': 'client_credentials',
        'client_id': client_id,
        'client_secret': client_secret
    }
    response = requests.post(auth_url, data=auth_data)
    if response.status_code == 200:
        return response.json()['access_token']
    else:
        raise Exception('Failed to obtain access token')

# Function to get the Spotify artist ID using the artist name
def get_artist_id(artist_name, access_token):
    search_url = f'https://api.spotify.com/v1/search?q={artist_name}&type=artist&limit=1'
    headers = {'Authorization': f'Bearer {access_token}'}
    response = requests.get(search_url, headers=headers)
    if response.status_code == 200:
        search_results = response.json()
        artists = search_results['artists']['items']
        if artists:
            return artists[0]['id']
        else:
            return None
    else:
        return None

# Function to retrieve the artist's followers and popularity
def get_artist_followers(artist_id, access_token):
    artist_url = f'https://api.spotify.com/v1/artists/{artist_id}'
    headers = {'Authorization': f'Bearer {access_token}'}
    response = requests.get(artist_url, headers=headers)
    if response.status_code == 200:
        artist_data = response.json()
        return artist_data['followers']['total'], artist_data['popularity']
    else:
        return None, None

# Load the input CSV file 
input_file = './Example_singer_info.csv' 
df = pd.read_csv(input_file)

# Spotify API credentials
client_id = '31aba57b31344fdebf98f51375d07834' 
client_secret = '2c4f929786784910bc9a843518785cae'  

# Obtain Spotify API access token
access_token = get_access_token(client_id, client_secret)

# Initialize lists to store followers and popularity data
followers_list = []
popularity_list = []

# Process each artist in the DataFrame
for artist_name in df['artist']:
    artist_id = get_artist_id(artist_name, access_token)
    if artist_id:
        followers, popularity = get_artist_followers(artist_id, access_token)
        followers_list.append(followers)
        popularity_list.append(popularity)
    else:
        # If the artist is not found, append None
        followers_list.append(None)
        popularity_list.append(None)

# Add followers and popularity data to the DataFrame
df['Followers'] = followers_list
df['Popularity'] = popularity_list

# Save the updated DataFrame to a new CSV file
output_file = './Example_artist_data_with_followers_and_popularity.csv'  
df.to_csv(output_file, index=False)

print(f"Processed data saved to: {output_file}")

Processed data saved to: ./Example_artist_data_with_followers_and_popularity.csv


## 3. Data integration and cleansing
First manually merge Spotify and YouTube csv.
Then using Spotify and YouTube artists as the matching key, match to verify artist match, if not, then delete the mismatched rows.

In [35]:
import pandas as pd

input_file = './Example_Detailed_YouTube_Video_Data.csv' 
df = pd.read_csv(input_file, encoding='MacRoman')

# Filter rows where 'Artist Name' matches 'Artist'
df_cleaned = df[df['Artist Name'] == df['artist']]

# Save the cleaned DataFrame to a new CSV file
output_file = './Example_spotify_youtube.csv'  
df_cleaned.to_csv(output_file, index=False)
print(f"Cleaned data saved to: {output_file}")

Cleaned data saved to: ./Example_spotify_youtube.csv


Then we completed all the data collection steps!