---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Example

In the following code, we first utilized the requests library to retrieve the HTML content from the Wikipedia page. Afterward, we employed BeautifulSoup to parse the HTML and locate the specific table of interest by using the find function. Once the table was identified, we extracted the relevant data by iterating through its rows, gathering country names and their respective populations. Finally, we used Pandas to store the collected data in a DataFrame, allowing for easy analysis and visualization. The data could also be optionally saved as a CSV file for further use. 


### Data Collection
This project collects music data through YouTube and Spotify APIs, covering information on the works of 20 representative artists in five genres: Electronic, Jazz, Hip-Hop, Pop and Rock. The data processing flow is as follows:

### 1. YouTube Data Collection
#### 1.1 Acquiring Official Music Video Data

In [8]:
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import dateutil.parser

# API Key
api_key = "AIzaSyDtKE-4QZj6EA-rwG7cj5gMJxdt4Fe14Nw"

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)

# List to store data
all_data = []

# Read song data and fetch YouTube statistics
# We should have put 20*5 names of singers, but for the sake of presentation, we choose three singers as a demonstration here 
with open('find.txt', 'r') as file:
    for line in file:
        artist = line.strip()
        query = f"{artist} official music video"

        # Search request for the query
        search_request = youtube.search().list(
            part="snippet",
            q=query,
            maxResults=5,
            type="video",
            order='relevance'
        )
        search_response = search_request.execute()

        for item in search_response['items']:
            video_id = item['id']['videoId']
            video_request = youtube.videos().list(
                part="snippet,contentDetails,statistics",
                id=video_id
            )
            video_response = video_request.execute()

            for video in video_response['items']:
                snippet = video['snippet']
                content_details = video['contentDetails']
                statistics = video['statistics']

                # Calculate days since video was published
                published_at = dateutil.parser.parse(snippet['publishedAt'])
                days_since_published = (datetime.now(published_at.tzinfo) - published_at).days

                # Fetch channel details for subscriber count
                channel_response = youtube.channels().list(
                    part='statistics',
                    id=snippet['channelId']
                ).execute()
                subscriber_count = channel_response['items'][0]['statistics'].get('subscriberCount', '0')

                # Fetch comments
                comments_request = youtube.commentThreads().list(
                    part='snippet',
                    videoId=video_id,
                    order='relevance',
                    maxResults=10
                )
                comments_response = comments_request.execute()

                comments = [comment['snippet']['topLevelComment']['snippet']['textDisplay']
                            for comment in comments_response.get('items', [])]

                # Store all data in a dictionary
                video_data = {
                    'Video ID': video_id,
                    'Title': snippet['title'],
                    'Description': snippet['description'],
                    'Published At': snippet['publishedAt'],
                    'Days Since Published': days_since_published,
                    'View Count': statistics.get('viewCount', '0'),
                    'Like Count': statistics.get('likeCount', '0'),
                    'Comment Count': statistics.get('commentCount', '0'),
                    'Comments': comments,
                    'Subscriber Count': subscriber_count,
                    'Category ID': snippet['categoryId'],
                    'Definition': content_details['definition'],
                    'Duration': content_details['duration']
                }

                all_data.append(video_data)

# Convert the list to a DataFrame
final_df = pd.DataFrame(all_data)

# Optionally save the DataFrame to a CSV file
final_df.to_csv('Example_Detailed_YouTube_Video_Data.csv', index=False)

# Display the DataFrame
print(final_df.head())

      Video ID                                              Title  \
0  1VQ_3sBZEm0    Foo Fighters - Learn To Fly (Official HD Video)   
1  eBG7P-K-r1Y        Foo Fighters - Everlong (Official HD Video)   
2  SBjQ9tuuTJQ                       Foo Fighters - The Pretender   
3  EqWRaAF6_WY         Foo Fighters - My Hero (Official HD Video)   
4  h_L4Rixya64  Foo Fighters - Best Of You (Official Music Video)   

                                         Description          Published At  \
0  Foo Fighters' official music video for 'Learn ...  2009-10-03T04:46:13Z   
1  "Everlong" by Foo Fighters \nListen to Foo Fig...  2009-10-03T04:49:58Z   
2  Watch the official music video for "The Preten...  2009-10-03T04:46:14Z   
3  "My Hero" by Foo Fighters \nListen to Foo Figh...  2011-03-18T19:35:42Z   
4  Watch the official music video for "Best Of Yo...  2009-10-03T20:49:33Z   

   Days Since Published View Count Like Count Comment Count  \
0                  5550  183921366     808172        

#### 1.2 Merge artist name and music genre into csv
The row of the initial find.csv(include artist name and genre) is repeated five times per row to correspond to the five mv chosen by each artist (python)
Then merge these two columns into the csv (copy manually) 

In [None]:
import pandas as pd

input_file = './find.csv'  
output_file = './Example_singer_info.csv'  

# Load the input CSV file
df = pd.read_csv(input_file)

# Create an empty DataFrame to store repeated rows
repeated_df = pd.DataFrame()

# Repeat each row 5 times and append it to the new DataFrame
for i in range(len(df)):
    repeated_df = pd.concat([repeated_df, pd.DataFrame([df.iloc[i]] * 5)], ignore_index=True)

# Save the processed DataFrame to a new CSV file
repeated_df.to_csv(output_file, index=False)

print(f"Data processed successfully and saved as {output_file}")

Data processed successfully and saved as ./Example_singer_info.csv


#### 1.3 Data preprocessing
Extract the song name from the csv's title and generate a new CSV file containing the singer's and song's name.

In [11]:
import pandas as pd
import re

# Load the CSV file with YouTube video data
df = pd.read_csv('./Example_Detailed_YouTube_Video_Data.csv', encoding='MacRoman')

# Function to extract the song name from the title
def extract_song_name(title):
    # Use regular expression to find text between " - " and "("
    match = re.search(r' - (.*?) \(.*\)', title)
    if match:
        return match.group(1) 
    else:
        return title  

# Create a new DataFrame with extracted song names
new_df = pd.DataFrame({
    'Extracted Song Name': df['Title'].apply(extract_song_name)
})

# Save the new DataFrame to a CSV file in the current directory
output_file = './Example_extracted_song_names.csv'
new_df.to_csv(output_file, index=False, encoding='MacRoman')

print(f"Song titles extracted and saved to {output_file}")

Song titles extracted and saved to ./Example_extracted_song_names.csv


In [12]:

input_file = './Example_extracted_song_names.csv'
df = pd.read_csv(input_file, encoding='MacRoman')

# Function to remove content inside brackets (e.g., [example])
def remove_brackets(text):
    return re.sub(r'\[.*?\]', '', text)

# Apply the function to the 'Extracted Song Name' column
df['Extracted Song Name'] = df['Extracted Song Name'].apply(remove_brackets)


output_file = './Example_extracted_song_names_cleaned.csv'
df.to_csv(output_file, index=False)

print(f"Processed data saved to {output_file}")

Processed data saved to ./Example_extracted_song_names_cleaned.csv


Manually merge example_extracted_song_name_cleaned.csv with example_artist_info.csv

### 2. Spotify Collection
#### 2.1 Acquiring Track Information

In [33]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# Set up Spotify client credentials
client_credentials_manager = SpotifyClientCredentials(
    client_id='3e1596de002340b898f5d10c9aeae4ea',
    client_secret='526fa44678974475b0f6ba5d8efd16c4'
)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# Input and output file paths
input_file = './Example_extracted_song_names_cleaned.csv'
output_file = './Example_spotify_track_info.csv'

# Load the input CSV file
df = pd.read_csv(input_file, encoding='MacRoman')

# Initialize a list to store results
results = []

# Process each row in the DataFrame
for _, row in df.iterrows():
    song = row['Extracted Song Name']
    artist = row['singer']
    query = f'{song} {artist}'  # Concatenate song and artist for the search query

    try:
        # Search for the track on Spotify
        result = sp.search(q=query, limit=1, type='track')
        tracks = result.get('tracks', {}).get('items', [])

        if tracks:
            # If a track is found, extract its details
            track = tracks[0]
            track_info = {
                'Track Name': track['name'],
                'Artist Name': track['artists'][0]['name'],
                'Album Name': track['album']['name'],
                'Popularity': track['popularity'],
                'Duration (ms)': track['duration_ms'],
                'Track ID': track['id'],
                'Spotify URL': track['external_urls']['spotify']
            }
            print(f"Found: {track_info['Track Name']} by {track_info['Artist Name']}")
        else:
            # If no track is found, append placeholders
            print(f"Track not found: {query}")
            track_info = {
                'Track Name': song,
                'Artist Name': artist,
                'Album Name': None,
                'Popularity': None,
                'Duration (ms)': None,
                'Track ID': None,
                'Spotify URL': None
            }

        # Append the result to the list
        results.append(track_info)

    except Exception as e:
        # Handle exceptions during the search
        print(f"Error processing query '{query}': {e}")
        results.append({
            'Track Name': song,
            'Artist Name': artist,
            'Album Name': None,
            'Popularity': None,
            'Duration (ms)': None,
            'Track ID': None,
            'Spotify URL': None
        })

# Create a DataFrame from the results
output_df = pd.DataFrame(results)

# Save the output DataFrame to a CSV file
output_df.to_csv(output_file, index=False, encoding='utf-8')

print(f"Processed data saved to: {output_file}")


Found: Learn to Fly by Foo Fighters
Found: Everlong by Foo Fighters
Found: The Pretender by Foo Fighters
Found: My Hero by Foo Fighters
Found: Best of You by Foo Fighters
Found: Mr. Brightside by The Killers
Found: When You Were Young by The Killers
Found: Mr. Brightside by The Killers
Found: Somebody Told Me by The Killers
Found: One Empty Grave by A Sound of Thunder
Found: Basket Case by Green Day
Found: When I Come Around by Green Day
Found: American Idiot by Green Day
Found: Boulevard of Broken Dreams by Green Day
Found: Wake Me up When September Ends by Green Day
Processed data saved to: ./Example_spotify_track_info.csv


#### 2.2 Obtaining Artist Information

In [19]:
import pandas as pd
import requests

# Function to obtain an access token for Spotify API
def get_access_token(client_id, client_secret):
    auth_url = 'https://accounts.spotify.com/api/token'
    auth_data = {
        'grant_type': 'client_credentials',
        'client_id': client_id,
        'client_secret': client_secret
    }
    response = requests.post(auth_url, data=auth_data)
    if response.status_code == 200:
        return response.json()['access_token']
    else:
        raise Exception('Failed to obtain access token')

# Function to get the Spotify artist ID using the artist name
def get_artist_id(artist_name, access_token):
    search_url = f'https://api.spotify.com/v1/search?q={artist_name}&type=artist&limit=1'
    headers = {'Authorization': f'Bearer {access_token}'}
    response = requests.get(search_url, headers=headers)
    if response.status_code == 200:
        search_results = response.json()
        artists = search_results['artists']['items']
        if artists:
            return artists[0]['id']
        else:
            return None
    else:
        return None

# Function to retrieve the artist's followers and popularity
def get_artist_followers(artist_id, access_token):
    artist_url = f'https://api.spotify.com/v1/artists/{artist_id}'
    headers = {'Authorization': f'Bearer {access_token}'}
    response = requests.get(artist_url, headers=headers)
    if response.status_code == 200:
        artist_data = response.json()
        return artist_data['followers']['total'], artist_data['popularity']
    else:
        return None, None

# Load the input CSV file 
input_file = './Example_singer_info.csv' 
df = pd.read_csv(input_file)

# Spotify API credentials
client_id = '31aba57b31344fdebf98f51375d07834' 
client_secret = '2c4f929786784910bc9a843518785cae'  

# Obtain Spotify API access token
access_token = get_access_token(client_id, client_secret)

# Initialize lists to store followers and popularity data
followers_list = []
popularity_list = []

# Process each artist in the DataFrame
for artist_name in df['artist']:
    artist_id = get_artist_id(artist_name, access_token)
    if artist_id:
        followers, popularity = get_artist_followers(artist_id, access_token)
        followers_list.append(followers)
        popularity_list.append(popularity)
    else:
        # If the artist is not found, append None
        followers_list.append(None)
        popularity_list.append(None)

# Add followers and popularity data to the DataFrame
df['Followers'] = followers_list
df['Popularity'] = popularity_list

# Save the updated DataFrame to a new CSV file
output_file = './Example_artist_data_with_followers_and_popularity.csv'  
df.to_csv(output_file, index=False)

print(f"Processed data saved to: {output_file}")

Processed data saved to: ./Example_artist_data_with_followers_and_popularity.csv


### 3. Data integration and cleansing
First manually merge Spotify and YouTube csv.
Then using Spotify and YouTube artists as the matching key, match to verify artist match, if not, then delete the mismatched rows.

In [35]:
import pandas as pd

input_file = './Example_Detailed_YouTube_Video_Data.csv' 
df = pd.read_csv(input_file, encoding='MacRoman')

# Filter rows where 'Artist Name' matches 'Artist'
df_cleaned = df[df['Artist Name'] == df['artist']]

# Save the cleaned DataFrame to a new CSV file
output_file = './Example_spotify_youtube.csv'  
df_cleaned.to_csv(output_file, index=False)
print(f"Cleaned data saved to: {output_file}")

Cleaned data saved to: ./Example_spotify_youtube.csv


Then we completed all the data collection steps!