## Title
### Group 14 

> Github repository: https://github.com/LivDreyer/CSS24.git

> Shortlog of git commits:
- x  LivDreyer
- x  AIAndreas
- x  FelixxAI


> Contribution: The workload was distributed equally between all members of the group. 

![Crowd](crowd.png)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import json
import networkx as nx
import netwulf as nu
import json
from networkx.readwrite import json_graph
import random
import pandas as pd
import ast
from collections import Counter
from wordcloud import WordCloud
import re

## Motivation

**Dataset** 

- What is our dataset? 
- What is our motivation for choosing this data set? 
- What was your goal for the end user's experience? 

This project aims to answer the research question: "What is the relationship between artist collaboration patterns, popularity, and lyrical expression of genre themes?". To answer this question, we decided to center our network analysis around the worlds largest music streaming service: Spotify. Given our project's focus on artist popularity, using a platform such as Spotify for insights is logical. With streaming services contributing to 84% of the music industry's revenue, and Spotify holding a dominant market share of 30.5%, it offers a comprehensive insight to artist popularity. 

For the textual analysis, Genius, among others, serve as an "online music encyclopidia". Although the textual analysis in this project could utilize a variety of song lyric API's, Genius offers the ability for artists and users to annotate lyrics, making it an interesting choice for other analyses in the future. 

This projects takes starting point in the Kaggle dataset "US Top 10K Artists and Their Popular Songs". The dataset, created by Spoorthi Uday Karakaraddi, was collected using the Spotify API and features several attributes of the top 10k artists in the US in 2023. It serves as the foundation for constructing the final dataset used for network analysis, which is then used for constructing the dataset for our textual analysis. 


- https://en.wikipedia.org/wiki/Genius_(company)
- https://www.kaggle.com/datasets/spoorthiuk/us-top-10k-artists-and-their-popular-songs 
- https://explodingtopics.com/blog/music-streaming-stats

## Datasets and basic statistics 

Since this project is working with API's, we are constructing the datasets from this data taking our starting point, as mentioned, in the Kaggle dataset "US Top 10k Artists and Their Popular Songs". To provide a clear explanation of our course of action, this data section will be divided into the construction of the dataset used for network analysis and subsequently the one used for the text analysis after the initial work of fetching data from the Spotify API. Each block of code will be labeled with the name of the resulting csv-file. 


> **Note:** In this project, a collaboration between two artists is considered one if an artist have a featuring artist on their songs. An example could be the artist "Rihanna". On her song titled "Consideration", we see the artist "SZA" is featured, which in this project will be considered a collaboration. To explore artists collaboration patterns, we used the "US Top 10K Artists and Their Popular Songs"-dataset to create our first list of artists.  

### Fetching data from the Spotify API

The first query from the Spotify API consisted of retrieving the top 10 tracks of each artist to reveal possible featuring artists. Due to a rate limit of 5000 requests per day on queries from the Spotify API, with the API key being valid for only one hour, only the top 4250 artists, ranked by Spotify's measure of popularity, from the Kaggle dataset was used. This resulted in a dataframe of the song ID, song Name, main artist, featured artist(s) of the song, and the genre of the song. 

##### Complete_Songs_with_Artists_and_Features.csv

In [None]:
import requests
import pandas as pd
import time
from tqdm import tqdm
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Spotify API credentials
client_id = '#'  
client_secret = '#'  

def refresh_token():
    """ Refresh the Spotify API token. """
    url = 'https://accounts.spotify.com/api/token'
    payload = {'grant_type': 'client_credentials'}
    response = requests.post(url, auth=(client_id, client_secret), data=payload)
    if response.status_code == 200:
        new_token = response.json()['access_token']
        logging.info("Token refreshed successfully.")
        return new_token
    else:
        logging.error(f"Failed to refresh token: {response.text}")
        raise Exception("Failed to refresh token")

# Initial token
token = refresh_token()
headers = {'Authorization': f'Bearer {token}'}

def get_top_tracks(artist_id, retry_count=0):
    """ Fetch top tracks for a given artist ID from Spotify, handling rate limits dynamically. """
    global headers
    top_tracks_url = f"https://api.spotify.com/v1/artists/{artist_id}/top-tracks?country=US"
    response = requests.get(top_tracks_url, headers=headers)
    if response.status_code == 200:
        data = response.json()
        return [(track['id'], track['name'], [artist['name'] for artist in track['artists']]) for track in data.get('tracks', [])]
    elif response.status_code == 429:
        retry_after = int(response.headers.get("Retry-After", 30))
        wait_time = min(retry_after, 30)  # Cap at 30 seconds to avoid overly long delays
        logging.warning(f"Rate limit exceeded, retrying after {wait_time} seconds...")
        time.sleep(wait_time)
        return get_top_tracks(artist_id, retry_count + 1) if retry_count < 5 else []
    elif response.status_code == 401 and retry_count < 5:
        logging.warning("Token expired, refreshing token...")
        headers['Authorization'] = 'Bearer ' + refresh_token()
        return get_top_tracks(artist_id, retry_count + 1)
    else:
        logging.error(f"Failed to fetch data: {response.status_code}")
        return []

def separate_artists(artists, main_artist_name):
    """ Separate main artist from featured artists. """
    featured_artists = [artist for artist in artists if artist != main_artist_name]
    return main_artist_name, ', '.join(featured_artists)

def gather_top_tracks(df):
    """ Process each artist and fetch their top tracks. """
    songs = []
    for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Fetching top tracks"):
        artist_id = row['ID']
        results = get_top_tracks(artist_id)
        for song_id, song_name, artists in results:
            main_artist, features = separate_artists(artists, row['Name'])
            songs.append([song_id, song_name, main_artist, features])
    return songs


df_artists = pd.read_csv('Artists.csv')
first_part = df_artists.iloc[:4250]
logging.info("Processing the first 4250 artists...")
song_data = gather_top_tracks(first_part)
songs_df = pd.DataFrame(song_data, columns=['Song ID', 'Song Name', 'Main Artist', 'Featured Artists', 'Genres'])

# Save the data to CSV
songs_df.to_csv('Complete_Songs_with_Artists_and_Features.csv', index=False)
logging.info("Data saved to Complete_Songs_with_Artists_and_Features.csv.")

We then created a dataframe of the 4250 artists from the Kaggle dataset containing the following information: Main artist, Top 10 tracks, Artist ID, Genre (categorized by Spotify), Popularity Score, Follower Count, and their URI. See the following code block. 

##### Final_Artist_Tracks_Info.csv

In [None]:
import pandas as pd

# Load the datasets
songs_df = pd.read_csv('Complete_Songs_with_Artists_and_Features.csv')
artists_info_df = pd.read_csv('Artists.csv')

songs_df['Main Artist'] = songs_df['Main Artist'].str.strip().str.lower()
artists_info_df['Name'] = artists_info_df['Name'].str.strip().str.lower()
top_tracks = songs_df.groupby('Main Artist')['Song Name'].apply(list).reset_index()
top_tracks.rename(columns={'Main Artist': 'Name'}, inplace=True)
final_df = pd.merge(top_tracks, artists_info_df, on='Name', how='left')
final_df.rename(columns={'Name': 'Main Artist'}, inplace=True)
final_df = final_df[['Main Artist', 'Song Name', 'ID', 'Genres', 'Popularity', 'Followers', 'URI']]

# Save the final merged dataset to a new CSV file
final_df.to_csv('Final_Artist_Tracks_Info.csv', index=False)

We now both have a dataframe of the first 4250 artists and the attributes previously mentioned and know which artists are featured on those 4250 artists top tracks. We will now query the same information for the featured artists, so we can merge the data, creating a final dataset.  

##### Complete_Artists_Info.csv

In [None]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from tqdm import tqdm

sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id="#", client_secret="#"))

# Function to search for an artist on Spotify and get their info
def get_artist_info(artist_name):
    try:
        results = sp.search(q='artist:' + artist_name, type='artist', limit=1)
        items = results['artists']['items']
        if items:
            artist = items[0]
            return {
                'Name': artist['name'],
                'ID': artist['id'],
                'Genres': ', '.join(artist['genres']),
                'Popularity': artist['popularity'],
                'Followers': artist['followers']['total'],
                'URI': artist['uri']
            }
    except spotipy.client.SpotifyException as e:
        print(f"Spotify API error for {artist_name}: {e}")
    return None

# Read the text file with artist names
with open('Unique_Features.txt', 'r') as file:
    unique_features = file.read().splitlines()

artists_to_query = unique_features
artists_info = []
for artist_name in tqdm(artists_to_query, desc='Querying artists'):
    artist_info = get_artist_info(artist_name)
    if artist_info:
        artists_info.append(artist_info)
    else:
        print(f"No data found for artist: {artist_name}")

artists_df = pd.DataFrame(artists_info)
artists_df.to_csv('Complete_Artists_Info.csv', index=False)
print("Finished collecting all artists' information.")

We have now found the following information on the featured artists: Artist, Top 10 tracks, Artist ID, Genre (categorized by Spotify), Popularity Score, Follower Count, and their URI. To obtain the top 10 tracks for each of the featuring artists, who are now just denoted as artists as well, we query the Spotify API yet again. As we are now interested in the top 10 tracks of 11260 artists, we carry out the following code in three sections. For every run of the code, we obtain about 4000 artists songs. 

##### The following two blocks of code results in Feature_Artist_Songs_Combined.csv

In [None]:
import requests
import pandas as pd
import time
from tqdm import tqdm
import logging


logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Spotify API credentials
client_id = '#'  
client_secret = '#'  

def refresh_token():
    """ Refresh the Spotify API token. """
    url = 'https://accounts.spotify.com/api/token'
    payload = {'grant_type': 'client_credentials'}
    response = requests.post(url, auth=(client_id, client_secret), data=payload)
    if response.status_code == 200:
        new_token = response.json()['access_token']
        logging.info("Token refreshed successfully.")
        return new_token
    else:
        logging.error(f"Failed to refresh token: {response.text}")
        raise Exception("Failed to refresh token")

# Initial token
token = refresh_token()
headers = {'Authorization': f'Bearer {token}'}

def get_top_tracks(artist_id, retry_count=0):
    """Fetch top tracks for a given artist ID from Spotify, handling rate limits dynamically."""
    global headers
    top_tracks_url = f"https://api.spotify.com/v1/artists/{artist_id}/top-tracks?country=US"
    response = requests.get(top_tracks_url, headers=headers)
    if response.status_code == 200:
        data = response.json()
        return [(track['id'], track['name'], [artist['name'] for artist in track['artists']]) for track in data.get('tracks', [])]
    elif response.status_code == 429:
        retry_after = int(response.headers.get("Retry-After", 30))
        wait_time = min(retry_after, 30)  # Cap at 30 seconds to avoid overly long delays
        logging.warning(f"Rate limit exceeded, retrying after {wait_time} seconds...")
        time.sleep(wait_time)
        return get_top_tracks(artist_id, retry_count + 1) if retry_count < 5 else []
    elif response.status_code == 401 and retry_count < 5:
        logging.warning("Token expired, refreshing token...")
        headers['Authorization'] = 'Bearer ' + refresh_token()
        return get_top_tracks(artist_id, retry_count + 1)
    else:
        logging.error(f"Failed to fetch data: {response.status_code}")
        return []

def separate_artists(artists, main_artist_name):
    """Separate main artist from featured artists."""
    featured_artists = [artist for artist in artists if artist != main_artist_name]
    return main_artist_name, ', '.join(featured_artists)

def gather_top_tracks(df):
    """Process each artist and fetch their top tracks."""
    songs = []
    for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Fetching top tracks"):
        artist_id = row['ID']
        results = get_top_tracks(artist_id)
        for song_id, song_name, artists in results:
            main_artist, features = separate_artists(artists, row['Name'])
            songs.append([song_id, song_name, main_artist, features])
    return songs

# Load the dataset
df_artists = pd.read_csv('Complete_Artists_Info.csv')

# Split the dataset into 3 parts
#first_half = df_artists.iloc[:4300]
#first_half = df_artists.iloc[4300:8600]
first_half = df_artists.iloc[8600:]


logging.info("Processing the first half of the dataset...")
first_half_data = gather_top_tracks(first_half)
songs_df = pd.DataFrame(first_half_data, columns=['Song ID', 'Song Name', 'Main Artist', 'Featured Artists'])
#songs_df.to_csv('Feature_Artist_Songs_Part1.csv', index=False)
#songs_df.to_csv('Feature_Artist_Songs_Part2.csv', index=False)
songs_df.to_csv('Feature_Artist_Songs_Part3.csv', index=False)

They are then merged into one single dataframe. 

In [None]:
import pandas as pd

# Load each part of the dataset
df_part1 = pd.read_csv('Feature_Artist_Songs_Part1.csv')
df_part2 = pd.read_csv('Feature_Artist_Songs_Part2.csv')
df_part3 = pd.read_csv('Feature_Artist_Songs_Part3.csv')

df_combined = pd.concat([df_part1, df_part2, df_part3], ignore_index=True)
df_combined.to_csv('Feature_Artist_Songs_Combined.csv', index=False)

print("Combined DataFrame Info:")
print(df_combined.info())

We then create a full dataset of featured artists and their top 10 songs.

#### Final_Feature_Artist_Tracks_Info.csv

In [None]:
import pandas as pd

# Load the datasets
songs_df = pd.read_csv('Feature_Artist_Songs_Combined.csv')
artists_info_df = pd.read_csv('Featured/Complete_Feature_Artists_Info.csv')

# Normalize artist names to ensure the case and trimming issues don't affect the merge
songs_df['Main Artist'] = songs_df['Main Artist'].str.strip().str.lower()
artists_info_df['Name'] = artists_info_df['Name'].str.strip().str.lower()

# Aggregate the top tracks for each main artist
top_tracks = songs_df.groupby('Main Artist')['Song Name'].apply(list).reset_index()
top_tracks.rename(columns={'Main Artist': 'Name'}, inplace=True)
final_df = pd.merge(top_tracks, artists_info_df, on='Name', how='left')
final_df.rename(columns={'Name': 'Main Artist'}, inplace=True)
final_df = final_df[['Main Artist', 'Song Name', 'ID', 'Genres', 'Popularity', 'Followers', 'URI']]
final_df.to_csv('Final_Feature_Artist_Tracks_Info.csv', index=False)

print("Final DataFrame head:")
print(final_df.head())

Lastly we combine the *Final_Feature_Artist_Tracks_Info.csv* and *Final_Artist_Tracks_Info.csv* into one dataframe with the following attributes: Artist, Top songs, Artist ID, Genre, Popularity score, Followers, and URI. 

##### Combined_Artist_Tracks_Info.csv

In [None]:
import pandas as pd


featured_tracks = pd.read_csv('Featured\\Final_Feature_Artist_Tracks_Info.csv')
final_artist_tracks = pd.read_csv('Final_Artist_Tracks_Info.csv')

combined_data = pd.concat([featured_tracks, final_artist_tracks], ignore_index=True)
combined_data = combined_data.drop_duplicates()
combined_data.to_csv('Combined_Artist_Tracks_Info.csv', index=False)

We have now fetched all data needed from the Spotify API. The following sections will be divided into the data used for the network analysis followed by the data used for the textual analysis. 

#### Dataset and basic statistics - Network Analysis 

For the network analysis, we will use the following csv-files created en the previous section: 

- Final_Feature_Artist_Tracks_Info.csv
- Feature_Artist_Songs_Combined.csv
- Complete_Songs_with_Artists_and_Features.csv

To create the final dataset used for the network analysis, we first take *Final_Feature_Artist_Tracks_Info.csv* with the attributes Main Artist, Genre, Popularity Score, and Followers and join on "Main Artist" with *Feature_Artist_Songs_Combined.csv* that have the attributes Song Name, Main Artist and Featured Artist(s). 

In [None]:
df_artist = pd.read_csv('Artists.csv')

df1 = pd.read_csv('Complete_Songs_with_Artists_and_Features.csv', usecols=['Song Name','Main Artist', 'Featured Artists'])

merged_df = df1.merge(df_artist, left_on='Main Artist', right_on='Name', how='left')

df1_with_data = merged_df.drop(columns=['Name', 'ID', 'URI', 'Age', 'Country', 'Gender'])

We then join *Complete_Songs_with_Artists_and_Features.csv* with the Kaggle "US 10k Top Artists and Their Popular Songs"-dataset on the name of the Main Artist. 

In [None]:
df2 = pd.read_csv('Feature_Artist_Songs_Combined.csv', usecols=['Song Name','Main Artist', 'Featured Artists'])

df3 = pd.read_csv('Final_Feature_Artist_Tracks_info.csv', usecols=['Main Artist', 'Genres', 'Popularity', 'Followers'])

df2['Main Artist'] = df2['Main Artist'].str.lower()

merged_df = df2.merge(df3, on='Main Artist', how='inner')

Lastly we combine the two dataframes, resulting in the final one shown below: 

In [None]:
df1_with_data['Main Artist'] = df1_with_data['Main Artist'].str.lower()
merged_df = merged_df[~merged_df['Main Artist'].isin(df1_with_data['Main Artist'])]

extended_df = pd.concat([df1_with_data, merged_df], ignore_index=True)

cleaned_df = extended_df.drop_duplicates()
cleaned_df.reset_index(drop=True, inplace=True)
cleaned_df.to_csv('Final_Songs_with_Artists_and_Features.csv', index=False)

In [None]:
#show df 
cleaned_df