## Title
### Group 14 

> Github repository: https://github.com/LivDreyer/CSS24.git

> Shortlog of git commits:
- x  LivDreyer
- x  AIAndreas
- x  FelixxAI


> Contribution: The workload was distributed equally between all members of the group. 

![Crowd](crowd.png)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import json
import networkx as nx
import netwulf as nu
from networkx.readwrite import json_graph
import random
import pandas as pd
import ast
from collections import Counter
from wordcloud import WordCloud
import re
import requests
import time
from tqdm import tqdm
import logging
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import BeautifulSoup
import os

# Table of Contents
* [1. Motivation](#motivation)
    * [1.1 Datasets and motivation](#datasets)
    * [1.2 Goal for the end user's experience](#goaluser)
* [2. Data construction and Basic Statistics](#datacons)
    * [2.1 Data collection from the Spotify API](#spotify)
    * [2.2 Data collection from the Genius API](#textdata)
        * [2.2.1 Creating data for textual analysis](#textdata_create)
        * [2.2.2 Basic Statistics of textual data](#basic_stats_text)
    * [2.3 Artist collaboration network data](#networkdata)
        * [2.3.1 Preprocessing and cleaning](#networkpreclean)
        * [2.3.2 Creation of network](#creation_of_network)
        * [2.3.3 Basic statistics - network data](#basic_stats_network)
* [3. Data analysis](#data_analysis)
    * [3.1 Network analysis](#network_analysis)
        * [3.1.1 Degree distribution](#degree_distribution)
        * [3.1.2 Assortivity](#assortivity)
        * [3.1.3 Communities](#communities)
    * [3.2 Textual analysis](#text_analysis)
        * [3.2.1 TF-IDF](#TF_IDF)
        * [3.2.2 Wordclouds](#wordclouds)
* [4. Discussion](#discussion)
* [5. References](#references)


## 1. Motivation <a class="anchor" id="motivation"></a>

### 1.1 What are our datasets and what is our motivation for choosing these? <a class="anchor" id="datasets"></a>

This project aims to answer the research question: "What is the relationship between artist collaboration patterns, popularity, and lyrical expression of genre themes?". To answer this question, we decided to center our network analysis around the worlds largest music streaming service: Spotify. Given our project's focus on artist popularity, using a platform such as Spotify for insights is logical. With streaming services contributing to 84% of the music industry's revenue, and Spotify holding a dominant market share of 30.5% [1], it offers a comprehensive insight to artist popularity. This project utilizes the [Spotify API](https://developer.spotify.com/documentation/web-api) to gain insight into artist collaborations and the popularity of artists.

For the textual analysis [Genius](https://genius.com), among others, serve as an "online music encyclopidia" [2]. Although the textual analysis in this project could utilize a variety of song lyric API's, [Genius' API](https://docs.genius.com) offers the ability for artists and users to annotate lyrics, making it an interesting choice for analyses in the future. 

This projects takes starting point in the Kaggle dataset ["US Top 10K Artists and Their Popular Songs"](https://www.kaggle.com/datasets/spoorthiuk/us-top-10k-artists-and-their-popular-songs). The dataset, created by Spoorthi Uday Karakaraddi, was collected using the Spotify API and features several attributes of the top 10k artists in the US in 2023. It serves as the foundation for constructing the dataset used for network analysis, which is subsequently used for constructing the dataset for our textual analysis. 

[1] https://explodingtopics.com/blog/music-streaming-stats

[2] https://en.wikipedia.org/wiki/Genius_(company)

### 1.2 Our goal for the end user's experience <a class="anchor" id="goaluser"></a>

Text

## 2. Data construction and basic statistics <a class="anchor" id="datacons"></a>

Since this project is working with API's, we are constructing the datasets from API data taking our starting point, as mentioned, in the Kaggle dataset ["US Top 10K Artists and Their Popular Songs"](https://www.kaggle.com/datasets/spoorthiuk/us-top-10k-artists-and-their-popular-songs). To provide a clear explanation of our course of action, this data section will be divided into the construction of the dataset used for network analysis and subsequently the one used for the text analysis after the initial work of fetching data from the Spotify API. Each block of code will be labeled with the name of the resulting csv-file. 


> **Note:** In this project, a collaboration between two artists is considered one if an artist have a featuring artist on their songs. An example could be the artist "Rihanna". On her song titled "Consideration", we see the artist "SZA" is featured, which in this project will be considered a collaboration. To explore artists collaboration patterns, we used the "US Top 10K Artists and Their Popular Songs"-dataset to create our first list of artists.  

### 2.1 Data collection from the Spotify API <a class="anchor" id="spotify"></a>

The first query from the Spotify API consisted of retrieving the top 10 tracks of each artist to reveal possible featuring artists. Due to a rate limit of 5000 requests per day on queries from the Spotify API, with the API key being valid for only one hour, only the top 4250 artists, ranked by Spotify's measure of popularity, from the Kaggle dataset was used. This resulted in a dataframe of the song ID, song Name, main artist, featured artist(s) of the song, and the genre of the song. 

##### Complete_Songs_with_Artists_and_Features.csv

In [None]:
import requests
import pandas as pd
import time
from tqdm import tqdm
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Spotify API credentials
client_id = '#'  
client_secret = '#'  

def refresh_token():
    """ Refresh the Spotify API token. """
    url = 'https://accounts.spotify.com/api/token'
    payload = {'grant_type': 'client_credentials'}
    response = requests.post(url, auth=(client_id, client_secret), data=payload)
    if response.status_code == 200:
        new_token = response.json()['access_token']
        logging.info("Token refreshed successfully.")
        return new_token
    else:
        logging.error(f"Failed to refresh token: {response.text}")
        raise Exception("Failed to refresh token")

# Initial token
token = refresh_token()
headers = {'Authorization': f'Bearer {token}'}

def get_top_tracks(artist_id, retry_count=0):
    """ Fetch top tracks for a given artist ID from Spotify, handling rate limits dynamically. """
    global headers
    top_tracks_url = f"https://api.spotify.com/v1/artists/{artist_id}/top-tracks?country=US"
    response = requests.get(top_tracks_url, headers=headers)
    if response.status_code == 200:
        data = response.json()
        return [(track['id'], track['name'], [artist['name'] for artist in track['artists']]) for track in data.get('tracks', [])]
    elif response.status_code == 429:
        retry_after = int(response.headers.get("Retry-After", 30))
        wait_time = min(retry_after, 30)  # Cap at 30 seconds to avoid overly long delays
        logging.warning(f"Rate limit exceeded, retrying after {wait_time} seconds...")
        time.sleep(wait_time)
        return get_top_tracks(artist_id, retry_count + 1) if retry_count < 5 else []
    elif response.status_code == 401 and retry_count < 5:
        logging.warning("Token expired, refreshing token...")
        headers['Authorization'] = 'Bearer ' + refresh_token()
        return get_top_tracks(artist_id, retry_count + 1)
    else:
        logging.error(f"Failed to fetch data: {response.status_code}")
        return []

def separate_artists(artists, main_artist_name):
    """ Separate main artist from featured artists. """
    featured_artists = [artist for artist in artists if artist != main_artist_name]
    return main_artist_name, ', '.join(featured_artists)

def gather_top_tracks(df):
    """ Process each artist and fetch their top tracks. """
    songs = []
    for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Fetching top tracks"):
        artist_id = row['ID']
        results = get_top_tracks(artist_id)
        for song_id, song_name, artists in results:
            main_artist, features = separate_artists(artists, row['Name'])
            songs.append([song_id, song_name, main_artist, features])
    return songs


df_artists = pd.read_csv('Artists.csv')
first_part = df_artists.iloc[:4250]
logging.info("Processing the first 4250 artists...")
song_data = gather_top_tracks(first_part)
songs_df = pd.DataFrame(song_data, columns=['Song ID', 'Song Name', 'Main Artist', 'Featured Artists', 'Genres'])

# Save the data to CSV
songs_df.to_csv('Complete_Songs_with_Artists_and_Features.csv', index=False)
logging.info("Data saved to Complete_Songs_with_Artists_and_Features.csv.")

We then created a dataframe of the 4250 artists from the Kaggle dataset containing the following information: Main artist, Top 10 tracks, Artist ID, Genre (categorized by Spotify), Popularity Score, Follower Count, and their URI. See the following code block. 

##### Final_Artist_Tracks_Info.csv

In [None]:
import pandas as pd

# Load the datasets
songs_df = pd.read_csv('Complete_Songs_with_Artists_and_Features.csv')
artists_info_df = pd.read_csv('Artists.csv')

songs_df['Main Artist'] = songs_df['Main Artist'].str.strip().str.lower()
artists_info_df['Name'] = artists_info_df['Name'].str.strip().str.lower()
top_tracks = songs_df.groupby('Main Artist')['Song Name'].apply(list).reset_index()
top_tracks.rename(columns={'Main Artist': 'Name'}, inplace=True)
final_df = pd.merge(top_tracks, artists_info_df, on='Name', how='left')
final_df.rename(columns={'Name': 'Main Artist'}, inplace=True)
final_df = final_df[['Main Artist', 'Song Name', 'ID', 'Genres', 'Popularity', 'Followers', 'URI']]

# Save the final merged dataset to a new CSV file
final_df.to_csv('Final_Artist_Tracks_Info.csv', index=False)

We now both have a dataframe of the first 4250 artists and the attributes previously mentioned and know which artists are featured on those 4250 artists top tracks. We will now query the same information for the featured artists, so we can merge the data, creating a final dataset.  

##### Complete_Artists_Info.csv

In [None]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from tqdm import tqdm

sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id="#", client_secret="#"))

# Function to search for an artist on Spotify and get their info
def get_artist_info(artist_name):
    try:
        results = sp.search(q='artist:' + artist_name, type='artist', limit=1)
        items = results['artists']['items']
        if items:
            artist = items[0]
            return {
                'Name': artist['name'],
                'ID': artist['id'],
                'Genres': ', '.join(artist['genres']),
                'Popularity': artist['popularity'],
                'Followers': artist['followers']['total'],
                'URI': artist['uri']
            }
    except spotipy.client.SpotifyException as e:
        print(f"Spotify API error for {artist_name}: {e}")
    return None

# Read the text file with artist names
with open('Unique_Features.txt', 'r') as file:
    unique_features = file.read().splitlines()

artists_to_query = unique_features
artists_info = []
for artist_name in tqdm(artists_to_query, desc='Querying artists'):
    artist_info = get_artist_info(artist_name)
    if artist_info:
        artists_info.append(artist_info)
    else:
        print(f"No data found for artist: {artist_name}")

artists_df = pd.DataFrame(artists_info)
artists_df.to_csv('Complete_Artists_Info.csv', index=False)
print("Finished collecting all artists' information.")

We have now found the following information on the featured artists: Artist, Top 10 tracks, Artist ID, Genre (categorized by Spotify), Popularity Score, Follower Count, and their URI. To obtain the top 10 tracks for each of the featuring artists, who are now just denoted as artists as well, we query the Spotify API yet again. As we are now interested in the top 10 tracks of 11260 artists, we carry out the following code in three sections. For every run of the code, we obtain about 4000 artists songs. 

##### The following two blocks of code results in Feature_Artist_Songs_Combined.csv

In [None]:
import requests
import pandas as pd
import time
from tqdm import tqdm
import logging


logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Spotify API credentials
client_id = '#'  
client_secret = '#'  

def refresh_token():
    """ Refresh the Spotify API token. """
    url = 'https://accounts.spotify.com/api/token'
    payload = {'grant_type': 'client_credentials'}
    response = requests.post(url, auth=(client_id, client_secret), data=payload)
    if response.status_code == 200:
        new_token = response.json()['access_token']
        logging.info("Token refreshed successfully.")
        return new_token
    else:
        logging.error(f"Failed to refresh token: {response.text}")
        raise Exception("Failed to refresh token")

# Initial token
token = refresh_token()
headers = {'Authorization': f'Bearer {token}'}

def get_top_tracks(artist_id, retry_count=0):
    """Fetch top tracks for a given artist ID from Spotify, handling rate limits dynamically."""
    global headers
    top_tracks_url = f"https://api.spotify.com/v1/artists/{artist_id}/top-tracks?country=US"
    response = requests.get(top_tracks_url, headers=headers)
    if response.status_code == 200:
        data = response.json()
        return [(track['id'], track['name'], [artist['name'] for artist in track['artists']]) for track in data.get('tracks', [])]
    elif response.status_code == 429:
        retry_after = int(response.headers.get("Retry-After", 30))
        wait_time = min(retry_after, 30)  # Cap at 30 seconds to avoid overly long delays
        logging.warning(f"Rate limit exceeded, retrying after {wait_time} seconds...")
        time.sleep(wait_time)
        return get_top_tracks(artist_id, retry_count + 1) if retry_count < 5 else []
    elif response.status_code == 401 and retry_count < 5:
        logging.warning("Token expired, refreshing token...")
        headers['Authorization'] = 'Bearer ' + refresh_token()
        return get_top_tracks(artist_id, retry_count + 1)
    else:
        logging.error(f"Failed to fetch data: {response.status_code}")
        return []

def separate_artists(artists, main_artist_name):
    """Separate main artist from featured artists."""
    featured_artists = [artist for artist in artists if artist != main_artist_name]
    return main_artist_name, ', '.join(featured_artists)

def gather_top_tracks(df):
    """Process each artist and fetch their top tracks."""
    songs = []
    for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Fetching top tracks"):
        artist_id = row['ID']
        results = get_top_tracks(artist_id)
        for song_id, song_name, artists in results:
            main_artist, features = separate_artists(artists, row['Name'])
            songs.append([song_id, song_name, main_artist, features])
    return songs

# Load the dataset
df_artists = pd.read_csv('Complete_Artists_Info.csv')

# Split the dataset into 3 parts
#first_half = df_artists.iloc[:4300]
#first_half = df_artists.iloc[4300:8600]
first_half = df_artists.iloc[8600:]


logging.info("Processing the first half of the dataset...")
first_half_data = gather_top_tracks(first_half)
songs_df = pd.DataFrame(first_half_data, columns=['Song ID', 'Song Name', 'Main Artist', 'Featured Artists'])
#songs_df.to_csv('Feature_Artist_Songs_Part1.csv', index=False)
#songs_df.to_csv('Feature_Artist_Songs_Part2.csv', index=False)
songs_df.to_csv('Feature_Artist_Songs_Part3.csv', index=False)

They are then merged into one single dataframe. 

In [None]:
import pandas as pd

# Load each part of the dataset
df_part1 = pd.read_csv('Feature_Artist_Songs_Part1.csv')
df_part2 = pd.read_csv('Feature_Artist_Songs_Part2.csv')
df_part3 = pd.read_csv('Feature_Artist_Songs_Part3.csv')

df_combined = pd.concat([df_part1, df_part2, df_part3], ignore_index=True)
df_combined.to_csv('Feature_Artist_Songs_Combined.csv', index=False)

print("Combined DataFrame Info:")
print(df_combined.info())

We then create a full dataset of featured artists and their top 10 songs.

#### Final_Feature_Artist_Tracks_Info.csv

In [None]:
import pandas as pd

# Load the datasets
songs_df = pd.read_csv('Feature_Artist_Songs_Combined.csv')
artists_info_df = pd.read_csv('Featured/Complete_Feature_Artists_Info.csv')

# Normalize artist names to ensure the case and trimming issues don't affect the merge
songs_df['Main Artist'] = songs_df['Main Artist'].str.strip().str.lower()
artists_info_df['Name'] = artists_info_df['Name'].str.strip().str.lower()

# Aggregate the top tracks for each main artist
top_tracks = songs_df.groupby('Main Artist')['Song Name'].apply(list).reset_index()
top_tracks.rename(columns={'Main Artist': 'Name'}, inplace=True)
final_df = pd.merge(top_tracks, artists_info_df, on='Name', how='left')
final_df.rename(columns={'Name': 'Main Artist'}, inplace=True)
final_df = final_df[['Main Artist', 'Song Name', 'ID', 'Genres', 'Popularity', 'Followers', 'URI']]
final_df.to_csv('Final_Feature_Artist_Tracks_Info.csv', index=False)

print("Final DataFrame head:")
print(final_df.head())

Lastly we combine the *Final_Feature_Artist_Tracks_Info.csv* and *Final_Artist_Tracks_Info.csv* into one dataframe with the following attributes: Artist, Top songs, Artist ID, Genre, Popularity score, Followers, and URI. 

##### Combined_Artist_Tracks_Info.csv

In [None]:
import pandas as pd


featured_tracks = pd.read_csv('Featured\\Final_Feature_Artist_Tracks_Info.csv')
final_artist_tracks = pd.read_csv('Final_Artist_Tracks_Info.csv')

combined_data = pd.concat([featured_tracks, final_artist_tracks], ignore_index=True)
combined_data = combined_data.drop_duplicates()
combined_data.to_csv('Combined_Artist_Tracks_Info.csv', index=False)

We have now fetched all data needed from the Spotify API. The following sections will be divided into the data used for the network analysis followed by the data used for the textual analysis. 

### 2.2 Data collection from the Genius API <a class="anchor" id="textdata"></a>

#### 2.2.1 Creating data for textual analysis <a class="anchor" id="textdata_create"></a>

We use *Combined_Artist_Track_Info.csv* that has about 14k rows and the following attributes: Artist, Songs, ID, GEnre, Popularity, Followers, URI. We then work to rename genres. We do it the same way as seen above, but a slight difference is removing "others". We then find the top 10 most popular genres based on total follower count. Then find the top 100 artists within each of the top 10 popular genres and randomly choose one of their top tracks as a representation of a song within the given genre.  

In [None]:
import pandas as pd
import random
import ast  

# Load dataset
data = pd.read_csv('Combined_Artist_Tracks_Info.csv')

# Function to convert string representations of lists into actual list objects
def parse_song_names(song_names):
    try:
       
        return ast.literal_eval(song_names)
    except:
        return []


data['Song Name'] = data['Song Name'].apply(parse_song_names)

def rename_genres(original_genre_name):
    pop_acronyms = ['pop']
    rock_acronyms = ['rock', 'metal', 'punk', 'grunge']
    hip_hop_acronyms = ['hip hop', 'rap', 'trap']
    rnb_acronyms = ['r&b', 'jazz', 'blues', 'funk', 'lounge', 'soul']
    country_acronyms = ['country']
    indie_acronyms = ['indie']
    electronic_acronyms = ['electronic', 'electro', 'edm', 'house', 'techno', 'dubstep', 'basshall', 'bass']
    latin_acronyms = ['latino', 'corrido', 'latin', 'banda', 'ranchera', 'mariachi', 'cantautor', 'arrocha', 'sertanejo', 'vallenato']
    raggae_acronyms = ['reggaeton', 'reggae']
    hindi_acronyms = ['bollywood', 'filmi']
    hollywood_acronyms = ['hollywood', 'soundtrack', 'movie tunes']
    
    new_genre_name = []
    if type(original_genre_name) != str:
        original_genre_name = str(original_genre_name)
    if any(x in original_genre_name for x in pop_acronyms):
        new_genre_name.append('pop')
    if any(x in original_genre_name for x in rock_acronyms):
        new_genre_name.append('rock')
    if any(x in original_genre_name for x in hip_hop_acronyms):
        new_genre_name.append('hip hop')
    if any(x in original_genre_name for x in rnb_acronyms):
        new_genre_name.append('r&b')
    if any(x in original_genre_name for x in electronic_acronyms):
        new_genre_name.append('electronic')
    if any(x in original_genre_name for x in country_acronyms):
        new_genre_name.append('country')
    if any(x in original_genre_name for x in indie_acronyms):
        new_genre_name.append('indie')
    if any(x in original_genre_name for x in latin_acronyms):
        new_genre_name.append('latin')
    if any(x in original_genre_name for x in raggae_acronyms):
        new_genre_name.append('reggae')
    if any(x in original_genre_name for x in hindi_acronyms):
        new_genre_name.append('hindi/bollywood')
    if any(x in original_genre_name for x in hollywood_acronyms):
        new_genre_name.append('hollywood')
    if not new_genre_name:
        new_genre_name.append('other')
    return new_genre_name

# Apply the rename_genres function
data['Genres'] = data['Genres'].apply(lambda x: rename_genres(x))

# Exploding the list of genres into separate rows
data_exploded = data.explode('Genres')
# Exclude 'other' genre
data_filtered = data_exploded[data_exploded['Genres'] != 'other']

# Calculate total followers for each genre and identify the top 10 genres
genre_followers = data_filtered.groupby('Genres')['Followers'].sum().reset_index()
top_genres = genre_followers.nlargest(10, 'Followers')['Genres']

# Prepare a DataFrame to collect all top 100 artists across genres
all_top_artists = pd.DataFrame()

# Find top 100 artists in each of the top 10 genres
for genre in top_genres:
    filtered_data = data_filtered[data_filtered['Genres'] == genre]
    top_artists = filtered_data.groupby(['Main Artist', 'Genres'])['Followers'].sum().reset_index()
    top_artists_sorted = top_artists.nlargest(100, 'Followers')
    
    # Add a column for a random song from each artist's list of songs
    top_artists_sorted['Random Song'] = top_artists_sorted['Main Artist'].map(
        lambda artist: random.choice(data[data['Main Artist'] == artist]['Song Name'].iloc[0])
        if data[data['Main Artist'] == artist]['Song Name'].iloc[0] else None
    )

    all_top_artists = pd.concat([all_top_artists, top_artists_sorted], ignore_index=True)

# Save the consolidated list of top 100 artists from each genre with a random song to a single CSV file
all_top_artists.to_csv('Top_100_Artists_Across_Top_Genres_With_Random_Song.csv', index=False)


The above csv-file will serve as our foundation for querying the Genius API. When we give the Genius API a song title, we receive an URL for the song on Genius' website back. From there we can webscrape the lyrics of the song, saving those to our dataframe alongside the above mentioned attributes. 

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import os

GENIUS_API_TOKEN = '#'  # Replace with your actual Genius API token

def clean_song_title(song_title):
    # Truncate song title at the first occurrence of '(' or '-'
    return re.split(r'\(|-', song_title)[0].strip()

def request_song_url(artist_name, song_title):
    clean_title = clean_song_title(song_title)
    base_url = 'https://api.genius.com'
    headers = {'Authorization': 'Bearer ' + GENIUS_API_TOKEN}
    search_url = f"{base_url}/search?q={artist_name} {clean_title}"
    response = requests.get(search_url, headers=headers)
    json = response.json()

    for hit in json['response']['hits']:
        if artist_name.lower() in hit['result']['primary_artist']['name'].lower() and clean_title.lower() in hit['result']['title'].lower():
            return hit['result']['url']
    return None

def scrape_song_lyrics(url):
    if url is None:
        return "URL not found"
    page = requests.get(url)
    html = BeautifulSoup(page.text, 'html.parser')
    lyrics_div = html.find('div', class_='lyrics') or html.find('div', class_=re.compile(r'Lyrics__Container'))
    if lyrics_div:
        lyrics = lyrics_div.get_text(strip=True)
        lyrics = re.sub(r'[\(\[].*?[\)\]]', '', lyrics)
        lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])
    else:
        lyrics = "Lyrics not found"
    return lyrics

# Load the dataset
data = pd.read_csv('Top_100_Artists_Across_Top_Genres_With_Random_Song.csv')

# Add a new column for the lyrics
data['Lyrics'] = ''

# Process each row to fetch lyrics
for index, row in data.iterrows():
    url = request_song_url(row['Main Artist'], row['Random Song'])
    lyrics = scrape_song_lyrics(url)
    data.at[index, 'Lyrics'] = lyrics  # Store the lyrics in the DataFrame

# Save the updated DataFrame with lyrics to a new CSV file
data.to_csv('Top_100_Artists_Across_Top_Genres_With_Random_Song_with_Lyrics.csv', index=False)


In some instances, Genius is not able to find the URL or the lyrics of the randomly chosen song, meaning we will have to adjust for that. We do it by fetching another song by the artist, and repeat until we have a dataset where only 3% of the 1000 songs are missing lyrics. 

In [None]:
import pandas as pd
import random
import ast  

# Load datasets
lyrics_data = pd.read_csv('Top_100_Artists_Across_Top_Genres_With_Random_Song_with_Lyrics.csv')
tracks_data = pd.read_csv('Combined_Artist_Tracks_Info.csv')

# Convert string representation of lists in 'Song Name' to actual list objects
tracks_data['Song Name'] = tracks_data['Song Name'].apply(lambda x: ast.literal_eval(x))

# Function to get a new random song that is not the same as the current one
def get_new_song(artist, current_song):
    possible_songs = tracks_data[tracks_data['Main Artist'] == artist]['Song Name'].iloc[0]
    if current_song in possible_songs:
        possible_songs.remove(current_song)  # Avoid picking the same song
    return random.choice(possible_songs) if possible_songs else None

# Find rows with 'URL not found' and 'Lyrics not found' then replace the 'Random Song' with a new random song
for index, row in lyrics_data.iterrows():
    if row['Lyrics'] == 'URL not found' or row['Lyrics'] == 'Lyrics not found':
        new_song = get_new_song(row['Main Artist'], row['Random Song'])
        if new_song:
            lyrics_data.at[index, 'Random Song'] = new_song
            lyrics_data.at[index, 'Lyrics'] = 'New song selected, lyrics need re-fetching'  # Update lyrics status

# Save the modified lyrics dataset
lyrics_data.to_csv('Top_100_Artists_Across_Top_Genres_With_Random_Song_with_Lyrics.csv', index=False)


import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

GENIUS_API_TOKEN = '#'  

def clean_song_title(song_title):
    """ Truncate song title at the first occurrence of '(' or '-' """
    return re.split(r'\(|-', song_title)[0].strip()

def request_song_url(artist_name, song_title):
    """ Request song URL from Genius API """
    clean_title = clean_song_title(song_title)
    base_url = 'https://api.genius.com'
    headers = {'Authorization': 'Bearer ' + GENIUS_API_TOKEN}
    search_url = f"{base_url}/search?q={artist_name} {clean_title}"
    response = requests.get(search_url, headers=headers)
    json = response.json()

    for hit in json['response']['hits']:
        if artist_name.lower() in hit['result']['primary_artist']['name'].lower() and clean_title.lower() in hit['result']['title'].lower():
            return hit['result']['url']
    return None

def scrape_song_lyrics(url):
    """ Scrape song lyrics from Genius URL """
    if url is None:
        return "URL not found"
    page = requests.get(url)
    html = BeautifulSoup(page.text, 'html.parser')
    lyrics_div = html.find('div', class_='lyrics') or html.find('div', class_=re.compile(r'Lyrics__Container'))
    if lyrics_div:
        lyrics = lyrics_div.get_text(strip=True)
        lyrics = re.sub(r'[\(\[].*?[\)\]]', '', lyrics)
        lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])
    else:
        lyrics = "Lyrics not found"
    return lyrics

# Load the dataset
data = pd.read_csv('Top_100_Artists_Across_Top_Genres_With_Random_Song_with_Lyrics.csv')

# Process each row to fetch lyrics only for entries marked as needing re-fetching
for index, row in data.iterrows():
    if row['Lyrics'] == "New song selected, lyrics need re-fetching":
        url = request_song_url(row['Main Artist'], row['Random Song'])
        lyrics = scrape_song_lyrics(url)
        data.at[index, 'Lyrics'] = lyrics  # Update the lyrics in the DataFrame

# Update the csv file with the new lyrics
data.to_csv('Top_100_Artists_Across_Top_Genres_With_Random_Song_with_Lyrics.csv', index=False)


#### 2.2.2 Basic statistics of textual data <a class="anchor" id="basic_stats_text"></a>

### 2.3 Artist collaboration network data <a class="anchor" id="networkdata"></a>

For the network analysis, we will use the following csv-files created en the previous section: 

- Final_Feature_Artist_Tracks_Info.csv
- Feature_Artist_Songs_Combined.csv
- Complete_Songs_with_Artists_and_Features.csv

#### 2.3.1 Preprocessing and cleaning <a class="anchor" id="networkpreclean"></a>

To create the final dataset used for the network analysis, we first take *Final_Feature_Artist_Tracks_Info.csv* with the attributes Main Artist, Genre, Popularity Score, and Followers and join on "Main Artist" with *Feature_Artist_Songs_Combined.csv* that have the attributes Song Name, Main Artist and Featured Artist(s). 

In [None]:
df_artist = pd.read_csv('Artists.csv')

df1 = pd.read_csv('Complete_Songs_with_Artists_and_Features.csv', usecols=['Song Name','Main Artist', 'Featured Artists'])

merged_df = df1.merge(df_artist, left_on='Main Artist', right_on='Name', how='left')

df1_with_data = merged_df.drop(columns=['Name', 'ID', 'URI', 'Age', 'Country', 'Gender'])

We then join *Complete_Songs_with_Artists_and_Features.csv* with the Kaggle "US 10k Top Artists and Their Popular Songs"-dataset on the name of the Main Artist. 

In [None]:
df2 = pd.read_csv('Feature_Artist_Songs_Combined.csv', usecols=['Song Name','Main Artist', 'Featured Artists'])

df3 = pd.read_csv('Final_Feature_Artist_Tracks_info.csv', usecols=['Main Artist', 'Genres', 'Popularity', 'Followers'])

df2['Main Artist'] = df2['Main Artist'].str.lower()

merged_df = df2.merge(df3, on='Main Artist', how='inner')

Lastly we combine the two dataframes, resulting in the final one shown below: 

In [None]:
df1_with_data['Main Artist'] = df1_with_data['Main Artist'].str.lower()
merged_df = merged_df[~merged_df['Main Artist'].isin(df1_with_data['Main Artist'])]

extended_df = pd.concat([df1_with_data, merged_df], ignore_index=True)

cleaned_df = extended_df.drop_duplicates()
cleaned_df.reset_index(drop=True, inplace=True)
cleaned_df.to_csv('Final_Songs_with_Artists_and_Features.csv', index=False)

In [None]:
#show df 
cleaned_df

As of November 2023, Spotify had 6000 different genres and artists often have more than one genre assigned to them.  

We have created new groups of genres based on "umbrella"-genres.

In [None]:
pop_acronyms = ['pop']	
rock_acronyms = ['rock', 'metal', 'punk', 'grunge']
hip_hop_acronyms = ['hip hop', 'rap', 'trap']
rnb_acronyms = ['r&b', 'jazz', 'blues', 'funk', 'lounge', 'soul']
country_acronyms = ['country']
indie_acronyms = ['indie']
electronic_acronyms = ['electronic','electro', 'edm', 'house', 'techno', 'dubstep', 'basshall', 'bass']
latin_acronyms = ['latino', 'corrido', 'latin', 'banda', 'ranchera', 'mariachi', 'cantautor', 'arrocha', 'sertanejo', 'vallenato'] # cantautor should maybe be added to rock and pop
raggae_acronyms = ['reggaeton', 'reggae']
hindi_acronyms = ['bollywood', 'filmi']
hollywood_acronyms = ['hollywood', 'soundtrack', 'movie tunes']

def rename_genres(original_genre_name):
    new_genre_name = []
    if type(original_genre_name) != str:
        original_genre_name = str(original_genre_name)
    if any(x in original_genre_name for x in pop_acronyms):
        new_genre_name.append('pop')
    if any(x in original_genre_name for x in rock_acronyms):
        new_genre_name.append('rock')
    if any(x in original_genre_name for x in hip_hop_acronyms):
        new_genre_name.append('hip hop')
    if any(x in original_genre_name for x in rnb_acronyms):
        new_genre_name.append('r&b')
    if any(x in original_genre_name for x in electronic_acronyms):
        new_genre_name.append('electronic')
    if any(x in original_genre_name for x in country_acronyms):
        new_genre_name.append('country')
    if any(x in original_genre_name for x in indie_acronyms):
        new_genre_name.append('indie')
    if any(x in original_genre_name for x in latin_acronyms):
        new_genre_name.append('latin')
    if any(x in original_genre_name for x in raggae_acronyms):
        new_genre_name.append('raggae')
    if any(x in original_genre_name for x in hindi_acronyms):
        new_genre_name.append('hindi/bollywood')
    if any(x in original_genre_name for x in hollywood_acronyms):
        new_genre_name.append('hollywood')
    if not new_genre_name:
        new_genre_name.append('other')
    return new_genre_name

df = pd.read_csv('Final_Songs_with_Artists_and_Features.csv')

df['Genres'] = df['Genres'].apply(rename_genres) # Rename genres

print(df['Genres'].head)

#### 2.3.2 Creation of network <a class="anchor" id="creation_of_network"></a>

Text

#### 2.3.3 Basic statistics - network data <a class="anchor" id="basic_stats_network"></a>

## 3. Data Analysis <a class="anchor" id="data_analysis"></a>

### 3.1 Network Analysis <a class="anchor" id="network_analysis"></a>

#### 3.1.1 Degree Distribution <a class="anchor" id="degree_distribution"></a>

#### 3.1.2 Assortivity <a class="anchor" id="assortivity"></a>

#### 3.1.3 Communities <a class="anchor" id="communities"></a>

### 3.2 Textual Analysis <a class="anchor" id="text_analysis"></a>

#### 3.2.1 TF-IDF <a class="anchor" id="TF_IDF"></a>

#### 3.2.2 Wordclouds <a class="anchor" id="wordclouds"></a>

## 4. Discussion <a class="anchor" id="discussion"></a>

## 5. References <a class="anchor" id="references"></a>