# Data Pre-Processing Notebook

In this notebook we go through most of the code used to fetch/generate the data used our project. <br> 
In total we combine 4 data sources: 
* List of top 1000 artists from Acclaimed Music
* Wikipedia Articles of each artist
* Spotify API for artist to song relation mapping, metadata and audio features
* Spotify & Genius API for lyrics

In [None]:
# Imports
import os
import networkx as nx
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from fa2 import ForceAtlas2
import community
from operator import itemgetter
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.probability import FreqDist
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize
import math
from collections import Counter
import copy
import codecs
from bs4 import BeautifulSoup
from urllib.parse import quote
import wikipedia
import urllib.request
import json
import requests
from dataclasses import dataclass
from typing import List
import re

# Getting Top 1000 artist list

We use basic HTML parsing through `BeautifulSoup` to retrieve all the artist names

In [None]:
 links = [f"http://www.acclaimedmusic.net/061024/1948-09art{n}.htm" for n in ["", "2", "3", "4", "5"]]
docs = [requests.get(link).text for link in links]

In [None]:
artists = []
for doc in docs:
    soup = BeautifulSoup(doc, 'html.parser')
    rows = soup.find("table").find_all("tr")

    for row in rows:
        cols = row.find_all("td")
        if len(cols) == 11:
            artist_name = cols[1].find("a").string
            artists.append(artist_name)

When all artists have been fetched they are saved in a simple list file

In [None]:
with codecs.open("./artists_1000.txt", "w","utf-8") as f:
    f.write("\n".join(artists))

# Getting artist links from Wikipedia Articles

### Downloading Wikipedia "Content" and "Links"

First off we use the Wikipedia API (MediaWiki API) to obtain get the article's page ID. From the page ID we can use the `Wikipedia` Python library to obtain the article content and the links in the articles. The search functionality in the library does not seem to match the Wikipedia API, which is why this is needed 

In [None]:
def get_wikipedia_key(band):

    baseurl = "https://en.wikipedia.org/w/api.php?"
    action = "action=query"
    content = "prop=revisions&rvprop=content&rvslots=*"
    dataformat ="format=json"
    band = band.replace(' ', '_')
    band = quote(band)
    title = "titles=" + band

    query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)

    # Now we get the data
    wikiresponse = urllib.request.urlopen(query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    wikijson = json.loads(wikitext)

    return list(wikijson['query']['pages'].keys())[0]

# Find key and then article
def get_document(artist):
    key = get_wikipedia_key(artist)
    doc = wikipedia.page(pageid=key)
    return doc

We load the artist names and then get each of their documents. When searching for the artist on Wikipedia we need to deal with ambiguity, so we try different variants of the artists name (e.g. (band) as suffix in the search term). When we find the artist we save the content of the article, (which ended up not being used in the project) and the links (if they exist in our artist list). 

In [None]:
## Importing artist list:
with open("artists_1000.txt") as f:
    df_artists = f.read().split('\n')

offset = 524
for i, artist in enumerate(df_artists[offset:]):
    print(f"[{i+offset}] Artist: {artist}")

    variants = [artist, f"{artist} (band)", f"{artist} (musician)", f"{artist} (singer)", f"{artist} (group)", artist.replace("The", "the")]

    for name in variants:
        try:
            doc = get_document(name)
        except:
            continue # Try next variant
        break # If no errors, no need to check others, we have the needed document

    links = doc.links
    content = doc.content
    links = [link for link in links if link in df_artists and content.find(link) > 0]

    artist_file_name = artist.replace("/", "-")
    with codecs.open(f"./content/{artist_file_name}.txt", "w","utf-8") as f:
        f.write(content)

    with codecs.open(f"./links/{artist_file_name}.txt", "w","utf-8") as f:
        f.write("\n".join(links))
    
     

### Construct graph from "Links"

From the links we can now construct the graph. We load all the link files and create a dictionary. From this dictionary we can then build the graph in `networkx`. 

In [None]:
# Deal with different naming on Wiki versus artist text file
translator = {
    "Public Image Ltd." : "Public Image Ltd",
    "Run-D.M.C." : "Run-DMC",
    "Notorious B.I.G." : "The Notorious B.I.G.",
    "The Bee Gees": "Bee Gees",
    "N.W.A." : "N.W.A"
}

def translate(artist):
    if artist in translator:
        return translator[artist]
    else:
        return artist

In [None]:
# Create dictionary with links by artist
artist_links = dict()

for artist in artists:
    artist_file_name = artist.replace("/", "-")
    with open(f"./links/{artist_file_name}.txt", "r") as f:
        links = f.read().split("\n")
        links = [link for link in links if link != ""]
        links = [translate(link) for link in links]
        artist_links[artist] = links

In [None]:
# Add links to graph
G = nx.DiGraph()
G.add_nodes_from(artists)
for artist in artists: #add links
    edges = [(artist, to_artist) for to_artist in artist_links[artist]]
    G.add_edges_from(edges)

In [None]:
nx.draw(G, with_labels = False, node_size = 20) # takes 2 minutes

From the graph we can now obtain the largest connected component (GCC), which will be used for the rest of the analysis. The artists contained in the GCC are saved in a text file and the whole Python object is saved with pickle.

In [None]:
## Extracting the GCC (giant connected component):
G_undirected = G.to_undirected()
largest_cc = max(nx.connected_components(G_undirected), key=len)
GCC = G.subgraph(largest_cc).copy()

In [None]:
# Save list of artist that we need
textfile = open("GCC_artist.txt","w")
for element in list(largest_cc):
    textfile.write(element + "\n")
textfile.close()

In [None]:
# Save graph
with open("GCC_Grahp.pickle", 'wb') as f:
    pickle.dump(GCC, f)

# Getting Spotify Data

Spotify has for many years had a nicely documented API that can be used. All we need to do is register an "app" to our personal Spotify account. We then need to generate an access token for the API. For convenience this can be generated temporally from the API Console (https://developer.spotify.com/console/). This expires after an hour. <br><br>
In this step we use 3 endpoints in the Spotify API:
- /search
- /artists
- /audio-features

We do a text search using the /search endpoint. The term is the artist name given from our Top 1000 list. This gives us a collection of artists, from which we choose the first one (in the hope that Spotify search engine makes the best guess). <br>
From the found artist we get the artist ID. This is the used with the /artists endpoint to get the top tracks (using a sub-endpoint). From the returned collection we again choose the first item, as it is sorted by popularity. This item gives us the track name, track ID and other metadata like release date. <br>
From the track ID we can look up audio features. These features are created by Spotify and gives a value for different properties like danceability, acousticness and so on. <br><br>
Codewise, we create some dataclasses that help us put names on our properties. Otherwise it's just a simple REST JSON-based API.

In [None]:
# Expired
SPOTIFY_ACCESS_TOKEN = 'BQAXSBwgJRPFlXp_wwjjWvAZVyKNhNKaYZ9yalrETtLLKkJXa4XFhPqbWvzDbAG_QSrqMwmCLK-8LeciOkmykFqiFSsjyCM2ohEmkziw3Yu4SRl-MiCe829WWiazPqq9Cfs2te8pLC0xMg'

In [None]:
@dataclass
class Artist:
    artist_name: str
    artist_id: str
    genres: List[str]
    followers: int
    image_url: str

@dataclass
class Track:
    track_name: str
    track_id: str
    # Audio Features
    danceability : float
    energy: float
    loudness: float
    speechiness: float
    acousticness: float
    instrumentalness: float
    liveness: float
    valence: float
    tempo: float
    release_date: str

def get_json(url):
    r = requests.get(url, headers={'Authorization': 'Bearer '+ SPOTIFY_ACCESS_TOKEN})
    return r.json()

# Find artist by seaching for term
def search(artist_name: str):
    artist_name_url = quote(artist_name)
    search_base_url = 'https://api.spotify.com/v1/search'
    limit = 3
    item_type = 'artist'
    url = f"{search_base_url}?query={artist_name}&limit={limit}&type={item_type}"

    json = get_json(url)
    try:
        items = json["artists"]["items"]
        a_obj = items[0]
    except:
        print(json)
    return Artist(artist_name=artist_name, artist_id=a_obj["id"], genres=a_obj["genres"], followers=a_obj["followers"]["total"], image_url=a_obj["images"][0]["url"])

def find_top_track(artist: Artist):
    artists_base_url = 'https://api.spotify.com/v1/artists'
    audio_feature_base_url = 'https://api.spotify.com/v1/audio-features'
    market = "DK"
    
    # Find top trapcks by artist
    url = f"{artists_base_url}/{artist.artist_id}/top-tracks?market={market}"
    json = get_json(url)
    id = json["tracks"][0]["id"]
    name = json["tracks"][0]["name"]
    release_date = json["tracks"][0]["album"]["release_date"]

    # Find audio features by track
    url = f"{audio_feature_base_url}/{id}"
    json = get_json(url)

    if 'error' in json:
        print("Error in ", name)
        return Track(track_name=name, track_id=id, 
            danceability = 0,
            energy = 0,
            loudness = 0,
            speechiness = 0,
            acousticness = 0,
            instrumentalness = 0,
            liveness = 0,
            valence = 0,
            tempo = 0,
            release_date=release_date
        )
    else:
        return Track(track_name=name, track_id=id, 
            danceability = json["danceability"],
            energy = json["energy"],
            loudness = json["loudness"],
            speechiness = json["speechiness"],
            acousticness = json["acousticness"],
            instrumentalness = json["instrumentalness"],
            liveness = json["liveness"],
            valence = json["valence"],
            tempo = json["tempo"],
            release_date=release_date
        )


In [None]:
artists = []
with open("artists_1000.txt") as f:
    artists = f.read().split('\n')

In [None]:
# Deal with different naming on Wiki versus artist text file
def translate(artist):
    translator = {
        "Ian Dury and The Blockheads" : "The Blockheads",
        "The Dead Kennedys" : "Dead Kennedys",
        "The Mahavishnu Orchestra" : "Mahavishnu Orchestra",
        "The Mekons" : "Mekons",
        "Sam the Sham and The Pharaos" : "Sam the Sham and The Pharaohs",
        "Michael Hurley/The Unholy Modal Rounders" : "Michael Hurley",
        "Rythim Is Rythim" : "Rhythim Is Rhythim",
        "Question Mark and the Mysterians" : "? & The Mysterians",
        "The Screaming Trees" : "Screaming Trees",
        "The Sparks" : "Sparks",
        "The Raspberries" : "Raspberries"
    }
    if artist in translator:
        return translator[artist]
    else:
        return artist

In [None]:
offset = 735
for idx, artist_name in enumerate(artists[offset:]):
    print(f"[{idx+offset}]: {artist_name}")
    translated_artist_name = translate(artist_name)
    if translated_artist_name != artist_name:
        print(f" - Translated to {translated_artist_name}")

    # Search for Spotify artist by Acclaimed artist name
    artist = search(translated_artist_name)

    # Find top track for found artist
    track = find_top_track(artist)
    rows.append({**artist.__dict__, **track.__dict__}) # For making columns in future DataFrame
    time.sleep(0.25) # Don't smash Spotify API to avoid potential ban

In [None]:
# Create DataFrame
df_artists = pd.DataFrame(rows)
df_artists.drop_duplicates(subset=["artist_name"], inplace=True)
df_artists = df_artists[df_artists["artist_name"] != "The Original Soundtrack"]
df_artists

# Getting Lyrics 

### Getting lyrics directly from Spotify

The lyrics feature is brand new on Spotify. That is probably why it is not available on the official API. So, inspired by a JavaScript library we were able to find, we used the Spotify Web Player in the browser, looked in the Network tab of the Developer Tools and found the request to some internal REST API. Using right-click, copy > "Copy as fetch" we could then see the exact request with required headers (like "app-platform": "WebPlayer"). By "replaying" the request from Python with the track ID replaced by the track ID we wanted to find the lyrics of, we could get lines of the lyrics, even with timestamps, although that was not needed in this project.

In [None]:
df_artists = pd.read_csv("./spotify_data.csv", index_col=0)

In [None]:
# Expired. Obtained by using Web Player and looking into Developer Tools in the browser.
SPOTIFY_ACCESS_TOKEN = 'BQBuDVO7QS10HH72STkMtxNOX4-yZr4urMfganhnFQ_W-KHub1wDAILVSIr__sRPojcolSngaISqP8eD7dlMSQZ05T3Gixg2mNdCOx1b81daPpKm9T6YrajJb6k2dTTnqhH3Zqzz49e9CS5eAd5vtkSxp7y9pJFecZMXG_fafHly5AFFVEfps3M6gZY_2VSiBzOib4Zs9GHot7Vje_ls9c8FE79FuG-5yJeXkr3zmjfOGjz6SDWcUjtX1dY3EJ3vJSjh3yRulMAIVdfsOtTxzw37cXYOiuNJOPS1Uwk'

In [None]:
def get_lyrics(track_id):
    url = f"https://spclient.wg.spotify.com/color-lyrics/v2/track/{track_id}?format=json&vocalRemoval=false&market=from_token"
    print(url)
    r = requests.get(url, headers={
        "accept": "application/json",
        "accept-language": "da",
        "app-platform": "WebPlayer",
        "sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"96\", \"Google Chrome\";v=\"96\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Windows\"",
        "sec-fetch-dest": "empty",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "same-site",
        "spotify-app-version": "1.1.73.498.gd50d3243",
        "referrer": "https://open.spotify.com/",
        "referrerPolicy": "strict-origin-when-cross-origin",
        "authorization": "Bearer "+ SPOTIFY_ACCESS_TOKEN
    })
    js = None
    try:
        js = r.json()
    except:
        print("No lyrics found")
    return js

def format_lyrics_to_text(lyrics):
    lines = lyrics["lyrics"]["lines"]
    text = "\n".join(map(lambda x: x["words"],lines))
    text = text.replace("♪", "")
    return text

def write_lyrics_text_to_file(artist_name, text):
    artist_name = artist_name.replace("/", "-")
    with codecs.open(f"./lyrics/{artist_name}.txt", "w","utf-8") as f:
        f.write(text)

In [None]:
offset = 0
rows = [(row["artist_name"], row["track_id"]) for _, row in df_artists.iterrows()]
for idx, (artist_name, track_id) in enumerate(rows[offset:]):
    print(f"[{idx+offset}]: {artist_name}")

    lyrics = get_lyrics(track_id)
    if lyrics is not None:
        text = format_lyrics_to_text(lyrics)
        write_lyrics_text_to_file(artist_name,text)
    else: 
        error_artists.append(artist_name)
    time.sleep(0.35) # Don't smash the API with requests to avoid potential IP ban

Some songs on Spotify had no lyrics. For these artists/top song we saved the artist in a .CSV for further processing.

In [None]:
print(len(error_artists))
df_error_artists = df_artists[df_artists["artist_name"].isin(error_artists)][["artist_name", "artist_id", "track_name", "track_id"]]
df_error_artists.to_csv("./missing_lyrics.csv")
df_error_artists

### Getting missing lyrics not found on Spotify

The song for which we could not find the lyrics on Spotify, we turned to the Genius API (online lyrics service)

In [None]:
missing_lyrics= pd.read_csv('missing_lyrics.csv')

Since our track names are given by the Spotify API, and we now want to search on Genius, we need to clean the track names of text like "... - Remastered", "... (feat. DJ John Doe)" etc. This was done using the below regular expression. <br> <br>
The Genius API, although handy for our purpose, is not as well built at the Spotify API. This means that when there is no result it returns some list of other popular songs. However, when there is a result it returns simply the lyric. To know when we had success or not we found that we could look for text like [Verse 1], [Chorus] and so on. Again a regular expression could help us here. If there was no proper results, we simply save and empty lyric file.

In [None]:
##Removing () in track
for i in range(len(missing_lyrics.track_name)):
    missing_lyrics.track_name[i] = re.sub("[\(\[].*?[\)\]]|\-.*", "", missing_lyrics.track_name.iloc[i])

In [None]:
lyrics_all = []
for i in range(len(missing_lyrics)):
    try:
        artist = genius.search_artist(missing_lyrics.artist_name.iloc[i], max_songs=0, sort="title")
        song = artist.song(missing_lyrics.track_name.iloc[i])
        if bool(re.match('\[Verse .\]|\[Intro\]|\[Chorus\]|\[Outro\]|\[Bridge\]', song.lyrics)):
            artist_name = missing_lyrics.artist_name.iloc[i].replace("/", "-")
            with codecs.open(f"./lyrics/{artist_name}.txt", "w","utf-8") as f:
                f.write(song.lyrics)
  
        else:
            artist_name = missing_lyrics.artist_name.iloc[i].replace("/", "-")
            with codecs.open(f"./lyrics/{artist_name}.txt", "w","utf-8") as f:
                f.write("")
    except:
        artist_name = missing_lyrics.artist_name.iloc[i].replace("/", "-")
        with codecs.open(f"./lyrics/{artist_name}.txt", "w","utf-8") as f:
            f.write("")

### Tokenization of lyrics

When we have fetched all the lyrics that we are able to, we need to process them into list of tokens. This is used for lexical analysis and sentiment using the vader method. We tokenize the lyrics and clean the tokens of non-alphabetic tokens, and stopwords in english, spanish, german and french. We also lemmatize the words, so we get their stem.

In [None]:
nltk.download("stopwords")
nltk.download("wordnet")
sw = stopwords.words("english")
sw_spanish = stopwords.words("spanish")
sw_german = stopwords.words("german")
sw_french = stopwords.words("french")
tokenizer = WordPunctTokenizer()
lemmatizer = WordNetLemmatizer()

In [None]:
lyrics_files = os.listdir("./lyrics" )
offset = 0
for i, file_name in enumerate(lyrics_files[offset:]):
    print(f"[{i+1} / {len(lyrics_files)}]: {file_name}")
    with codecs.open(f"./lyrics/{file_name}", "r", "utf-8") as f:
        lyrics_text = f.read()

        tokens = tokenizer.tokenize(lyrics_text)

        # Only include words from the alphabetic 
        tokens = [t for t in tokens if t.isalpha()]
        tokens = [t.lower() for t in tokens]

        # Remove stopwords from four different language 
        tokens = [t for t in tokens if t not in sw]
        tokens = [t for t in tokens if t not in sw_spanish]
        tokens = [t for t in tokens if t not in sw_german]
        tokens = [t for t in tokens if t not in sw_french]

        # Lemmatize 
        tokens = [lemmatizer.lemmatize(t) for t in tokens]

        with codecs.open(f"./tokens/{file_name}", "w", "utf-8") as f:
            f.write("\n".join(tokens))

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=d42bf612-b695-4354-ae90-1a8fa8a214f4' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>