In [None]:
#As always, we import everything
import pandas as pd
import os
import re
import hdf5_getters as getters
import requests
from bs4 import BeautifulSoup
import numpy as np
from collections import OrderedDict

# Introduction

Our project consists of exploring the lyrics of many songs and finding themes and the usage of the words used in these songs over time. We use the Million Song dataset to find information about the song as well as various other dataset and sources to find lyrics data.

# Data Collection and Descriptive Analysis

We begin by getting a list of all the files from our dataset. The Million Song dataset organises the dataset in multiple files and directories. The following code snippet gets all these files and prints the number of the files found.

In [None]:
all_files = []
for (dirpath, dirnames, filenames) in os.walk("MillionSongSubset/data"):
    all_files.extend([dirpath + "/" + filename for filename in filenames if filename.endswith(".h5")])
all_files_num = len(all_files)
all_files_num

The Million Song dataset is not given in simple text but encoded using the [Hierarchical Data Format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format). The following functions are used to get the relevant data from a file. Each file is a single record of the dataset, a single song described with multiple fields. These functions simply call the getter functions provided with the dataset to access the data. Since we only need a few fields, we simply take the track id, title, artist name and year. This fields will be relevant later on for our analysis and vizualisation.

The track id will obviously identify the track in our analysis while the title and the artist will help us find the lyrics of the song. The year will be used for our vizualisation tool to see the evolution of the vocabulary and themes used in the song over time.

In [None]:
def get_track_id(filename):
    h5 = getters.open_h5_file_read(filename)
    track_id = getters.get_track_id(h5)
    h5.close()
    return track_id

In [None]:
def get_title(filename):
    h5 = getters.open_h5_file_read(filename)
    title = getters.get_title(h5).decode()
    h5.close()
    return title

In [None]:
def get_artist_name(filename):
    h5 = getters.open_h5_file_read(filename)
    artist_name = getters.get_artist_name(h5).decode()
    h5.close()
    return artist_name

In [None]:
def get_year(filename):
    h5 = getters.open_h5_file_read(filename)
    year = getters.get_year(h5)
    h5.close()
    return year

The Million Song dataset does not contain information about the genre of the songs. However, there is an additional dataset from the same source that contains this information. Unfortunetely, it's not present for all the tracks of the Million Song dataset. We read this genre dataset here and will later link the genres with the data we obtain from the main dataset.

Note that the file read here is not the one directly obtained from the source but the one where we only take the genre and the track id, since these are the only ones we need.

The genre dataset only has around 60 thousand tracks which is substantially smaller than the Million Song in its entirety. However we believe this amount of tracks will be enough for our data analysis and visualisation.

In [None]:
#df = pd.read_csv('MillionSongSubset/msd_genre_dataset.txt')
#genre_dataset = df[['genre', 'track_id']].set_index('track_id')
#genre_dataset.to_csv('MillionSongSubset/genre_dataset.txt')
genre_dataset = pd.read_csv('MillionSongSubset/genre_dataset.txt').set_index('track_id')

This function links gets the genres for a single track given its id.

In [None]:
def get_song_genres(track_id):
    if track_id in genre_dataset.index:
        return "&".join(genre_dataset.loc[[track_id]].values[0][0].split(' and '))
    else:
        return None

The following snippet reads the Million Song dataset in its entirety and uses the genre dataset to link the two. It gets all the fields we need as we discussed above and also gets the genres of each track. This information is then put into a dataframe. 

For convenience, we save this data in a new `.csv` file.

In [None]:
"""
data = pd.DataFrame([])

i = 0
curr_percent = -1
for filename in all_files:
    percent = int(100 * i / float(all_files_num))
    if percent != curr_percent:
        curr_percent = percent
        print(curr_percent, "%", end='\r')
    
    track_id = get_track_id(filename).decode()
    genres = get_song_genres(track_id)
    if genres:
        to_add = [('track_id', track_id), ('genres', genres), ('artist_name', get_artist_name(filename)), ('title', get_title(filename)), ('year', get_year(filename)), ('lyrics', "")]
        data = data.append(pd.DataFrame(OrderedDict(to_add), index=[0]))
    i += 1

data.set_index('track_id', inplace=True)
data.to_csv('data/data.csv')
"""

data = pd.read_csv('data/data.csv').set_index('track_id')

Finally, we need to obtain lyrics data for our tracks. For this, we have found two datasets. Both of these contain artist, track title and lyrics data which we read in the following code snippets. We try to get the lyrics from both datasets, but it's possible that neither of them contains the lyrics for some on our tracks. For this reason, we will also look at genius.com which is a website containing many lyrics.

In [None]:
lyrics_df1 = pd.read_csv('lyrics/songdata1.csv')
lyrics_df1.set_index(['artist', 'song'], inplace=True)

In [None]:
lyrics_df2_raw = pd.read_csv('lyrics/songdata2.csv', na_filter=False)
lyrics_df2 = lyrics_df2_raw[['song', 'artist', 'lyrics']].set_index(['artist', 'song'])

In [None]:
def get_lyrics_csv1(artist_name, title):
    if (artist_name, title) in lyrics_df1.index:
        return lyrics_df1.loc[artist_name, title].values[0][1]
    else:
        return None

In [None]:
def get_lyrics_csv2(artist_name, title):
    index_artist_name = artist_name.lower().replace(' ', '-')
    index_title = title.lower().replace(' ', '-')
    if (index_artist_name, index_title) in lyrics_df2.index:
        lyrics = lyrics_df2.loc[index_artist_name, index_title].values[0][0]
        if len(lyrics) == 0:
            return None
        else:
            return lyrics
    else:
        return None

In [None]:
def get_lyrics(artist_name, title):
    lyrics = get_lyrics_csv1(artist_name, title)
    if lyrics:
        return lyrics
    
    lyrics = get_lyrics_csv2(artist_name, title)
    if lyrics:
        return lyrics
    
    return ""

We make a new dataframe that contains lyrics information for our previous data. If the lyrics are not found in either of the lyrics datasets, we generate the genius.com url to search for that song's lyrics.

In [None]:
#Match lyrics
data_lyrics = data.copy()
urls = {}
i = 1
for index, row in data.iterrows():
    lyrics = get_lyrics(row['artist_name'], row['title'])
    
    if lyrics == "":
        url = (row['artist_name'].lower().replace(' ', '-') + '-' + re.sub(r'\([^)]*\)', '', row['title']).rstrip().lower().replace(' ', '-') + '-lyrics').capitalize().replace("'", '')
        urls[index] = 'https://genius.com/' + url
    
    print(i, '/', data.shape[0])
    i += 1
    data_lyrics.loc[index, 'lyrics'] = lyrics

The genius.com URLs are collected in a file so that they can be fed into a scrapper.

In [None]:
with open('data/urls', 'w') as urls_files:
    for index, url in urls.items():
        print(index, url, file=urls_files)

At this point, we run our scrapper which is using `scrapy`. This is not done in this notebook but instead you can find the scrapper code in the `scrapper` folder in this repository. We obtain a file that contains the track ids as well as their lyrics found on genius.com.

The resulting file is then read and its data is added to our data.

In [None]:
import json
with open('data/missing_lyrics.json') as lyrics_file:
    lyrics_json = json.load(lyrics_file)
    for item in lyrics_json:
        for index, lyrics in item.items():
            data_lyrics.loc[index, 'lyrics'] = lyrics

We save the resulting data in a file for convenience. This is the final state of our data and contains verything we need.

In [None]:
data_lyrics.to_csv('data/data_lyrics.csv')

Now let's see how many songs are there with lyrics in the Million Song Subset.

In [None]:
data_lyrics = data_lyrics[data_lyrics.lyrics != '']

In [None]:
data_lyrics.shape

In [None]:
data_lyrics

At 235 songs, we have a good working dataset for the next part of the project.

# Analysis

To obtain our analysis, we will follow the following steps.

### Language Detection

Since the lyrics of the songs we find might be in different languages than english, it is interesting to identify the language of the lyrics. We do this because to be able to find themes or to compare the words used in lyrics, it is important that we work with lyrics of the same languag. We know that our dataset contains lyrics in other languages so it is important to only take the songs in english to obtain meaningful analysis in the next part.

This is done using [langdetect](https://pypi.python.org/pypi/langdetect?).

### Lyrics Processing

We would like to be able to extract different themes from our lyrics. Other than being able to see the evolution of some words over time and depending of the genres of the song, it's interesting to see the themes or sentiments that the song's lyrics portray. For this, we will use Natural Language Processing (NLP) libraries to extract this information about each track.

From the lyrics that we have, we apply the bag-of-words model and only keep the interesting (meaningful) words.

To find the general sentiment of a song's lyrics, we use [TextBlob](https://textblob.readthedocs.io/en/dev/).

And finally, to find the themes of our songs, we use [Gensim](https://radimrehurek.com/gensim/). 

### Aggregation

Ultimately, we will want to see our tracks grouped by the genres and the year of their release. This way we can compare the words used as well as themes portrayed in the tracks depending on their genres and their evolution over time. For this we aggregate the data on genre and year.

# Visualization

For the visualization, we would like to view our data. Our plan is to be able to enter a word and obtaina graph that shows the songs that contains this word. The songs would be identifiable by their genre and organised so that we can view the year of their release. The same would be possible for themes. This way we could explore interesting ideas about lyrical content of the songs over the past and obtain meaningful insights about them.