# Obtaining lyrics for songs in Million Song Dataset, Attempt #2


This notebook describes the process of obtaining our dataset of song lyrics with genre annotations. It makes use of two datasets built on the [Million Song dataset (MSD)](http://millionsongdataset.com/): the [musiXmatch dataset](http://millionsongdataset.com/musixmatch/) and the [tagtraum genre annotations for the Million Song dataset](https://www.tagtraum.com/msd_genre_datasets.html). As the musiXmatch dataset does not contain full-text lyrics, we have to fetch the lyrics ourselves using the [musiXmatch API](https://developer.musixmatch.com/). Unfortunately, we can only fetch a maximum of 2000 lyrics per day with the free API plan and the lyrics are limited to the first 30%. Still, this should give us a solid basis for our project.

In [4]:
import numpy as np
import pandas as pd
import os
from io import BytesIO
from zipfile import ZipFile
import requests

Let's define some helpers:

In [5]:
def get_remote_zip(url):
  resp = requests.get(url)
  return ZipFile(BytesIO(resp.content))

In [6]:
def get_remote_textfile(url):
  resp = requests.get(url)
  return resp.text

In [7]:
def fetch_data(url, target_path, is_zip=False):
  if is_zip:
    zip_file = get_remote_zip(url)
    with zip_file.open(zip_file.namelist()[0]) as f:
      content = f.read().decode()
      with open(target_path, "w") as out:
        out.write(content)
  else:
    textfile = get_remote_textfile(url)
    with open(target_path, "w") as out:
        out.write(content)
      


In [8]:
data_dir = "data"

## Load data with MSD track ID -> musiXmatch track ID mappings

We load the `msd_to_mxm.csv` we generated (for details on that see the `get_mxm_tids.ipynb` notebook):

In [9]:
msd_to_mxm_path = os.path.join(data_dir, "msd_to_mxm.csv")
msd_to_mxm = pd.read_csv(msd_to_mxm_path)
msd_to_mxm

Unnamed: 0,msd_tid,mxm_tid,is_test
0,TRAAAAV128F421A322,4623710,0
1,TRAAABD128F429CF47,6477168,0
2,TRAAAED128E0783FAB,2516445,0
3,TRAAAEF128F4273421,3759847,0
4,TRAAAEW128F42930C0,3783760,0
...,...,...,...
237657,TRZZXFY128F9342D0E,1265451,1
237658,TRZZXOQ128F932A083,4292070,1
237659,TRZZXVN128F93285B4,7528751,1
237660,TRZZYLF128F9316CAB,3748433,1


As we can see, this file also contains information about the train/test split suggested by the authors of the musiXmatch dataset.

## Add metadata from [mapping of MSD IDs to musiXmatch IDs](http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_779k_matches.txt.zip)

This "full mapping file" contains additional information about the artist name and song title from both the MSD and musiXmatch.

### Load file


In [10]:
mapping_file_path = os.path.join(data_dir, "mxm_779k_matches.txt")

if not os.path.exists(mapping_file_path):
  fetch_data(
    "http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_779k_matches.txt.zip",
    mapping_file_path,
    True
  )




mxm_mapping_full = pd.read_table(
  mapping_file_path,
  skiprows=18, names=["msd_tid", "msd_artist_name", "msd_title", "mxm_tid", "mxm_artist_name", "mxm_title"],
  sep="<SEP>"
  )
mxm_mapping_full

  mxm_mapping_full = pd.read_table(


Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title
0,TRMMMKD128F425225D,Karkkiautomaatti,Tanssi vaan,4418550,Karkkiautomaatti,Tanssi vaan
1,TRMMMRX128F93187D9,Hudson Mohawke,No One Could Ever,8898149,Hudson Mohawke,No One Could Ever
2,TRMMMCH128F425532C,Yerba Brava,Si Vos Querés,9239868,Yerba Brava,Si vos queres
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
4,TRMMMBB12903CB7D21,Kris Kross,2 Da Beat Ch'yall,2511405,Kris Kross,2 Da Beat Ch'yall
...,...,...,...,...,...,...
779051,TRYYYZM128F428E804,SKYCLAD,Inequality Street,788003,Skyclad,Inequality Street
779052,TRYYYON128F932585A,Loose Shus,Taurus (Keenhouse Remix),8564800,Loose Shus,Red Sonja
779053,TRYYYUS12903CD2DF0,Kiko Navarro,O Samba Da Vida,8472838,Kiko Navarro,A Samba Da Vida
779054,TRYYYMG128F4260ECA,Gabriel Le Mar,Novemba,1997445,Gabriel Le Mar,140 Degrees


### Merge with raw ID list

Note that we merge on MSD ID, as merging on musiXmatch ID would actually re-introduce duplicates that were filtered out from the full mapping list in the [musiXmatch dataset SQLite database file](http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_dataset.db) that the `msd_to_mxm.csv` file is based on.

In [11]:
merged = pd.merge(msd_to_mxm, mxm_mapping_full, on="msd_tid")

In [12]:
merged

Unnamed: 0,msd_tid,mxm_tid_x,is_test,msd_artist_name,msd_title,mxm_tid_y,mxm_artist_name,mxm_title
0,TRAAAAV128F421A322,4623710,0,Western Addiction,A Poor Recipe For Civic Cohesion,4623710,Western Addiction,A Poor Recipe for Civic Cohesion
1,TRAAABD128F429CF47,6477168,0,The Box Tops,Soul Deep,6477168,The Box Tops,Soul Deep
2,TRAAAED128E0783FAB,2516445,0,Jamie Cullum,It's About Time,2516445,Jamie Cullum,It's About Time
3,TRAAAEF128F4273421,3759847,0,Adam Ant,Something Girls,3759847,Adam Ant,Something Girls
4,TRAAAEW128F42930C0,3783760,0,Broken Spindles,Burn My Body (Album Version),3783760,Broken Spindles,Burn My Body
...,...,...,...,...,...,...,...,...
237657,TRZZXFY128F9342D0E,1265451,1,Fragma,Toca Me,1265451,Fragma,Toca Me
237658,TRZZXOQ128F932A083,4292070,1,Riverside,After,4292070,Riverside,After
237659,TRZZXVN128F93285B4,7528751,1,ASP,Abschied,7528751,ASP,Abschied
237660,TRZZYLF128F9316CAB,3748433,1,Biagio Antonacci,Non Cambiare Tu,3748433,Biagio Antonacci,Non cambiare tu


In [13]:
len(merged[merged.mxm_tid_x != merged.mxm_tid_y])

0

As we can see, the `mxm_tid` columns from both dataframes are exactly the same, so we can drop one of them and rename the remaining column back to `mxm_tid`:

In [14]:
merged = merged.drop(merged.filter(regex='_y$').columns, axis=1).rename(columns={"mxm_tid_x": "mxm_tid"})
merged

Unnamed: 0,msd_tid,mxm_tid,is_test,msd_artist_name,msd_title,mxm_artist_name,mxm_title
0,TRAAAAV128F421A322,4623710,0,Western Addiction,A Poor Recipe For Civic Cohesion,Western Addiction,A Poor Recipe for Civic Cohesion
1,TRAAABD128F429CF47,6477168,0,The Box Tops,Soul Deep,The Box Tops,Soul Deep
2,TRAAAED128E0783FAB,2516445,0,Jamie Cullum,It's About Time,Jamie Cullum,It's About Time
3,TRAAAEF128F4273421,3759847,0,Adam Ant,Something Girls,Adam Ant,Something Girls
4,TRAAAEW128F42930C0,3783760,0,Broken Spindles,Burn My Body (Album Version),Broken Spindles,Burn My Body
...,...,...,...,...,...,...,...
237657,TRZZXFY128F9342D0E,1265451,1,Fragma,Toca Me,Fragma,Toca Me
237658,TRZZXOQ128F932A083,4292070,1,Riverside,After,Riverside,After
237659,TRZZXVN128F93285B4,7528751,1,ASP,Abschied,ASP,Abschied
237660,TRZZYLF128F9316CAB,3748433,1,Biagio Antonacci,Non Cambiare Tu,Biagio Antonacci,Non cambiare tu


### Check how often lowercase artist/track names don't match despite actually being the same track

In [15]:
merged[merged.msd_artist_name.str.lower() != merged.mxm_artist_name.str.lower()]

Unnamed: 0,msd_tid,mxm_tid,is_test,msd_artist_name,msd_title,mxm_artist_name,mxm_title
30,TRAACIR128F42963AC,6275430,0,Number Twelve Looks Like You,Cradle the Crater,The Number Twelve Looks Like You,Cradle in the Crater
39,TRAADCQ128F93436C3,1885215,0,Diomedes Diaz,El Verdadero Culpable,Diomedes Díaz,El Verdadero Culpable
40,TRAADKA12903CD2511,2288970,0,BLESTeNATION,They're Coming For You,Pete Shelley,They're Coming For You
54,TRAAEEQ128F42180B2,3561951,0,Explicit Samouraï,X.plicit sentence,Explicit Samourai,X.plicit sentence
56,TRAAEJH128E0785506,1018402,0,Hank Williams Jr.,Tuesday's Gone (Remastered Album Version),"Hank Williams, Jr.",Tuesday's Gone
...,...,...,...,...,...,...,...
237631,TRZZFDR128F14687CF,2534997,1,Bebe And Cece Winans,Celebrate New Life,BeBe & CeCe Winans,Celebrate New Life
237636,TRZZJFS128F422860B,4624085,1,Mantovani,O sole mio,Me First and the Gimme Gimmes,O sole mio
237645,TRZZPMG128F4228DC3,2331337,1,Cliffhanger,Born Again,Born Against,Born Again
237649,TRZZQHH128F1495208,2966726,1,Kierra Sheard,Done Did It,Kierra Kiki Sheard,Done Did It


In [16]:
merged[merged.msd_title.str.lower() != merged.mxm_title.str.lower()]

Unnamed: 0,msd_tid,mxm_tid,is_test,msd_artist_name,msd_title,mxm_artist_name,mxm_title
4,TRAAAEW128F42930C0,3783760,0,Broken Spindles,Burn My Body (Album Version),Broken Spindles,Burn My Body
7,TRAAAHJ128F931194C,5133845,0,Devotchka,The Last Beat Of My Heart (b-side),DeVotchKa,The Last Beat of My Heart
8,TRAAAHZ128E0799171,1619153,0,Snoop Dogg,The One And Only (Edited),Snoop Dogg,The One and Only
9,TRAAAJG128F9308A25,8525084,0,Malvina Reynolds,Tungsten (only issued previously on 45),Malvina Reynolds,Bitter Rain
10,TRAAAOF128F429C156,2973058,0,The Bonzo Dog Band,King Of Scurf (2007 Digital Remaster),The Bonzo Dog Band,King Of Scurf
...,...,...,...,...,...,...,...
237634,TRZZIIE128F92F7082,1137055,1,Bodyjar,Make A Difference (Live),Bodyjar,Make a Difference
237635,TRZZJBT12903CC4921,1180508,1,LITTLE TEXAS,A Night I'll Never Remember (Album Version),Little Texas,A Night I'll Never Remember
237637,TRZZKAB128F92E0BDC,1289726,1,Nekromantix,Devile smile,Nekromantix,Devil Smile
237654,TRZZUKM12903CB42AC,3911404,1,Kids Like Us,Dog Food (Live),Kids Like Us,Dog Food


We see that the researchers did a pretty good job, finding several hundreds of thousands of songs that wouldn't match if we chose a very simple matching strategy.

## Demo: fetching lyrics using musiXmatch API

This is a quick demo for how we could finally get lyrics using the musiXmatch API.

We will need a musiXmatch API key. It should be stored in a `.env` file inside of this directory. The file content should look like this: 
```
MUSIXMATCH_API_KEY=<your key>
```

where `<key>` is replaced with the actual API key that can be obtained after creating a musiXmatch developer account (for details check the API [docs](https://developer.musixmatch.com/documentation)).

We also need to make sure that python-dotenv is installed so that we can load the environment variable for the API key:

In [17]:
pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [18]:
from dotenv import load_dotenv

load_dotenv()
musixmatch_api_key = os.getenv("MUSIXMATCH_API_KEY")

Pick some arbitrary track:

In [19]:
track_id = merged.mxm_tid[0]
track_id

4623710

There seems to be no nice Python wrapper for interacting with the API, so we need to code the API request ourselves, which luckily isn't hard:

In [20]:
api_base_url = "https://api.musixmatch.com/ws/1.1/"

def fetch_lyrics(track_id):
  lyric_fetch_url = f"{api_base_url}track.lyrics.get?apikey={musixmatch_api_key}&track_id={track_id}"
  response = requests.get(lyric_fetch_url)
  return response.json()["message"]["body"]["lyrics"]["lyrics_body"]

fetch_lyrics(track_id)


"If patience is virtuous\nI got the temperament for temperance\nPartitioning stems and seeds\nDamn lifeless galleries\nI'll implement the elements\nYou're soiling a sacrament\nSlicing sharks and your porcelain pedestals\nSomehow, it seems so pitiful\n\nCatastrophe\nOf the highest order\nWas likeness captured?\nFeigned composure\n...\n\n******* This Lyrics is NOT for Commercial use *******"

Notice that we only get 30% of the lyrics on the free plan, which is unfortunate :/

## Add genre annotations

The [tagtraum genre annotations for the MSD](https://www.tagtraum.com/msd_genre_datasets.html) are available in three variants (different combined datasets, abbreviated with CD):

| Name | Labels  | File                      | Description                                                                                                                     |
|------|---------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| CD1  | 133,676 | msd_tagtraum_cd1.cls.zip  | Constructed from BGD, LFMGD, and Top-MAGD, same labels as Top-MAGD, contains minority votes.                |
| CD2  | 280,831 | msd_tagtraum_cd2.cls.zip  | Based on modified BGD and LFMGD. Additional labels Metal and Punk, International = World, removed Vocal. Some labels ambiguous. |
| CD2C | 191,401 | msd_tagtraum_cd2c.cls.zip | Same as CD2 without ambiguous annotations.                                                                                      |


For now, we'll go with CD2C, as in this dataset every song has exactly one genre annotation and this makes the classification task easier (only one target class). The other two datasets use the concept of "majority" and "minority" genres.

In [24]:
tagtraum_cd2c_file_path = f"{data_dir}/msd_tagtraum_cd2c.cls"
tagtraum_cd2c_zip_link = f"https://www.tagtraum.com/genres/msd_tagtraum_cd2c.cls.zip"


if not os.path.exists(tagtraum_cd2c_file_path):
  fetch_data(
    tagtraum_cd2c_zip_link,
    tagtraum_cd2c_file_path,
    True
  )

In [25]:
tagtraum_cd2c = pd.read_table(tagtraum_cd2c_file_path, sep="\t", names=["msd_tid", "genre"], skiprows=7)

In [30]:
tagtraum_cd2c

Unnamed: 0,msd_tid,genre
0,TRAAAAK128F9318786,Rock
1,TRAAAAW128F429D538,Rap
2,TRAAADJ128F4287B47,Rock
3,TRAAADZ128F9348C2E,Latin
4,TRAAAED128E0783FAB,Jazz
...,...,...
191396,TRZZZMY128F426D7A2,Reggae
191397,TRZZZRJ128F42819AF,Rock
191398,TRZZZUK128F92E3C60,Folk
191399,TRZZZZD128F4236844,Rock


Now we can merge this genre information with the musiXmatch data we already have!

In [31]:
songs_and_genres = pd.merge(merged, tagtraum_cd2c, on="msd_tid")

## Quick "evaluation" of result 

In [32]:
songs_and_genres

Unnamed: 0,msd_tid,mxm_tid,is_test,msd_artist_name,msd_title,mxm_artist_name,mxm_title,genre
0,TRAAAED128E0783FAB,2516445,0,Jamie Cullum,It's About Time,Jamie Cullum,It's About Time,Jazz
1,TRAAAEF128F4273421,3759847,0,Adam Ant,Something Girls,Adam Ant,Something Girls,Rock
2,TRAAAGF12903CEC202,5493388,0,Halvdan Sivertsen,Små Ord,Halvdan Sivertsen,Små ord,Pop
3,TRAAAHZ128E0799171,1619153,0,Snoop Dogg,The One And Only (Edited),Snoop Dogg,The One and Only,Rap
4,TRAAARJ128F9320760,1422131,0,Planet P Project,Pink World,Planet P Project,Pink World,Rock
...,...,...,...,...,...,...,...,...
83187,TRZZSXX128F93262C9,2736386,1,Tarkio,My Mother Was A Chinese Trapeze Artist,Tarkio,My Mother Was a Chinese Trapeze Artist,Rock
83188,TRZZUTD12903CADD68,8852681,1,Kid Cudi,Solo Dolo (nightmare),Kid Cudi,Solo Dolo (Nightmare),Rap
83189,TRZZWEM128F428BD9A,1441760,1,The Buzzcocks,Operator's Manual,Buzzcocks,Operator's Manual,Punk
83190,TRZZXOQ128F932A083,4292070,1,Riverside,After,Riverside,After,Rock


The values we can see here look very plausible (q auick Google search for the artist shows that the genres fit ok), nice!

Let's try to get an idea of what data source describes the artist name and track tile better:

In [57]:
non_matching_artists = songs_and_genres.loc[
  (songs_and_genres.msd_artist_name.str.lower() != songs_and_genres.mxm_artist_name.str.lower()),
  ["msd_artist_name", "mxm_artist_name"]]

In [58]:
non_matching_artists

Unnamed: 0,msd_artist_name,mxm_artist_name
54,Isao Tomita,Cirque du Soleil
66,Ace Enders & A Million Different People,Ace Enders and A Million Different People
84,The B-52's,The B-52s
119,The Pernice Brothers,Pernice Brothers
120,Paul Revere & The Raiders,Paul Revere and The Raiders
...,...,...
83146,Tilly & The Wall,Tilly and the Wall
83150,Zen Cafe,Zen Café
83152,Jurassic 5 / Nelly Furtado,Jurassic 5
83162,Manolo Garcia,Manolo García


The first entry looks like a blunder, let's look at the details:

In [64]:
songs_and_genres.loc[54]

msd_tid            TRAAKHQ128F4289A3F
mxm_tid                       1229572
is_test                             0
msd_artist_name           Isao Tomita
msd_title                      Boléro
mxm_artist_name      Cirque du Soleil
mxm_title                      Boléro
genre                      Electronic
Name: 54, dtype: object

It looks like it's the same song, performed by a different artist. Actually, if we search for the tracks on YouTube we find out that there's again several versions of that track. Some sound electronic, but there also some that don't have those electronic elements. This is a good example for how difficult it might be to find a clear genre label for songs.

If we had the time, we could look into the data in greater detail and analyze how common it is in general that the song name matches, but the artist doesn't. But for now we assume that the researchers involved in linking the data got it right for the vast majority of tracks.

In [59]:
non_matching_titles = songs_and_genres.loc[
  (songs_and_genres.msd_title.str.lower() != songs_and_genres.mxm_title.str.lower()),
  ["msd_title", "mxm_title"]]

In [60]:
non_matching_titles

Unnamed: 0,msd_title,mxm_title
3,The One And Only (Edited),The One and Only
25,Fall On Me (Live),Fall on Me
33,You Can,A Little Too Not Over You
44,My Eyes Burn (Album Version),My Eyes Burn
56,Come On You Slags,Come On You Slags!
...,...,...
83154,Ne Cheama Pamintul,Ne cheamă pămîntul
83158,Expectations,Climb On (a Back That's Strong)
83160,I Cry (LP Version),I Cry
83167,Got To Get You Off My Mind (LP Version),Got to Get You Off My Mind


It looks like in general the MSD titles are more specific about the version of a song (i.e. Album Version, Edited, LP Version). But as we're interested in the lyrical content, it makes sense to use the less specific general song names from musiXmatch.

Also, there's a name that does not match at all, what's wrong with that?

In [63]:
songs_and_genres.loc[83158]

msd_tid                         TRZXMVR128F92EF0EF
mxm_tid                                     557041
is_test                                          1
msd_artist_name                     Caedmon's Call
msd_title                             Expectations
mxm_artist_name                     Caedmon's Call
mxm_title          Climb On (a Back That's Strong)
genre                                          Pop
Name: 83158, dtype: object

Ok, at least the name of the band is the same. But this is just an example that shows us that the data we got is still not perfect. Actually, no real-world data can be, so we' ll live with that.

In [35]:
songs_and_genres.to_csv(os.path.join(data_dir, "songs_and_genres.csv"), index=False)

Did the train/test split change because of the join?

In [39]:
merged.is_test.sum()/len(merged)

0.11420841362944013

In [38]:
songs_and_genres.is_test.sum()/len(songs_and_genres)

0.11459034522550245

Not really, that's also quite nice, this might be an indicator that the train/test split is indeed a good random split.

## Probably redundant: Loading tagtraum CD1 and CD2 
TODO: think about whether we want to use that data too

In [None]:
tagtraum_cd1_file_path = f"{data_dir}/msd_tagtraum_cd1.cls"
tagtraum_cd1_zip_link = f"https://www.tagtraum.com/genres/msd_tagtraum_cd1.cls.zip"


if not os.path.exists(tagtraum_cd1_file_path):
  fetch_data(
    tagtraum_cd1_zip_link,
    tagtraum_cd1_file_path,
    True
  )



In [None]:
tagtraum_cd1 = pd.read_table(tagtraum_cd1_file_path, sep="\t", names=["msd_tid", "majority_genre", "minority_genre"], skiprows=7)

Unnamed: 0,msd_tid,majority_genre,minority_genre
0,TRAAAAK128F9318786,Pop_Rock,
1,TRAAAAW128F429D538,Rap,
2,TRAAABD128F429CF47,Pop_Rock,
3,TRAAAED128E0783FAB,Jazz,Vocal
4,TRAAAEF128F4273421,Pop_Rock,
...,...,...,...
133671,TRZZZFV128F4259A2B,Pop_Rock,Electronic
133672,TRZZZHL128F9329CFB,Pop_Rock,
133673,TRZZZMY128F426D7A2,Reggae,Pop_Rock
133674,TRZZZYR128F92F0796,Pop_Rock,


In [27]:
tagtraum_cd2_file_path = f"{data_dir}/msd_tagtraum_cd2.cls"
tagtraum_cd2_zip_link = f"https://www.tagtraum.com/genres/msd_tagtraum_cd2.cls.zip"


if not os.path.exists(tagtraum_cd2_file_path):
  fetch_data(
    tagtraum_cd2_zip_link,
    tagtraum_cd2_file_path,
    True
  )

In [28]:
tagtraum_cd2 = pd.read_table(tagtraum_cd2_file_path, sep="\t", names=["msd_tid", "majority_genre", "minority_genre"], skiprows=7)


In [29]:
tagtraum_cd2

Unnamed: 0,msd_tid,majority_genre,minority_genre
0,TRAAAAK128F9318786,Rock,
1,TRAAAAW128F429D538,Rap,
2,TRAAABD128F429CF47,Rock,RnB
3,TRAAADJ128F4287B47,Rock,
4,TRAAADZ128F9348C2E,Latin,
...,...,...,...
280826,TRZZZRJ128F42819AF,Rock,
280827,TRZZZUK128F92E3C60,Folk,
280828,TRZZZYV128F92E996D,New Age,RnB
280829,TRZZZZD128F4236844,Rock,
