# Obtaining lyrics for songs in Million Song Dataset, Attempt #2


This notebook describes the process of obtaining our dataset of song lyrics with genre annotations. It makes use of two datasets built on the [Million Song dataset (MSD)](http://millionsongdataset.com/): the [musiXmatch dataset](http://millionsongdataset.com/musixmatch/) and the [tagtraum genre annotations for the Million Song dataset](https://www.tagtraum.com/msd_genre_datasets.html).

In [1]:
import numpy as np
import pandas as pd
import os
from io import BytesIO
from zipfile import ZipFile
import requests

Let's define some helpers:

In [3]:
def get_remote_zip(url):
  resp = requests.get(url)
  return ZipFile(BytesIO(resp.content))

In [4]:
def get_remote_textfile(url):
  resp = requests.get(url)
  return resp.text

In [5]:
data_dir = "data"

## Load data with MSD track ID -> musiXmatch track ID mappings

We load the `msd_to_mxm.csv` we generated (for details on that see the `get_mxm_tids.ipynb` notebook):

In [6]:
msd_to_mxm_path = os.path.join(data_dir, "msd_to_mxm.csv")
msd_to_mxm = pd.read_csv(msd_to_mxm_path)
msd_to_mxm

Unnamed: 0,msd_tid,mxm_tid,is_test
0,TRAAAAV128F421A322,4623710,0
1,TRAAABD128F429CF47,6477168,0
2,TRAAAED128E0783FAB,2516445,0
3,TRAAAEF128F4273421,3759847,0
4,TRAAAEW128F42930C0,3783760,0
...,...,...,...
237657,TRZZXFY128F9342D0E,1265451,1
237658,TRZZXOQ128F932A083,4292070,1
237659,TRZZXVN128F93285B4,7528751,1
237660,TRZZYLF128F9316CAB,3748433,1


As we can see, this file also contains information about the train/test split suggested by the authors of the musiXmatch dataset.

## Add metadata from [mapping of MSD IDs to musiXmatch IDs](http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_779k_matches.txt.zip)

This "full mapping file" contains additional information about the artist name and song title from both the MSD and musiXmatch.

### Load file


In [9]:
mapping_file_path = os.path.join(data_dir, "mxm_779k_matches.txt")

if not os.path.exists(mapping_file_path):
  mapping_zip = get_remote_zip("http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_779k_matches.txt.zip")
  with mapping_zip.open(mapping_zip.namelist()[0]) as f:
    mapping_file_content = f.read().decode()
    with open(mapping_file_path, "w") as f:
      f.write(mapping_file_content)




mxm_mapping_full = pd.read_table(
  mapping_file_path,
  skiprows=18, names=["msd_tid", "msd_artist_name", "msd_title", "mxm_tid", "mxm_artist_name", "mxm_title"],
  sep="<SEP>"
  )
mxm_mapping_full

  mxm_mapping_full = pd.read_table(


Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title
0,TRMMMKD128F425225D,Karkkiautomaatti,Tanssi vaan,4418550,Karkkiautomaatti,Tanssi vaan
1,TRMMMRX128F93187D9,Hudson Mohawke,No One Could Ever,8898149,Hudson Mohawke,No One Could Ever
2,TRMMMCH128F425532C,Yerba Brava,Si Vos Querés,9239868,Yerba Brava,Si vos queres
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
4,TRMMMBB12903CB7D21,Kris Kross,2 Da Beat Ch'yall,2511405,Kris Kross,2 Da Beat Ch'yall
...,...,...,...,...,...,...
779051,TRYYYZM128F428E804,SKYCLAD,Inequality Street,788003,Skyclad,Inequality Street
779052,TRYYYON128F932585A,Loose Shus,Taurus (Keenhouse Remix),8564800,Loose Shus,Red Sonja
779053,TRYYYUS12903CD2DF0,Kiko Navarro,O Samba Da Vida,8472838,Kiko Navarro,A Samba Da Vida
779054,TRYYYMG128F4260ECA,Gabriel Le Mar,Novemba,1997445,Gabriel Le Mar,140 Degrees


### Merge with raw ID list

Note that we merge on MSD ID, as merging on musiXmatch ID would actually re-introduce duplicates that were filtered out from the full mapping list in the [musiXmatch dataset SQLite database file](http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_dataset.db) that the `msd_to_mxm.csv` file is based on.

In [23]:
merged = pd.merge(msd_to_mxm, mxm_mapping_full, on="msd_tid")

In [24]:
merged

Unnamed: 0,msd_tid,mxm_tid_x,is_test,msd_artist_name,msd_title,mxm_tid_y,mxm_artist_name,mxm_title
0,TRAAAAV128F421A322,4623710,0,Western Addiction,A Poor Recipe For Civic Cohesion,4623710,Western Addiction,A Poor Recipe for Civic Cohesion
1,TRAAABD128F429CF47,6477168,0,The Box Tops,Soul Deep,6477168,The Box Tops,Soul Deep
2,TRAAAED128E0783FAB,2516445,0,Jamie Cullum,It's About Time,2516445,Jamie Cullum,It's About Time
3,TRAAAEF128F4273421,3759847,0,Adam Ant,Something Girls,3759847,Adam Ant,Something Girls
4,TRAAAEW128F42930C0,3783760,0,Broken Spindles,Burn My Body (Album Version),3783760,Broken Spindles,Burn My Body
...,...,...,...,...,...,...,...,...
237657,TRZZXFY128F9342D0E,1265451,1,Fragma,Toca Me,1265451,Fragma,Toca Me
237658,TRZZXOQ128F932A083,4292070,1,Riverside,After,4292070,Riverside,After
237659,TRZZXVN128F93285B4,7528751,1,ASP,Abschied,7528751,ASP,Abschied
237660,TRZZYLF128F9316CAB,3748433,1,Biagio Antonacci,Non Cambiare Tu,3748433,Biagio Antonacci,Non cambiare tu


In [25]:
len(merged[merged.mxm_tid_x != merged.mxm_tid_y])

0

As we can see, the `mxm_tid` columns from both dataframes are exactly the same, so we can drop one of them and rename the remaining column back to `mxm_tid`:

In [27]:
merged = merged.drop(merged.filter(regex='_y$').columns, axis=1).rename(columns={"mxm_tid_x": "mxm_tid"})
merged

Unnamed: 0,msd_tid,mxm_tid,is_test,msd_artist_name,msd_title,mxm_artist_name,mxm_title
0,TRAAAAV128F421A322,4623710,0,Western Addiction,A Poor Recipe For Civic Cohesion,Western Addiction,A Poor Recipe for Civic Cohesion
1,TRAAABD128F429CF47,6477168,0,The Box Tops,Soul Deep,The Box Tops,Soul Deep
2,TRAAAED128E0783FAB,2516445,0,Jamie Cullum,It's About Time,Jamie Cullum,It's About Time
3,TRAAAEF128F4273421,3759847,0,Adam Ant,Something Girls,Adam Ant,Something Girls
4,TRAAAEW128F42930C0,3783760,0,Broken Spindles,Burn My Body (Album Version),Broken Spindles,Burn My Body
...,...,...,...,...,...,...,...
237657,TRZZXFY128F9342D0E,1265451,1,Fragma,Toca Me,Fragma,Toca Me
237658,TRZZXOQ128F932A083,4292070,1,Riverside,After,Riverside,After
237659,TRZZXVN128F93285B4,7528751,1,ASP,Abschied,ASP,Abschied
237660,TRZZYLF128F9316CAB,3748433,1,Biagio Antonacci,Non Cambiare Tu,Biagio Antonacci,Non cambiare tu


### Check how often lowercase artist/track names don't match despite actually being the same track

In [32]:
merged[merged.msd_artist_name.str.lower() != merged.mxm_artist_name.str.lower()]

Unnamed: 0,msd_tid,mxm_tid,is_test,msd_artist_name,msd_title,mxm_artist_name,mxm_title
30,TRAACIR128F42963AC,6275430,0,Number Twelve Looks Like You,Cradle the Crater,The Number Twelve Looks Like You,Cradle in the Crater
39,TRAADCQ128F93436C3,1885215,0,Diomedes Diaz,El Verdadero Culpable,Diomedes Díaz,El Verdadero Culpable
40,TRAADKA12903CD2511,2288970,0,BLESTeNATION,They're Coming For You,Pete Shelley,They're Coming For You
54,TRAAEEQ128F42180B2,3561951,0,Explicit Samouraï,X.plicit sentence,Explicit Samourai,X.plicit sentence
56,TRAAEJH128E0785506,1018402,0,Hank Williams Jr.,Tuesday's Gone (Remastered Album Version),"Hank Williams, Jr.",Tuesday's Gone
...,...,...,...,...,...,...,...
237631,TRZZFDR128F14687CF,2534997,1,Bebe And Cece Winans,Celebrate New Life,BeBe & CeCe Winans,Celebrate New Life
237636,TRZZJFS128F422860B,4624085,1,Mantovani,O sole mio,Me First and the Gimme Gimmes,O sole mio
237645,TRZZPMG128F4228DC3,2331337,1,Cliffhanger,Born Again,Born Against,Born Again
237649,TRZZQHH128F1495208,2966726,1,Kierra Sheard,Done Did It,Kierra Kiki Sheard,Done Did It


In [31]:
merged[merged.msd_title.str.lower() != merged.mxm_title.str.lower()]

Unnamed: 0,msd_tid,mxm_tid,is_test,msd_artist_name,msd_title,mxm_artist_name,mxm_title
4,TRAAAEW128F42930C0,3783760,0,Broken Spindles,Burn My Body (Album Version),Broken Spindles,Burn My Body
7,TRAAAHJ128F931194C,5133845,0,Devotchka,The Last Beat Of My Heart (b-side),DeVotchKa,The Last Beat of My Heart
8,TRAAAHZ128E0799171,1619153,0,Snoop Dogg,The One And Only (Edited),Snoop Dogg,The One and Only
9,TRAAAJG128F9308A25,8525084,0,Malvina Reynolds,Tungsten (only issued previously on 45),Malvina Reynolds,Bitter Rain
10,TRAAAOF128F429C156,2973058,0,The Bonzo Dog Band,King Of Scurf (2007 Digital Remaster),The Bonzo Dog Band,King Of Scurf
...,...,...,...,...,...,...,...
237634,TRZZIIE128F92F7082,1137055,1,Bodyjar,Make A Difference (Live),Bodyjar,Make a Difference
237635,TRZZJBT12903CC4921,1180508,1,LITTLE TEXAS,A Night I'll Never Remember (Album Version),Little Texas,A Night I'll Never Remember
237637,TRZZKAB128F92E0BDC,1289726,1,Nekromantix,Devile smile,Nekromantix,Devil Smile
237654,TRZZUKM12903CB42AC,3911404,1,Kids Like Us,Dog Food (Live),Kids Like Us,Dog Food


We see that the researchers did a pretty good job, finding several hundreds of thousands of songs that wouldn't match if we chose a very simple matching strategy.

## Demo: fetching lyrics using musiXmatch API

This is a quick demo for how we could finally get lyrics using the musiXmatch API.

We will need a musiXmatch API key. It should be stored in a `.env` file inside of this directory. The file content should look like this: 
```
MUSIXMATCH_API_KEY=<your key>
```

where `<key>` is replaced with the actual API key that can be obtained after creating a musiXmatch developer account (for details check the API [docs](https://developer.musixmatch.com/documentation)).

We also need to make sure that python-dotenv is installed so that we can load the environment variable for the API key:

In [35]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-0.21.0-py3-none-any.whl (18 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-0.21.0


In [41]:
from dotenv import load_dotenv

load_dotenv()
musixmatch_api_key = os.getenv("MUSIXMATCH_API_KEY")

Pick some arbitrary track:

In [42]:
track_id = merged.mxm_tid[0]
track_id

4623710

There seems to be no nice Python wrapper for interacting with the API, so we need to code the API request ourselves, which luckily isn't hard:

In [60]:
api_base_url = "https://api.musixmatch.com/ws/1.1/"

def fetch_lyrics(track_id):
  lyric_fetch_url = f"{api_base_url}track.lyrics.get?apikey={musixmatch_api_key}&track_id={track_id}"
  response = requests.get(lyric_fetch_url)
  return response.json()["message"]["body"]["lyrics"]["lyrics_body"]

fetch_lyrics(track_id)


https://api.musixmatch.com/ws/1.1/track.lyrics.get?apikey=7593e61a3578c0d699b341c18e430552&track_id=4623710


"If patience is virtuous\nI got the temperament for temperance\nPartitioning stems and seeds\nDamn lifeless galleries\nI'll implement the elements\nYou're soiling a sacrament\nSlicing sharks and your porcelain pedestals\nSomehow, it seems so pitiful\n\nCatastrophe\nOf the highest order\nWas likeness captured?\nFeigned composure\n...\n\n******* This Lyrics is NOT for Commercial use *******"

Notice that we only get 30% of the lyrics on the free plan, which is unfortunate :/

## Add genre annotations

TODO