# Obtaining musiXmatch IDs for songs in Million Songs Dataset
At first sight, the [musiXmatch dataset](http://millionsongdataset.com/musixmatch/) looked like the perfect fit for this group project, as it contains lyric information for around 200k songs. However, the lyrics are unfortunately only available in bag-of-words format. That's why we try to acquire the list of songs with musiXmatch IDs by using the [mapping of MSD IDs to musiXmatch IDs](http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_779k_matches.txt.zip) instead.

As mentioned in the description of the dataset as well, it will be necessary to filter the list further by removing instrumental tracks and tracks that were identified as duplicates (and are therefore contained in the [duplicate list](http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_duplicates.txt) - for details see [this](http://millionsongdataset.com/blog/11-3-15-921810-song-dataset-duplicates) post).

In [1]:
import numpy as np
import pandas as pd

## Load mapping of MSD IDs to musiXmatch IDs


In [10]:
mapping = pd.read_table(
  "data/mxm_779k_matches.txt",
  skiprows=18, names=["msd_tid", "msd_artist_name", "msd_title", "mxm_tid", "mxm_artist_name", "mxm_title"],
  sep="<SEP>"
  )
mapping

  return func(*args, **kwargs)


Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title
0,TRMMMKD128F425225D,Karkkiautomaatti,Tanssi vaan,4418550,Karkkiautomaatti,Tanssi vaan
1,TRMMMRX128F93187D9,Hudson Mohawke,No One Could Ever,8898149,Hudson Mohawke,No One Could Ever
2,TRMMMCH128F425532C,Yerba Brava,Si Vos Querés,9239868,Yerba Brava,Si vos queres
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
4,TRMMMBB12903CB7D21,Kris Kross,2 Da Beat Ch'yall,2511405,Kris Kross,2 Da Beat Ch'yall
...,...,...,...,...,...,...
779051,TRYYYZM128F428E804,SKYCLAD,Inequality Street,788003,Skyclad,Inequality Street
779052,TRYYYON128F932585A,Loose Shus,Taurus (Keenhouse Remix),8564800,Loose Shus,Red Sonja
779053,TRYYYUS12903CD2DF0,Kiko Navarro,O Samba Da Vida,8472838,Kiko Navarro,A Samba Da Vida
779054,TRYYYMG128F4260ECA,Gabriel Le Mar,Novemba,1997445,Gabriel Le Mar,140 Degrees


### Checking how often lowercase artist/track names don't match despite actually being the same track

In [12]:
mapping[mapping.msd_artist_name.str.lower() != mapping.mxm_artist_name.str.lower()]

Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
5,TRMMMHY12903CB53F1,Joseph Locke,Goodbye,793273,Joseph LoDuca,Goodbye
17,TRMMMTG128F426B5BB,Craze,Fuckin Ethic People (999),620542,DJ Craze,Fuckin Ethic People (999)
34,TRMMMBU128F9305AC3,The Maytals,Night And Day,3254947,Toots & The Maytals,Night and Day
36,TRMMMNI12903CE0AF1,Lil O,My Everything [Screwed] (feat. Trae The Truth),8589413,Lil' O,My Everything (feat. Trae tha Truth)
...,...,...,...,...,...,...
779003,TRYYKCY128F932323B,Narkoi,Show Time,8499737,羅志祥,Show Time
779004,TRYYYVQ128F4264186,Jose Luis Perales,Así Te Quiero Yo,8196388,José Luis Perales,Así te quiero yo
779008,TRYYYHR128F429DA7F,Blackbyrds,Wilford's Gone,3457853,Donald Byrd And The Blackbyrds,Wilford's Gone
779030,TRYYYPH128F933D084,Gabriel Le Mar,ambient jam,3682829,Gabriel Le Mar vs. Cylancer,Ambient Jam


In [14]:
mapping[mapping.msd_title.str.lower() != mapping.mxm_title.str.lower()]

Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title
2,TRMMMCH128F425532C,Yerba Brava,Si Vos Querés,9239868,Yerba Brava,Si vos queres
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
11,TRMMMUT128F42646E8,Shawn Colvin,(Looking For) The Heart Of Saturday,674743,Shawn Colvin,(Looking for) The Heart of Saturday Night
18,TRMMMPJ128F9306985,Christian Castro,Tu Vida Con La Mía,3578541,Christian Castro,Tu vida con la mia
22,TRMMMWA128F1462C8C,Sev Statik,All For A Purpose (Speak Life Album Version),1941194,Sev Statik,All for a Purpose
...,...,...,...,...,...,...
779048,TRYYYHG128F9343EFB,Jazz Addixx,Chill,8484862,Jazz Addixx,Jazz Hop
779052,TRYYYON128F932585A,Loose Shus,Taurus (Keenhouse Remix),8564800,Loose Shus,Red Sonja
779053,TRYYYUS12903CD2DF0,Kiko Navarro,O Samba Da Vida,8472838,Kiko Navarro,A Samba Da Vida
779054,TRYYYMG128F4260ECA,Gabriel Le Mar,Novemba,1997445,Gabriel Le Mar,140 Degrees


We see that the researchers did a pretty good job, finding several hundreds of thousands of songs that wouldn't match if we chose a very simple matching strategy.

## Load duplicates list and use it for filtering

The duplicates list has the following structure:
```
% ARTIST - TITLE
DUPLICATE_ID1
...
DUPLICATE_ID_N
% NEXT_ARTIST - NEXT_TITLE
...
```

So, we have some processing to do. We assume that if we have a single match in terms of musiXmatch ID, we can remove all the remaining matches from our dataset.

In [16]:
import re

with open("data/msd_duplicates.txt") as f:
  duplicates = f.read()
  re.