# Obtaining musiXmatch IDs for songs in Million Songs Dataset
At first sight, the [musiXmatch dataset](http://millionsongdataset.com/musixmatch/) looked like the perfect fit for this group project, as it contains lyric information for around 200k songs. However, the lyrics are unfortunately only available in bag-of-words format. That's why we try to acquire the list of songs with musiXmatch IDs by using the [mapping of MSD IDs to musiXmatch IDs](http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_779k_matches.txt.zip) instead.

As mentioned in the description of the dataset as well, it will be necessary to filter the list further by removing instrumental tracks and tracks that were identified as duplicates (and are therefore contained in the [duplicate list](http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_duplicates.txt) - for details see [this](http://millionsongdataset.com/blog/11-3-15-921810-song-dataset-duplicates) post).

In [1]:
import numpy as np
import pandas as pd
import os
from io import BytesIO
from zipfile import ZipFile
import requests

In [2]:
def get_remote_zip(url):
  resp = requests.get(url)
  return ZipFile(BytesIO(resp.content))

In [3]:
def get_remote_textfile(url):
  resp = requests.get(url)
  return resp.text

In [4]:
data_dir = "data"
if not os.path.exists(data_dir):
  os.mkdir(data_dir)

## Load mapping of MSD IDs to musiXmatch IDs


In [5]:
mapping_file_path = os.path.join(data_dir, "mxm_779k_matches.txt")

if not os.path.exists(mapping_file_path):
  mapping_zip = get_remote_zip("http://millionsongdataset.com/sites/default/files/AdditionalFiles/mxm_779k_matches.txt.zip")
  with mapping_zip.open(mapping_zip.namelist()[0]) as f:
    mapping_file_content = f.read().decode()
    with open(mapping_file_path, "w") as f:
      f.write(mapping_file_content)




mxm_mapping = pd.read_table(
  mapping_file_path,
  skiprows=18, names=["msd_tid", "msd_artist_name", "msd_title", "mxm_tid", "mxm_artist_name", "mxm_title"],
  sep="<SEP>"
  )
mxm_mapping

  mxm_mapping = pd.read_table(


Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title
0,TRMMMKD128F425225D,Karkkiautomaatti,Tanssi vaan,4418550,Karkkiautomaatti,Tanssi vaan
1,TRMMMRX128F93187D9,Hudson Mohawke,No One Could Ever,8898149,Hudson Mohawke,No One Could Ever
2,TRMMMCH128F425532C,Yerba Brava,Si Vos Querés,9239868,Yerba Brava,Si vos queres
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
4,TRMMMBB12903CB7D21,Kris Kross,2 Da Beat Ch'yall,2511405,Kris Kross,2 Da Beat Ch'yall
...,...,...,...,...,...,...
779051,TRYYYZM128F428E804,SKYCLAD,Inequality Street,788003,Skyclad,Inequality Street
779052,TRYYYON128F932585A,Loose Shus,Taurus (Keenhouse Remix),8564800,Loose Shus,Red Sonja
779053,TRYYYUS12903CD2DF0,Kiko Navarro,O Samba Da Vida,8472838,Kiko Navarro,A Samba Da Vida
779054,TRYYYMG128F4260ECA,Gabriel Le Mar,Novemba,1997445,Gabriel Le Mar,140 Degrees


### Checking how often lowercase artist/track names don't match despite actually being the same track

In [6]:
mxm_mapping[mxm_mapping.msd_artist_name.str.lower() != mxm_mapping.mxm_artist_name.str.lower()]

Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
5,TRMMMHY12903CB53F1,Joseph Locke,Goodbye,793273,Joseph LoDuca,Goodbye
17,TRMMMTG128F426B5BB,Craze,Fuckin Ethic People (999),620542,DJ Craze,Fuckin Ethic People (999)
34,TRMMMBU128F9305AC3,The Maytals,Night And Day,3254947,Toots & The Maytals,Night and Day
36,TRMMMNI12903CE0AF1,Lil O,My Everything [Screwed] (feat. Trae The Truth),8589413,Lil' O,My Everything (feat. Trae tha Truth)
...,...,...,...,...,...,...
779003,TRYYKCY128F932323B,Narkoi,Show Time,8499737,羅志祥,Show Time
779004,TRYYYVQ128F4264186,Jose Luis Perales,Así Te Quiero Yo,8196388,José Luis Perales,Así te quiero yo
779008,TRYYYHR128F429DA7F,Blackbyrds,Wilford's Gone,3457853,Donald Byrd And The Blackbyrds,Wilford's Gone
779030,TRYYYPH128F933D084,Gabriel Le Mar,ambient jam,3682829,Gabriel Le Mar vs. Cylancer,Ambient Jam


In [7]:
mxm_mapping[mxm_mapping.msd_title.str.lower() != mxm_mapping.mxm_title.str.lower()]

Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title
2,TRMMMCH128F425532C,Yerba Brava,Si Vos Querés,9239868,Yerba Brava,Si vos queres
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
11,TRMMMUT128F42646E8,Shawn Colvin,(Looking For) The Heart Of Saturday,674743,Shawn Colvin,(Looking for) The Heart of Saturday Night
18,TRMMMPJ128F9306985,Christian Castro,Tu Vida Con La Mía,3578541,Christian Castro,Tu vida con la mia
22,TRMMMWA128F1462C8C,Sev Statik,All For A Purpose (Speak Life Album Version),1941194,Sev Statik,All for a Purpose
...,...,...,...,...,...,...
779048,TRYYYHG128F9343EFB,Jazz Addixx,Chill,8484862,Jazz Addixx,Jazz Hop
779052,TRYYYON128F932585A,Loose Shus,Taurus (Keenhouse Remix),8564800,Loose Shus,Red Sonja
779053,TRYYYUS12903CD2DF0,Kiko Navarro,O Samba Da Vida,8472838,Kiko Navarro,A Samba Da Vida
779054,TRYYYMG128F4260ECA,Gabriel Le Mar,Novemba,1997445,Gabriel Le Mar,140 Degrees


We see that the researchers did a pretty good job, finding several hundreds of thousands of songs that wouldn't match if we chose a very simple matching strategy.

## Load duplicates list and use it for filtering

In [8]:
duplicate_file_path = os.path.join(data_dir, "msd_duplicates.txt")
if not os.path.exists(duplicate_file_path):
  duplicate_file_content = get_remote_textfile("http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_duplicates.txt")
  with open(duplicate_file_path, "w") as f:
    f.write(duplicate_file_content)

The duplicates list has the following structure:
```
% ARTIST - TITLE
DUPLICATE_ID1
...
DUPLICATE_ID_N
% NEXT_ARTIST - NEXT_TITLE
...
```

So, we have some processing to do. We assume that if we have a single match in terms of musiXmatch ID, we can remove all the remaining matches from our dataset.

In [9]:
import re

with open(duplicate_file_path) as f:
  # remove comments (lines starting with "#")
  duplicates_data = "".join([line for line in f.readlines() if not line.startswith("#")])
  # we could extract the artist and track names for the duplicates from the file, like this:
  # artist_and_track_name_strs = re.findall(r"\%[0-9]*\s(.*)\n", duplicates_data)
  # splits = [str.split(" - ") for str in artist_and_track_name_strs]

  # But actually there's a few instances where " - " is either part of the song title or the artist's name
  # there's no clear rule for splitting that without introducing error
  # so, rather than picking the names from here, we will just use the lists of duplicates

  duplicated_tracks = [ids_str.split("\n")[:-1] for ids_str in re.split(r"\%[0-9]*\s.*\n", duplicates_data)[1:]]

How many songs are actually duplicated?

In [10]:
len(duplicated_tracks)

53471

This perfectly coincides with the content of the file, it also lists 53471 combinations of artist and track title!

The authors of the duplicate list also state that in total 131661 "song objects" are duplicates of another one. Let's make sure we extracted the same information:

In [11]:
sum([len(track_ids) for track_ids in duplicated_tracks])

131661

Looks good!

Now we need to figure out a way to actually remove the duplicates.

For this purpose, we first create a DataFrame with two columns: a "duplicate ID" that uniquely identifies one of the 54771 duplicated tracks and the respective track ID from the MSD dataset.

In [12]:
dups_and_tracks = [[(i, track_id) for track_id in track_ids] for (i, track_ids) in enumerate(duplicated_tracks)]
dups_and_tracks_flat = [item for dup_and_tracks in dups_and_tracks for item in dup_and_tracks]
dup_mapping = pd.DataFrame(dups_and_tracks_flat, columns=["duplicate_id", "msd_tid"])
dup_mapping

Unnamed: 0,duplicate_id,msd_tid
0,0,TRFCVSW12903D0A298
1,0,TRCWFEM128F9320F94
2,0,TRKYJRK12903CE6493
3,0,TRWTOBV128F9300F8A
4,1,TRWFIGX128F42920CA
...,...,...
131656,53468,TRVTTQH12903C9B37B
131657,53469,TRDNEDV128F92FFE25
131658,53469,TRUYBTI128F422D6CC
131659,53470,TRXVMUN128E0784025


Now we can use that to find out which tracks from the musiXmatch mappings are duplicates:

In [16]:
mxm_with_dup_info = pd.merge(mxm_mapping, dup_mapping, on="msd_tid", how="left")
mxm_with_dup_info

Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title,duplicate_id
0,TRMMMKD128F425225D,Karkkiautomaatti,Tanssi vaan,4418550,Karkkiautomaatti,Tanssi vaan,
1,TRMMMRX128F93187D9,Hudson Mohawke,No One Could Ever,8898149,Hudson Mohawke,No One Could Ever,
2,TRMMMCH128F425532C,Yerba Brava,Si Vos Querés,9239868,Yerba Brava,Si vos queres,
3,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":...",
4,TRMMMBB12903CB7D21,Kris Kross,2 Da Beat Ch'yall,2511405,Kris Kross,2 Da Beat Ch'yall,
...,...,...,...,...,...,...,...
779051,TRYYYZM128F428E804,SKYCLAD,Inequality Street,788003,Skyclad,Inequality Street,
779052,TRYYYON128F932585A,Loose Shus,Taurus (Keenhouse Remix),8564800,Loose Shus,Red Sonja,
779053,TRYYYUS12903CD2DF0,Kiko Navarro,O Samba Da Vida,8472838,Kiko Navarro,A Samba Da Vida,
779054,TRYYYMG128F4260ECA,Gabriel Le Mar,Novemba,1997445,Gabriel Le Mar,140 Degrees,


How many of the tracks with musixMatch mappings are actually duplicated?

In [20]:
mxm_dups = mxm_with_dup_info[mxm_with_dup_info.duplicate_id.notna()].sort_values(by="duplicate_id")
mxm_dups

Unnamed: 0,msd_tid,msd_artist_name,msd_title,mxm_tid,mxm_artist_name,mxm_title,duplicate_id
215550,TRFCVSW12903D0A298,The Del Vikings,Whispering Bells,9124806,The Del-Vikings,Whispering Bells,0.0
748808,TRKYJRK12903CE6493,The Del Vikings,Whispering Bells,9124806,The Del-Vikings,Whispering Bells,0.0
47284,TRWTOBV128F9300F8A,The Del-Vikings,Whispering Bells,9124806,The Del-Vikings,Whispering Bells,0.0
121324,TRCWFEM128F9320F94,The Del Vikings,Whispering Bells,9124806,The Del-Vikings,Whispering Bells,0.0
217467,TRFBNON128F4292174,ANGELZOOM,Blasphemous rumours,3374054,Angelzoom,Blasphemous Rumours,1.0
...,...,...,...,...,...,...,...
720084,TRKMSYQ128F425D7BA,Lilly Allen,Naive,7686473,Lily Allen,Naïve,53467.0
401516,TRPZSRF128F1484C3B,Lily Allen,Naïve,7686473,Lily Allen,Naïve,53467.0
616320,TRVTTQH12903C9B37B,Eddie Money,She Came in Through the Bathroom Window,754252,The Beatles,She Came In Through the Bathroom Window,53468.0
193145,TRBAAOT128F4261A18,Harry Connick_ Jr.,Once,1147634,"Harry Connick, Jr.",Once,53470.0


How often were tracks mapped to different musiXmatch track IDs despite actually being duplicates?

In [23]:
mxm_dups.groupby("duplicate_id").mxm_tid.nunique().value_counts()

1    41908
2     4457
3      172
4       12
Name: mxm_tid, dtype: int64

We see that the musiXmatch mapping was pretty good but still not perfect: in some instances songs that should be the same actually map to different musiXmatch track IDs!?

Let's compare that with the number of duplicates in the musiXmatch data:

In [24]:
mxm_dups.groupby("duplicate_id").mxm_tid.count().value_counts()

2     35038
3      6274
4      2210
5      1007
1       729
6       530
7       287
8       200
9       102
10       65
11       45
12       20
13       12
14        9
18        4
16        4
15        3
17        2
19        2
20        2
87        1
25        1
31        1
29        1
Name: mxm_tid, dtype: int64