The Million Song Dataset is quite old and the data is very segragated, uses a lot of different formats (txt, hdf5, sqlite3 etc.). This notebook lays down the steps to aggregate these files together and setup the required data for final model.

Note: You can skip running this notebook to directly work with the model.ipynb for demo since we provide the aggregated data already as part of files (song_genre_year.csv).

Also, if you wish to run with the whole dataset we recommend running this on Google colab/PACE cluster, your kernel might crash due to insufficient memory.

This notebook has been optimized to run with 12.7 GB RAM on Google Colab when working with full 1 Million songs, freeing up memory at each step. Requesting for similar resources to Google Colab on https://ondemand-ice.pace.gatech.edu/pun/sys/dashboard/ should also be sufficient.

We provide you the option to customize number of rows. Have set it a low number for demo purposes on personal system. Feel free to try out on a bigger subset or whole dataset as per your resources.

Before you run the notebook, create a virtual environment using the following cmd in this directory
python3 -m venv venv

Then select venv using
source venv/bin/activate

Or alternatively make sure you install packages from requirements.txt

In [1]:
!pip3 install -r requirements.txt



We will be downloading the files required to run this notebook: (either run following cell or download manually) <br>
Song Meta Data Summary HDF5 file: http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_summary_file.h5 <br>
Genre Dataset: https://www.tagtraum.com/genres/msd_tagtraum_cd2c.cls.zip <br>
Year Dataset: http://millionsongdataset.com/sites/default/files/AdditionalFiles/tracks_per_year.txt

In [23]:
!wget http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_summary_file.h5
!wget http://millionsongdataset.com/sites/default/files/AdditionalFiles/tracks_per_year.txt
!wget https://www.tagtraum.com/genres/msd_tagtraum_cd2c.cls.zip
!wget http://millionsongdataset.com/sites/default/files/challenge/train_triplets.txt.zip

--2023-12-02 16:07:12--  http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_summary_file.h5
Resolving millionsongdataset.com (millionsongdataset.com)... 

172.104.14.177
Connecting to millionsongdataset.com (millionsongdataset.com)|172.104.14.177|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 316104481 (301M)
Saving to: ‘msd_summary_file.h5.2’

--2023-12-02 16:09:59--  http://millionsongdataset.com/sites/default/files/AdditionalFiles/tracks_per_year.txt
Resolving millionsongdataset.com (millionsongdataset.com)... 172.104.14.177
Connecting to millionsongdataset.com (millionsongdataset.com)|172.104.14.177|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35028078 (33M) [text/plain]
Saving to: ‘tracks_per_year.txt.2’

--2023-12-02 16:10:24--  https://www.tagtraum.com/genres/msd_tagtraum_cd2c.cls.zip
Resolving www.tagtraum.com (www.tagtraum.com)... 81.169.145.77, 2a01:238:20a:202:1077::
Connecting to www.tagtraum.com (www.tagtraum.com)|81.169.145.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1621202 (1.5M) [application/zip]
Saving to: ‘msd_tagtraum_cd2c.cls.zip.2

In the following row set the number of songs you want to work with. We have initialized to to 10_000. It can go upto 1_000_000. If you want to work with the whole 1 Million Song Dataset, we recommend using Google Colab/PACE cluster. We worked with it on Google Colab.

In [1]:
nrows = 10_000

In [2]:
import h5py
import pandas as pd

h5_file_path = "msd_summary_file.h5"
with h5py.File(h5_file_path, "r") as h5file:
    songs_dataset = h5file["metadata"]["songs"][:nrows]
    analysis_dataset = h5file["analysis"]["songs"][:nrows]
    musicbrainz_dataset = h5file["musicbrainz"]["songs"][:nrows]

    songs_data = songs_dataset[()]
    analysis_data = analysis_dataset[()]
    musicbrainz_data = musicbrainz_dataset[()]

songs_df = pd.DataFrame(songs_data)
analysis_df = pd.DataFrame(analysis_data)
musicbrainz_df = pd.DataFrame(musicbrainz_data)

print(songs_df.columns)
print(analysis_df.columns)
print(musicbrainz_df.columns)

Index(['analyzer_version', 'artist_7digitalid', 'artist_familiarity',
       'artist_hotttnesss', 'artist_id', 'artist_latitude', 'artist_location',
       'artist_longitude', 'artist_mbid', 'artist_name', 'artist_playmeid',
       'genre', 'idx_artist_terms', 'idx_similar_artists', 'release',
       'release_7digitalid', 'song_hotttnesss', 'song_id', 'title',
       'track_7digitalid'],
      dtype='object')
Index(['analysis_sample_rate', 'audio_md5', 'danceability', 'duration',
       'end_of_fade_in', 'energy', 'idx_bars_confidence', 'idx_bars_start',
       'idx_beats_confidence', 'idx_beats_start', 'idx_sections_confidence',
       'idx_sections_start', 'idx_segments_confidence',
       'idx_segments_loudness_max', 'idx_segments_loudness_max_time',
       'idx_segments_loudness_start', 'idx_segments_pitches',
       'idx_segments_start', 'idx_segments_timbre', 'idx_tatums_confidence',
       'idx_tatums_start', 'key', 'key_confidence', 'loudness', 'mode',
       'mode_confidence',

In [3]:
msd_df = pd.concat([songs_df, analysis_df, musicbrainz_df], axis=1)
del songs_df
del analysis_df
del musicbrainz_df

In [4]:
msd_df = msd_df.applymap(
    lambda x: x.decode("utf-8") if isinstance(x, bytes) else x
)
msd_df.drop(["year", "genre"], axis=1, inplace=True)

  msd_df = msd_df.applymap(


In [5]:
!unzip "msd_tagtraum_cd2c.cls.zip"

Archive:  msd_tagtraum_cd2c.cls.zip
replace msd_tagtraum_cd2c.cls? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [6]:
file_path = "msd_tagtraum_cd2c.cls"

genre_df = pd.read_csv(
    file_path,
    sep="\t",
    header=None,
    names=["track_id", "genre", "minority_genre"],
    comment="#",
)

genre_df

Unnamed: 0,track_id,genre,minority_genre
0,TRAAAAK128F9318786,Rock,
1,TRAAAAW128F429D538,Rap,
2,TRAAADJ128F4287B47,Rock,
3,TRAAADZ128F9348C2E,Latin,
4,TRAAAED128E0783FAB,Jazz,
...,...,...,...
191396,TRZZZMY128F426D7A2,Reggae,
191397,TRZZZRJ128F42819AF,Rock,
191398,TRZZZUK128F92E3C60,Folk,
191399,TRZZZZD128F4236844,Rock,


In [7]:
genre_df.drop(["minority_genre"], axis=1, inplace=True)
genre_df.head

<bound method NDFrame.head of                   track_id       genre
0       TRAAAAK128F9318786        Rock
1       TRAAAAW128F429D538         Rap
2       TRAAADJ128F4287B47        Rock
3       TRAAADZ128F9348C2E       Latin
4       TRAAAED128E0783FAB        Jazz
...                    ...         ...
191396  TRZZZMY128F426D7A2      Reggae
191397  TRZZZRJ128F42819AF        Rock
191398  TRZZZUK128F92E3C60        Folk
191399  TRZZZZD128F4236844        Rock
191400  TRZZZZZ12903D05E3A  Electronic

[191401 rows x 2 columns]>

In [8]:
msd_df = genre_df.merge(msd_df, on="track_id")
del genre_df

In [9]:
file_path = (
    "tracks_per_year.txt"
)

year_df = pd.read_csv(
    file_path,
    sep="<SEP>",
    header=None,
    names=["year", "track_id", "song_name", "artist_name"],
)

year_df.head

  year_df = pd.read_csv(


<bound method NDFrame.head of         year            track_id           song_name  \
0       1922  TRSGHLU128F421DF83      Alberta Hunter   
1       1922  TRMYDFV128F42511FC     Barrington Levy   
2       1922  TRRAHXQ128F42511FF     Barrington Levy   
3       1922  TRFAFTK12903CC77B8     Barrington Levy   
4       1922  TRSTBUY128F4251203     Barrington Levy   
...      ...                 ...                 ...   
515571  2010  TRPLSGR128F930135B      Ólafur Arnalds   
515572  2010  TRLYTZU128F930135F      Ólafur Arnalds   
515573  2010  TRJHYON128F9301354      Ólafur Arnalds   
515574  2010  TRXQZXQ128F9301352      Ólafur Arnalds   
515575  2011  TRCGTAH12903CAFAB8  The Joy Formidable   

                          artist_name  
0                        Don't Pan Me  
1                  Warm And Sunny Day  
2                     Looking My Love  
3                  Warm And Sunny Day  
4                 Mandela You're Free  
...                               ...  
515571  Loftið ve

In [10]:
year_df.drop(["song_name", "artist_name"], axis=1, inplace=True)

In [11]:
msd_df = msd_df.merge(year_df, on="track_id")
del year_df
msd_df.head

<bound method NDFrame.head of                 track_id       genre analyzer_version  artist_7digitalid  \
0     TRMBAEP128F92FF86F  Electronic                               51120   
1     TRMBAGK128F92E59B5        Jazz                                9580   
2     TRMBALI128F4263C1B        Jazz                                 777   
3     TRMBANE128F423ECDB        Rock                               87687   
4     TRMBAXZ12903CF4300       Metal                               90917   
...                  ...         ...              ...                ...   
1469  TRMWZNP128F931E7B8  Electronic                               70840   
1470  TRMWZUQ128F934B40C       Metal                              111353   
1471  TRMWZWM128F92E8341     Country                               26388   
1472  TRMWZXH128F930691C        Rock                              106258   
1473  TRMWZZR128F42AE75E        Rock                               52287   

      artist_familiarity  artist_hotttnesss           art

In [12]:
msd_df.columns.to_list()

['track_id',
 'genre',
 'analyzer_version',
 'artist_7digitalid',
 'artist_familiarity',
 'artist_hotttnesss',
 'artist_id',
 'artist_latitude',
 'artist_location',
 'artist_longitude',
 'artist_mbid',
 'artist_name',
 'artist_playmeid',
 'idx_artist_terms',
 'idx_similar_artists',
 'release',
 'release_7digitalid',
 'song_hotttnesss',
 'song_id',
 'title',
 'track_7digitalid',
 'analysis_sample_rate',
 'audio_md5',
 'danceability',
 'duration',
 'end_of_fade_in',
 'energy',
 'idx_bars_confidence',
 'idx_bars_start',
 'idx_beats_confidence',
 'idx_beats_start',
 'idx_sections_confidence',
 'idx_sections_start',
 'idx_segments_confidence',
 'idx_segments_loudness_max',
 'idx_segments_loudness_max_time',
 'idx_segments_loudness_start',
 'idx_segments_pitches',
 'idx_segments_start',
 'idx_segments_timbre',
 'idx_tatums_confidence',
 'idx_tatums_start',
 'key',
 'key_confidence',
 'loudness',
 'mode',
 'mode_confidence',
 'start_of_fade_out',
 'tempo',
 'time_signature',
 'time_signature_

msd_df was saved to a file as song_genre_year.csv which we work with in our model.

Following code snippets are to generate files and add support for the UI to consume in an easy manner. Not an important part, but you can wish to go through it.

Year Genre Song Hottness Count Chart

In [13]:
chart_df = msd_df[["song_id", "genre", "year", "song_hotttnesss", "duration"]]
chart_df = (
    chart_df.groupby(["year", "genre"])
    .agg({"song_hotttnesss": "mean", "duration": "mean", "song_id": "size"})
    .reset_index()
)
chart_df = chart_df.rename(columns={"song_id": "count"})
chart_df["song_hotttnesss"].fillna(0, inplace=True)

unique_years = chart_df["year"].unique()
unique_genres = chart_df["genre"].unique()
all_combinations = pd.MultiIndex.from_product(
    [unique_years, unique_genres], names=["year", "genre"]
)
all_combinations_df = pd.DataFrame(index=all_combinations).reset_index()

final_chart_df = pd.merge(
    all_combinations_df, chart_df, on=["year", "genre"], how="left"
).fillna(0)

final_chart_df.loc[final_chart_df["song_hotttnesss"] == 0, "count"] = 0
final_chart_df

Unnamed: 0,year,genre,song_hotttnesss,duration,count
0,1938,Blues,0.000000,171.676280,0.0
1,1938,Country,0.000000,0.000000,0.0
2,1938,Jazz,0.000000,0.000000,0.0
3,1938,Rock,0.000000,0.000000,0.0
4,1938,RnB,0.000000,0.000000,0.0
...,...,...,...,...,...
850,2010,Punk,0.000000,0.000000,0.0
851,2010,New Age,0.000000,0.000000,0.0
852,2010,Electronic,0.575618,285.661587,8.0
853,2010,Rap,0.576644,179.495603,3.0


We now work the user song interaction dataset. For the UI we need to show the top 5 songs played by the user. Also user_id is a alphanumeric string. For better UX we give fake names to each unique user_id. Following code snippets are used to achieve this.

In [14]:
!unzip "train_triplets.txt.zip"

Archive:  train_triplets.txt.zip
replace train_triplets.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

^C


In [15]:
column_names = ["user_id", "song_id", "listen_count"]
user_play_counts_df = pd.read_csv(
    "train_triplets.txt", sep="\t", names=column_names, nrows=10_000
)

In [16]:
pc_df = msd_df[["song_id", "title"]]
group_df = user_play_counts_df.merge(pc_df, on="song_id")
top_5_songs = (
    group_df.groupby("user_id")
    .apply(lambda x: x.nlargest(5, "listen_count"))
    .reset_index(drop=True)
)

In [17]:
top_5_songs

Unnamed: 0,user_id,song_id,listen_count,title
0,01655ae6bc52e29c9cd100a7dde4e9eeae5e4031,SOPRIUU12A8C133EF5,2,Home Again
1,01655ae6bc52e29c9cd100a7dde4e9eeae5e4031,SODJMNX12A8C1320C4,1,Astronaut
2,0ef42a19efb74d0a05c308d00636c8d8d41bec0c,SOSCMSY12A8C13C1E9,1,Warrior
3,12a4a991c8b53cd6906995caed8b1f2bd3b6436a,SOAQZZG12A8C135BFF,2,Ain't No Good
4,12a4a991c8b53cd6906995caed8b1f2bd3b6436a,SOIUVMN12A8AE47414,1,City
5,12a4a991c8b53cd6906995caed8b1f2bd3b6436a,SOSCNMY12AC3DFB313,1,Official
6,17aa9f6dbdf753831da8f38c71b66b64373de613,SONMWXV12AB01803B4,1,Year 3000
7,18dc95f354220b343e98ebbb7c8564291284ed9f,SORWULZ12A6D4F5B1E,1,Queen Of My Double Wide Trailer
8,18dc95f354220b343e98ebbb7c8564291284ed9f,SOJUERX12AB0184DE1,1,Street Of Dreams
9,3187e658fd71be99d68d96f3a61a436c8a607365,SOWWQXA12A81C1F9CA,1,Scream


In [18]:
from faker import Faker

fake = Faker()

def generate_fake_names(df):
    user_ids = df["user_id"].unique()
    fake_names = [fake.name() for _ in range(len(user_ids))]
    fake_mapping = dict(zip(user_ids, fake_names))
    df["fake_name"] = df["user_id"].map(fake_mapping)
    return df

df_with_fake_names = generate_fake_names(top_5_songs)
df_with_fake_names

Unnamed: 0,user_id,song_id,listen_count,title,fake_name
0,01655ae6bc52e29c9cd100a7dde4e9eeae5e4031,SOPRIUU12A8C133EF5,2,Home Again,Barry Jenkins
1,01655ae6bc52e29c9cd100a7dde4e9eeae5e4031,SODJMNX12A8C1320C4,1,Astronaut,Barry Jenkins
2,0ef42a19efb74d0a05c308d00636c8d8d41bec0c,SOSCMSY12A8C13C1E9,1,Warrior,Donna Murphy
3,12a4a991c8b53cd6906995caed8b1f2bd3b6436a,SOAQZZG12A8C135BFF,2,Ain't No Good,Joe Davis
4,12a4a991c8b53cd6906995caed8b1f2bd3b6436a,SOIUVMN12A8AE47414,1,City,Joe Davis
5,12a4a991c8b53cd6906995caed8b1f2bd3b6436a,SOSCNMY12AC3DFB313,1,Official,Joe Davis
6,17aa9f6dbdf753831da8f38c71b66b64373de613,SONMWXV12AB01803B4,1,Year 3000,Daryl Wright
7,18dc95f354220b343e98ebbb7c8564291284ed9f,SORWULZ12A6D4F5B1E,1,Queen Of My Double Wide Trailer,Jeffrey James
8,18dc95f354220b343e98ebbb7c8564291284ed9f,SOJUERX12AB0184DE1,1,Street Of Dreams,Jeffrey James
9,3187e658fd71be99d68d96f3a61a436c8a607365,SOWWQXA12A81C1F9CA,1,Scream,Jessica Anderson
