# Steps

1. Create a way to find a particular song, by song-title, on the ~~Million Song Dataset(MSD)~~ (\*) or Million Song Subset, 10_000 song sample of the Million Song Dataset (MSD), if it exists in the subset.

2. Compile a list of songs for each genre.

3. Pick out at least 30 songs from each compiled list, which are included in the Million Song Subset. Collect their track IDs for future data lookups.

\* The MSD takes about 300 GB of memory; unfeasible to download. The Million Song Subset is more reasonable, consisting of less than 2 GB of data. 

### Backup plans
- The Tagtraum and MAGD genre annotations datasets available from the MSD website's homepage have genre annotations that map MSD track ids to genres. Could use those to pull songs of different genres from the MSD. (https://www.tagtraum.com/msd_genre_datasets.html), (http://www.ifs.tuwien.ac.at/mir/msd/).

## Download the Million Song Subset

In [1]:
# download the 10_000 song subset from the Million Song Dataset

# !wget labrosa.ee.columbia.edu/~dpwe/tmp/millionsongsubset.tar.gz

In [3]:
# extract the files from the downloaded zip

# !tar -xf millionsongsubset.tar.gz

## Finding a Song in the Million Song Subset by Song Title

In [4]:
# Download the reverse-index provided by MSD to find songs by title.
# Format of each entry (line) is track id<SEP>song id<SEP>artist name<SEP>song title

# !wget http://millionsongdataset.com/sites/default/files/AdditionalFiles/unique_tracks.txt 
    # this is for the Million Song Dataset, not Million Song Subset.
    # Need to find out which of these are in the Million Song Subset.

--2021-04-02 16:35:23--  http://millionsongdataset.com/sites/default/files/AdditionalFiles/unique_tracks.txt
Resolving millionsongdataset.com (millionsongdataset.com)... 172.104.14.177
Connecting to millionsongdataset.com (millionsongdataset.com)|172.104.14.177|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84046293 (80M) [text/plain]
Saving to: ‘unique_tracks.txt.1’


2021-04-02 16:35:29 (12.5 MB/s) - ‘unique_tracks.txt.1’ saved [84046293/84046293]



In [2]:
# connecting song titles to song IDs using the reverse index file for a million songs
song_ids = []
song_titles = []
track_ids = []
num_lines = 0
with open("unique_tracks.txt", 'r') as f:
  for x in f:
    num_lines = num_lines + 1
    line = x.split("<SEP>")
    track_ids.append(line[0])
    song_ids.append(line[1])
    song_titles.append(line[3])

In [3]:
print(num_lines)
print(len(song_titles))
print(len(song_ids))
print(len(track_ids))

1000000
1000000
1000000
1000000


In [4]:
# for i in range(50): print(f"title: {song_titles[i]} id: {song_ids[i]}")
song_title_to_trackid = {song_titles[i]:track_ids[i] for i in range(num_lines)}
song_title_to_songid = {song_titles[i]:song_ids[i] for i in range(num_lines)}

for i in range(30): print(f"{i}) title: {song_titles[i]} track_id: {song_title_to_trackid[song_titles[i]]} song_id: {song_title_to_songid[song_titles[i]]}")

0) title: Silent Night
 track_id: TRYIAOS128F92CD8E9 song_id: SOJJKTG12A8C140EBF
1) title: Tanssi vaan
 track_id: TRMMMKD128F425225D song_id: SOVFVAK12A8C1350D9
2) title: No One Could Ever
 track_id: TRMMMRX128F93187D9 song_id: SOGTUKN12AB017F4F1
3) title: Si Vos Querés
 track_id: TRMMMCH128F425532C song_id: SOBNYVR12A8C13558C
4) title: Tangle Of Aspens
 track_id: TRMMMWA128F426B589 song_id: SOHSBXH12A8C13B0DF
5) title: Symphony No. 1 G minor "Sinfonie Serieuse"/Allegro con energia
 track_id: TRMMMXN128F42936A5 song_id: SOZVAPQ12A8C13B63C
6) title: We Have Got Love
 track_id: TRMMMLR128F1494097 song_id: SOQVRHI12A6D4FB2D7
7) title: 2 Da Beat Ch'yall
 track_id: TRMMMBB12903CB7D21 song_id: SOEYRFT12AB018936C
8) title: Goodbye
 track_id: TRYNFCT128F4245B3F song_id: SOIONTS12A6D4FDA6D
9) title: Mama_ mama can't you see ?
 track_id: TRMMMML128F4280EE9 song_id: SOJCFMH12A8C13B0C2
10) title: L'antarctique
 track_id: TRMMMNS128F93548E1 song_id: SOYGNWH12AB018191E
11) title: El hijo del pueblo


In [5]:
# checking if a song, "Nervous" (index 22), is in the MSD, and if it is, return its track_id
print(type(song_titles[22]))
print(type('Nervous'))
print(song_titles[22])
print('Nervous' == song_titles[22]) # will print False
print('Nervous\n' == song_titles[22]) # will print True
# in other words, print(song_title_to_id['Nervous']) will raise a KeyError exception.
# However, the following song id lookup will work:
print(song_title_to_trackid['Nervous\n'])
print(song_title_to_songid['Nervous\n'])
# i.e while we would expect song_title_to_trackid['Nervous'] or song_title_to_songid['Nervous'] to work, 
#           it won't work without the "\n" after the title. This is because in unique_tracks.txt, the song 
#           titles are the last piece of info about each song, and are followed by newline characters before 
#           the next songs' info is listed. 

<class 'str'>
<class 'str'>
Nervous

False
True
TRXPSWB128F930E511
SOZRCWW12AF72A07F7


Here, I try to access the song "Nervous" in the Million Song Subset, but it's not in the subset. I need to work out how to determine whether a song from the reverse-index file exists in the Million Song Subset, and make a list of those that are indeed in the subset.

In [6]:
# preparing machinery to work with hdf5 files, the file format of the data in MSD and Million Song Subset

# !wget https://raw.githubusercontent.com/tbertinmahieux/MSongsDB/master/PythonSrc/hdf5_getters.py
!pip install tables
import hdf5_getters as GETTERS 
# bug in hdf5_getters.py: line 39: tables.openFile(h5filename, mode='r') should be tables.open_file(h5filename, mode='r')
import os

Defaulting to user installation because normal site-packages is not writeable


In [7]:
# helper functions:
# - to construct the path to the data file.
# - to extract song features from data file.
#
# Citation: 
# Thierry Bertin-Mahieux (2011) Columbia University
# tb2332@columbia.edu
# http://millionsongdataset.com/sites/default/files/create_genre_dataset.py.txt

def path_from_trackid(msddir,trackid):
    """
    Create a full path from the main MSD dir and a track id.
    Does not check if the file actually exists.
    """
    p = os.path.join(msddir,trackid[2])
    p = os.path.join(p,trackid[3])
    p = os.path.join(p,trackid[4])
    p = os.path.join(p,trackid.upper()+'.h5')
    return p

def feat_from_file(path):
    """
    Extract a list of features in an array, already converted to string
    """
    feats = []
    h5 = GETTERS.open_h5_file_read(path)
    # basic info
    feats.append( GETTERS.get_track_id(h5) )
    feats.append( GETTERS.get_artist_name(h5).replace(',','') )
    feats.append( GETTERS.get_title(h5).replace(',','') )
    feats.append( GETTERS.get_loudness(h5) )
    feats.append( GETTERS.get_tempo(h5) )
    feats.append( GETTERS.get_time_signature(h5) )
    feats.append( GETTERS.get_key(h5) )
    feats.append( GETTERS.get_mode(h5) )
    feats.append( GETTERS.get_duration(h5) )
    # timbre
    timbre = GETTERS.get_segments_timbre(h5)
    avg_timbre = np.average(timbre,axis=0)
    for k in avg_timbre:
        feats.append(k)
    var_timbre = np.var(timbre,axis=0)
    for k in var_timbre:
        feats.append(k)
    # done with h5 file
    h5.close()
    # makes sure we return strings
    feats = map(lambda x: str(x), feats)
    return feats

In [8]:
# constructing the path
path = path_from_trackid("MillionSongSubset", song_title_to_trackid['Nervous\n'])
path

'MillionSongSubset/X/P/S/TRXPSWB128F930E511.h5'

In [9]:
# what happens if we try to open a file that does not exist?
h5 = GETTERS.open_h5_file_read(path)

OSError: ``MillionSongSubset/X/P/S/TRXPSWB128F930E511.h5`` does not exist

In [10]:
# printing if path exists
os.path.exists(path)

False

In [11]:
song_exists_in_subset = [os.path.exists(path_from_trackid("MillionSongSubset", track_ids[i])) for i in range(len(track_ids))]
song_exists_in_subset[:10]

[False, False, False, False, False, False, False, False, False, False]

In [12]:
# Indices, for all of song_titles[1...1_000_000], track_ids[1...1_000_000], and song_ids[1...1_000_000], 
# of songs that exist in the Million Song Subset
existing_song_indices = [i for i in range(len(song_exists_in_subset)) if song_exists_in_subset[i]]
print(existing_song_indices[:50])
print(len(existing_song_indices)) # should print 10000

[233581, 233584, 233597, 233601, 233605, 233608, 233610, 233616, 233618, 233621, 233623, 233624, 233627, 233629, 233637, 233638, 233646, 233648, 233651, 233658, 233659, 233660, 233665, 233667, 233669, 233671, 233680, 233684, 233685, 233686, 233688, 233691, 233699, 233705, 233719, 233720, 233723, 233726, 233731, 233741, 233742, 233748, 233753, 233758, 233765, 233767, 233768, 233769, 233771, 233773]
10000


In [13]:
# song_titles_subset = [song_titles[x] for x in existing_song_indices]
# print(song_titles_subset[:10])
# len(song_titles_subset)

In [14]:
# track_ids_subset = [track_ids[x] for x in existing_song_indices]
# print(track_ids_subset[:10])
# len(track_ids_subset)

In [15]:
# song_ids_subset = [song_ids[i] for i in existing_song_indices]
# print(song_ids_subset[:10])
# len(song_ids_subset)

In [16]:
# check that all the songs at the extracted indices do indeed exist in the million song subset
passed = 1
failed_indices = []
for i in range(10_000):
    if not os.path.exists(path_from_trackid("MillionSongSubset", track_ids[existing_song_indices[i]])):
        passed = 0
        failed_indices.append(existing_song_indices[i])
if passed == 1: print("pass")
else: print("fail")
print(len(failed_indices))
    

pass
0


In [17]:
# check that the mapping between song title and track id has remained intact
passed = 1
failed_indices = []
for i in range(10_000):
    if not track_ids[existing_song_indices[i]] == song_title_to_trackid[song_titles[existing_song_indices[i]]]:
        passed = 0
        failed_indices.append(existing_song_indices[i])
if passed == 1: print("pass")
else: print("fail")
print(len(failed_indices))
print(failed_indices[:10])

fail
3399
[233597, 233605, 233608, 233618, 233621, 233624, 233627, 233629, 233646, 233651]


In [18]:
for i in range(10):
    print(f"song title: {song_titles[failed_indices[i]]}, \
        correct track_id: {track_ids[failed_indices[i]]}, \
        incorrect track_id: {song_title_to_trackid[song_titles[failed_indices[i]]]}")

song title: Kicking And Screaming
,         correct track_id: TRBGMOG128F92D75BD,         incorrect track_id: TRDLSCJ12903D00EF5
song title: Terraplane Blues
,         correct track_id: TRBGMAW128F4231326,         incorrect track_id: TRKUJNM128F4233D6E
song title: Toda Mi Vida
,         correct track_id: TRBGMJB128F92E5936,         incorrect track_id: TRAWHIW128F421BC96
song title: Lookin' for My Baby
,         correct track_id: TRBGMWD12903D0F674,         incorrect track_id: TRLDQER128F426A286
song title: La Rosa
,         correct track_id: TRBGMJD128F4266852,         incorrect track_id: TRVQOSU128F935DC17
song title: The Harbour
,         correct track_id: TRBGMHU128EF356CB8,         incorrect track_id: TRXUTTK128F92FF227
song title: Slow And Low
,         correct track_id: TRBGMUA128F92E9A1C,         incorrect track_id: TRXNSFR12903CDBC9E
song title: Dio Nisia
,         correct track_id: TRBGMKO128F933A55B,         incorrect track_id: TRLQWFZ128F933A87F
song title: I Don't Like You


In [19]:
trackid_to_song = {track_ids[i]:song_titles[i] for i in range(num_lines)}
for i in range(10):
    print(f"{trackid_to_song[song_title_to_trackid[song_titles[failed_indices[i]]]]}{song_titles[failed_indices[i]]}")

# do both incorrect track id and correct track id map to the same song?
one_song_two_trackids_part_1 = True
for i in range(len(failed_indices)):
    if trackid_to_song[track_ids[failed_indices[i]]] != trackid_to_song[song_title_to_trackid[song_titles[failed_indices[i]]]]:
        one_song_two_trackids_part_1 = False
if one_song_two_trackids_part_1:
    print("Different track ids map to the same song title. All the songs at the failed indices have two track ids")

one_song_two_trackids_part_2 = True
for i in range(len(failed_indices)):
    if track_ids[failed_indices[i]] == song_title_to_trackid[song_titles[failed_indices[i]]]:
        one_song_two_trackids_part_2 = False
if one_song_two_trackids_part_2:
    print("Double checked that the track ids mapped to the same song are different. All the songs at the failed indices have two track ids")



Kicking And Screaming
Kicking And Screaming

Terraplane Blues
Terraplane Blues

Toda Mi Vida
Toda Mi Vida

Lookin' for My Baby
Lookin' for My Baby

La Rosa
La Rosa

The Harbour
The Harbour

Slow And Low
Slow And Low

Dio Nisia
Dio Nisia

I Don't Like You
I Don't Like You

Take The Time
Take The Time

Different track ids map to the same song title. All the songs at the failed indices have two track ids
Double checked that the track ids mapped to the same song are different. All the songs at the failed indices have two track ids


The "incorrect" track ids found through song_title_to_trackid\[song_titles\[failed_indices\[i\]\]\] all map to the "correct" song song_titles\[failed_indices\[i\]\]. The same song that is mapped to the "correct" track id track_ids\[i\]. But track_ids\[i\] != song_title_to_trackid\[song_titles\[failed_indices\[i\]\]\]; two track ids, which map to the same song, are different. So, the songs at the failed indices must have two track ids. This suggests that those song titles were repeats in the dataset. We can discard the song titles, song ids, and track ids at those indices.

In [20]:
# discarding repeated songs
existing_song_indices = [existing_song_indices[i] for i in range(len(existing_song_indices)) \
                         if not existing_song_indices[i] in failed_indices]
print(len(existing_song_indices) == 10000 - len(failed_indices))
len(existing_song_indices)

True


6601

In [21]:
# check that all the songs at the extracted indices do indeed exist in the million song subset
passed = 1
failed_indices = []
for i in range(len(existing_song_indices)):
    if not os.path.exists(path_from_trackid("MillionSongSubset", track_ids[existing_song_indices[i]])):
        passed = 0
        failed_indices.append(existing_song_indices[i])
if passed == 1: print("pass")
else: print("fail")
print(len(failed_indices))

pass
0


In [22]:
# check that the mapping between song title and track id has remained intact
passed = 1
failed_indices = []
for i in range(len(existing_song_indices)):
    if not track_ids[existing_song_indices[i]] == song_title_to_trackid[song_titles[existing_song_indices[i]]]\
        and song_titles[existing_song_indices[i]] == trackid_to_song[song_title_to_trackid[song_titles[existing_song_indices[i]]]]:
            passed = 0
            failed_indices.append(existing_song_indices[i])
if passed == 1: print("pass")
else: print("fail")
print(len(failed_indices))

pass
0


The data we can retieve from the million song subset is represented by:
- song_titles[1...1_000_000]
- track_ids[1...1_000_000]
- song_ids[1...1_000_000]
- existing_song_indices[1...6601]
- the dictionaries, song_title_to_trackid and song_title_to_songid 

In [23]:
# some songs available in the million songs subset:
for i in range(10): print(song_titles[existing_song_indices[i]])

Life's Blood

End Of The Beginning

Black Connect 3

Lift Jesus Up

Jubilation T. Cornpone

The Way Of Love (West Coast Diaries Vol. 1 Album Version)

L'idole des femmes

All Or Nothing  (Wonderful World Album Version)

Together We Stand (Album Version)

Cet Air Étrange (Take 1 - Abbey Road Rough Mix)



For convenience, I will create a new dictionary existing_song_to_trackid to hold the song title to track id mappings for songs in the Million Song Subset...

In [71]:
song_titles[4].upper()

'TANGLE OF ASPENS\n'

In [84]:
existing_song_to_trackid = {song_titles[i].upper():track_ids[i] for i in existing_song_indices}
print(len(existing_song_to_trackid) == len(existing_song_indices))
print(len(existing_song_to_trackid), len(existing_song_indices))
for i in range(10):
    print(existing_song_to_trackid[song_titles[existing_song_indices[i]].upper()])

False
6600 6601
TRBGMWQ12903CC23CD
TRBGMJF128F425E8EA
TRBGMXY128F92FC2B2
TRBGMMC12903CEFD3C
TRBGMNM128F932D140
TRBGMZJ128F4264AA6
TRBGMAZ12903CC3707
TRBGMIX128F9303ED6
TRBGWRL128F9301CB9
TRBGWRN128F92F1E46


Now, we can finally retieve info about a specific song in the million song subset

In [24]:
path = path_from_trackid("MillionSongSubset", song_title_to_trackid["Life's Blood\n"])
print(path)
os.path.exists(path)

MillionSongSubset/B/G/M/TRBGMWQ12903CC23CD.h5


True

In [25]:
# open the hdf5 file specified by path
h5 = GETTERS.open_h5_file_read(path)
# extract song info from the file
print(GETTERS.get_num_songs(h5))
print(GETTERS.get_artist_familiarity(h5))
print(GETTERS.get_artist_name(h5))
print(GETTERS.get_title(h5))
# and so on

1
0.7523659740980329
b'Eighteen Visions'
b"Life's Blood"


## Compiling a List of Rock and Jazz Songs

Sources:
- Rock: https://www.rocknrollamerica.net/Top1000.html
- Rock: https://digitaldreamdoor.com/pages/best_songsddd.html
- Jazz: https://goldstandardsonglist.com/Pages_Sort_1a/Sort_1a_Jazz.htm#JAZZ%20LIST%20TOP
- Jazz: https://www.jazz24.org/the-jazz-100/

In [26]:
!pip3 install pandas 
# !sudo apt-get install python3-lxml
!pip3 install lxml

# !pip install pandas, and !pip install lxml 
#result in Python: ImportError: lxml not found, please install it exception for pd.read_html()

import pandas as pd
import lxml

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


Let's create a list of rock songs from the rock source https://www.rocknrollamerica.net/Top1000.html

In [27]:
rock_df = pd.read_html("https://www.rocknrollamerica.net/Top1000.html")

In [28]:
len(rock_df)

1

In [29]:
rock_df # a list of pandas dataframe objects

[         0                         1               2
 0     Rank                      Song          Artist
 1        1        Stairway to Heaven    Led Zeppelin
 2        2                  Hey Jude         Beatles
 3        3  All Along the Watchtower   Hendrix, Jimi
 4        4              Satisfaction  Rolling Stones
 ...    ...                       ...             ...
 996    996             Strange Magic             ELO
 997    997       Great White Buffalo      Ted Nugent
 998    998                Outlaw Man          Eagles
 999    999         Get Out Of Denver      Seger, Bob
 1000  1000         Flying High Again   Ozzy Osbourne
 
 [1001 rows x 3 columns]]

In [30]:
rock_df[0] # one particular dataframe object in the list

Unnamed: 0,0,1,2
0,Rank,Song,Artist
1,1,Stairway to Heaven,Led Zeppelin
2,2,Hey Jude,Beatles
3,3,All Along the Watchtower,"Hendrix, Jimi"
4,4,Satisfaction,Rolling Stones
...,...,...,...
996,996,Strange Magic,ELO
997,997,Great White Buffalo,Ted Nugent
998,998,Outlaw Man,Eagles
999,999,Get Out Of Denver,"Seger, Bob"


In [31]:
rock_df[0].columns

Int64Index([0, 1, 2], dtype='int64')

In [32]:
rock_df[0][1] # extracting the song title column from the rock_df[0] dataframe

0                           Song
1             Stairway to Heaven
2                       Hey Jude
3       All Along the Watchtower
4                   Satisfaction
                  ...           
996                Strange Magic
997          Great White Buffalo
998                   Outlaw Man
999            Get Out Of Denver
1000           Flying High Again
Name: 1, Length: 1001, dtype: object

In [33]:
rock_songs = rock_df[0][1]
rock_songs = rock_songs[1:] # discarding the title, "Song," of the extracted column
rock_songs

1             Stairway to Heaven
2                       Hey Jude
3       All Along the Watchtower
4                   Satisfaction
5           Like A Rolling Stone
                  ...           
996                Strange Magic
997          Great White Buffalo
998                   Outlaw Man
999            Get Out Of Denver
1000           Flying High Again
Name: 1, Length: 1000, dtype: object

In [34]:
type(rock_songs) # the data type of the column of song titles is a pandas series. Need a list of strings instead.

pandas.core.series.Series

In [35]:
rock_songs = rock_songs.astype(str).tolist() # converting the song title column into a list of strings
rock_songs

['Stairway to Heaven',
 'Hey Jude',
 'All Along the Watchtower',
 'Satisfaction',
 'Like A Rolling Stone',
 'Another Brick In The Wall',
 "Won't Get Fooled Again",
 'Hotel California',
 'Layla',
 'Sweet Home Alabama',
 'Bohemian Rhapsody',
 'Riders on the Storm',
 'Rock and Roll',
 'Barracuda',
 'La Grange',
 'Dream On',
 'You Really Got Me',
 'More Than a Feeling',
 'Sultans of Swing',
 'You Shook Me All Night Long',
 'Kashmir',
 'Lola',
 'Carry on Wayward Son',
 'Tiny Dancer',
 'Locomotive Breath',
 "I Still Haven't Found",
 'Magic Carpet Ride',
 'Free Bird',
 'Purple Haze',
 'Tom Sawyer',
 'Let It Be',
 "Baba O'Riley",
 'The Joker',
 'Roxanne',
 'Time',
 "It's A Long Way to the Top",
 'Whole Lotta Love',
 'The Chain',
 "I've Seen All Good People",
 "For What It's Worth",
 'Black Magic Woman',
 'Nights in White Satin',
 'While My Guitar Gently Weeps',
 'Gimme Shelter',
 'Gold Dust Woman',
 'Fortunate Son',
 'American Pie',
 'Bad Company',
 "Waitin' For The Bus/Jesus Just Left",
 'Ove

Amazing. We finally have our list of rong songs in rock_songs. Now lets do the same for jazz songs...

In [36]:
jazz_df_1 = pd.read_html("https://goldstandardsonglist.com/Pages_Sort_1a/Sort_1a_Jazz.htm#JAZZ%20LIST%20TOP")

In [37]:
len(jazz_df_1)

7

In [38]:
jazz_df_1[5] # the 6th dataframe in jazz_df_1 is the one we want

Unnamed: 0,0,1,2,3
0,1902,Bill Bailey Won't You Please Come Home,Jazz,"Cannon, Hughie; REC: Louis Armstrong; Pearl Ba..."
1,1900,Creole Belles,Jazz,"Lampe, J. Bodewalt; REC: The New Orleans Ragti..."
2,,,,
3,"JAZZ, 1910 - 1919 (16 Songs)","JAZZ, 1910 - 1919 (16 Songs)","JAZZ, 1910 - 1919 (16 Songs)","JAZZ, 1910 - 1919 (16 Songs)"
4,1911,Alexander's Ragtime Band,Jazz,"Berlin, Irving; REC: Arthur Collins, Byron Har..."
...,...,...,...,...
880,1998,"Taste Of Voodoo, A",Jazz,"Zorn, John; REC: John Zorn"
881,1995,Thelonius Melodius,Jazz,"McLaughlin, John; REC: John McLaughlin; Chick ..."
882,1996,Waltz For Ruth,Jazz,"Haden, Charlie; REC: Charlie Haden & Pat Metheny"
883,1999,What's He Building?,Jazz,"Waits, Tom; REC: Tom Waits"


In [39]:
print(len(jazz_df_1[5]))

885


In [40]:
jazz_songs = [jazz_df_1[5][1][i] for i in range(len(jazz_df_1[5])) if jazz_df_1[5][2][i] == "Jazz"]
print(len(jazz_songs))
jazz_songs[:10]

867


["Bill Bailey Won't You Please Come Home",
 'Creole Belles',
 "Alexander's Ragtime Band",
 'Back Home Again In Indiana',
 'Dardanella',
 "Darktown Strutters' Ball, The",
 'For Me And My Gal',
 "I Ain't Got Nobody",
 "I'm Always Chasing Rainbows",
 'Livery Stable Blues']

First Jazz song source scrapped. Let's go to the second one.

In [41]:
jazz_df_2 = pd.read_html("https://www.jazz24.org/the-jazz-100/")

In [42]:
len(jazz_df_2)

1

In [43]:
jazz_df_2[0]

Unnamed: 0,0,1,2
0,,Song,Artist
1,1.0,Take Five,Dave Brubeck
2,2.0,So What,Miles Davis
3,3.0,Take The A Train,Duke Ellington
4,4.0,Round Midnight,Thelonious Monk
...,...,...,...
96,96.0,Ceora,Lee Morgan
97,97.0,Sophisticated Lady,Duke Ellington
98,98.0,Sugar,Stanley Turrentine
99,99.0,Footprints,Wayne Shorter


In [44]:
print(len(jazz_df_2[0]))

101


In [45]:
print(jazz_df_2[0][1].astype(str).tolist()[:10])
jazz_df_2[0][1].astype(str).tolist()[1:11]

['Song', 'Take Five', 'So What', 'Take The A Train', 'Round Midnight', 'My Favorite Things', 'A Love Supreme (Acknowledgment)', 'All Blues', 'Birdland', 'The Girl From Ipanema']


['Take Five',
 'So What',
 'Take The A Train',
 'Round Midnight',
 'My Favorite Things',
 'A Love Supreme (Acknowledgment)',
 'All Blues',
 'Birdland',
 'The Girl From Ipanema',
 'Sing, Sing, Sing']

In [46]:
jazz_songs = jazz_songs + jazz_df_2[0][1].astype(str).tolist()[1:]

In [47]:
len(jazz_songs)

967

In [48]:
jazz_songs[:10]

["Bill Bailey Won't You Please Come Home",
 'Creole Belles',
 "Alexander's Ragtime Band",
 'Back Home Again In Indiana',
 'Dardanella',
 "Darktown Strutters' Ball, The",
 'For Me And My Gal',
 "I Ain't Got Nobody",
 "I'm Always Chasing Rainbows",
 'Livery Stable Blues']

Now, we have a list of about 1000 jazz songs and 1000 rock songs. Time to search through the Million Song Subset for them.

## Finding Rock and Jazz Song Track-IDs in the Million Song Subset

Finding Rock songs...

In [49]:
# songs in Million Song Subset
for i in range(10): print(song_titles[existing_song_indices[i]])

Life's Blood

End Of The Beginning

Black Connect 3

Lift Jesus Up

Jubilation T. Cornpone

The Way Of Love (West Coast Diaries Vol. 1 Album Version)

L'idole des femmes

All Or Nothing  (Wonderful World Album Version)

Together We Stand (Album Version)

Cet Air Étrange (Take 1 - Abbey Road Rough Mix)



In [75]:
rock_trackids = []
invalid_rock_songs = []
for x in rock_songs:
    try:
        rock_trackids.append(existing_song_to_trackid[x.upper() + '\n'])
    except KeyError:
        invalid_rock_songs.append(x)
len(rock_trackids), len(invalid_rock_songs)

(8, 992)

We only found 8 out of about 1000 rock songs. This is no good. Let's try the same for jazz songs. 

In [76]:
jazz_trackids = []
invalid_jazz_songs = []
for x in jazz_songs:
    try:
        jazz_trackids.append(existing_song_to_trackid[x.upper() + '\n'])
    except KeyError:
        invalid_jazz_songs.append(x)
len(jazz_trackids), len(invalid_jazz_songs)

(6, 961)

Only 6 usable jazz song track ids found... Disappointing. Time to try our backup plan.

## Backup Plan: Find Jazz and Rock Songs in the Million Song Subset Using the MAGD and Tagtraum Genre Datasets

In [85]:
!wget http://www.ifs.tuwien.ac.at/mir/msd/partitions/msd-MAGD-genreAssignment.cls

--2021-04-04 14:46:48--  http://www.ifs.tuwien.ac.at/mir/msd/partitions/msd-MAGD-genreAssignment.cls
Resolving www.ifs.tuwien.ac.at (www.ifs.tuwien.ac.at)... 128.131.167.11
Connecting to www.ifs.tuwien.ac.at (www.ifs.tuwien.ac.at)|128.131.167.11|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11625230 (11M) [text/plain]
Saving to: ‘msd-MAGD-genreAssignment.cls’


2021-04-04 14:46:59 (1017 KB/s) - ‘msd-MAGD-genreAssignment.cls’ saved [11625230/11625230]



In [None]:
# read genre assignments
trackids_to_genre = {}
with open("msd-MAGD-genreAssignment.cls", 'r') as f:
    for x in f:
        line = x.split("	")
        trackids_to_genre[line[0]] = line[1]
print(len(trackids_to_genre))

In [130]:
# list of track ids in Million Song Subset without song title repeats
subset_track_ids = [track_ids[i] for i in existing_song_indices]

In [131]:
# filtering out track ids with unknown genre from subset_track_ids
trackids_with_unknown_genre = []                # compiling a list of track ids with unknown genre
print_count = 0
for x in subset_track_ids:
    try:
        trackids_to_genre[x]
        if print_count < 10:
            print(trackids_to_genre[x])
        print_count = print_count + 1
    except KeyError:
        trackids_with_unknown_genre.append(x)
print(len(subset_track_ids))
print(len(trackids_with_unknown_genre))

for x in trackids_with_unknown_genre:            # filtering out track ids with unknown genre
    subset_track_ids.remove(x)
print(len(subset_track_ids) == len(existing_song_indices) - len(trackids_with_unknown_genre))

Rap

Pop_Rock

Religious

Pop_Rock

Pop_Rock

Pop_Rock

Pop_Rock

Pop_Rock

Pop_Rock

Pop_Rock

6601
3978
True


In [132]:
# what are the genres that are included in the MAGD genre annotations?
# how many track ids do we have for each genre included in the MAGD genre annotations?

genres_and_count = {}
for x in subset_track_ids:
    try:
        genres_and_count[trackids_to_genre[x]] += 1
    except KeyError:
        genres_and_count[trackids_to_genre[x]] = 1
genres_and_count
    

{'Rap\n': 168,
 'Pop_Rock\n': 1265,
 'Religious\n': 146,
 'Folk\n': 32,
 'New Age\n': 63,
 'Latin\n': 176,
 'International\n': 117,
 'Vocal\n': 22,
 'Country\n': 105,
 'Avant_Garde\n': 9,
 'Jazz\n': 99,
 'RnB\n': 64,
 'Blues\n': 63,
 'Easy_Listening\n': 13,
 'Reggae\n': 76,
 'Electronic\n': 139,
 'Comedy_Spoken\n': 34,
 'Stage \n': 25,
 'Classical\n': 4,
 'Children\n': 2,
 'Holiday\n': 1}

The counts for rock (pop rock) and jazz are not bad, but this is with genre annotations for only (6601 - 3978) of 6601 track ids. Lets look at the Tagtraum genre annotations 

In [133]:
!wget https://www.tagtraum.com/genres/msd_tagtraum_cd2.cls.zip

--2021-04-04 16:31:57--  https://www.tagtraum.com/genres/msd_tagtraum_cd2.cls.zip
Resolving www.tagtraum.com (www.tagtraum.com)... 2a01:238:20a:202:1077::, 81.169.145.77
Connecting to www.tagtraum.com (www.tagtraum.com)|2a01:238:20a:202:1077::|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2473570 (2.4M) [application/zip]
Saving to: ‘msd_tagtraum_cd2.cls.zip’


2021-04-04 16:32:06 (295 KB/s) - ‘msd_tagtraum_cd2.cls.zip’ saved [2473570/2473570]



In [134]:
# !sudo apt install unzip 
!unzip msd_tagtraum_cd2.cls.zip

Archive:  msd_tagtraum_cd2.cls.zip
  inflating: msd_tagtraum_cd2.cls    


In [135]:
# getting track id - genre pairs from the tagtraum genre annotations file
tagtraum_trackids_to_genre = {}
with open("msd_tagtraum_cd2.cls", 'r') as f:
    for line in f:
        tagtraum_trackids_to_genre[line[0]] = line[1]

In [136]:
# filtering out track ids from trackids_with_unknown_genre that are not included in tagtraum_trackids_to_genre
subset_track_ids_1 = trackids_with_unknown_genre
trackids_with_unknown_genre = []
for x in subset_track_ids_1:
    try:
        tagtraum_trackids_to_genre[x]
    except KeyError:
        trackids_with_unknown_genre.append(x)
print(len(subset_track_ids_1))
print(len(trackids_with_unknown_genre))

3978
3978


It looks like all the Tagtraum annotations are for the same track ids as included in MAGD. Lets move forward with the rock (pop-rock) and jazz track ids as annotated in MAGD. 

In [137]:
# get rock song track ids from subset_track_ids
rock_trackids = []
for x in subset_track_ids:
    if trackids_to_genre[x] == "Pop_Rock\n":
            rock_trackids.append(x)
print(len(rock_trackids))
rock_trackids[:10]    

1265


['TRBGMAZ12903CC3707',
 'TRBGWGD128F42418B2',
 'TRBGGBF128F425E4D1',
 'TRBGGFE128E0785AC0',
 'TRBGHHB128F1487CD2',
 'TRBGHYG12903CEAC9C',
 'TRBGCWM12903CF5BF7',
 'TRBGCJQ128F1453CB3',
 'TRBGRAN128F9339A1B',
 'TRBGRGT12903CC5019']

In [138]:
# get jazz song track ids from subset_track_ids
jazz_trackids = []
for x in subset_track_ids:
    if trackids_to_genre[x] == "Jazz\n":
            jazz_trackids.append(x)
print(len(jazz_trackids))
jazz_trackids[:10] 

99


['TRBGEHK12903CEEFC0',
 'TRBGEZL12903CCC7E2',
 'TRBGDND128F9312B81',
 'TRBGKYA12903CF573B',
 'TRBHIPK128F92FFCF3',
 'TRBHAAD128F9341700',
 'TRBHYTA128F93425AC',
 'TRBCTKV128F42222BD',
 'TRBBBFO128F931535D',
 'TRBBTJG12903CD5B47']

In [139]:
# print some available rock songs
for i in range(10):
    print(f"{rock_trackids[i]} | {trackid_to_song[rock_trackids[i]]}")

TRBGMAZ12903CC3707 | L'idole des femmes

TRBGWGD128F42418B2 | When Rules Change

TRBGGBF128F425E4D1 | Rumour (Abstract Hip Hop Mix)

TRBGGFE128E0785AC0 | Coral Fang (Album Version)

TRBGHHB128F1487CD2 | I'm Your Money (12'' Extended Version) (2006 Digital Remaster)

TRBGHYG12903CEAC9C | When It Stings

TRBGCWM12903CF5BF7 | Enslaved By Propaganda

TRBGCJQ128F1453CB3 | Rakkahin -Be My Love-

TRBGRAN128F9339A1B | Creep Live Version

TRBGRGT12903CC5019 | 11th Street



In [140]:
# print some available jazz songs
for i in range(10):
    print(f"{jazz_trackids[i]} | {trackid_to_song[jazz_trackids[i]]}")

TRBGEHK12903CEEFC0 | Sannyasin

TRBGEZL12903CCC7E2 | Dance of the Blue Devils

TRBGDND128F9312B81 | Doodlin'

TRBGKYA12903CF573B | Take That!

TRBHIPK128F92FFCF3 | Des voiliers (Sur un poème de Claude Nougaro)

TRBHAAD128F9341700 | Latin Flute

TRBHYTA128F93425AC | Ayudame Freud

TRBCTKV128F42222BD | Tuscan Chica

TRBBBFO128F931535D | My Plastic Heart (Plastic Operator Remix)

TRBBTJG12903CD5B47 | Distant Cousin



## Write jazz_trackids and rock_trackids to a Text File Each

In [141]:
# jazz
with open("jazz_trackids.txt", "a") as f:
    for i in range(len(jazz_trackids)): f.write(jazz_trackids[i] + '\n')

In [142]:
# Rock
with open("rock_trackids.txt", "a") as f:
    for i in range(len(rock_trackids)): f.write(rock_trackids[i] + '\n')

## Collect Jazz and Rock Data Files from the Million Song Subset

In [143]:
jazz_paths = [path_from_trackid("MillionSongSubset", jazz_trackids[i]) for i in range(len(jazz_trackids))]
print(len(jazz_paths) == len(jazz_trackids))
jazz_paths[:10]

True


['MillionSongSubset/B/G/E/TRBGEHK12903CEEFC0.h5',
 'MillionSongSubset/B/G/E/TRBGEZL12903CCC7E2.h5',
 'MillionSongSubset/B/G/D/TRBGDND128F9312B81.h5',
 'MillionSongSubset/B/G/K/TRBGKYA12903CF573B.h5',
 'MillionSongSubset/B/H/I/TRBHIPK128F92FFCF3.h5',
 'MillionSongSubset/B/H/A/TRBHAAD128F9341700.h5',
 'MillionSongSubset/B/H/Y/TRBHYTA128F93425AC.h5',
 'MillionSongSubset/B/C/T/TRBCTKV128F42222BD.h5',
 'MillionSongSubset/B/B/B/TRBBBFO128F931535D.h5',
 'MillionSongSubset/B/B/T/TRBBTJG12903CD5B47.h5']

In [144]:
rock_paths = [path_from_trackid("MillionSongSubset", rock_trackids[i]) for i in range(len(rock_trackids))]
print(len(rock_paths) == len(rock_trackids))
rock_paths[:10]

True


['MillionSongSubset/B/G/M/TRBGMAZ12903CC3707.h5',
 'MillionSongSubset/B/G/W/TRBGWGD128F42418B2.h5',
 'MillionSongSubset/B/G/G/TRBGGBF128F425E4D1.h5',
 'MillionSongSubset/B/G/G/TRBGGFE128E0785AC0.h5',
 'MillionSongSubset/B/G/H/TRBGHHB128F1487CD2.h5',
 'MillionSongSubset/B/G/H/TRBGHYG12903CEAC9C.h5',
 'MillionSongSubset/B/G/C/TRBGCWM12903CF5BF7.h5',
 'MillionSongSubset/B/G/C/TRBGCJQ128F1453CB3.h5',
 'MillionSongSubset/B/G/R/TRBGRAN128F9339A1B.h5',
 'MillionSongSubset/B/G/R/TRBGRGT12903CC5019.h5']

In [57]:
!mkdir "Jazz Subset"
!mkdir "Rock Subset"

In [145]:
import shutil # a module to move files between directories using python

In [None]:
# move .h5 files for jazz track ids into directory "Jazz Subset"
for x in jazz_paths:
    shutil.move(x, "Jazz Subset")

In [148]:
# move jazz_trackids.txt into "Jazz Subset"
shutil.move("jazz_trackids.txt", "Jazz Subset")

'Jazz Subset/jazz_trackids.txt'

In [147]:
# move .h5 files for rock track ids into directory "Rock Subset"
for x in rock_paths:
    shutil.move(x, "Rock Subset")

In [149]:
# move rock_trackids.txt into "Rock Subset"
shutil.move("rock_trackids.txt", "Rock Subset")

'Rock Subset/rock_trackids.txt'