## Data Reader
Read the MillionSongDataset Summary file - same HDF5 format as regular files, it contains all metadata but no arrays like audio analysis, similar artists and tags. Need to join this data with the metrolyrics data eventually.

### Get the Data
http://millionsongdataset.com/sites/default/files/AdditionalFiles/msd_summary_file.h5

structure of the summary file: https://github.com/tbertinmahieux/MSongsDB/blob/master/PythonSrc/hdf5_utils.py

The following file will help in reading the data:

`wget https://raw.githubusercontent.com/tbertinmahieux/MSongsDB/master/PythonSrc/hdf5_getters.py -O hdf5_getters_2.py`

In [2]:
import h5py    
import numpy as np
import hdf5_getters
import pandas as pd
import os

%load_ext autoreload
%autoreload 2
DATA_DIR = 'data/'

In [3]:
dataFile = "data/msd_summary_file.h5"
features = {
    'metadata':["artist_familiarity", "artist_hotttnesss", "artist_id", "artist_mbid", "artist_playmeid", "artist_7digitalid", "artist_latitude", "artist_location", "artist_longitude", "artist_name", "release", "release_7digitalid", "song_id", "song_hotttnesss", "title", "track_7digitalid", "idx_similar_artists", "idx_artist_terms", "idx_similar_artists", "idx_artist_terms"],
    'analysis':["analysis_sample_rate", "audio_md5", "danceability", "duration", "end_of_fade_in", "energy", "key", "key_confidence", "loudness", "mode", "mode_confidence", "start_of_fade_out", "tempo", "time_signature", "time_signature_confidence", "track_id", "idx_segments_start", "idx_segments_confidence", "idx_segments_pitches", "idx_segments_timbre", "idx_segments_loudness_max", "idx_segments_loudness_max_time", "idx_segments_loudness_start", "idx_sections_start", "idx_sections_confidence", "idx_beats_start", "idx_beats_confidence", "idx_bars_start", "idx_bars_confidence", "idx_tatums_start", "idx_tatums_confidence"],
    'musicbrainz':["year"]
}

In [4]:
with h5py.File(dataFile, 'r') as f:  
    print("The HD5 file has the following groups:")
    print(f.keys())
    for key in f.keys():
        group = f[key]
        print("The group", key, "has the following keys: ")
        for key in group.keys():
            print("songs: ", group[key].shape)            

The HD5 file has the following groups:
<KeysViewHDF5 ['analysis', 'metadata', 'musicbrainz']>
The group analysis has the following keys: 
songs:  (1000000,)
The group metadata has the following keys: 
songs:  (1000000,)
The group musicbrainz has the following keys: 
songs:  (1000000,)


In [5]:
data = hdf5_getters.open_h5_file_read(dataFile)

In [6]:
hdf5_getters.get_song(data, 1)

{'analysis_sample_rate': 22050,
 'artist_7digitalid': 113480,
 'artist_familiarity': 0.4396039666767154,
 'artist_hotttnesss': 0.3569921077564064,
 'artist_id': b'ARMVN3U1187FB3A1EB',
 'artist_latitude': nan,
 'artist_location': b'',
 'artist_longitude': nan,
 'artist_mbid': b'8d7ef530-a6fd-4f8f-b2e2-74aec765e0f9',
 'artist_name': b'Karkkiautomaatti',
 'artist_playmeid': -1,
 'audio_md5': b'ed222d07c83bac7689d52753610a513a',
 'danceability': 0.0,
 'duration': 156.55138,
 'end_of_fade_in': 0.258,
 'energy': 0.0,
 'key': 9,
 'key_confidence': 0.808,
 'loudness': -10.555,
 'mode': 1,
 'mode_confidence': 0.355,
 'release': b'Karkuteill\xc3\xa4',
 'release_7digitalid': 145266,
 'song_hotttnesss': 0.2998774882739778,
 'song_id': b'SOVFVAK12A8C1350D9',
 'start_of_fade_out': 148.66,
 'tempo': 150.778,
 'time_signature': 1,
 'time_signature_confidence': 0.0,
 'title': b'Tanssi vaan',
 'track_7digitalid': 1514808,
 'track_id': b'TRMMMKD128F425225D',
 'year': 1995}

### Create Pandas DataFrames from the HDF5 Data

In [16]:
for chunk in range(1000):  
    chunkFrame = pd.DataFrame()
    for i in range(1000*chunk, 1000*(chunk+1)):
        song = hdf5_getters.get_song(data, i)
        chunkFrame = chunkFrame.append(song, ignore_index=True)
    print("Chunk: ", chunk+1, " done")
    chunkFrame.to_pickle('data/chunks/msddata' + str(chunk+1) + '.pkl')   

Chunk:  1  done
Chunk:  2  done
Chunk:  3  done
Chunk:  4  done
Chunk:  5  done
Chunk:  6  done
Chunk:  7  done
Chunk:  8  done
Chunk:  9  done
Chunk:  10  done
Chunk:  11  done
Chunk:  12  done
Chunk:  13  done
Chunk:  14  done
Chunk:  15  done
Chunk:  16  done
Chunk:  17  done
Chunk:  18  done
Chunk:  19  done
Chunk:  20  done
Chunk:  21  done
Chunk:  22  done
Chunk:  23  done
Chunk:  24  done
Chunk:  25  done
Chunk:  26  done
Chunk:  27  done
Chunk:  28  done
Chunk:  29  done
Chunk:  30  done
Chunk:  31  done
Chunk:  32  done
Chunk:  33  done
Chunk:  34  done
Chunk:  35  done
Chunk:  36  done
Chunk:  37  done
Chunk:  38  done
Chunk:  39  done
Chunk:  40  done
Chunk:  41  done
Chunk:  42  done
Chunk:  43  done
Chunk:  44  done
Chunk:  45  done
Chunk:  46  done
Chunk:  47  done
Chunk:  48  done
Chunk:  49  done
Chunk:  50  done
Chunk:  51  done
Chunk:  52  done
Chunk:  53  done
Chunk:  54  done
Chunk:  55  done
Chunk:  56  done
Chunk:  57  done
Chunk:  58  done
Chunk:  59  done
Chunk:

KeyboardInterrupt: 