In [2]:
import pandas as pd
import time
from IPython.display import display
from typing import List, Dict, Optional

# Data Acquisition & Import

## Primary Data Set and Platforms for Sharing our Data
The primary data source in this project was the **Million Playlist Dataset** provided by Spotify as part of the RecSys Challenge 2018.
https://labs.spotify.com/2018/05/30/introducing-the-million-playlist-dataset-and-recsys-challenge-2018/
The challenge ended on June 30th of 2018, so we used an archival copy of the dataset provided to us by the course staff.
All source code for this project is on a public GitHub Respository 
https://github.com/IACS-CS-209-Group44/Spotify

Unfortunately, Git and GitHub are not well suited to working on large files, especially large binary files.
Because of the large size of this data set, we were forced to use a shared folder on Dropbox to share large files.
This makes it more challenging for us to share our data with the public and create a conveniently reproducible set of calculations.  We are committed to the goal of fully reproducible science including data science with large files.
For a full scale research undertaking with a suitable budget, two ideas to better achieve this goal would be to host a publicly available database instance using a service like AWS and / or to share a container instance.  These techniques are beyond the scope of this project.  Anyone who is interested in accessing our dataset may email me at mse99@g.harvard.edu or michael.s.emanuel@gmail.com to request access to this dataset.  Any reasonable request will be granted read only access to the shared folder for as long as it is up.

## The JSON Data Set and the Choice of SQL Server Backend
The data provided by the mpd consists of 1000 json files.  Each file is a "slice" of 1,000 playlists.  There are 1,000 of these slice files that together comprise the 1,000,000 playlists in the data set.  The JSON files are a highly denormalized representation of the data, meaning they have a large amount of duplicated or redundant data.  As one prominent example, each spotify track has a unique identifier, and a track name associated with this identifier.  The JSON files duplicate the full track name when describing each track entry.  We stored the data in a fully normalized format, with separate database tables for the logical entities of a Track, Playlist, PlaylistEntry, etc.  I will describe our data model in greater detail below.  

The choice to use SQL Server as our back end for data was an important strategic decision on this project.  
The JSON respresentation of the data is convenient for a person reviewing the data, but is extremely inefficient for doing large scale computations.  As one example of how slow the JSON / Python API is, the data set comes with some Python utilities that do very simple summary calculations such as tabulating the most popular tracks.  This utility was on pace to take over 90 minutes to run on my desktop PC, which has top of the line hardware for a desktop computer.

It was clear that we needed a far more efficient back end for storing our data in a more normalized form.  Most people in this course would probably have chosen to use Pandas and persisted a series of data frames.  That is a completely sound choice.  In my particular case, I have spent many years working with SQL (first MySQL, then SQL Server).  I have also invested a large amount of time and money configuring an instance of SQL Server running on a high performance server sitting on a rack in my basement.  By comparison, I am very new to Pandas.  I'm now proficient for basic tasks, but I have often spent hours trying to figure out how to do an operation I could do in a matter of minutes in SQL Server.  It is often true that the best tool for a given job is the one you know how to use.  For me and this problem, that tool was SQL Server.

## Data Import into SQL Server
The first step was to understand the data and break it up into logical entities.  Each logical entity corresponds to its own database table.  The four most important logical entities in this data set are Artist, Album, Track, and Playlist.  I am going to go out of order for a moment and jump ahead to the Pandas dataframes that we used for downstream computations.  The code below loads these frames from an h5 data file.  

Before this code will run, please copy the files `data.h5` and `playlist_entry.h5` from the Dropbox folder to the directory where you cloned the GitHub repo.  On my system, the repo is cloned to `D:\IACS-CS-209-Spotify\` and the Dropbox folder is at `D:\Dropbox\IACS-CS-209-Spotify\`.  I copy the file `D:\Dropbox\IACS-CS-209-Spotify\mpd\database_export\h5` into `D:\IACS-CS-209-Spotify\data\data.h5`.  I do the analogous operation for files `playlist_entry.h5` and `track_pair.h5`.

It is possible to automate all of this in a very slick and convenient way using GitHub LFS.  https://git-lfs.github.com/
But this is a paid service that costs at least \$10 a month, possibly quite a bit more given the size of this data set.  I spent several hours investigating alternatives that would allow us to bypass this manual copy step to synchronize the Dropbox folder, but did not come up with a better procedure.  Once the file `data.h5` is in place, the code below will load the data frames for Artist, Album, Track, and Playlist into memory.

In [7]:
def load_frames(frame_names: Optional[List[str]] = None) -> Dict[str, pd.DataFrame]:
    """Load all available data frames.  Return a dictionary keyed by frame_name."""
    # Relative path to h5 data files
    path_h5 = '../data/'
    
    # Dictionary of dataframes to be generated.
    # Key = frame name, value = fname_h5
    frame_tbl: Dict[str, str] = {
        # Basic schema
        'Artist': 'data.h5',
        'Album': 'data.h5',
        'Track': 'data.h5',
        'Playlist': 'data.h5',
    
        # Tables relating to prediction outcomes and scoring
        'TrainTestSplit': 'data.h5',
        'Playlist_Last10': 'data.h5',
        'Playlist_trn': 'data.h5',
        'Playlist_tst': 'data.h5',
        
        # Tables relating to the baseline and playlist name prediction models
        'TrackRank': 'data.h5',
        'PlaylistName': 'data.h5',
        'PlaylistSimpleName': 'data.h5',
        'TrackRankBySimpleName': 'data.h5',
    
        # PlaylistEntry table is big - saved int its own file
        'PlaylistEntry': 'playlist_entry.h5',
    
        # Audio features
        'AudioFeatures': 'data.h5',
        'Genre': 'data.h5',
        'MetaGenre': 'data.h5',
        'TrackGenre': 'data.h5',
        'TrackMetaGenre': 'data.h5',
        
        # TrackPairs table is big - saved in its own file
        'TrackPairs': 'track_pairs.h5',
        
        # Scores of three models: baseline, playlist name, naive bayes
        'Scores_Baseline': 'data.h5',
        'Scores_SimpleName': 'data.h5',
        'Scores_TrackPair': 'data.h5',
        'Scores_Stack': 'data.h5',
        
        # Survey responses
        'SurveyResponse': 'data.h5',
        'SurveyPlaylist': 'data.h5',
        'SurveyPlaylistEntry': 'data.h5',

        # Artists being promoted by policy (mid-tier, female)
        'PromotedArtist': 'data.h5',

        # Survey recommendations
        'SurveyRecommendations': 'data.h5',
        'SurveyRecommendationsPromoted': 'data.h5',
        }
    
    # Set frame_names to all tables if it was not specified
    if frame_names is None:
        frame_names = frame_tbl.keys()
    
    # Start timer
    t0 = time.time()
    # Dictionary of data frames
    frames: Dict[str, pd.DataFrame] = dict()
    # Iterate over entries in frame_names, loading them from h5 files
    for frame_name in frame_names:
        # h5 filename for this frame
        fname_h5 = frame_tbl[frame_name]
        # Read the data frame
        frames[frame_name] = pd.read_hdf(path_h5 + fname_h5, frame_name)
        # Status update
        print(f'Loaded {frame_name}.')
    
    # Status update
    t1 = time.time()
    elapsed = t1 - t0
    print(f'\nLoaded {len(frames)} Data Frames.')
    print(f'Elapsed Time: {elapsed:0.2f} seconds.')
    return frames

**Load the frames for Artist, Album, Track, Playlist & PlaylistEntry into memory**

In [13]:
frames = load_frames(['Artist', 'Album', 'Track', 'Playlist', 'PlaylistEntry'])

Loaded Artist.
Loaded Album.
Loaded Track.
Loaded Playlist.
Loaded PlaylistEntry.

Loaded 5 Data Frames.
Elapsed Time: 4.90 seconds.


Notice how much faster and more efficient this is than loading the JSON files.  A fully normalized representation of the An analogous operation run directly on the JSON files took on the order of multiple minutes on my system.

## Data Model for Artist, Album, Track & Playlist  

In [14]:
display(frames['Artist'].head(10))

Unnamed: 0,ArtistID,ArtistUri,ArtistName
0,1,spotify:artist:0001cekkfdEBoMlwVQvpLg,Jordan Colle
1,2,spotify:artist:0001wHqxbF2YYRQxGdbyER,Motion Drive
2,3,spotify:artist:0001ZVMPt41Vwzt1zsmuzp,Thyro & Yumi
3,4,spotify:artist:0004C5XZIKZyd2RWvP4sOq,"""Faron Young, Nat Stuckey"""
4,5,spotify:artist:000DnGPNOsxvqb2YEHBePR,The Ruins
5,6,spotify:artist:000Dq0VqTZpxOP6jQMscVL,Thug Brothers
6,7,spotify:artist:000h2XLY65iWC9u5zgcL1M,Kosmose
7,8,spotify:artist:000spuc3oKgwYmfg5IE26s,Parliament Syndicate
8,9,spotify:artist:000UUAlAdQqkTD9sfoyQGf,Darren Gibson
9,10,spotify:artist:000UxvYLQuybj6iVRRCAw1,Primera Etica


In [15]:
display(frames['Album'].head(10))

Unnamed: 0,AlbumID,AlbumUri,AlbumName
0,1,spotify:album:00010fh2pSk7f1mGIhgorB,Okkadu (Original Motion Picture Soundtrack)
1,2,spotify:album:00045VFusrXwCSietfmspc,Let Love Begin Remixed
2,3,spotify:album:0005lpYtyKk9B3e0mWjdem,Stability
3,4,spotify:album:0005rH90S3le891y5XzPg4,"Mozart: Piano Concerto No. 27, KV595"
4,5,spotify:album:0008WZMLnvEBVnq418uZsI,Smart Flesh
5,6,spotify:album:0009lq7uJ6cW3Cxtf8eNUp,Earth: The Pale Blue Dot (Instrumental)
6,7,spotify:album:000aG92zPFtZ0FRLaaJHE5,X
7,8,spotify:album:000f3dTtvpazVzv35NuZmn,"Make It Fast, Make It Slow"
8,9,spotify:album:000g9ysmwb8NNsd4u1o087,"Nennt es, wie Ihr wollt"
9,10,spotify:album:000gdWY9uR4VYS5oZudY5o,Pérez Prado. Sus 40 Grandes Canciones


In [16]:
display(frames['Track'].head(10))

Unnamed: 0,TrackID,ArtistID,AlbumID,TrackUri,TrackName
0,1,208716,355266,spotify:track:0000uJA4xCdxThagdLkkLR,Heart As Cold As Stone
1,2,110598,6666,spotify:track:0002yNGLtYSYtc0X6ZnFvp,Muskrat Ramble
2,3,93681,77525,spotify:track:00039MgrmLoIzSpuYKurn9,Thas What I Do
3,4,5377,586855,spotify:track:0003Z98F6hUq7XxqSRM87H,???? ?????? ??? ???
4,5,285766,426742,spotify:track:0004ExljAge0P5XWn1LXmW,Gita
5,6,240510,337666,spotify:track:0005rgjsSeVLp1cze57jIN,Mi Razón de Ser
6,7,111429,625930,spotify:track:0005w1bMJ7QAMl6DY98oxa,"Sonata in G Major, BuxWV 271: Allegro -"
7,8,102394,113691,spotify:track:0006Rv1e2Xfh6QooyKJqKS,Nightwood
8,9,240180,287678,spotify:track:0007AYhg2UQbEm88mxu7js,Mandarin Oranges Part 2
9,10,35528,570255,spotify:track:0009mEWM7HILVo4VZYtqwc,Movement


In [17]:
display(frames['Playlist'].head(10))

Unnamed: 0,PlaylistID,PlaylistName,NumTracks,NumArtists,NumFollowers,NumEdits,DurationMS,IsCollaborative,ModifiedAt
0,0,Throwbacks,52,37,1,6,11532414,0,1493424000
1,1,Awesome Playlist,39,21,1,5,11656470,0,1506556800
2,2,korean,64,31,1,18,14039958,0,1505692800
3,3,mat,126,86,1,4,28926058,0,1501027200
4,4,90s,17,16,2,7,4335282,0,1401667200
5,5,Wedding,80,56,1,3,19156557,0,1430956800
6,6,I Put A Spell On You,16,13,1,2,3408479,0,1477094400
7,7,2017,53,48,1,38,12674796,0,1509321600
8,8,BOP,46,23,2,21,9948921,0,1508976000
9,9,old country,21,18,1,10,4297488,0,1501804800


In [18]:
display(frames['PlaylistEntry'].head(10))

Unnamed: 0,PlaylistID,Position,TrackID
0,0,0,236619
1,0,1,1866537
2,0,2,260403
3,0,3,347127
4,0,4,451364
5,0,5,270971
6,0,6,1784688
7,0,7,938244
8,0,8,2145897
9,0,9,776743


### Comments on the Table Design
The design of these database tables follows a few simple best practices.  All tables have an integer ID as their primary key.  This gives a large boost to the performance of queries that join tables.  Searching for an integer entry in an index is a much faster operation than comparings strings.  The original JSON data model did not have any integer IDs for any of these entities besides for the playlist ID.  All entities that exist in Spotify also have a field that Spotify names a Uri.  These are string identefiers.  These fields are equipped with unique constraints, building in both a data integrity check and causing SQL server to build indexes that both enforce the constraint and support fast joins on these fields.

The Track table demonstrates foreign key relationships.  The fields ArtistID and AlbumID are foreign keys onto the Artist and Album tables, respectively.  Note that no redundant information such as the ArtistName or ArtistUri are stored in the Track table.  Any consumer of this information is expected to get it by joining the Artist table using ArtistID.  The foreign key relationships are enforeced with foreign key constraints.  The primary key on the Track table is TrackID, and there is a separate unique constraint on the TrackUri.

This table also demonstrates a consistent naming scheme followed in the SpotifyDB database.  The name of the integer primary key on the table Artist is ArtistID.  This is one popular approach.  Another approach is to name the field ID.  I prefer to name the field ArtistID because then when you join from the Track table to the Artist table, the join clause uses the field AristID on both sides of the equality.  There are multiple naming approaches that are strong.  As is often the case, the most important thing is to pick one strategy and then follow it consistently.

For all you SQL aficianados out there, below please find the SQL table definitions for the main logical entities comprising the MPD dataset.  (Don't worry, I won't include all 42 tables in the database, just a few of the important ones!).  The entirety of the SQL used in this project can be found in the `sql` directory under the GitHub repo.  The files are named with a numeric prefix, so that tables are built in the correct order.  As an example, the Track table references the Artist and Album tables, so Artist and Album must be created first.  The relevant SQL scripts are named `03_MakeTable_Artist.sql`,  `04_MakeTable_Album.sql`, and `05_MakeTable_Tradck.sql`.

### SQL Table Definitions for Main Logical Entities

**Artist** / Script `03_MakeTable_Artist.sql`

```sql
DROP sequence IF EXISTS dbo.SEQ_ArtistID
CREATE sequence dbo.SEQ_ArtistID
  AS INT start WITH 1 increment BY 1 NO cycle;
  
DROP TABLE IF EXISTS dbo.Artist;
CREATE TABLE dbo.Artist(
ArtistID INT NOT NULL
  DEFAULT next value FOR dbo.SEQ_ArtistID,
ArtistUri CHAR(37) NOT NULL,
ArtistName VARCHAR(512) NOT NULL,
-- Primary Key and Unique constraints
CONSTRAINT PK_Artist_ArtistID PRIMARY KEY (ArtistID),
CONSTRAINT UNQ_Artist_ArtistUri UNIQUE(ArtistUri),
-- ArtistName should be unique, but unfortunately it's not; index it instead
INDEX IDX_Artist_ArtistName (ArtistName),
);
```