In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from os import listdir
from os.path import isfile, join

import time

sns.set()

Spot check an individual songs file to explore its format and data.

In [2]:
df = pd.read_csv('../data/Songs/songs285.csv')

In [3]:
print(df.shape)
list(df.columns)

(64928, 9)


['pid',
 'pos',
 'artist_name',
 'track_uri',
 'artist_uri',
 'track_name',
 'album_uri',
 'duration_ms',
 'album_name']

This is a *lot* of data: ball-park $65,000$ rows in $1,000$ files, totaling around $65,000,000$ observations – or songs in playlists, many of which are certainly repeated.

In [4]:
df.head()

Unnamed: 0,pid,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name
0,0,0,Deftones,spotify:track:4rEGJ9KirDlKiOHxqVwcVg,spotify:artist:6Ghvu1VvMGScGpOUJBAHNH,Sextape,spotify:album:4RQnFSkkZlA65Xxchhnaha,241533,Diamond Eyes
1,0,1,Muse,spotify:track:0It6VJoMAare1zdV2wxqZq,spotify:artist:12Chz98pHFMPJEknJQMWvI,Undisclosed Desires,spotify:album:0eFHYz8NmK75zSplL5qlfM,235000,The Resistance
2,0,2,Pearl Jam,spotify:track:0LBmvPJYmtEJ7kkWvc3kbT,spotify:artist:1w5Kfo2jwwIPruYS2UWh56,Oceans,spotify:album:5B4PYA7wNN4WdEXdIJu58a,161893,Ten
3,0,3,My Chemical Romance,spotify:track:0uukw2CgEIApv4IWAjXrBC,spotify:artist:7FBcuc1gsnv6Y1nwFtNRCb,Dead!,spotify:album:0FZK97MXMm5mUQ8mtudjuK,195520,The Black Parade
4,0,4,Red Hot Chili Peppers,spotify:track:1iFIZUVDBCCkWe705FLXto,spotify:artist:0L8ExT028jH3ddEcZwqJJ5,Dosed,spotify:album:6deiaArbeoqp1xPEGdEKp1,311866,By The Way


In [5]:
# Some very different playlist lengths
df.groupby('pid')['track_uri'].nunique()[10:20]

pid
10     22
11     63
12    212
13     12
14     11
15     25
16     96
17     24
18      9
19    165
Name: track_uri, dtype: int64

Every file seems to be an arbitrary-length list of playlists, each of which has an id `pid` (scoped to that file) and an arbitrary-length list of songs, each with a position id `pos` in that playlist. For each song, the `artist_name`, `track_uri`, `artist_uri`, `track_name`, `album_uri`, `duration_ms`, and `album_name` are all stored. Most of these are repeated every time a song is repeated (within and across files), so there is a lot of room for simplification here. Furthermore, any URI is effectively a unique identified for the song, artist, or album, and can be used as such.

$1,000$ playlist files, as expected.

In [6]:
all_files = listdir('../data/Songs')
print(len(all_files))
all_files[0:3]

1000


['songs284.csv', 'songs290.csv', 'songs247.csv']

### Structuring the data
A reasonable first step to slim down the size of the dataset without losing information or fidelity, is to parse through all the files to create a reference table/file of all songs and their metadata. Each playlist can then be stored as a simple named object, where the name is the overall playlist id and its value a vector of song ids.

### Looping over all our files to fill out the master DataFrame (songs) and Series (playlists)

In [17]:
start_time = time.time()
loop_start = time.time()

# List of all files
all_files = listdir('../data/Songs')
# Ditch half the playlists; 500,000 playlists is plenty
all_files = all_files[0:200]

# Load first file to get columns (standard across all)
df = pd.read_csv('../data/Songs/' + all_files[0])

# Master DataFrame of all unique songs included across all playlists
#songs = pd.DataFrame(columns = list(df.columns)[2:])
songs = pd.DataFrame()

# Master Series of playlists and the songs included in each
playlists = pd.Series()

# Aggregator functions to limit to one row per song and count occurrences across playlists
a1 = dict()
for key in df.columns[2:]:
    a1[key] = 'first'
a1['track_uri'] = 'count'

# Aggregator to consolidate into sum of songs across playlists
a2= dict()
for key in df.columns[2:]:
    a2[key] = 'first'
del a2['track_uri']
a2['count'] = 'sum'

# Loop over each file to extract data
for i, file in enumerate(all_files):
    # split on "." to split into "filename" and "csv"
    # Then select "filename" and ditch the first five letters "songs"
    filenum = file.split(".")[0][5:]
    
    # Load file and store in temporary dataframe
    fdf = pd.read_csv('../data/Songs/' + file)
    
    # --- SONGS IN FILE ---
    fdf_counts = fdf.iloc[:, 2:]
    fdf_counts = fdf_counts.groupby('track_uri').agg(a1)
    fdf_counts.rename(columns = {'track_uri': 'count'}, inplace = True)
    
    # Add to df of unique songs, update counters, and remove duplicates
    songs = songs.append(fdf_counts)
    
    # -- SONGS IN EACH PLAYLIST --
    # Songs included in every playlist (ordered) in file
    # For each playlist, get list of track_uri's (unique identifiers)
    songs_in_playlist = fdf.groupby('pid')['track_uri'].unique()

    # Update index to be not the pid in file (id), but a combination of them
    #songs_in_playlist.index = [filenum + '_' + str(pid) for pid in songs_in_playlist.index.values]
    songs_in_playlist.index = list(map(lambda x: filenum + '_' + str(x), songs_in_playlist.index.values))
    
    # Add playlists to master Series of all playlists
    playlists = playlists.append(songs_in_playlist)
    
    # Every 50 files, consolidate the songs table so it doesn't grow too big
    if (i+1)%25 == 0: 
        print('{}/{} -- {} s'.format(i+1, len(all_files), time.time() - loop_start))
        loop_start = time.time()
        songs = songs.groupby('track_uri').agg(a2, sort = True)
        print('   Consolidation: {} s'.format(time.time() - loop_start))
    
print("--- %s seconds ---" % (time.time() - start_time))

25/200 -- 28.84281301498413 s
   Consolidation: 2.741060733795166 s


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


50/200 -- 47.54296398162842 s
   Consolidation: 4.794252872467041 s
75/200 -- 51.14710998535156 s
   Consolidation: 5.952292203903198 s
100/200 -- 61.163079023361206 s
   Consolidation: 7.0074687004089355 s
125/200 -- 70.34738993644714 s
   Consolidation: 9.418751001358032 s
150/200 -- 75.32100200653076 s
   Consolidation: 8.680530071258545 s
175/200 -- 80.10952401161194 s
   Consolidation: 10.126471281051636 s
200/200 -- 83.53081130981445 s
   Consolidation: 10.37688398361206 s
--- 508.3829379081726 seconds ---


In [18]:
# Do a final consolidation just to be safe (should be very fast)
# Add song ID to table, now that it only contains unique songs
start_time = time.time()
songs_counts = songs.groupby('track_uri').agg(a2)
songs_counts['song_id'] = np.arange(len(songs_counts))
print("--- %s seconds ---" % (time.time() - start_time))

print(songs_counts.shape)
display(songs_counts.head())

--- 7.567456960678101 seconds ---
(1003760, 8)


Unnamed: 0_level_0,artist_name,artist_uri,track_name,album_uri,duration_ms,album_name,count,song_id
track_uri,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
spotify:track:0002yNGLtYSYtc0X6ZnFvp,Sidney Bechet's Blue Note Jazzmen,spotify:artist:2XouUSO0EAJ9gMMoHiXqMt,Muskrat Ramble,spotify:album:04hQBJ7YSuNnZ0nbuXNYbY,220293,Jazz Classics,1,0
spotify:track:00039MgrmLoIzSpuYKurn9,Zach Farlow,spotify:artist:2jTojc4rAsOMx6200a8Ah1,Thas What I Do,spotify:album:0UHfgx3ITlxePDXLaN5Y6x,222727,The Great Escape 2,2,1
spotify:track:0006Rv1e2Xfh6QooyKJqKS,Two Steps from Hell,spotify:artist:2qvP9yerCZCS0U1gZU8wYp,Nightwood,spotify:album:1BD29pKydSXe1EsHFj0GrQ,189638,Colin Frake On Fire Mountain,4,2
spotify:track:0007AYhg2UQbEm88mxu7js,Little Simz,spotify:artist:6eXZu6O7nAUA5z6vLV8NKI,Mandarin Oranges Part 2,spotify:album:32RJzqlapfiU0fr2l4SSW9,198000,E.D.G.E,1,3
spotify:track:0009mEWM7HILVo4VZYtqwc,Slam,spotify:artist:0Y0Kj7BOR5DM0UevuY7IvO,Movement,spotify:album:62VkRE2ucNvZDnYMCsnNDh,447534,Movement,1,4


In [61]:
# Replace playlist track_uri with song_id
start_time = time.time()
loop_start = time.time()

playlists_songids = pd.Series(index = playlists.index)
playlists_songids = playlists_songids.astype(object)

i = 0
for ind, row in playlists.items():
    songids = np.array(songs_counts.loc[row, 'song_id'], 'int')
    playlists_songids.loc[str(ind)] = songids
    
    i += 1
    if i % (len(playlists)/20) == 0 == 0: 
        print('{}/{} -- {} s'.format(i, int(len(playlists)), time.time() - loop_start))
        loop_start = time.time()
    
print("--- %s seconds ---" % (time.time() - start_time))

print(playlists_songids.shape)
print(playlists_songids.head())

10000/200000 -- 38.7440550327301 s
20000/200000 -- 43.11882281303406 s
30000/200000 -- 44.18085217475891 s
40000/200000 -- 48.200636863708496 s
50000/200000 -- 52.817174196243286 s
60000/200000 -- 50.61113619804382 s
70000/200000 -- 57.99031400680542 s
80000/200000 -- 64.7493839263916 s
90000/200000 -- 67.53792810440063 s
100000/200000 -- 67.3009626865387 s
110000/200000 -- 68.4447910785675 s
120000/200000 -- 71.28671312332153 s
130000/200000 -- 72.44740080833435 s
140000/200000 -- 77.3995201587677 s
150000/200000 -- 80.82779884338379 s
160000/200000 -- 358.95413088798523 s
170000/200000 -- 90.39383912086487 s
180000/200000 -- 89.62128067016602 s
190000/200000 -- 100.55844020843506 s
200000/200000 -- 97.31215310096741 s
--- 1642.5336339473724 seconds ---
(200000,)
284_0    [340039, 125250, 881533, 653897, 49614, 356319...
284_1    [738782, 7646, 142078, 900881, 533258, 429837,...
284_2    [552361, 135177, 507876, 865927, 638474, 55164...
284_3    [214695, 27387, 700562, 448130, 1000188

In [62]:
# Change songs table to have song id as index and track_uri as column
# We will be doing lookups on song_id while running
songs_counts_id = songs_counts.copy()
songs_counts_id['track_uri'] = songs_counts_id.index.values
songs_counts_id.set_index('song_id', inplace = True)
songs_counts_id.head()

Unnamed: 0_level_0,artist_name,artist_uri,track_name,album_uri,duration_ms,album_name,count,track_uri
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,Sidney Bechet's Blue Note Jazzmen,spotify:artist:2XouUSO0EAJ9gMMoHiXqMt,Muskrat Ramble,spotify:album:04hQBJ7YSuNnZ0nbuXNYbY,220293,Jazz Classics,1,spotify:track:0002yNGLtYSYtc0X6ZnFvp
1,Zach Farlow,spotify:artist:2jTojc4rAsOMx6200a8Ah1,Thas What I Do,spotify:album:0UHfgx3ITlxePDXLaN5Y6x,222727,The Great Escape 2,2,spotify:track:00039MgrmLoIzSpuYKurn9
2,Two Steps from Hell,spotify:artist:2qvP9yerCZCS0U1gZU8wYp,Nightwood,spotify:album:1BD29pKydSXe1EsHFj0GrQ,189638,Colin Frake On Fire Mountain,4,spotify:track:0006Rv1e2Xfh6QooyKJqKS
3,Little Simz,spotify:artist:6eXZu6O7nAUA5z6vLV8NKI,Mandarin Oranges Part 2,spotify:album:32RJzqlapfiU0fr2l4SSW9,198000,E.D.G.E,1,spotify:track:0007AYhg2UQbEm88mxu7js
4,Slam,spotify:artist:0Y0Kj7BOR5DM0UevuY7IvO,Movement,spotify:album:62VkRE2ucNvZDnYMCsnNDh,447534,Movement,1,spotify:track:0009mEWM7HILVo4VZYtqwc


### Write files to disk (csv and pickle)
Use `pd.read_pickle` to easily read back in a data frame or series with the exact same structure as the one you dumped.

In [65]:
songs_counts_id.to_csv('../data/songs_counts_'+str(len(all_files))+'.csv')
songs_counts_id.to_pickle('../data/songs_counts_'+str(len(all_files))+'.pkl')

In [66]:
playlists_songids.to_csv('../data/playlists_song_ids_'+str(len(all_files))+'.csv', header = False)
playlists_songids.to_pickle('../data/playlists_song_ids_'+str(len(all_files))+'.pkl')

### For later use:
Quickly filtering out from playlists songs that do not exist. Use to filter out songs that appear below a certain threshold by first filtering `songs_counts` on that threshold, and run `get` on that. Some songs in some playlists will then not exist in `songs_counts`, and the below code will drop them from the playlist. Handy for limiting the size of the dataset.

In [15]:
songs_counts.song_id.get(['derp', *playlists[0]], np.NaN)

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]


track_uri
derp                                         NaN
spotify:track:2d7LPtieXdIYzf7yHPooWd     90338.0
spotify:track:0y4TKcc7p2H6P0GJlt01EI     33049.0
spotify:track:6q4c1vPRZREh7nw3wG7Ixz    234643.0
spotify:track:54KFQB6N4pn926IUUYZGzK    173775.0
spotify:track:0NeJjNlprGfZpeX2LQuN6c     13104.0
spotify:track:2kuFVY6hWX6yavTiWHE3SQ     94661.0
spotify:track:66mmvchQ4C3LnPzq4DiAI3    209472.0
spotify:track:4gFxywaJejXWxo0NjlWzgg    160465.0
spotify:track:6wQSrFnJYm3evLsavFeCVT    238134.0
spotify:track:3ZjnFYlal0fXN6t61wdxhl    122726.0
spotify:track:617EQMgzYFe2THz093j68m    206271.0
spotify:track:6Hki3HcbeU2c4T72lJjyZ5    215590.0
spotify:track:6dbKl5pzPue67ttfvYCpK9    227718.0
Name: song_id, dtype: float64

In [16]:
np.array(songs_counts.song_id.get(['derp', *playlists[0]], np.NaN).dropna(), 'int')

array([ 90338,  33049, 234643, 173775,  13104,  94661, 209472, 160465,
       238134, 122726, 206271, 215590, 227718])