# Spotify Data Case Study
I listen to a lot of music on Spotify, as do many other people. But this is about me. My problem is that I promised a friend a list of my top 50 albums and I seriously don't know what they are. I can list maybe 25 before it becomes really hard to decide. The aim of this case study is to solve my pressing issue using supervised learning. And I guess other people can also download their Spotify data do the same thing. 

I decided on using a neural network since my music taste is quite sophistocated and therefore contains many complex patterns. Specifically, it will address the problem of varying feature length better than other methods (since I listen to some albums much more than others). For this case study I won't be working with the Spotify API or any song metadata, nor will I reference album, artist, or track ID as features in the learning. 

The results were better than I expected. The neural network and logistic regression each performed noticeably better than random assignment. The measurement is admittedly a little subjective: I counted how many of their 25 album picks I would even consider for my top 50. 

**Overview**
Step 1: Contact spotify support to download my full listening history. You can only download the last year from the account page, but I want all of it. 
Step 2: Clean the data, do some EDA, fit a neural network, predict my favorite albums.

## Initialize and Clean

In [1]:
import json
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
import time
import datetime

In [2]:
# read json files to df
for i in range(0,5):
    if i == 0:
        df = pd.read_json('MyData/endsong_' + str(i) + '.json')
    else:
        # ignore_index makes sure they fully combine
        df = df.append(pd.read_json('MyData/endsong_' + str(i) + '.json'), ignore_index=True)
y_train = pd.read_csv('y_train.csv')

**Clean**

In [3]:
%%capture
# drop podcasts
df.drop(df[~df['episode_show_name'].isnull()].index) 
# drop useless columns
renames = {"master_metadata_track_name":"track","master_metadata_album_artist_name":"artist","master_metadata_album_album_name":"album"}
drops = ["username","conn_country","ip_addr_decrypted","city","region","metro_code","longitude","latitude","offline","offline_timestamp","incognito_mode","user_agent_decrypted","platform","episode_name","episode_show_name","spotify_episode_uri","spotify_track_uri"]
df.rename(columns = renames, inplace = True)
df = df.drop(drops, axis = 1)
# drop Sleepy John
df = df.drop(df[df['artist'].str.match("Sleepy John",case=False,na=False)].index)
# drop empty observations
df.drop(df[df['spotify_track_uri'].isnull()].index)
# if I do not have over 20 observations of an album I assume it can't be one of my favorites
df = df.groupby(['artist', 'album']).filter(lambda x: len(x) > 20).reset_index()
# merge df with y_train file
df = df.merge(y_train, how='left', on=['artist', 'album'])

In [4]:
# create a date column
df['date'] = df['ts'].str.split("T", expand=True)[0]
# create a timestamp for number of days since unix epoch 
# technically it's the number of seconds as a time delta, but this makes no difference 
df['date_ts'] = (pd.to_datetime(df['date']) - np.datetime64('1970-01-01T00:00:00'))

In [39]:
# here's what the data looks like
df.head(6)

Unnamed: 0,index,ts,ms_played,track,artist,album,spotify_track_uri,reason_start,reason_end,shuffle,skipped,predict,date,date_ts
0,0,2018-07-15T17:15:45Z,153351,4th Dimension,KIDS SEE GHOSTS,KIDS SEE GHOSTS,spotify:track:6JyEh4kl9DLwmSAoNDRn5b,trackdone,trackdone,False,,,2018-07-15,17727 days
1,4,2019-07-25T23:58:38Z,44912,I'm in Love Again,Tomppabeats,Harbor,spotify:track:3Pm3R9cbWkanONrubREjW9,trackdone,trackdone,False,,,2019-07-25,18102 days
2,7,2018-11-08T18:54:12Z,46359,Shimmy,System Of A Down,Toxicity,spotify:track:1a3X8Y882vwSnlnHqf9ztF,trackdone,endplay,False,,,2018-11-08,17843 days
3,8,2019-02-15T06:12:55Z,245098,Kids See Ghosts,KIDS SEE GHOSTS,KIDS SEE GHOSTS,spotify:track:2I3dW2dCBZAJGj5X21E53k,trackdone,trackdone,False,,,2019-02-15,17942 days
4,9,2017-04-17T15:17:46Z,8904,"Sing About Me, I'm Dying Of Thirst",Kendrick Lamar,"good kid, m.A.A.d city",spotify:track:0sd6BRTa0O96tfEbFGhJF9,clickrow,endplay,False,,1.0,2017-04-17,17273 days
5,11,2017-03-24T19:19:26Z,493400,Holiday / Boulevard of Broken Dreams,Green Day,American Idiot,spotify:track:0MsrWnxQZxPAcov7c74sSo,trackdone,trackdone,False,,,2017-03-24,17249 days


**Preview**

In [40]:
%%capture
# describe all features
df.describe(include="all")

In [29]:
%%capture
# see what the frequency of trackdone and trackstart messages are
display(df["reason_end"].value_counts())
display(df["reason_start"].value_counts())

In [28]:
%%capture
# Observe the album I have listened to most: Blonde
album_grp = df.groupby(['album'])
album_grp.get_group('Blonde').head(4)

In [26]:
# 20 albums that I have the most observations for
df.groupby(['artist', 'album']).size().sort_values(ascending=False).head(20)

artist                 album                 
Frank Ocean            Blonde                    498
Chon                   Homey                     442
Flying Lotus           You're Dead!              388
Radiohead              Kid A                     324
Porter Robinson        Worlds                    317
Nujabes                Modal Soul                315
Jon Bellion            The Human Condition       301
Kanye West             The Life Of Pablo         299
Chon                   Grow                      282
Tokyo Police Club      A Lesson In Crime         270
Taylor Swift           Lover                     258
BROCKHAMPTON           SATURATION II             258
Red Hot Chili Peppers  Stadium Arcadium          254
Taylor Swift           1989                      253
Sufjan Stevens         Carrie & Lowell           244
Red Hot Chili Peppers  Californication           243
                       By the Way                242
Kendrick Lamar         good kid, m.A.A.d city    229


In [27]:
%%capture
# string matching for artist
df[df['artist'].str.match("ecco",case=False,na=False)].head(3)

## Feature and Neural Net Construction

In [12]:
# group by artist-album pair; each group will be one observation
dfgrouped = df.groupby(['artist','album'])
n_obs_max = dfgrouped.size().max()
n_groups = dfgrouped.ngroups

A = []
Y = []
# construct features, zero padding to deal with input length variation.
for name, group in dfgrouped:
    # timestamp (in days)
    a = np.array(group['date_ts'])
    a.resize(n_obs_max)
    # length of play
    b = np.array(group['ms_played'])
    b.resize(n_obs_max)
    # whether shuffle was on
    c = np.array(group['shuffle'])
    c.resize(n_obs_max)
    # standard deviation of song played. 
    # this feature distinguishes between spamming a single song and listening to many songs on the album
    e = np.array(pd.Series(group.track, dtype='category').index).std()
    A.append(a + b)
    Y.append(group['predict'].max())

A = np.vstack(A).T
A = A.astype(np.int64)
# training X
X_train = A[:,(~np.isnan(Y)).T].T
# training Y
Y = np.array(Y)[~np.isnan(Y)]

In [13]:
# the structure of this NN is extremely arbitrary. It's more of a proof of concept. 
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='tanh'),
    tf.keras.layers.Dense(64, activation='tanh'),
    tf.keras.layers.Dense(64, activation='tanh', kernel_regularizer=tf.keras.regularizers.L2(0.1)),
    tf.keras.layers.Dense(64, activation='tanh'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])
# binary crossentropy is fine for sigmoid activation
loss_fn = tf.keras.losses.BinaryCrossentropy()

# run model, didn't seem to get much after 20 epochs
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
model.fit(X_train, Y, epochs=20)

# statistics about the prediction
# pd.DataFrame(model.predict(A.T).T[0]).describe()

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x1d2872e1760>

In [14]:
# create grouped df to add predicted Y values. Inappropriately named 'test_y'
dfgrouped_merge = pd.DataFrame(dfgrouped.size())
dfgrouped_merge.columns = ['test_y']
dfgrouped_merge['test_y'] = model.predict(A.T).T[0]
dfgrouped_merge = df.merge(dfgrouped_merge, how='left', on=['artist', 'album'])

In [15]:
#df[df.test_y == 1].groupby(['artist','album','test_y']).size().head(55)
def f(x):
    k = f['test_y'].mean()
    return pd.Series(k, index='test_y')

df_predictions = pd.DataFrame(dfgrouped_merge.groupby(['artist','album']).apply(
    lambda x: pd.Series([x.test_y.mean(),x.predict.mean()], index=['test_y','predict']))).sort_values('test_y',ascending=0)

## Neural Net Results
Here I ran through a couple different ideas. Theres some variation due to the random starting parameters with gradient descent, so it's good to test out a couple things. I decided scrambling the predicted values was a good way to check the neural net against randomly chosen songs. It's a start at seeing how overfit the model is. I was also curious how a simple logistic regression would perform versus the "deep" neural net. Honestly the 'logistic regression' (single feature sigmoid) works pretty well, and it seems to catch a slightly different trend in the data, pulling up a couple good picks that the neural net missed. 

In [16]:
df_predictions[df_predictions.predict!=1].head(25)
# 1. ) NN set: 12/25 are considerations in my mind; Random set: 9/50 I barely recognized ~20 of them
# 2. ) ADDED "SONG SPAMMING FEATURE":    NN set: 10/25; Random set: 11/47  not sure what to make of the difference if any
# 3. )^ again:    NN set: 19/25. There are albums it consistently likes, that I like too. 
# The top 5-10 are consistently some of my considerations
# Random set: 10/44; Seems worse than the NN, but its pretty helpful in its own way. Randomly listing albums jogs my memory.
# 4. ) "logistic" regression (1 feature sigmoid NN). Performs exceptional on train set. 14/25 on test set !!!
# Random Set: 7/47; again some good picks that don't show up on the NN or LR.
# 5. ) Logistic regression again. 7/25   Random assignment: 10/45
# 6. ) Logistic 13/25, NN: 12/25, Random: not much different from other times
# 7. ) Added much more training data. 

Unnamed: 0_level_0,Unnamed: 1_level_0,test_y,predict
artist,album,Unnamed: 2_level_1,Unnamed: 3_level_1
Tomppabeats,Harbor,0.924204,
BROCKHAMPTON,GINGER,0.896664,
Kendrick Lamar,To Pimp A Butterfly,0.896664,
Kanye West,The Life Of Pablo,0.890843,
Madvillain,Madvillainy,0.880098,
Rae Sremmurd,SremmLife 2,0.869153,
XXXTENTACION,17,0.869153,
"Tyler, The Creator",IGOR,0.86642,
Charly Bliss,Guppy,0.861609,
Flying Lotus,You're Dead!,0.847708,


## Random Assignment Results

In [17]:
# create a column of the same predictions, but permuted randomly
df_predictions['test_y_scramble'] = list(df_predictions.test_y.sample(frac=1).reset_index(drop=1))
# look at scrambled prediction values to see if the neural network is doing anything productive
# so far looks more effective than randomness... but there's a good chance of confirmation bias
df_predictions.sort_values('test_y_scramble',ascending=0).head(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,test_y,predict,test_y_scramble
artist,album,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mac DeMarco,This Old Dog,0.331755,,0.964504
Jimi Hendrix,Electric Ladyland,0.320928,,0.958369
Ugly Casanova,Sharpen Your Teeth,0.103632,,0.957338
System Of A Down,Toxicity,0.680597,,0.943285
Philanthrope,Clockwork,0.017902,,0.931907
KOAN Sound,Forgotten Myths,0.408924,,0.924204
Taylor Swift,Fearless,0.50696,,0.9148
DJ Okawari,Libyus Music Sound History 2004-2010,0.242424,,0.896664
Drake,More Life,0.196324,0.0,0.896664
"Tyler, The Creator",Flower Boy,0.199434,,0.890843


In [35]:
%%capture
# plot data to look for what features correlate with the model's predictions
df2 = df.merge(df_predictions, on=['artist','album'])
sns.scatterplot(data=df2, x='shuffle', y='test_y')

## Logistic Regression Results

In [36]:
# lets try a logistic regression. Seems to work as well as the NN


model_logitstic_regression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1, activation='sigmoid'),
])
loss_fn = tf.keras.losses.BinaryCrossentropy()
model_logitstic_regression.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])

model_logitstic_regression.fit(X_train, Y, epochs=200)

# statistics about the prediction
# pd.DataFrame(model.predict(A.T).T[0]).describe()

dfgrouped_merge_LR = pd.DataFrame(dfgrouped.size())
dfgrouped_merge_LR.columns = ['test_y']
dfgrouped_merge_LR['test_y'] = model_logitstic_regression.predict(A.T).T[0]
dfgrouped_merge_LR = df.merge(dfgrouped_merge_LR, how='left', on=['artist', 'album'])

#df[df.test_y == 1].groupby(['artist','album','test_y']).size().head(55)
def f(x):
    k = f['test_y'].mean()
    return pd.Series(k, index='test_y')

df_predictions_LR = pd.DataFrame(dfgrouped_merge.groupby(['artist','album']).apply(
    lambda x: pd.Series([x.test_y.mean(),x.predict.mean()], index=['test_y','predict']))).sort_values('test_y',ascending=0)

df_predictions_LR[df_predictions_LR.predict!=1].head(25)


# create a column of the same predictions, but permuted randomly
df_predictions_LR['test_y_scramble'] = list(df_predictions_LR.test_y.sample(frac=1).reset_index(drop=1))
# look at scrambled prediction values to see if the neural network is doing anything productive
# so far looks more effective than randomness... but there's a good chance of confirmation bias


Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200


Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200
Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200


Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


In [38]:
df_predictions_LR[df_predictions.predict!=1].head(25)

Unnamed: 0_level_0,Unnamed: 1_level_0,test_y,predict,test_y_scramble
artist,album,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tomppabeats,Harbor,0.924204,,0.072582
BROCKHAMPTON,GINGER,0.896664,,0.367004
Kendrick Lamar,To Pimp A Butterfly,0.896664,,0.449296
Kanye West,The Life Of Pablo,0.890843,,0.444893
Madvillain,Madvillainy,0.880098,,0.498583
Rae Sremmurd,SremmLife 2,0.869153,,0.543826
XXXTENTACION,17,0.869153,,0.057449
"Tyler, The Creator",IGOR,0.86642,,0.012642
Charly Bliss,Guppy,0.861609,,0.040976
Flying Lotus,You're Dead!,0.847708,,0.287854
