# Spotify Data Case Study
I listen to a lot of music on Spotify, as do many other people. I promised a friend a list of my top 50 albums but found this very challenging. I was able to list ~25 of them, but after that I had a lot of trouble deciding. This case study is a demonstration that supervised machine learning can solve pressing real-world issues. 

For this challenge I decided on a neural network since my music taste is quite sophistocated and therefore contains many complex patterns. Specifically, it will address the problem of varying feature length better than other methods (since I listen to some albums much more than others). I won't be working with the Spotify API or any song metadata, nor will I reference album, artist, or track ID as features in the learning. These would be very useful inclusions in a deeper analysis, but require much more training data to avoid overfitting. 

The results were more positive than I expected. The neural network and logistic regression each performed noticeably better than random assignment. However the measurement is admittedly a little subjective: I counted how many of their album picks I would even consider for my top 50. 

**Process Overview**

- Contact spotify support to download full listening history. You can only download the last year from the account page.
- Create a small .csv of training data (~50 album-artist pairs with a rating of 1 or 0)
- Clean the dataset. Drop unnecessary columns and rows, address NA values, and create new features.
- Exploratory data analysis to get a general feeling of the dataset.
- Extract key features into matrices.
- Create neural network and logistic regression models in tensorflow. Train them, then predict on all albums in the dataset.
- Combine predicted values with original dataset, arrange by confidence, and evaluate the prediected top 50.

## Get To Know The Data

In [1]:
import json
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
import time
import datetime

In [2]:
# read json files to df
for i in range(0,5):
    if i == 0:
        df = pd.read_json('MyData/endsong_' + str(i) + '.json')
    else:
        # ignore_index makes sure they fully combine
        df = df.append(pd.read_json('MyData/endsong_' + str(i) + '.json'), ignore_index=True)
y_train = pd.read_csv('y_train.csv')

**Clean**

In [3]:
%%capture
# drop podcasts
df.drop(df[~df['episode_show_name'].isnull()].index) 
# drop empty observations
df.drop(df[df['spotify_track_uri'].isnull()].index)
# drop useless columns
renames = {"master_metadata_track_name":"track","master_metadata_album_artist_name":"artist","master_metadata_album_album_name":"album"}
drops = ["username","conn_country","ip_addr_decrypted","city","region","metro_code","longitude","latitude","offline","offline_timestamp","incognito_mode","user_agent_decrypted","platform","episode_name","episode_show_name","spotify_episode_uri","spotify_track_uri"]
df.rename(columns = renames, inplace = True)
df = df.drop(drops, axis = 1)
# drop Sleepy John
df = df.drop(df[df['artist'].str.match("Sleepy John",case=False,na=False)].index)
# if I do not have over 20 observations of an album I assume it can't be one of my favorites
df = df.groupby(['artist', 'album']).filter(lambda x: len(x) > 20).reset_index()
# merge df with y_train file
df = df.merge(y_train, how='left', on=['artist', 'album'])

In [4]:
# create a date column
df['date'] = df['ts'].str.split("T", expand=True)[0]
# create a timestamp for number of days since unix epoch 
# technically it's the number of seconds as a time delta, but this makes no difference 
df['date_ts'] = (pd.to_datetime(df['date']) - np.datetime64('1970-01-01T00:00:00'))

**Preview**

In [5]:
# here's what the data looks like
df.head(6)

Unnamed: 0,index,ts,ms_played,track,artist,album,reason_start,reason_end,shuffle,skipped,predict,date,date_ts
0,0,2018-07-15T17:15:45Z,153351,4th Dimension,KIDS SEE GHOSTS,KIDS SEE GHOSTS,trackdone,trackdone,False,,,2018-07-15,17727 days
1,4,2019-07-25T23:58:38Z,44912,I'm in Love Again,Tomppabeats,Harbor,trackdone,trackdone,False,,,2019-07-25,18102 days
2,7,2018-11-08T18:54:12Z,46359,Shimmy,System Of A Down,Toxicity,trackdone,endplay,False,,,2018-11-08,17843 days
3,8,2019-02-15T06:12:55Z,245098,Kids See Ghosts,KIDS SEE GHOSTS,KIDS SEE GHOSTS,trackdone,trackdone,False,,,2019-02-15,17942 days
4,9,2017-04-17T15:17:46Z,8904,"Sing About Me, I'm Dying Of Thirst",Kendrick Lamar,"good kid, m.A.A.d city",clickrow,endplay,False,,1.0,2017-04-17,17273 days
5,11,2017-03-24T19:19:26Z,493400,Holiday / Boulevard of Broken Dreams,Green Day,American Idiot,trackdone,trackdone,False,,,2017-03-24,17249 days


In [6]:
%%capture
# describe all features
df.describe(include="all")

In [7]:
%%capture
# see what the frequency of trackdone and trackstart messages are
display(df["reason_end"].value_counts())
display(df["reason_start"].value_counts())

In [8]:
%%capture
# Observe the album I have listened to most: Blonde
album_grp = df.groupby(['album'])
album_grp.get_group('Blonde').head(4)

In [9]:
# 20 albums that I have the most observations for
df.groupby(['artist', 'album']).size().sort_values(ascending=False).head(20)

artist                 album                 
Frank Ocean            Blonde                    498
Chon                   Homey                     442
Flying Lotus           You're Dead!              388
Radiohead              Kid A                     324
Porter Robinson        Worlds                    317
Nujabes                Modal Soul                315
Jon Bellion            The Human Condition       301
Kanye West             The Life Of Pablo         299
Chon                   Grow                      282
Tokyo Police Club      A Lesson In Crime         270
Taylor Swift           Lover                     258
BROCKHAMPTON           SATURATION II             258
Red Hot Chili Peppers  Stadium Arcadium          254
Taylor Swift           1989                      253
Sufjan Stevens         Carrie & Lowell           244
Red Hot Chili Peppers  Californication           243
                       By the Way                242
Kendrick Lamar         good kid, m.A.A.d city    229


In [10]:
%%capture
# string matching for artist
df[df['artist'].str.match("ecco",case=False,na=False)].head(3)

## Feature and Neural Net Construction

In [11]:
# group by artist-album pair; each group will be one observation
dfgrouped = df.groupby(['artist','album'])
n_obs_max = dfgrouped.size().max()
n_groups = dfgrouped.ngroups

A = []
Y = []
# construct features, zero padding to deal with input length variation.
for name, group in dfgrouped:
    # timestamp (in days)
    a = np.array(group['date_ts'])
    a.resize(n_obs_max)
    # length of play
    b = np.array(group['ms_played'])
    b.resize(n_obs_max)
    # whether shuffle was on
    c = np.array(group['shuffle'])
    c.resize(n_obs_max)
    # standard deviation of song played. 
    # this feature distinguishes between spamming a single song and listening to many songs on the album
    e = np.array(pd.Series(group.track, dtype='category').index).std()
    A.append(a + b)
    Y.append(group['predict'].max())

A = np.vstack(A).T
A = A.astype(np.int64)
# training X
X_train = A[:,(~np.isnan(Y)).T].T
# training Y
Y = np.array(Y)[~np.isnan(Y)]

In [12]:
%%capture
# the structure of this NN is extremely arbitrary. It's more of a proof of concept. 
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='tanh'),
    tf.keras.layers.Dense(64, activation='tanh'),
    tf.keras.layers.Dense(64, activation='tanh', kernel_regularizer=tf.keras.regularizers.L2(0.1)),
    tf.keras.layers.Dense(64, activation='tanh'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])
# binary crossentropy is fine for sigmoid activation
loss_fn = tf.keras.losses.BinaryCrossentropy()

# run model, didn't seem to get much after 20 epochs
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
model.fit(X_train, Y, epochs=20)

# statistics about the prediction
# pd.DataFrame(model.predict(A.T).T[0]).describe()

In [13]:
# create grouped df to add predicted Y values. Inappropriately named 'test_y'
dfgrouped_merge = pd.DataFrame(dfgrouped.size())
dfgrouped_merge.columns = ['test_y']
dfgrouped_merge['test_y'] = model.predict(A.T).T[0]
dfgrouped_merge = df.merge(dfgrouped_merge, how='left', on=['artist', 'album'])

In [14]:
#df[df.test_y == 1].groupby(['artist','album','test_y']).size().head(55)
def f(x):
    k = f['test_y'].mean()
    return pd.Series(k, index='test_y')

df_predictions = pd.DataFrame(dfgrouped_merge.groupby(['artist','album']).apply(
    lambda x: pd.Series([x.test_y.mean(),x.predict.mean()], index=['test_y','predict']))).sort_values('test_y',ascending=0)

## Neural Net Results
Here I ran through a couple different ideas. Theres some variation due to the random starting parameters with gradient descent, so it's good to test out a couple things. I decided scrambling the predicted values was a good way to check the neural net against randomly chosen songs. It's a start at seeing how overfit the model is. I was also curious how a simple logistic regression would perform versus the "deep" neural net. Honestly the 'logistic regression' (single feature sigmoid) works pretty well, and it seems to catch a slightly different trend in the data, pulling up a couple good picks that the neural net missed. 

In [15]:
df_predictions[df_predictions.predict!=1].head(25)
# 1. ) NN set: 12/25 are considerations in my mind; Random set: 9/50 I barely recognized ~20 of them
# 2. ) ADDED "SONG SPAMMING FEATURE":    NN set: 10/25; Random set: 11/47  not sure what to make of the difference if any
# 3. )^ again:    NN set: 19/25. There are albums it consistently likes, that I like too. 
# The top 5-10 are consistently some of my considerations
# Random set: 10/44; Seems worse than the NN, but its pretty helpful in its own way. Randomly listing albums jogs my memory.
# 4. ) "logistic" regression (1 feature sigmoid NN). Performs exceptional on train set. 14/25 on test set !!!
# Random Set: 7/47; again some good picks that don't show up on the NN or LR.
# 5. ) Logistic regression again. 7/25   Random assignment: 10/45
# 6. ) Logistic 13/25, NN: 12/25, Random: not much different from other times
# 7. ) Added much more training data. 

Unnamed: 0_level_0,Unnamed: 1_level_0,test_y,predict
artist,album,Unnamed: 2_level_1,Unnamed: 3_level_1
Tomppabeats,Harbor,0.912141,
Flying Lotus,You're Dead!,0.890063,
Punch Brothers,The Phosphorescent Blues,0.888555,
Big K.R.I.T.,4eva Is A Mighty Long Time,0.874911,
Daft Punk,Discovery,0.84783,
Pink Floyd,The Dark Side of the Moon,0.846025,
Kendrick Lamar,DAMN.,0.84491,
"Tyler, The Creator",IGOR,0.84491,
Red Hot Chili Peppers,Californication,0.842219,
Death Cab for Cutie,Plans,0.839557,


## Random Assignment Results

In [16]:
# create a column of the same predictions, but permuted randomly
df_predictions['test_y_scramble'] = list(df_predictions.test_y.sample(frac=1).reset_index(drop=1))
# look at scrambled prediction values to see if the neural network is doing anything productive
# so far looks more effective than randomness... but there's a good chance of confirmation bias
df_predictions.sort_values('test_y_scramble',ascending=0).head(25)

Unnamed: 0_level_0,Unnamed: 1_level_0,test_y,predict,test_y_scramble
artist,album,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Luke Bryan,Kill The Lights,0.364171,,0.943545
Losing Teeth,Houses,0.059687,,0.936516
Ka,Honor Killed the Samurai,0.046216,,0.917595
Polyphia,New Levels New Devils,0.027959,,0.912141
Eric Clapton,Complete Clapton,0.05797,,0.912141
BROCKHAMPTON,SATURATION III,0.105813,0.0,0.904056
The Antlers,Burst Apart,0.205704,,0.90074
Chance the Rapper,Coloring Book,0.457322,0.0,0.895892
Pop Smoke,Shoot For The Stars Aim For The Moon,0.084309,,0.890063
Aesop Rock,None Shall Pass,0.047706,,0.888555


In [17]:
%%capture
# plot data to look for what features correlate with the model's predictions
df2 = df.merge(df_predictions, on=['artist','album'])
sns.scatterplot(data=df2, x='shuffle', y='test_y')

## Logistic Regression Results

In [19]:
%%capture
# lets try a logistic regression. Seems to work as well as the NN


model_logitstic_regression = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1, activation='sigmoid'),
])
loss_fn = tf.keras.losses.BinaryCrossentropy()
model_logitstic_regression.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])

model_logitstic_regression.fit(X_train, Y, epochs=200)

# statistics about the prediction
# pd.DataFrame(model.predict(A.T).T[0]).describe()

dfgrouped_merge_LR = pd.DataFrame(dfgrouped.size())
dfgrouped_merge_LR.columns = ['test_y']
dfgrouped_merge_LR['test_y'] = model_logitstic_regression.predict(A.T).T[0]
dfgrouped_merge_LR = df.merge(dfgrouped_merge_LR, how='left', on=['artist', 'album'])

#df[df.test_y == 1].groupby(['artist','album','test_y']).size().head(55)
def f(x):
    k = f['test_y'].mean()
    return pd.Series(k, index='test_y')

df_predictions_LR = pd.DataFrame(dfgrouped_merge.groupby(['artist','album']).apply(
    lambda x: pd.Series([x.test_y.mean(),x.predict.mean()], index=['test_y','predict']))).sort_values('test_y',ascending=0)

df_predictions_LR[df_predictions_LR.predict!=1].head(25)


# create a column of the same predictions, but permuted randomly
df_predictions_LR['test_y_scramble'] = list(df_predictions_LR.test_y.sample(frac=1).reset_index(drop=1))
# look at scrambled prediction values to see if the neural network is doing anything productive
# so far looks more effective than randomness... but there's a good chance of confirmation bias


In [20]:
df_predictions_LR[df_predictions.predict!=1].head(25)

Unnamed: 0_level_0,Unnamed: 1_level_0,test_y,predict,test_y_scramble
artist,album,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tomppabeats,Harbor,0.912141,,0.07861
Flying Lotus,You're Dead!,0.890063,,0.048789
Punch Brothers,The Phosphorescent Blues,0.888555,,0.084309
Big K.R.I.T.,4eva Is A Mighty Long Time,0.874911,,0.084309
Daft Punk,Discovery,0.84783,,0.31461
Pink Floyd,The Dark Side of the Moon,0.846025,,0.197716
Kendrick Lamar,DAMN.,0.84491,,0.051984
"Tyler, The Creator",IGOR,0.84491,,0.125366
Red Hot Chili Peppers,Californication,0.842219,,0.11249
Death Cab for Cutie,Plans,0.839557,,0.07861
