Spotify Streaming History Machine Learning Pet Project

In this project I:

1) Collect my Spotify streaming history (17/11/2020 - 17/11/2021)

2) Carry out data wrangling to produce usable Pandas dataframes containing:
    UniqueID, track_uri, streamCount etc for each track in my personal '2021' playlist

3) Issue requests to the Spotify API to obtain main audio features for each track

4) Fit a model with y = streamCount and X = {audio features} training it on my streaming history from 1/1/2020 to 31/10/2021

5) Make streamCount predictions using model on the tracks in my '2021 Favourite 50' playlist (removed from training data)

In [212]:
import pandas as pd
import numpy as np
import requests

In [213]:
# read StreamingHistory files into pandas dataframes

df_stream0 = pd.read_json('./MyData/StreamingHistory0.json')
df_stream1 = pd.read_json('./MyData/StreamingHistory1.json')
df_stream2 = pd.read_json('./MyData/StreamingHistory2.json')
df_stream3 = pd.read_json('./MyData/StreamingHistory3.json')
df_stream4 = pd.read_json('./MyData/StreamingHistory4.json')


# merge streaming dataframes
df_stream = pd.concat([df_stream0, df_stream1, df_stream2, df_stream3, df_stream4])

# create a 'UniqueID' for each song by combining the fields 'artistName' and 'trackName'
df_stream['UniqueID'] = df_stream['artistName'] + ":" + df_stream['trackName']

df_stream.head()

# df_stream: 1 year of streaming history (every instance of a track being played) 


Unnamed: 0,endTime,artistName,trackName,msPlayed,UniqueID
0,2020-11-17 01:16,BROCKHAMPTON,SUGAR,7560,BROCKHAMPTON:SUGAR
1,2020-11-17 01:19,Jaden,I'm Ready,189749,Jaden:I'm Ready
2,2020-11-17 01:19,12AM,come over,31310,12AM:come over
3,2020-11-17 01:22,The Weeknd,Final Lullaby - Bonus Track,149194,The Weeknd:Final Lullaby - Bonus Track
4,2020-11-17 01:22,Lil Nas X,HOLIDAY,154997,Lil Nas X:HOLIDAY


In [214]:
# read playlist data into dataframe

all_playlists = pd.read_json('./MyData/Playlist1.json')
df_playlists = pd.DataFrame.from_records(all_playlists.playlists.values.tolist(), index = 'name')

In [215]:
# get '2021' playlist
df_2021 = df_playlists.loc['2021']

# create dataframe of playlist items
df_2021_items= pd.DataFrame(df_2021[1])

# create dataframe of track info
df_2021_tracks = pd.DataFrame(df_2021_items['track'].values.tolist())

# add UniqueID column
df_2021_tracks['UniqueID'] = df_2021_tracks['artistName'] + ":" + df_2021_tracks['trackName']

# add column with track URI stripped of 'spotify:track:'
new = df_2021_tracks["trackUri"].str.split(":", expand = True)
df_2021_tracks['track_uri'] = new[2]

df_2021_tracks.head()
# df_2021_tracks: '2021' (499) tracks with UniqueID, track_uri


Unnamed: 0,trackName,artistName,albumName,trackUri,UniqueID,track_uri
0,Good Days,SZA,Good Days,spotify:track:3YJJjQPAbDT7mGpX3WtQ9A,SZA:Good Days,3YJJjQPAbDT7mGpX3WtQ9A
1,Anyone,Justin Bieber,Anyone,spotify:track:31qCy5ZaophVA81wtlwLc4,Justin Bieber:Anyone,31qCy5ZaophVA81wtlwLc4
2,Adrenaline,You Me At Six,Adrenaline,spotify:track:29PHjd3lImwA6U5mizZbde,You Me At Six:Adrenaline,29PHjd3lImwA6U5mizZbde
3,Lose Your Head,London Grammar,Lose Your Head,spotify:track:0lTNcrrVOlHJSuDXYNSkOH,London Grammar:Lose Your Head,0lTNcrrVOlHJSuDXYNSkOH
4,Vibez,ZAYN,Vibez,spotify:track:709F3MwiVvLD0LQXeKs5Cz,ZAYN:Vibez,709F3MwiVvLD0LQXeKs5Cz


In [1223]:
# get '2021 Favourite 50' playlist
df_2021_favourite_50 = df_playlists.loc['2021 Favourite 50']

# create dataframe of playlist items
df_2021_favourite_50_items = pd.DataFrame(df_2021_favourite_50[1])

# create dataframe of track info
df_50_tracks = pd.DataFrame(df_2021_favourite_50_items['track'].values.tolist())

# add UniqueID column
df_50_tracks['UniqueID'] = df_50_tracks['artistName'] + ":" + df_50_tracks['trackName']

# add column with track URI stripped of 'spotify:track:'
new = df_50_tracks["trackUri"].str.split(":", expand = True)
df_50_tracks['track_uri'] = new[2]

df_50_tracks.head()
# df_50_tracks: '2021 Favourite 50' (54) tracks with UniqueID, track_uri


Unnamed: 0,trackName,artistName,albumName,trackUri,UniqueID,track_uri
0,STAY (with Justin Bieber),The Kid LAROI,F*CK LOVE 3: OVER YOU,spotify:track:5PjdY0CKGZdEuoNab3yDmX,The Kid LAROI:STAY (with Justin Bieber),5PjdY0CKGZdEuoNab3yDmX
1,Higher (feat. iann dior),Clean Bandit,Higher (feat. iann dior),spotify:track:4OoYfejHABzYe2mG8p5s8b,Clean Bandit:Higher (feat. iann dior),4OoYfejHABzYe2mG8p5s8b
2,Take My Breath,The Weeknd,Take My Breath,spotify:track:6OGogr19zPTM4BALXuMQpF,The Weeknd:Take My Breath,6OGogr19zPTM4BALXuMQpF
3,BYE,Jaden,BYE,spotify:track:3OUyyDN7EZrL7i0Sbi5SVd,Jaden:BYE,3OUyyDN7EZrL7i0Sbi5SVd
4,INDUSTRY BABY (feat. Jack Harlow),Lil Nas X,INDUSTRY BABY (feat. Jack Harlow),spotify:track:27NovPIUIRrOZoCHxABJwK,Lil Nas X:INDUSTRY BABY (feat. Jack Harlow),27NovPIUIRrOZoCHxABJwK


In [1237]:
# Filtering df_stream (removing the 2021 Favourite 50 tracks)

mask = df_stream['UniqueID'].isin(df_50_tracks['UniqueID'])
df_filtered = df_stream[~mask]

# select 2021 streaming history
df_2021_stream = df_filtered[df_filtered['endTime'].str.startswith('2021')]

# select 2021 streaming history up to the end of October
df_2021_to_oct = df_2021_stream[df_2021_stream['endTime'].str.startswith('2021-11') != True]



In [1069]:
# create final dict as a copy of df_2021_to_oct
df_merge = df_2021_to_oct.copy()

# add column checking if streamed song is in 2021 playlist
df_merge['In2021'] = np.where(df_merge['UniqueID'].isin(df_2021_tracks['UniqueID'].tolist()),1,0)

# left join with df_2021_tracks on UniqueID to bring in album and track_uri
df_merge = pd.merge(df_merge, df_2021_tracks[['albumName','UniqueID','track_uri']],how='left',on=['UniqueID'])

# selecting only the streams that were of tracks contained in 2021 playlist
df_merge_in_2021 = df_merge[df_merge['In2021'] == 1]

# selecting streams played for over a minute
df_final = df_merge_in_2021[df_merge_in_2021['msPlayed'] > 60000]


In [1098]:
# get uniqueIDs 
unique_IDs = df_final['UniqueID'].unique()
# convert to list
unique_IDs_list = unique_IDs.tolist()

# build dataframe that will contain only the relevant information for the analysis
df_relevant = pd.DataFrame(unique_IDs_list, columns=['UniqueID'])

# get artistNames
artistName_list = [unique_ID.split(":")[0] for unique_ID in unique_IDs_list]
# append artistName column to dataframe
df_relevant['artistName'] = artistName_list


In [1105]:
# code not relevant anymore since I have now removed duplicate track_uri rows

# retrieved the duplicate UniqueID since there was one more unique track_uri than unique UniqueIDs
# if length of the track_uri.unique() column > 1 then get UniqueID
j = 0
for i in range(0, 430):
    temp_df = df_final[df_final.UniqueID == unique_IDs_list[i]]
    if temp_df.track_uri.nunique() != 1:
        j = i
        break

df_final[df_final.UniqueID == unique_IDs_list[j]]

# 6P2Y4KnF2x8uwZV2cZWA8t
# 0jH7gF7KCk2Lom9gimaKms

# so I will remove one of these duplicate track_uris (done below)

Unnamed: 0_level_0,endTime,artistName,trackName,msPlayed,UniqueID,In2021,albumName,track_uri
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2021-01-01 00:10,You Me At Six,Adrenaline,210080,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
1,2021-01-01 00:17,You Me At Six,Adrenaline,209130,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
4,2021-01-01 11:15,You Me At Six,Adrenaline,210080,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
7,2021-01-01 19:18,You Me At Six,Adrenaline,210080,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
9,2021-01-01 20:53,You Me At Six,Adrenaline,210080,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
11,2021-01-01 23:48,You Me At Six,Adrenaline,210080,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
14,2021-01-02 09:02,You Me At Six,Adrenaline,210080,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
16,2021-01-02 11:43,You Me At Six,Adrenaline,210080,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
18,2021-01-02 13:10,You Me At Six,Adrenaline,209984,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde
20,2021-01-02 14:14,You Me At Six,Adrenaline,210080,You Me At Six:Adrenaline,1,Adrenaline,29PHjd3lImwA6U5mizZbde


In [1161]:
# add column for indexing the dataframe
nrow = len(df_final['trackName'])
df_final['index'] = range(0, nrow)

df_coldplay_color = df_final[df_final['UniqueID'] == 'Coldplay:Coloratura']
df_color_duplicate = df_coldplay_color[df_coldplay_color['albumName'] == 'Coloratura']

# getting indexes for duplicate rows to remove
index_store = []
k = 0
for track in df_final['track_uri']:
    if track == '6P2Y4KnF2x8uwZV2cZWA8t':
        index_store.append(k)
    k += 1

# set index of dataframe to 'index'
df_final = df_final.set_index('index')
# drop duplicate rows from dataframe
df_final = df_final.drop(index_store)

# get track_uris
my_track_uris = df_final['track_uri'].unique()
# convert to list
track_uris_list = my_track_uris.tolist()
# append track_uri column to dataframe
df_relevant['track_uri'] = track_uris_list

df_relevant # UniqueID, artistName, track_uri

Unnamed: 0,UniqueID,artistName,track_uri,streamCount
0,You Me At Six:Adrenaline,You Me At Six,29PHjd3lImwA6U5mizZbde,26
1,SZA:Good Days,SZA,3YJJjQPAbDT7mGpX3WtQ9A,40
2,Justin Bieber:Anyone,Justin Bieber,31qCy5ZaophVA81wtlwLc4,93
3,London Grammar:Lose Your Head,London Grammar,0lTNcrrVOlHJSuDXYNSkOH,23
4,ZAYN:Vibez,ZAYN,709F3MwiVvLD0LQXeKs5Cz,70
...,...,...,...,...
425,Digga D:Red Light Green Light,Digga D,5Qq3zFalngksq1frZ7idHt,6
426,DPR IAN:So Beautiful,DPR IAN,6syar8JKCt3R9ZBl11zmgI,3
427,ILLENIUM:Superhero,ILLENIUM,1dVn0YP06AMySJXG6hdC55,2
428,Ed Sheeran:Overpass Graffiti,Ed Sheeran,4btFHqumCO31GksfuBLLv3,6


In [None]:
# wanna make a new column in df_relevant that stores the number of streams for each UniqueID i.e. number of rows of df_final
stream_counts = []

# count number of streams for each UniqueID
for ID in unique_IDs_list:
    temp_df2 = df_final[df_final['UniqueID'] == ID]
    stream_count = len(temp_df2['endTime'])
    stream_counts.append(stream_count)


df_relevant['streamCount'] = stream_counts
df_relevant

Unnamed: 0,UniqueID,artistName,track_uri,streamCount
0,You Me At Six:Adrenaline,You Me At Six,29PHjd3lImwA6U5mizZbde,26
1,SZA:Good Days,SZA,3YJJjQPAbDT7mGpX3WtQ9A,40
2,Justin Bieber:Anyone,Justin Bieber,31qCy5ZaophVA81wtlwLc4,93
3,London Grammar:Lose Your Head,London Grammar,0lTNcrrVOlHJSuDXYNSkOH,23
4,ZAYN:Vibez,ZAYN,709F3MwiVvLD0LQXeKs5Cz,70
...,...,...,...,...
425,Digga D:Red Light Green Light,Digga D,5Qq3zFalngksq1frZ7idHt,6
426,DPR IAN:So Beautiful,DPR IAN,6syar8JKCt3R9ZBl11zmgI,3
427,ILLENIUM:Superhero,ILLENIUM,1dVn0YP06AMySJXG6hdC55,2
428,Ed Sheeran:Overpass Graffiti,Ed Sheeran,4btFHqumCO31GksfuBLLv3,6


In [None]:
import spotify

In [1167]:

CLIENT_ID = 'hidden'
CLIENT_SECRET = 'hidden'

access_token = spotify.get_access_token(CLIENT_ID=CLIENT_ID, CLIENT_SECRET=CLIENT_SECRET)
print(access_token)

BQBBlEBzs0YK9Od2ov8QjNABW7Uf0Mujlabkr1SvdydCzEsvzM1RT1K1tTwe1Coc1xjmX98TaBr-Mv-qGeU


In [1168]:
headers = {'Authorization': 'Bearer {token}'.format(token=access_token)}
BASE_URL = 'https://api.spotify.com/v1/'

# create blank dictionary to store audio features
feature_dict = {}

# convert track_uri column to an iterable list
track_uris = df_relevant['track_uri'].to_list()

# loop through track URIs and pull audio features using the API,
# store all these in a dictionary
for t_uri in track_uris:
    
    feature_dict[t_uri] = {'danceability': 0,
                           'energy': 0,
                           'speechiness': 0,
                           'instrumentalness': 0,
                           'tempo': 0}
    
    #r = requests.get(BASE_URL + 'tracks/' + t_uri, headers=headers)
    #r = r.json()
    #feature_dict[t_uri]['popularity'] = r['popularity']
    
    s = requests.get(BASE_URL + 'audio-features/' + t_uri, headers=headers)
    s = s.json()
    feature_dict[t_uri]['danceability'] = s['danceability']
    feature_dict[t_uri]['energy'] = s['energy']
    feature_dict[t_uri]['speechiness'] = s['speechiness']
    feature_dict[t_uri]['instrumentalness'] = s['instrumentalness']
    feature_dict[t_uri]['tempo'] = s['tempo']

In [1170]:
feature_dict

# convert dictionary into dataframe with track_uri as the first column
df_features = pd.DataFrame.from_dict(feature_dict, orient='index')
df_features.insert(0, 'track_uri', df_features.index)
df_features.reset_index(inplace=True, drop=True)

df_features.head()

Unnamed: 0,track_uri,danceability,energy,speechiness,instrumentalness,tempo
0,29PHjd3lImwA6U5mizZbde,0.347,0.719,0.0999,8e-06,137.178
1,3YJJjQPAbDT7mGpX3WtQ9A,0.436,0.655,0.0583,8e-06,121.002
2,31qCy5ZaophVA81wtlwLc4,0.686,0.538,0.0345,3e-06,115.884
3,0lTNcrrVOlHJSuDXYNSkOH,0.556,0.712,0.0604,0.185,98.008
4,709F3MwiVvLD0LQXeKs5Cz,0.635,0.659,0.116,0.0285,96.855


In [1171]:
# create final dict as a copy df_relevant
df_streams_and_feats = df_relevant.copy()

# left join with df_features on UniqueID to bring in features and track_uri
df_streams_and_feats = pd.merge(df_streams_and_feats, df_features[['track_uri','danceability','energy','speechiness', 'instrumentalness','tempo']],how='left',on=['track_uri'])

df_streams_and_feats.head()

Unnamed: 0,UniqueID,artistName,track_uri,streamCount,danceability,energy,speechiness,instrumentalness,tempo
0,You Me At Six:Adrenaline,You Me At Six,29PHjd3lImwA6U5mizZbde,26,0.347,0.719,0.0999,8e-06,137.178
1,SZA:Good Days,SZA,3YJJjQPAbDT7mGpX3WtQ9A,40,0.436,0.655,0.0583,8e-06,121.002
2,Justin Bieber:Anyone,Justin Bieber,31qCy5ZaophVA81wtlwLc4,93,0.686,0.538,0.0345,3e-06,115.884
3,London Grammar:Lose Your Head,London Grammar,0lTNcrrVOlHJSuDXYNSkOH,23,0.556,0.712,0.0604,0.185,98.008
4,ZAYN:Vibez,ZAYN,709F3MwiVvLD0LQXeKs5Cz,70,0.635,0.659,0.116,0.0285,96.855


In [1188]:
# repeat above feature requesting process for the tracks in '2021 Favourite 50'

# create blank dictionary to store audio features
pred_feature_dict = {}

# convert track_uri column to an iterable list
pred_track_uris = df_50_tracks['track_uri'].to_list()

# loop through track URIs and pull audio features using the API,
# store all these in a dictionary
for t_uri in pred_track_uris:
    
    pred_feature_dict[t_uri] = {'danceability': 0,
                                'energy': 0,
                                'speechiness': 0,
                                'instrumentalness': 0,
                                'tempo': 0}
    
    #r = requests.get(BASE_URL + 'tracks/' + t_uri, headers=headers)
    #r = r.json()
    #feature_dict[t_uri]['popularity'] = r['popularity']
    
    s = requests.get(BASE_URL + 'audio-features/' + t_uri, headers=headers)
    s = s.json()
    pred_feature_dict[t_uri]['danceability'] = s['danceability']
    pred_feature_dict[t_uri]['energy'] = s['energy']
    pred_feature_dict[t_uri]['speechiness'] = s['speechiness']
    pred_feature_dict[t_uri]['instrumentalness'] = s['instrumentalness']
    pred_feature_dict[t_uri]['tempo'] = s['tempo']

In [1198]:
pred_feature_dict

# convert dictionary into dataframe with track_uri as the first column
df_pred_features = pd.DataFrame.from_dict(pred_feature_dict, orient='index')
df_pred_features.insert(0, 'track_uri', df_pred_features.index)
df_pred_features.reset_index(inplace=True, drop=True)

df_pred_features.head()

Unnamed: 0,track_uri,danceability,energy,speechiness,instrumentalness,tempo
0,5PjdY0CKGZdEuoNab3yDmX,0.591,0.764,0.0483,0.0,169.928
1,4OoYfejHABzYe2mG8p5s8b,0.716,0.736,0.0363,1.4e-05,104.018
2,6OGogr19zPTM4BALXuMQpF,0.748,0.74,0.0484,2.2e-05,121.004
3,3OUyyDN7EZrL7i0Sbi5SVd,0.635,0.51,0.0344,0.0127,119.986
4,27NovPIUIRrOZoCHxABJwK,0.736,0.704,0.0615,0.0,149.995


In [1203]:
# create final dict as a copy df_50_tracks
df_pred_streams_and_feats = df_50_tracks.copy()

# left join with df_pred_features on UniqueID to bring in features and track_uri
df_pred_streams_and_feats = pd.merge(df_pred_streams_and_feats, df_pred_features[['track_uri','danceability','energy','speechiness', 'instrumentalness','tempo']],how='left',on=['track_uri'])

df_pred_streams_and_feats.head()

Unnamed: 0,trackName,artistName,albumName,trackUri,UniqueID,track_uri,danceability,energy,speechiness,instrumentalness,tempo
0,STAY (with Justin Bieber),The Kid LAROI,F*CK LOVE 3: OVER YOU,spotify:track:5PjdY0CKGZdEuoNab3yDmX,The Kid LAROI:STAY (with Justin Bieber),5PjdY0CKGZdEuoNab3yDmX,0.591,0.764,0.0483,0.0,169.928
1,Higher (feat. iann dior),Clean Bandit,Higher (feat. iann dior),spotify:track:4OoYfejHABzYe2mG8p5s8b,Clean Bandit:Higher (feat. iann dior),4OoYfejHABzYe2mG8p5s8b,0.716,0.736,0.0363,1.4e-05,104.018
2,Take My Breath,The Weeknd,Take My Breath,spotify:track:6OGogr19zPTM4BALXuMQpF,The Weeknd:Take My Breath,6OGogr19zPTM4BALXuMQpF,0.748,0.74,0.0484,2.2e-05,121.004
3,BYE,Jaden,BYE,spotify:track:3OUyyDN7EZrL7i0Sbi5SVd,Jaden:BYE,3OUyyDN7EZrL7i0Sbi5SVd,0.635,0.51,0.0344,0.0127,119.986
4,INDUSTRY BABY (feat. Jack Harlow),Lil Nas X,INDUSTRY BABY (feat. Jack Harlow),spotify:track:27NovPIUIRrOZoCHxABJwK,Lil Nas X:INDUSTRY BABY (feat. Jack Harlow),27NovPIUIRrOZoCHxABJwK,0.736,0.704,0.0615,0.0,149.995


In [1204]:
# Begin Machine Learning process

# set prediction target
y = df_streams_and_feats.streamCount

# choose features
features = ['danceability', 'energy', 'speechiness', 'instrumentalness', 'tempo']

# call features data X
X = df_streams_and_feats[features]

# quickly reviewing the data
X.describe()
X.head()

Unnamed: 0,danceability,energy,speechiness,instrumentalness,tempo
0,0.347,0.719,0.0999,8e-06,137.178
1,0.436,0.655,0.0583,8e-06,121.002
2,0.686,0.538,0.0345,3e-06,115.884
3,0.556,0.712,0.0604,0.185,98.008
4,0.635,0.659,0.116,0.0285,96.855


In [1205]:
import sklearn


In [1206]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
streams_model = DecisionTreeRegressor(random_state=1)

# Fit model
streams_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

In [1236]:
# set up predictions X
pred_X = df_pred_streams_and_feats[features]

df_pred_streams_and_feats.trackName

# make predictions
print(streams_model.predict(pred_X))

# At this point, I realise that by removing my favourite 54 tracks from the model training;
# I have not fed the model with the most valuable information about the tracks that I like the most
# Hence, why the predictions are very from the truth


[ 5. 10. 10.  6.  3. 66.  3. 16.  7. 16.  2. 29.  4.  7. 30. 25.  6.  6.
 15.  9. 33. 33. 23.  7. 10. 11. 53.  2. 31.  1. 10.  6.  5.  1.  4.  4.
 23.  5.  6.  7. 53.  7.  2.  8. 16. 22.  1.  4.  3. 15.  5. 19.  6. 30.
  4.]


In [1222]:
# Next idea: train the model including the most listened to tracks and try to predict streamCount for tracks closer to the middle of my listening habits

5726