# Song Recommendation Engine for Groups

There are many recommendation models out there that can take a list of songs that a particular user has liked and then output recommended songs for that users. However, our objective is different. Our final goal is to create a machine learning model that can recommend songs that are likely to garner the most positive feedback from a $group$ of individuals at a party, each with unique tastes in music. 

In order to do this, we need to develop a pipeline that takes a set of users and the songs they interacted with and then outputs a list of recommended songs for the entire group.


Note that we are creating a pipeline. So, this means that we cannot pre-train our model; the clusters are going to be different for every group. The model will have to train live, during the party.

## Loading the Spotify Dataset

To build the pipeline, we will use the Spotify dataset of a million playlists and the songs in them. In real life, parties will likely have around 10 people, so we will consider 10 random playlists in the dataset. Note that each playlist can be treated as a unique person, with the songs in it corresponding to a user's upvotes.

In [7]:
# import some necessary libraries

import sys
import spotipy
import spotipy.util as util

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# dataframe settings
pd.set_option('display.max_columns', 50)

In [9]:
# load the data
original_df = pd.read_csv("0-999_playlists.csv")

In [10]:
original_df

Unnamed: 0,trackid,artist_name,track_name,pid
0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),0
1,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,Britney Spears,Toxic,0
2,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,Beyoncé,Crazy In Love,0
3,spotify:track:1AWQoqb9bSvzTjaLralEkT,Justin Timberlake,Rock Your Body,0
4,spotify:track:1lzr43nnXAijIGYnCT8M8H,Shaggy,It Wasn't Me,0
5,spotify:track:0XUfyU2QviPAs6bxSpXYG4,Usher,Yeah!,0
6,spotify:track:68vgtRHr7iZHpzGpon6Jlo,Usher,My Boo,0
7,spotify:track:3BxWKCI06eQ5Od8TY2JBeA,The Pussycat Dolls,Buttons,0
8,spotify:track:7H6ev70Weq6DdpZyyTmUXk,Destiny's Child,Say My Name,0
9,spotify:track:2PpruBYCo4H7WOBJ7Q2EwM,OutKast,Hey Ya! - Radio Mix / Club Mix,0


It is important to keep in mind that data from the app will be coming into this pipeline with the following two relevant columns:

$\text{username | upvoted_song_trackid}$

as the upvotes are cast. The app will only send the 5 (or less, if the user has not cast 5 upvotes yet) most recent upvoted tracks of each user.

Hence, we should manipulate this data frame to match our pipeline input.

In [11]:
pipeline_df = original_df[['pid', 'trackid']]

In [12]:
pipeline_df

Unnamed: 0,pid,trackid
0,0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI
1,0,spotify:track:6I9VzXrHxO9rA9A5euc8Ak
2,0,spotify:track:0WqIKmW4BTrj3eJFmnCKMv
3,0,spotify:track:1AWQoqb9bSvzTjaLralEkT
4,0,spotify:track:1lzr43nnXAijIGYnCT8M8H
5,0,spotify:track:0XUfyU2QviPAs6bxSpXYG4
6,0,spotify:track:68vgtRHr7iZHpzGpon6Jlo
7,0,spotify:track:3BxWKCI06eQ5Od8TY2JBeA
8,0,spotify:track:7H6ev70Weq6DdpZyyTmUXk
9,0,spotify:track:2PpruBYCo4H7WOBJ7Q2EwM


And we are treating each of the playlists as unique users, so let's change that column name.

In [13]:
pipeline_df.columns = ['username', 'trackid']

In [14]:
pipeline_df

Unnamed: 0,username,trackid
0,0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI
1,0,spotify:track:6I9VzXrHxO9rA9A5euc8Ak
2,0,spotify:track:0WqIKmW4BTrj3eJFmnCKMv
3,0,spotify:track:1AWQoqb9bSvzTjaLralEkT
4,0,spotify:track:1lzr43nnXAijIGYnCT8M8H
5,0,spotify:track:0XUfyU2QviPAs6bxSpXYG4
6,0,spotify:track:68vgtRHr7iZHpzGpon6Jlo
7,0,spotify:track:3BxWKCI06eQ5Od8TY2JBeA
8,0,spotify:track:7H6ev70Weq6DdpZyyTmUXk
9,0,spotify:track:2PpruBYCo4H7WOBJ7Q2EwM


Now, we will choose 10 random usernames to consider as the people in our party.

In [15]:
from random import sample
number_of_playlists_to_select = 10
random_usernames = sample(range(0, 1000), number_of_playlists_to_select)
random_usernames.sort()

print(random_usernames)

pipeline_df_shortened = pd.DataFrame(columns = ['username', 'trackid'])
count = 0
for index, row in pipeline_df.iterrows():
    if (row['username'] in random_usernames):
        pipeline_df_shortened.loc[count] = row
        count = count + 1

[5, 167, 192, 306, 467, 472, 516, 648, 731, 797]


In [16]:
pipeline_df_shortened

Unnamed: 0,username,trackid
0,5,spotify:track:61LtVmmkGr8P9I2tSPvdpf
1,5,spotify:track:5Q0Nhxo0l2bP3pNjpGJwV1
2,5,spotify:track:1V4jC0vJ5525lEF1bFgPX2
3,5,spotify:track:3XVozq1aeqsJwpXrEZrDJ9
4,5,spotify:track:6eE6akYdf9w2rZUOIMwSgw
5,5,spotify:track:0CAfXk7DXMnon4gLudAp7J
6,5,spotify:track:5bitEcj72xFL3yv9ZS5fkE
7,5,spotify:track:4Ow5x7P5NAAR1jPoskudoA
8,5,spotify:track:3S2R0EVwBSAVMd5UMgKTL0
9,5,spotify:track:7MCNnnmwm7TXMh7xyNGohi


Next, for each user, we will randomly select between 1 to 5 of its songs to keep in the dataframe in order to mimic the fact that this pipeline will be receiving the five (or less) most recently upvoted tracks of the user.

In [17]:
from random import randrange

input_df = pd.DataFrame(columns = ['username', 'trackid'])

for user in random_usernames:
    mini_df = pipeline_df_shortened[pipeline_df_shortened.username == user]
    num_tracks_user = randrange(1,6) # we will randomly choose between 1 and 5 (inclusive) of the user's tracks
    mini_df = mini_df.sample(num_tracks_user)
    input_df = input_df.append(mini_df, ignore_index = True)

In [18]:
input_df

Unnamed: 0,username,trackid
0,5,spotify:track:5xS9hkTGfxqXyxX6wWWTt4
1,5,spotify:track:70cTMpcgWMcR18t9MRJFjB
2,167,spotify:track:6U8EShYiwNNrGogCxTeFm2
3,167,spotify:track:5LxQohFfm9A4V1VSTS1RDG
4,167,spotify:track:1dD1aarWotVIiFo5gGdMc2
5,192,spotify:track:0ESJlaM8CE1jRWaNtwSNj8
6,192,spotify:track:5w1vhNA2OEWUQ371QzyMmM
7,192,spotify:track:2ZBfTcQM9S3yTLKhHrvCnQ
8,192,spotify:track:2Fe6gDE0mCZz0g98i5QpVL
9,192,spotify:track:3a1lNhkSLSkpJE4MSHpDu9


## Feature Extraction

Note: input_df needs to have at least 6 rows. If there are $\le 5$ rows, the tracks can be directly inputted into the Spotify API's recommendation engine as seed tracks (which takes a maximum of 5 seeds). It is not possible to identify structures in the music tastes of a group with $\le 5$ tracks upvoted.

Now, for each track, we need to get the Spotify audio features for it that can be useful for clustering.

- danceability
- energy
- speechiness
- acousticness
- instrumentalness
- valence

In [28]:
# set up Spotipy object

import os
from json.decoder import JSONDecodeError

scope = 'user-read-email'
username = 'musicrg1'

try:
    token = util.prompt_for_user_token(username, scope, client_id='13734e89943a4249864bb67a9fdd3f9f', client_secret='f4046fa141c24926b3ee730529ffcf2b', redirect_uri='google.com')
except (AttributeError, JSONDecodeError):
    os.remove(f".cache-{username}")
    token = util.prompt_for_user_token(username, scope, client_id='13734e89943a4249864bb67a9fdd3f9f', client_secret='f4046fa141c24926b3ee730529ffcf2b', redirect_uri='google.com')
    
# token = util.prompt_for_user_token(scope,client_id,client_secret,redirect_uri='https://google.com')

if token:
    spot = spotipy.Spotify(auth=token)
else:
    print("Can't get token for", username)


In [29]:
# create new dataframe with columns including the audio features
df_final = pd.DataFrame(columns = ['username', 'trackid', 'danceability', 'energy', 'speechiness', 'acousticness', 'instrumentalness', 'valence'])

In [30]:
df_final

Unnamed: 0,username,trackid,danceability,energy,speechiness,acousticness,instrumentalness,valence


In [31]:
# Now, we can actually get the audio features for the various tracks
df_final.reset_index() # safety first!
count = 0
for index, row in input_df.iterrows():
    track_audio_features = spot.audio_features(row['trackid'])[0] # take first element (which is a dict) since we are only inputting one song into audio features function
    new_row = [row['username'], row['trackid'], track_audio_features['danceability'], track_audio_features['energy'], track_audio_features['speechiness'], track_audio_features['acousticness'], track_audio_features['instrumentalness'], track_audio_features['valence']]
    print(new_row)
    df_final.loc[count] = new_row
    count = count + 1

[5, 'spotify:track:5xS9hkTGfxqXyxX6wWWTt4', 0.451, 0.669, 0.0651, 0.0267, 0, 0.523]
[5, 'spotify:track:70cTMpcgWMcR18t9MRJFjB', 0.743, 0.766, 0.0265, 0.0873, 0, 0.61]
[167, 'spotify:track:6U8EShYiwNNrGogCxTeFm2', 0.468, 0.139, 0.214, 0.629, 0, 0.337]
[167, 'spotify:track:5LxQohFfm9A4V1VSTS1RDG', 0.466, 0.326, 0.0411, 0.642, 0, 0.288]
[167, 'spotify:track:1dD1aarWotVIiFo5gGdMc2', 0.554, 0.335, 0.0281, 0.618, 0, 0.298]
[192, 'spotify:track:0ESJlaM8CE1jRWaNtwSNj8', 0.743, 0.571, 0.145, 0.24, 0, 0.495]
[192, 'spotify:track:5w1vhNA2OEWUQ371QzyMmM', 0.791, 0.634, 0.0639, 0.00243, 5.43e-05, 0.323]
[192, 'spotify:track:2ZBfTcQM9S3yTLKhHrvCnQ', 0.699, 0.869, 0.0744, 0.0408, 0, 0.406]
[192, 'spotify:track:2Fe6gDE0mCZz0g98i5QpVL', 0.554, 0.481, 0.329, 0.0139, 0, 0.456]
[192, 'spotify:track:3a1lNhkSLSkpJE4MSHpDu9', 0.63, 0.804, 0.0363, 0.215, 0, 0.492]
[306, 'spotify:track:7wGoVu4Dady5GV0Sv4UIsx', 0.577, 0.522, 0.0984, 0.13, 9.03e-05, 0.119]
[306, 'spotify:track:77IAeEz8LEchPN8UNjaTJ2', 0.713, 0.4

In [23]:
df_final

Unnamed: 0,username,trackid,danceability,energy,speechiness,acousticness,instrumentalness,valence
0,5,spotify:track:5xS9hkTGfxqXyxX6wWWTt4,0.451,0.669,0.0651,0.0267,0.0,0.523
1,5,spotify:track:70cTMpcgWMcR18t9MRJFjB,0.743,0.766,0.0265,0.0873,0.0,0.61
2,167,spotify:track:6U8EShYiwNNrGogCxTeFm2,0.468,0.139,0.214,0.629,0.0,0.337
3,167,spotify:track:5LxQohFfm9A4V1VSTS1RDG,0.466,0.326,0.0411,0.642,0.0,0.288
4,167,spotify:track:1dD1aarWotVIiFo5gGdMc2,0.554,0.335,0.0281,0.618,0.0,0.298
5,192,spotify:track:0ESJlaM8CE1jRWaNtwSNj8,0.743,0.571,0.145,0.24,0.0,0.495
6,192,spotify:track:5w1vhNA2OEWUQ371QzyMmM,0.791,0.634,0.0639,0.00243,5.43e-05,0.323
7,192,spotify:track:2ZBfTcQM9S3yTLKhHrvCnQ,0.699,0.869,0.0744,0.0408,0.0,0.406
8,192,spotify:track:2Fe6gDE0mCZz0g98i5QpVL,0.554,0.481,0.329,0.0139,0.0,0.456
9,192,spotify:track:3a1lNhkSLSkpJE4MSHpDu9,0.63,0.804,0.0363,0.215,0.0,0.492


Now that we have our data in a nice format, we can go ahead and build some models that recommend songs to the party. 

## Model 1

This model will take all of the songs that have been upvoted by all users, take their average (in terms of the 6 audio features), and then use the Spotify API recommendation engine to recommend 100 songs to the group.

In [24]:
# get averages for different audio features

avg_danceability = df_final['danceability'].mean()
avg_energy = df_final['energy'].mean()
avg_speechiness = df_final['speechiness'].mean()
avg_acousticness = df_final['acousticness'].mean()
avg_instrumentalness = df_final['instrumentalness'].mean()
avg_valence = df_final['valence'].mean()

In [25]:
# randomly select 5 tracks as seeds from the dataframe
seeds = df_final['trackid'].sample(5).tolist()

In [26]:
# get the recommendations
num_recs = 100
recommendations_model_1 = spot.recommendations(seed_tracks = seeds, target_danceability = avg_danceability, target_energy = avg_energy, target_speechines = avg_speechiness, target_acousticness = avg_acousticness, target_instrumentalness = avg_instrumentalness, target_valence = avg_valence,  limit = num_recs)
recommendations_model_1_output = [recommendations_model_1['tracks'][i]['uri'] for i in range (num_recs)]

In [27]:
recommendations_model_1_output

['spotify:track:75yUmYDFb9tqmeXni8bJ69',
 'spotify:track:72gv4zhNvRVdQA0eOenCal',
 'spotify:track:0IUsGxWP18PBAUMhSEyRLO',
 'spotify:track:62vpWI1CHwFy7tMIcSStl8',
 'spotify:track:6NUiDZQALrNiDfqDB6ZBaF',
 'spotify:track:7D4ur1zrmGhTMV0mX25Pbd',
 'spotify:track:0VhgEqMTNZwYL1ARDLLNCX',
 'spotify:track:4KW1lqgSr8TKrvBII0Brf8',
 'spotify:track:7jZ4UZAmg006Qx3rVuF7JI',
 'spotify:track:7FOJvA3PxiIU0DN3JjQ7jT',
 'spotify:track:06FCvd7rrRcF3DdvWH5Isp',
 'spotify:track:5ZiL1WC0SPgNjEYV4eP0I2',
 'spotify:track:6Bqn71zg1dznO7Ck8ykEWc',
 'spotify:track:3lSDIJ2abCrOdDJ6pshUap',
 'spotify:track:5JdSkfCEPPXPRU2QgYZgh6',
 'spotify:track:0I20rLT2MJDhcF96AjbNYo',
 'spotify:track:2OEKdLpIhPT11FR746kOoQ',
 'spotify:track:3ZLyt2ndLFBh148XRYjYYZ',
 'spotify:track:1wIQtB3UQ1TfjNMZZqO6eh',
 'spotify:track:6FzjhVjXDoBGfq1sSdNq7S',
 'spotify:track:1TpeT2PWnAv9NDbqK1qy6J',
 'spotify:track:4k77gN6nozNqbsFGpAr6ol',
 'spotify:track:6PmnGYDsruYLBNY4Rpx4t9',
 'spotify:track:12KG3DCYkxoMUAPWq1uFnw',
 'spotify:track:

However, there is a big downside to this model: it is biased towards users who have upvoted more songs and thus have more rows in the DataFrame than other users. We can fix this issue in our next model.

## Model 2

This model is not a whole lot different than Model 1. All it really does is fix the bias problem. For each user, this model takes their upvoted songs and replaces them with 5 recommended songs for the user. It will take all of the $5k$ songs that have been recommended to the $k$, take their average (in terms of the 6 audio features), and then use the Spotify API recommendation engine to recommend 100 songs to the group.

In [162]:
# Data frame for storing all the recommended songs for each user
df_final_standardized = pd.DataFrame(columns = ['username', 'trackid', 'danceability', 'energy', 'speechiness', 'acousticness', 'instrumentalness', 'valence'])

In [163]:
count = 0
for user in random_usernames:
    user_tracks = df_final[df_final.username == user]['trackid'].tolist() # get the user's tracks
    user_5_recs = spot.recommendations(seed_tracks = user_tracks, limit = 5) # get 5 recommendations for the tracks
    user_5_recs_output = [user_5_recs['tracks'][i]['uri'] for i in range (5)] # get the recommendations' track URI's
    user_5_recs_audio_features = spot.audio_features(user_5_recs_output)
    for i in range (5):
        new_standard_row = [user, user_5_recs_output[i], user_5_recs_audio_features[i]['danceability'], user_5_recs_audio_features[i]['energy'], user_5_recs_audio_features[i]['speechiness'], user_5_recs_audio_features[i]['acousticness'], user_5_recs_audio_features[i]['instrumentalness'], user_5_recs_audio_features[i]['valence']]
        print(new_standard_row)
        df_final_standardized.loc[count] = new_standard_row
        count = count + 1

[94, 'spotify:track:0RCgSTkAbohhqEXVxkwBI0', 0.607, 0.737, 0.0263, 0.109, 0, 0.509]
[94, 'spotify:track:00tB8c71eTcG5jV7PhuF4Q', 0.612, 0.698, 0.0357, 0.248, 3.09e-06, 0.199]
[94, 'spotify:track:34PsixEmIceg39NpaYxBsH', 0.715, 0.815, 0.0576, 0.0689, 0, 0.804]
[94, 'spotify:track:6t7Qd7wnEdrNxj1QFUoiss', 0.489, 0.783, 0.0413, 0.0445, 0.000647, 0.542]
[94, 'spotify:track:081t95JRuDUrQYSS3h8iKk', 0.477, 0.672, 0.0306, 0.049, 0, 0.263]
[100, 'spotify:track:2dyQtNorA8TdLVxUa947hL', 0.362, 0.512, 0.033, 0.432, 0.955, 0.0805]
[100, 'spotify:track:0VUXPRlZlAItwGhN4zs2aK', 0.693, 0.642, 0.17, 0.587, 0, 0.903]
[100, 'spotify:track:1t1ZHTjbDxmwrwtw6CJYaB', 0.779, 0.764, 0.105, 0.0977, 0, 0.765]
[100, 'spotify:track:4gpTPLGLrKdE1JJVZ2McYD', 0.64, 0.764, 0.066, 0.489, 0, 0.963]
[100, 'spotify:track:2NW5VMjtynzMn5r4NCTxLY', 0.763, 0.158, 0.166, 0.978, 0.561, 0.885]
[219, 'spotify:track:29tIhq8ByVaG5GVlnS4XRL', 0.643, 0.744, 0.029, 0.0213, 0.287, 0.194]
[219, 'spotify:track:2oSK5tH6d1HPVWyPHCMfr7', 0

In [164]:
df_final_standardized

Unnamed: 0,username,trackid,danceability,energy,speechiness,acousticness,instrumentalness,valence
0,94,spotify:track:0RCgSTkAbohhqEXVxkwBI0,0.607,0.737,0.0263,0.109,0.0,0.509
1,94,spotify:track:00tB8c71eTcG5jV7PhuF4Q,0.612,0.698,0.0357,0.248,3.09e-06,0.199
2,94,spotify:track:34PsixEmIceg39NpaYxBsH,0.715,0.815,0.0576,0.0689,0.0,0.804
3,94,spotify:track:6t7Qd7wnEdrNxj1QFUoiss,0.489,0.783,0.0413,0.0445,0.000647,0.542
4,94,spotify:track:081t95JRuDUrQYSS3h8iKk,0.477,0.672,0.0306,0.049,0.0,0.263
5,100,spotify:track:2dyQtNorA8TdLVxUa947hL,0.362,0.512,0.033,0.432,0.955,0.0805
6,100,spotify:track:0VUXPRlZlAItwGhN4zs2aK,0.693,0.642,0.17,0.587,0.0,0.903
7,100,spotify:track:1t1ZHTjbDxmwrwtw6CJYaB,0.779,0.764,0.105,0.0977,0.0,0.765
8,100,spotify:track:4gpTPLGLrKdE1JJVZ2McYD,0.64,0.764,0.066,0.489,0.0,0.963
9,100,spotify:track:2NW5VMjtynzMn5r4NCTxLY,0.763,0.158,0.166,0.978,0.561,0.885


In [165]:
# get averages for different audio features

avg_danceability = df_final_standardized['danceability'].mean()
avg_energy = df_final_standardized['energy'].mean()
avg_speechiness = df_final_standardized['speechiness'].mean()
avg_acousticness = df_final_standardized['acousticness'].mean()
avg_instrumentalness = df_final_standardized['instrumentalness'].mean()
avg_valence = df_final_standardized['valence'].mean()

In [166]:
# randomly select 5 tracks as seeds from the dataframe
seeds = df_final_standardized['trackid'].sample(5).tolist()

In [167]:
# get the recommendations
num_recs = 100
recommendations_model_2 = spot.recommendations(seed_tracks = seeds, target_danceability = avg_danceability, target_energy = avg_energy, target_speechines = avg_speechiness, target_acousticness = avg_acousticness, target_instrumentalness = avg_instrumentalness, target_valence = avg_valence,  limit = num_recs)
recommendations_model_2_output = [recommendations_model_2['tracks'][i]['uri'] for i in range (num_recs)]

In [171]:
recommendations_model_2_output

['spotify:track:1LeqCcX0nGK0X0QzxkcYXe',
 'spotify:track:5i2bkKprKcLgcYb07g7I5u',
 'spotify:track:53cRE3WlbO3mH0f9npC1FP',
 'spotify:track:6HCeuEE5gftm4IYw2w5JHL',
 'spotify:track:6vD7QL274xLOMhFkPtJ0w4',
 'spotify:track:4WsD6mlUDPyDKNb73bowaT',
 'spotify:track:3JP5l8RE8Dj6PGO9Gzlc2s',
 'spotify:track:19wNTwojzecTDeZJQCKIAo',
 'spotify:track:4yVHgph1fQvcV5xZY2XGa8',
 'spotify:track:3XQY8kDjI8LARMIC9xkxQk',
 'spotify:track:18OlU3yF0SF8Vc0TtnU116',
 'spotify:track:3VvHSQ2x6PqKYe4MBgxV0a',
 'spotify:track:12wjCdJC8WgKNCfP1UKK1Z',
 'spotify:track:4YcUWt45LSmWt53aQMJP1e',
 'spotify:track:7FGMkUmNWHKeUOUqxGe0iB',
 'spotify:track:75DzpCyabjU1ljXQbMJjFC',
 'spotify:track:0nUHIerjaRmHJxjYPmu4qq',
 'spotify:track:4LXvQMaXoJIdBxhrYqGiWI',
 'spotify:track:4M0w8DMX8GwH9OskuCjrwk',
 'spotify:track:6lTQXJcbOYIg5fYivZou9R',
 'spotify:track:1egqZEHFjIngSwlHxOfv98',
 'spotify:track:4iegB2vnO5BZXxFYYw6tdT',
 'spotify:track:36zrB7SN7Hizi24wtellYv',
 'spotify:track:1F5u1Pgff0uoHOaL099i4b',
 'spotify:track: