# Predict Songs Using Personal Playlists

#### Notebook overview

The goal of this notebook is to provide a way for users to upload their personal playlists. Once these playlists are uploaded, we need a way to compare the "average" value for this playlist to each song in our master database. By generating a score for each song compared to the custom playlist, we can rank the similarity between the two and recommend the highest matches.

## Spotify API

The Spotify API will be used to upload a personal playlist to the application. Users can select their playlist uri, and we can return all songs within that playlist. We can then reference these playlists to the large, cleaned Spotify dataset to get the song vectors for each item in the playlist.

## Playlist Vector

For those songs in the custom playlist that are also in our master dataset, we will already have a full set of features. In this case, we need to derive some average vector that summarizes this playlist. To start this can be as simple as averaging down each column.

After seeing the performance of this method, we will introduce certain weighting factors to see how this impacts suggestions.

#### Import dependencies

In [1]:
import spotipy
import requests
from spotipy.oauth2 import SpotifyClientCredentials
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import dot
from numpy.linalg import norm

In [2]:
# redefine our numberic feature function

def return_numeric_features(df):
    numeric_features = df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())

    non_numeric_features = []
    for feat, value in numeric_features.iteritems():
        if value == False:
            non_numeric_features.append(feat)
            
    print('Non-numeric features: ', non_numeric_features)
    numeric_df = df.drop(non_numeric_features, axis=1)
    
    return numeric_df

#### Load in secrets and establish a connection with the API

In [3]:
spotify_keys = json.load(open('spotify_keys.json'))
client_id = spotify_keys['client_id']
client_secret = spotify_keys['client_secret']

In [4]:
client_credentials_manager = SpotifyClientCredentials(client_id=client_id,client_secret=client_secret)

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

#### Copy a spotify playlist uri and read in playlist metadata

For this section, I captured a playlist with popular songs likely to be located in our mega dataset. This API call should return a large json object containing playlist data.

The playlist used is titled "Classic Rock" and contains a small collection of popular classic rock songs from well known artists.

In [5]:
uri = "spotify:playlist:5BygwTQ3OrbiwVsQhXFHMz"
user = spotify_keys['username']
playlist = uri.split(":")[2]

results = sp.user_playlist(user, playlist, 'tracks')
print(results['tracks'])

{'href': 'https://api.spotify.com/v1/playlists/5BygwTQ3OrbiwVsQhXFHMz/tracks?offset=0&limit=100&additional_types=track', 'items': [{'added_at': '2020-10-07T05:43:47Z', 'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/sonymusicfinland'}, 'href': 'https://api.spotify.com/v1/users/sonymusicfinland', 'id': 'sonymusicfinland', 'type': 'user', 'uri': 'spotify:user:sonymusicfinland'}, 'is_local': False, 'primary_color': None, 'track': {'album': {'album_type': 'single', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/711MCceyCBcFnzjGY4Q7Un'}, 'href': 'https://api.spotify.com/v1/artists/711MCceyCBcFnzjGY4Q7Un', 'id': '711MCceyCBcFnzjGY4Q7Un', 'name': 'AC/DC', 'type': 'artist', 'uri': 'spotify:artist:711MCceyCBcFnzjGY4Q7Un'}], 'available_markets': ['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT', 'AU', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BN', 'BO', 'BR', 'BS', 'BT', 'BW', 'BY', 'BZ', 'CA', 'CD', 'CG', 'CH', 'CI', 'CL', 'C

#### Navigate this object to return song uris

In [6]:
playlist_song_uris = []

for song in results['tracks']['items']:
    playlist_song_uris.append(song['track']['id'])

In [7]:
print(playlist_song_uris)

['6ZtrGCcn38kGImt2GPFbJB', '39shmbIHICJ2Wxnk1fPSdz', '2wO8aOvN1ogLy1N8XT1WJE', '07KHJvlYBeQVqrmifTEqEp', '6QewNVIDKdSl8Y3ycuHIei', '395C2pn0PdOYPzM4B1jLoO', '2Cdvbe2G4hZsnhNMKyGrie', '6tunhVGD8C05MZNjSVIsjw', '5UWwZ5lm5PKu6eKsHAGxOk', '7bxon8K9DP6stYx5ZO9WlK', '3L60Vu9qmY6fg2QroRIxgi', '679zqcQuKakOGI93NPCqB8', '59WN2psjkt1tyaxjspN8fp', '77NNZQSqzLNqh2A9JhLRkg', '13L9jEt1IfmZQx77bzAxBp', '0i1RTnH2Lj5gTDRU5wtyT2', '0cO7JEo8deKuQMWpDyjenY', '57bgtoPSgt236HzfBOd8kj', '3KwsuUstyHS3a5z2GGYEST', '4xanWVQIzdCf51mg8cd1cQ', '0qRR9d89hIS0MHRkQ0ejxX', '6eEYGGFfFbtKHCgJM4uh9v', '7xdLNxZCtY68x5MAOBEmBq', '5OQsiBsky2k2kDKy2bX2eT', '5sMIFZaagXcwKiSfl95zIW', '4f3RDq9nYPBeR1yMSgnmBm', '7dQC53NiYOY9gKg3Qsu2Bs', '1Cr0L9EsOePPOAoXRTxo1p', '64UioB4Nmwgn2f4cbIpAkl', '6QZo2TgclkUMwJgggi8QSQ', '1hKdDCpiI9mqz1jVHRKG0E', '72ahyckBJfTigJCFCviVN7', '29AqPjeqZcXpGvdxLchZoP', '0tZ3mElWcr74OOhKEiNz1x', '2SiXAy7TuUkycRVbbWDEpo', '1QEEqeFIZktqIpPI4jSVSF', '6NTqBHONQqmud0ONBzsLfZ', '4WmjWLZh3YpJ88n66nEjgV', '52MmMUuyjO

#### Import our cleaned Spotify dataset from the spotify_visualization_feature_engineering notebook

In [8]:
cleaned_spotify_df = pd.read_csv('clean_df.csv').iloc[:, 1:]
cleaned_spotify_df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,year_2011,year_2012,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018,year_2019,year_2020
0,0.998996,['Carl Woitschach'],0.716599,0.028442,0.195,0.0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,0.151,0.745,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.997992,"['Robert Schumann', 'Vladimir Horowitz']",0.383603,0.051316,0.0135,0.0,6KuQTIu1KoTTkLXKrwlLPV,0.901,0.0763,0.494026,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.606426,['Seweryn Goszczyński'],0.758097,0.018374,0.22,0.0,6L63VW0PibdM1HDSBoqnoM,0.0,0.119,0.627609,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.998996,['Francisco Canaro'],0.790486,0.032538,0.13,0.0,6M94FkXd15sOAOQYRnWPN8,0.887,0.111,0.708887,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.993976,"['Frédéric Chopin', 'Vladimir Horowitz']",0.212551,0.12645,0.204,0.0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,0.098,0.676079,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Generate a new dataframe containing song vectors for songs in our custom playlist

It appears that all songs in our playlist were also in our dataset. This may be more common for older datasets.

In [9]:
custom_df = cleaned_spotify_df[cleaned_spotify_df['id'].isin(playlist_song_uris)]

In [10]:
custom_df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,year_2011,year_2012,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018,year_2019,year_2020
29292,3.3e-05,['Nirvana'],0.441296,0.046305,0.876,0.0,3sKLf8SmgbyIikCTBZty9F,0.000104,0.205,0.862861,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55026,0.322289,['The Edgar Winter Group'],0.696356,0.034042,0.739,0.0,52MmMUuyjO64Y1EiF7Y8KH,0.011,0.174,0.790447,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
82642,0.008795,['Jimi Hendrix'],0.539474,0.030695,0.905,0.0,0wJoRiX5K5BxlqZTolB2LD,0.578,0.0698,0.857098,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
82762,0.264056,"['Big Brother & The Holding Company', 'Janis J...",0.448381,0.045981,0.727,0.0,1xKQbqQtQWrtQS47fUJBtl,0.000141,0.169,0.815112,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
82847,0.048594,['Led Zeppelin'],0.417004,0.060904,0.902,0.0,0hCB0YR03f6AmQaHbwWDe8,0.131,0.405,0.757967,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
len(custom_df) == len(playlist_song_uris)

False

In [12]:
len(custom_df)

75

In [13]:
len(playlist_song_uris)

100

#### Drop songs form master database that are already in the playlist

In [14]:
cleaned_spotify_df = cleaned_spotify_df.drop(index=custom_df.index)

In [15]:
cleaned_spotify_df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,year_2011,year_2012,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018,year_2019,year_2020
0,0.998996,['Carl Woitschach'],0.716599,0.028442,0.195,0.0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,0.151,0.745,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.997992,"['Robert Schumann', 'Vladimir Horowitz']",0.383603,0.051316,0.0135,0.0,6KuQTIu1KoTTkLXKrwlLPV,0.901,0.0763,0.494026,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.606426,['Seweryn Goszczyński'],0.758097,0.018374,0.22,0.0,6L63VW0PibdM1HDSBoqnoM,0.0,0.119,0.627609,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.998996,['Francisco Canaro'],0.790486,0.032538,0.13,0.0,6M94FkXd15sOAOQYRnWPN8,0.887,0.111,0.708887,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.993976,"['Frédéric Chopin', 'Vladimir Horowitz']",0.212551,0.12645,0.204,0.0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,0.098,0.676079,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
cleaned_numeric_spotify_df = return_numeric_features(cleaned_spotify_df)
cleaned_numeric_spotify_df.head()

Non-numeric features:  ['artists', 'id', 'name']


Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,liveness,loudness,popularity,speechiness,...,year_2011,year_2012,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018,year_2019,year_2020
0,0.998996,0.716599,0.028442,0.195,0.0,0.563,0.151,0.745,0.0,0.052219,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.997992,0.383603,0.051316,0.0135,0.0,0.901,0.0763,0.494026,0.0,0.047678,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.606426,0.758097,0.018374,0.22,0.0,0.0,0.119,0.627609,0.0,0.95872,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.998996,0.790486,0.032538,0.13,0.0,0.887,0.111,0.708887,0.0,0.095562,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.993976,0.212551,0.12645,0.204,0.0,0.908,0.098,0.676079,0.01,0.043756,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Generate a playlist vector

For this vector, we will experiment by first averaging over all numeric columns.

In [17]:
numeric_cleaned_df = return_numeric_features(custom_df)

Non-numeric features:  ['artists', 'id', 'name']


In [18]:
averaged_playlist_vector = numeric_cleaned_df.mean(axis=0)

In [19]:
averaged_playlist_vector.shape

(112,)

In [20]:
averaged_playlist_vector.apply

<bound method Series.apply of acousticness    0.146179
danceability    0.516019
duration_ms     0.044638
energy          0.734440
explicit        0.026667
                  ...   
year_2016       0.000000
year_2017       0.000000
year_2018       0.000000
year_2019       0.000000
year_2020       0.000000
Length: 112, dtype: float64>

#### Using cosine similarity, compare this vector to all song vectors

In [21]:
def cos_sim(row, playlist_vector):
    return dot(row, playlist_vector)/(norm(row)*norm(playlist_vector))

In [22]:
song_similarity_to_playlist = cleaned_numeric_spotify_df.apply(cos_sim, axis=1, args=(averaged_playlist_vector,))

In [23]:
song_similarity_to_playlist.sort_values()

145844    0.000000
99092     0.000004
108452    0.000006
48566     0.000007
137188    0.002098
            ...   
83974     0.890871
83988     0.891942
83967     0.894423
165894    0.894652
83976     0.896605
Length: 169834, dtype: float64

#### Add this column to our starting cleaned dataframe

In [24]:
cleaned_spotify_df['cosine_similarity'] = song_similarity_to_playlist

In [25]:
cleaned_spotify_df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,year_2012,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018,year_2019,year_2020,cosine_similarity
0,0.998996,['Carl Woitschach'],0.716599,0.028442,0.195,0.0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,0.151,0.745,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.599957
1,0.997992,"['Robert Schumann', 'Vladimir Horowitz']",0.383603,0.051316,0.0135,0.0,6KuQTIu1KoTTkLXKrwlLPV,0.901,0.0763,0.494026,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.34527
2,0.606426,['Seweryn Goszczyński'],0.758097,0.018374,0.22,0.0,6L63VW0PibdM1HDSBoqnoM,0.0,0.119,0.627609,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.588367
3,0.998996,['Francisco Canaro'],0.790486,0.032538,0.13,0.0,6M94FkXd15sOAOQYRnWPN8,0.887,0.111,0.708887,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.549497
4,0.993976,"['Frédéric Chopin', 'Vladimir Horowitz']",0.212551,0.12645,0.204,0.0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,0.098,0.676079,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.392858


In [26]:
cleaned_spotify_df.sort_values('cosine_similarity', ascending=False)

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,liveness,loudness,...,year_2012,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018,year_2019,year_2020,cosine_similarity
83976,0.175703,['Billy Joel'],0.573887,0.045957,0.945,0.0,7gMOe0gXYcELUoVugfMmHP,0.000003,0.3500,0.826200,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.896605
165894,0.164659,['Cold Chisel'],0.689271,0.037053,0.889,0.0,3EkEomllpfXPPIGVFvZcEq,0.000000,0.2070,0.883564,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.894652
83967,0.074699,['DEVO'],0.789474,0.028902,0.869,0.0,4sscDOZCkbLSlDqcCgUJnX,0.004000,0.0621,0.826012,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.894423
83988,0.283133,['The Romantics'],0.520243,0.031520,0.942,0.0,4ebcE2SmkG7nplvzFAWRu7,0.000066,0.1700,0.807564,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.891942
83974,0.086044,['Bruce Springsteen'],0.648785,0.036001,0.894,0.0,1KsI8NEeAna8ZIdojI3FiT,0.009000,0.1570,0.823585,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.890871
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137188,0.000000,['Future Rapper'],0.000000,0.076855,0.000,0.0,0Rd7eiAZGayLT8TmrVpQzG,0.000000,0.0000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002098
48566,0.000000,['Sarah Vaughan'],0.000000,0.000252,0.000,0.0,3lRVIn6D6EUbvkOgPZAU1H,0.000000,0.0000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000007
108452,0.000000,['Benny Goodman'],0.000000,0.000232,0.000,0.0,523qs4UcGlQ6ycdha1VGqs,0.000000,0.0000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000006
99092,0.000000,['Benny Goodman'],0.000000,0.000164,0.000,0.0,3IcXTeq9O2dpsSXsDj9naH,0.000000,0.0000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000004


#### Print out the top 30 recommended songs from our model

In [27]:
cleaned_spotify_df[['name', 'artists', 'cosine_similarity']].sort_values('cosine_similarity', ascending=False).head(30)

Unnamed: 0,name,artists,cosine_similarity
83976,You May Be Right,['Billy Joel'],0.896605
165894,Cheap Wine - 2011 Remastered,['Cold Chisel'],0.894652
83967,Whip It,['DEVO'],0.894423
83988,What I Like About You,['The Romantics'],0.891942
83974,Hungry Heart,['Bruce Springsteen'],0.890871
83952,Once in a Lifetime - 2005 Remaster,['Talking Heads'],0.888774
93797,Givin the Dog a Bone,['AC/DC'],0.886996
83973,Biggest Part of Me,['Ambrosia'],0.886498
93760,Private Idaho,"[""The B-52's""]",0.886494
83983,Living After Midnight,['Judas Priest'],0.886476


# Initial Conclusions

The first model results appeared to do surprisingly well. Our playlist of choice contained popular music from the classic rock genre.

Our model returned mostly well known songs from popular artists from a similar time period. These artists included AC/DC, Billy Joel, Michael Jackson, and Bruce Springstein. The genres of the returned songs seem to extend past classic rock in certain cases. This might be an indication that a "genre" feature could be very powerful.

However, the success of the initial model shows that there is sufficient information in the existing feature set for successfully generating recommendations.

#### Next steps

To improve this model, I see three best paths forward:

1. Use the Spotify API to pull song genres. This could be used as a categorical variable that may hold a lot of weight, as playlists are often centered around a specific genre.
2. Create a categorical feature for "decade". We are currently turning year into a categorical variable for all unique values. This is an effective strategy, but when thinking of music, "decade" is often a powerful descriptor for type of song. Including decade as an additional feature could add correlation between songs released in the same decade, but different years.
3. Provide a weighting factor to reduce the weight of popular songs in our playlist vector. When we create our playlist vector, all songs are weighted equally. It might be interesting to reduce the impact of very popular songs when initially calculating this vector. From our results above, popular atrists seem to dominate our recommendations, which indicates that the popularity feature might be highly correlated. By reducing the weight of popular songs in the initial feature vector, the output songs might be less popular, and more likely to introduce new artists and music.