# Recommendations algorithm

Now that we have our ML model saved (The birch.obj file), we can move on to creating recommendations based on the user's playlist.

The system works by at first getting the playlist we want to inspect, using the trained BIRCH- model to generate clusters and then comparing them to the dataset we saved (tracdata_clustered.csv).
It gets the songs wich would have been in the same cluster, and then randomly pics five recommendations for the user. The user then has to add the recommended songs by themself to the playlist.

In [19]:
## Get all libraries needed
from requests import Timeout
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spotipy as spy
import pickle
import csv

In [6]:
## Again, as we will be using the spotify API, we need to get the secret
## This time, we will use a burner secret
## This will be only in use for couple of weeks
client_s = '0c4d44681c9f4a3fbf82594a6308b6af'
client_id = 'bd889025bc034cadab61bc6a169617b4'

In [7]:
## Initialize the credentials manager
credentials_manager = spy.oauth2.SpotifyClientCredentials(client_id, client_s)
sp = spy.Spotify(client_credentials_manager=credentials_manager)

In [4]:
## Get the playlist we want to inspect
url = input("Enter URL for the PUBLIC playlist you want to inspect: ")
user = input("Enter the username of the playlist's owner: ")
splitted = url.split("/")
URI = splitted[len(splitted)-1]

Enter URL for the PUBLIC playlist you want to inspect:  https://open.spotify.com/playlist/4b9mSnY53QnAf6KxFffw2h
Enter the username of the playlist's owner:  Riku Tuukkala


In [8]:
##Same as before, get the ID's from the playlist as we can't straightup get the data
pl_t = sp.user_playlist_tracks(user, URI)
tmp = pl_t['items']
ids = []
while pl_t['next']:
    pl_t = sp.next(pl_t)
    tmp.extend(pl_t['items'])
    for s in tmp:
        ids.append(s["track"]["id"])

In [9]:
## Remove duplicates
i = len(ids)
print(i)
if len(ids) != len(set(ids)):
    ids = list(set(ids))
    print("Duplicates found, {} rows cleaned.".format(i-len(ids)))
    print(len(ids))
else:
    print("No duplicates")

154
Duplicates found, 2 rows cleaned.
152


In [10]:
## Add the data to a dataframe
## As we saw before, if there is a ton of data, it will take a while to complete.
di = {"name": [], "length": [], "popularity": [], "acousticness": [], "danceability": [], 
"energy": [], "instrumentalness": [], "liveness": [], "loudness": [], "speechiness": [],
"tempo": [], "time_signiature": []}

for i in range(len(ids)):
    try:
        metadata = sp.track(ids[i])
        features = sp.audio_features(ids[i])
        di["name"].append(metadata['name'])
        di['length'].append(metadata['duration_ms'])
        di['popularity'].append(metadata['popularity'])
            
        di['acousticness'].append(features[0]['acousticness'])
        di['danceability'].append(features[0]['danceability'])
        di['energy'].append(features[0]['energy'])
        di['instrumentalness'].append(features[0]['instrumentalness'])
        di['liveness'].append(features[0]['liveness'])
        di['loudness'].append(features[0]['loudness'])
        di['speechiness'].append(features[0]['speechiness'])
        di['tempo'].append(features[0]['tempo'])
        di['time_signiature'].append(features[0]['time_signature'])
    except spy.SpotifyException:
        print("Spotify error at id", ids[i])
    except Timeout:
        print("API not responding, trying again.")
        if (i >= 1):
            i = i - 1
        else:
            i = 0
        continue
ddf = pd.DataFrame(di)
## Check out what the data looks like
ddf.head()
     

Unnamed: 0,name,length,popularity,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signiature
0,Viisaus ei asu meissä,188440,28,0.000178,0.61,0.973,0.0,0.0278,-4.435,0.148,133.014,4
1,Laura Jenna Ellinoora Alexandra Camilla Jurvanen,176133,40,0.0462,0.68,0.744,0.0,0.201,-8.547,0.0314,112.596,4
2,Lähtisitkö,274266,42,0.00884,0.644,0.677,1.3e-05,0.0601,-6.493,0.032,91.218,4
3,Kaljaa ainiaan,190876,29,0.066,0.56,0.794,0.0,0.0352,-5.04,0.0266,133.154,4
4,Syksyn Sävel,212666,41,0.483,0.528,0.513,0.0719,0.112,-9.598,0.0288,126.108,4


In [20]:
## Clean the data as we did before
Adf = ddf.copy().drop(columns=["time_signiature", "name", "instrumentalness", "length", "speechiness"])
round(Adf[Adf.columns].describe(), 2)

Unnamed: 0,popularity,acousticness,danceability,energy,liveness,loudness,tempo
count,152.0,152.0,152.0,152.0,152.0,152.0,152.0
mean,37.1,0.19,0.57,0.68,0.19,-8.28,126.63
std,7.77,0.23,0.11,0.17,0.13,3.25,27.88
min,0.0,0.0,0.31,0.08,0.03,-21.66,61.41
25%,34.0,0.02,0.5,0.59,0.08,-10.28,104.96
50%,39.0,0.08,0.58,0.69,0.15,-7.68,126.71
75%,42.0,0.28,0.65,0.8,0.26,-5.8,141.12
max,53.0,0.92,0.8,0.99,0.79,-2.81,204.42


In [None]:
## Inspect the data some more to get a better feeling what it is made of
hist = Adf.hist(figsize(16, 18))
hist

In [21]:
##Scale the data with standard scaler
columns_ = Adf.columns
scaled = StandardScaler().fit_transform(Adf.to_numpy())
sDF = pd.DataFrame(scaled, columns=columns_)
sDF.head()

Unnamed: 0,popularity,acousticness,danceability,energy,liveness,loudness,tempo
0,-1.174924,-0.820709,0.313542,1.748131,-1.201371,1.186872,0.229775
1,0.37465,-0.620875,0.930535,0.382905,0.096472,-0.082478,-0.504909
2,0.632913,-0.783098,0.613224,-0.016528,-0.959337,0.55158,-1.274136
3,-1.045793,-0.534901,-0.127168,0.680989,-1.145921,1.000112,0.234813
4,0.503781,1.275773,-0.409222,-0.994244,-0.570434,-0.406916,-0.018718


In [22]:
## Get the BIRCH- model we created
birch = pickle.load(open("birch.obj", "rb"))



In [23]:
## Get the labels and some information about their distribution
labels = birch.predict(sDF.to_numpy())
## Get the most frequent label in the array and use it as a basis of the recommendations
most_freq = np.argmax(np.bincount(labels))
analysis = [np.mean(labels), np.median(labels), np.std(labels)]
print("Mean: {}".format(analysis[0]))
print("Median: {}".format(analysis[1]))
print("Standard deviation: {}".format(analysis[2]))

Mean: 42.03947368421053
Median: 39.5
Standard deviation: 29.35263853892314


In [25]:
## Get the stored data for generating the recommendations
comparison = pd.read_csv("Datasets/tracks_clustered.csv")
comparison = comparison[["name", "clusters"]]
comparison.head()

Unnamed: 0,name,clusters
0,Major Minus,40
1,Kings And Queens,47
2,Tainted Love,50
3,Portal,75
4,Day One,26


# The Recommendaitons

In the cell below, you can see what songs the algorithm recommends you to add.

In [26]:
## Finally, get randomly five songs from the same cluster as recommendations
mDF = pd.DataFrame(comparison.loc[comparison["clusters"] == most_freq], columns=["name"])
mDF = mDF.sample(n=5)
mDF

Unnamed: 0,name
6847,Tear You Apart
338,Jesus' Tod
3189,First
3403,Back Then
6365,Champagne Showers


# The results

As we can see from the cell above, our system now gets some rough recommendations for the user. The quality of recommendations depends on the quality of the Birch ML model we trained earlier.
We could now using spotipy add the songs listed to the playlist, run the algorithm again, and continue adding more and more songs. The user at this point however needs to add the songs manually.
This is done because of our limited amount of songs to recommend and their type. For example, if you'r playlist consists of only a specific, niche-type music, we end up recommending songs that would,
in theory, fit the playlist but could still be in different language.

The quality of training data does affect our possibility to recommend songs greatly. The best results could be achieved by also getting the language of the songs as well as their genre, but currently spotify's API does not allow this. 