# Spotify Recommender System

Dataset: <https://www.kaggle.com/datasets/rodolfofigueroa/spotify-12m-songs>

You might want to subset this dataset to something like 100k rows right off of the bat so that it's easier to work with. You can do all of your modeling with the 100k row version and then once you've got things working the way you want them to you can run the notebook once with the entire dataset. 

In [15]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from joblib import dump, load
import pickle

In [17]:
from platform import python_version

print(python_version())

3.9.7


In [2]:
# Load dataset and sample it down to 8% of the original size
# Reset index after sampling to make indices easier to reason about
df = pd.read_csv('tracks_features.csv')
df = df.sample(frac=.08, random_state=42).reset_index()

# Drop old index to avoid confusing it for the new one
df = df.drop(columns=['index'])

### Usable Columns?

Columns that I can use with minimal data cleaning to make a simple recommender system. I could definitely make this better if I went to the work to make more columns or to do some feature engineering, but I want to get to a working prototype as fast as possible.

- explicit
- danceability
- energy
- key
- loudness
- mode
- speechiness
- acousticness
- time_signature
- year 

Todo:

- Check for Null Values (None)
- Categorically encode `explicit` column

In [19]:
df.shape

(1204025, 24)

In [206]:
df.columns

Index(['id', 'name', 'album', 'album_id', 'artists', 'artist_ids',
       'track_number', 'disc_number', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms',
       'time_signature', 'year', 'release_date'],
      dtype='object')

In [207]:
df.head()

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date
0,1aGS6nf2xgv3Xzdob4eOO3,Smokin' Sticky Sticky,Beat'n Down Yo Block,5ZO72kl3xMRRzlpod55k1Q,['Unk'],['0PGtMx1bsqoCHCy3MB3gXA'],15,1,True,0.623,...,0.402,0.0021,0.0,0.0691,0.422,87.988,380427,4.0,2006,2006-10-03
1,0fJfoqHIIiET2EcgjOfntG,Holding Back the Years,Holding Back The Years,7sV4kCqQYt8agM5TjkdOYU,['Norm Douglas'],['4kxKyoiYhldUlnfeCZtD0D'],1,1,False,0.585,...,0.0333,0.316,0.775,0.0993,0.88,170.082,266520,4.0,2008,2008-06-13
2,0V2R2LC8dR7S0REieXRaGt,All Along The Watchtower - Live - 1991,"Back On The Bus, Y'All",3jmmx4jRkul3POEhn1cgwF,['Indigo Girls'],['4wM29TDTr3HI0qFY3KoSFG'],7,1,False,0.331,...,0.0379,0.709,0.0,0.939,0.43,90.648,383773,4.0,1991,1991-06-04
3,4VUHYLocWOJ2GfvP78AmSs,Windmills,Total Folklore,5PyLkzuxmT6EoVNZCg8Iya,['Dan Friel'],['4HKTPJw50BFASrfhJEHIVP'],2,1,False,0.193,...,0.109,4.9e-05,0.838,0.285,0.594,113.345,82493,4.0,2013,2013-02-19
4,4m8a1AtmCnoeRzSYoQ0oX0,Overnite Flite,Normal Human Feelings,623VIdYR6Y0NCN9yPbMAC6,['Little Suns'],['5OLcAqMbHpecNOIQyTduQ7'],2,1,False,0.546,...,0.0323,0.427,0.000105,0.197,0.424,127.941,230667,1.0,2013,2013-10-08


In [3]:
df.describe()

Unnamed: 0,track_number,disc_number,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year
count,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0,96322.0
mean,7.644557,1.057744,0.493364,0.508325,5.187081,-11.819708,0.66721,0.084782,0.447697,0.28149,0.201646,0.427116,117.659471,248898.9,3.829696,2007.355578
std,5.971186,0.301655,0.18939,0.294787,3.537164,6.985058,0.471215,0.116789,0.385227,0.375692,0.180777,0.270253,30.890501,161519.7,0.563752,10.611802
min,1.0,1.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2400.0,0.0,1900.0
25%,3.0,1.0,0.357,0.25,2.0,-15.28,0.0,0.0351,0.0376,8e-06,0.0969,0.19,94.074,174133.0,4.0,2002.0
50%,7.0,1.0,0.501,0.523,5.0,-9.806,1.0,0.0445,0.39,0.008035,0.125,0.402,116.8955,224173.0,4.0,2009.0
75%,10.0,1.0,0.63275,0.765,8.0,-6.713,1.0,0.0726,0.863,0.714,0.245,0.643,137.10775,286000.0,4.0,2015.0
max,50.0,10.0,0.988,1.0,11.0,6.798,1.0,0.966,0.996,1.0,0.999,1.0,247.996,4995315.0,5.0,2020.0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1204025 entries, 0 to 1204024
Data columns (total 24 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   id                1204025 non-null  object 
 1   name              1204025 non-null  object 
 2   album             1204025 non-null  object 
 3   album_id          1204025 non-null  object 
 4   artists           1204025 non-null  object 
 5   artist_ids        1204025 non-null  object 
 6   track_number      1204025 non-null  int64  
 7   disc_number       1204025 non-null  int64  
 8   explicit          1204025 non-null  bool   
 9   danceability      1204025 non-null  float64
 10  energy            1204025 non-null  float64
 11  key               1204025 non-null  int64  
 12  loudness          1204025 non-null  float64
 13  mode              1204025 non-null  int64  
 14  speechiness       1204025 non-null  float64
 15  acousticness      1204025 non-null  float64
 16  

In [208]:
# no null values
df.isnull().sum()

id                  0
name                0
album               0
album_id            0
artists             0
artist_ids          0
track_number        0
disc_number         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
year                0
release_date        0
dtype: int64

In [16]:
df.isnull().sum().sum()

0

In [3]:
# any column that contains True and False will automatically
# change to 1s and 0s when cast to the `int` datatype
df['explicit'] = df['explicit'].astype(int)

df.head()

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date
0,1aGS6nf2xgv3Xzdob4eOO3,Smokin' Sticky Sticky,Beat'n Down Yo Block,5ZO72kl3xMRRzlpod55k1Q,['Unk'],['0PGtMx1bsqoCHCy3MB3gXA'],15,1,1,0.623,...,0.402,0.0021,0.0,0.0691,0.422,87.988,380427,4.0,2006,2006-10-03
1,0fJfoqHIIiET2EcgjOfntG,Holding Back the Years,Holding Back The Years,7sV4kCqQYt8agM5TjkdOYU,['Norm Douglas'],['4kxKyoiYhldUlnfeCZtD0D'],1,1,0,0.585,...,0.0333,0.316,0.775,0.0993,0.88,170.082,266520,4.0,2008,2008-06-13
2,0V2R2LC8dR7S0REieXRaGt,All Along The Watchtower - Live - 1991,"Back On The Bus, Y'All",3jmmx4jRkul3POEhn1cgwF,['Indigo Girls'],['4wM29TDTr3HI0qFY3KoSFG'],7,1,0,0.331,...,0.0379,0.709,0.0,0.939,0.43,90.648,383773,4.0,1991,1991-06-04
3,4VUHYLocWOJ2GfvP78AmSs,Windmills,Total Folklore,5PyLkzuxmT6EoVNZCg8Iya,['Dan Friel'],['4HKTPJw50BFASrfhJEHIVP'],2,1,0,0.193,...,0.109,4.9e-05,0.838,0.285,0.594,113.345,82493,4.0,2013,2013-02-19
4,4m8a1AtmCnoeRzSYoQ0oX0,Overnite Flite,Normal Human Feelings,623VIdYR6Y0NCN9yPbMAC6,['Little Suns'],['5OLcAqMbHpecNOIQyTduQ7'],2,1,0,0.546,...,0.0323,0.427,0.000105,0.197,0.424,127.941,230667,1.0,2013,2013-10-08


## Create X Matrix of numeric song attributes

In [7]:
df['index'] = df.index
usable_columns =['explicit', 'danceability', 'energy', 'key', 'loudness', 
                'mode', 'speechiness', 'acousticness', 'time_signature', 
                 'year']

X = df[usable_columns]

X.head()

Unnamed: 0,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,time_signature,year
0,1,0.623,0.736,11,-3.657,0,0.402,0.0021,4.0,2006
1,0,0.585,0.639,2,-9.641,0,0.0333,0.316,4.0,2008
2,0,0.331,0.466,9,-14.287,0,0.0379,0.709,4.0,1991
3,0,0.193,0.856,4,-2.97,1,0.109,4.9e-05,4.0,2013
4,0,0.546,0.373,3,-13.929,1,0.0323,0.427,1.0,2013


In [22]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


# Use Nearest Neighbors to get 5 most similar songs.

> Indented block



In [8]:
neigh = NearestNeighbors(n_neighbors=5, algorithm='auto', n_jobs=-1)
neigh.fit(X)

NearestNeighbors(n_jobs=-1)

In [10]:
filename = 'model.sav'
dump(neigh, filename)

['model.sav']

In [16]:
filename = 'pickle_model.sav'
pickle.dump(neigh, open(filename, 'wb'))

In [15]:
# Track name needs to be exact match of spelling, punctuation and capitalization
track_name = "Holding Back the Years"

# Look at the song that we want to find recommendations for
df[df['name'] == track_name]

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date
1,0fJfoqHIIiET2EcgjOfntG,Holding Back the Years,Holding Back The Years,7sV4kCqQYt8agM5TjkdOYU,['Norm Douglas'],['4kxKyoiYhldUlnfeCZtD0D'],1,1,0,0.585,...,0.0333,0.316,0.775,0.0993,0.88,170.082,266520,4.0,2008,2008-06-13
82635,5F9WGLNnZRRwVyiCt1nHDr,Holding Back the Years,The Lost and Found,4fZx2cNk1Vod8jZkPSWBpv,['Gretchen Parlato'],['76Gi1qoWLrIerL5FcL0TZb'],1,1,0,0.541,...,0.0427,0.778,0.13,0.115,0.127,92.624,226587,4.0,2011,2011-04-05


In [16]:
# We may have multiple tracks that match this title, we'll just select the first one
# We'll grab only its row index and then use that select the corresponding song's
# data from our X matrix.
track_index = df[df['name'] == track_name].index[0]

track_data = X.iloc[track_index]

print(type(track_data))
track_data

<class 'pandas.core.series.Series'>


explicit               0.0000
danceability           0.5850
energy                 0.6390
key                    2.0000
loudness              -9.6410
mode                   0.0000
speechiness            0.0333
acousticness           0.3160
time_signature         4.0000
year                2008.0000
tempo                170.0820
liveness               0.0993
valence                0.8800
instrumentalness       0.7750
Name: 1, dtype: float64

In [17]:
# Input to model must be a 2D array
# .reshape(1,-1) turns a 1D array into a 2D array
# (basically just adds an extra set of square brackets at
# the beginning and end of the array.)
track_data = track_data.values.reshape(1,-1)


print(type(track_data))
track_data

<class 'numpy.ndarray'>


array([[ 0.00000e+00,  5.85000e-01,  6.39000e-01,  2.00000e+00,
        -9.64100e+00,  0.00000e+00,  3.33000e-02,  3.16000e-01,
         4.00000e+00,  2.00800e+03,  1.70082e+02,  9.93000e-02,
         8.80000e-01,  7.75000e-01]])

In [10]:
# Since the selected song is also in the training data,
# the most similar song is itself 
# We will ask for 6 songs to get back 5 songs in addition to the one provided
distances, song_indexes = neigh.kneighbors(track_data, 6)

song_indexes

array([[    1, 95209, 44437, 89654, 81690, 83078]])

In [13]:
usa = [[False,  0.499, 0.899, 4, -8.478, 1, 0.2769999999999999, 2.13e-05, 4.0, 2006]]
print(type(usa))

print(type(usa))
usa

<class 'list'>
<class 'list'>


[[False, 0.499, 0.899, 4, -8.478, 1, 0.2769999999999999, 2.13e-05, 4.0, 2006]]

In [14]:
loaded_model = load('model.sav')
distances, song_indexes = loaded_model.kneighbors(usa, 6)

song_indexes

array([[   68, 39814, 82507, 60910, 95647, 74202]])

In [11]:
# 5 most similar songs
for index in song_indexes:
  print(df.iloc[index][['name', 'artists']])

                                      name                          artists
1                   Holding Back the Years                 ['Norm Douglas']
95209             Footprints In The Hamlet                  ['Masterpiece']
44437                        Maasina Tooro                   ['Afrissippi']
89654               Deathstriker6666 Theme                ['Geoff Lapaire']
81690     Komponent - Telefon Tel Aviv RMX  ['Apparat', 'Telefon Tel Aviv']
83078  Face #2 [Decider mix] by Chris Gill                   ['Slave Unit']


In [15]:
# Save model to disk
filename = 'test.sav'
# Dump model using JobLib
dump(neigh, 'test.sav')

['test.sav']