# S.a.M: Music Recommender Feature


Hello! In this notebook we will be implementing the music recommendation system for our AI: S.a.M. S.a.M allows the user to input their favorite songs and get recommendations based off of it. This will be possible with the help of K-Means clustering and Content-based filtering.

Let's begin!


## Step 1: Loading the data

In order to start, we will have to import the necessary libraries and resources.

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


We will now read the CSV file containing our dataset and put it in a pandas dataframe. In dataframe format, we will able to show important information such as the format of the dataset and its shape.

In [53]:
df = pd.read_csv('dataset.csv', index_col=0) #index_col=0 used to removed any unnamed columns which our dataset had one
df.head(10) #the head() function usually prints out the first 5 rows of the dataset but we will show the first 10

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic
5,01MVOl9KtVTNfFiBU9I7dc,Tyrone Wells,Days I Will Remember,Days I Will Remember,58,214240,False,0.688,0.481,6,-8.807,1,0.105,0.289,0.0,0.189,0.666,98.017,4,acoustic
6,6Vc5wAMmXdKIAM7WUoEb7N,A Great Big World;Christina Aguilera,Is There Anybody Out There?,Say Something,74,229400,False,0.407,0.147,2,-8.822,1,0.0355,0.857,3e-06,0.0913,0.0765,141.284,3,acoustic
7,1EzrEOXmMH3G43AXT1y7pA,Jason Mraz,We Sing. We Dance. We Steal Things.,I'm Yours,80,242946,False,0.703,0.444,11,-9.331,1,0.0417,0.559,0.0,0.0973,0.712,150.96,4,acoustic
8,0IktbUcnAGrvD03AWnz3Q8,Jason Mraz;Colbie Caillat,We Sing. We Dance. We Steal Things.,Lucky,74,189613,False,0.625,0.414,0,-8.7,1,0.0369,0.294,0.0,0.151,0.669,130.088,4,acoustic
9,7k9GuJYLp2AzqokyEdwEw2,Ross Copperman,Hunger,Hunger,56,205594,False,0.442,0.632,1,-6.77,1,0.0295,0.426,0.00419,0.0735,0.196,78.899,4,acoustic


In [54]:
print("The dimensions of our dataset are:", df.shape)

The dimensions of our dataset are: (114000, 20)


As seen above, our dataset consists of 114,000 songs with 20 features. These features consists of basic information of the song such as track name, artist, and album. But, it also has more interesting features that might be useful for use such as, loudness, acousticness, tempo, and genre. These can be useful when it comes to creating clusters.

## Step 2: Creating the clusters

For our recommendation system, we want to take in the user's input and output songs from a cluster that is most similar to our user's taste. However, before being able to do this, we need to figure out how exactly we should cluster this data. This is where **Principal Component Analysis** (PCA) comes in.
> PCA will help us figure out which features of our dataset are the most important for variability. This means rather than working with 21 features, we will only work with the most important features.

### Step 2a. Implementing PCA

Before implementing PCA, we first need to clean our data of unnecessary columns. PCA works best if we keep features that are numerical values. This will make our recommendation system more interesting as we will be recommending songs that sound similar rather than something simple such as genre. To achieve this, we will drop columns that do not contain numerical data. We will also be dropping NaN columns which are columns that contain missing or incomplete data.

In [55]:
df = df.dropna(axis=1) #we will first drop any NaN columns for efficiency
df.head(10)

Unnamed: 0,track_id,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,5vjLSffimiIP26QG5WcN2K,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic
5,01MVOl9KtVTNfFiBU9I7dc,58,214240,False,0.688,0.481,6,-8.807,1,0.105,0.289,0.0,0.189,0.666,98.017,4,acoustic
6,6Vc5wAMmXdKIAM7WUoEb7N,74,229400,False,0.407,0.147,2,-8.822,1,0.0355,0.857,3e-06,0.0913,0.0765,141.284,3,acoustic
7,1EzrEOXmMH3G43AXT1y7pA,80,242946,False,0.703,0.444,11,-9.331,1,0.0417,0.559,0.0,0.0973,0.712,150.96,4,acoustic
8,0IktbUcnAGrvD03AWnz3Q8,74,189613,False,0.625,0.414,0,-8.7,1,0.0369,0.294,0.0,0.151,0.669,130.088,4,acoustic
9,7k9GuJYLp2AzqokyEdwEw2,56,205594,False,0.442,0.632,1,-6.77,1,0.0295,0.426,0.00419,0.0735,0.196,78.899,4,acoustic


In [56]:
#now we have to manually drop any remaining uneccessary columns such as 'track_id', 'explicit', and 'track_genre'
df = df.drop(['track_id', 'explicit', 'track_genre'], axis=1)
df.head(10)

Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,73,230666,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4
1,55,149610,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4
2,57,210826,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4
3,71,201933,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3
4,82,198853,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4
5,58,214240,0.688,0.481,6,-8.807,1,0.105,0.289,0.0,0.189,0.666,98.017,4
6,74,229400,0.407,0.147,2,-8.822,1,0.0355,0.857,3e-06,0.0913,0.0765,141.284,3
7,80,242946,0.703,0.444,11,-9.331,1,0.0417,0.559,0.0,0.0973,0.712,150.96,4
8,74,189613,0.625,0.414,0,-8.7,1,0.0369,0.294,0.0,0.151,0.669,130.088,4
9,56,205594,0.442,0.632,1,-6.77,1,0.0295,0.426,0.00419,0.0735,0.196,78.899,4


In [57]:
print("New dimensions of dataset:", df.shape)

New dimensions of dataset: (114000, 14)


Last step before implementing PCA is to normalize our data. If we did not do that, our PCA graph would be skewed to favored larger values such as duration.

In [59]:
df = StandardScaler().fit_transform(df)
df = pd.DataFrame(df)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1.782627,0.024575,0.629244,-0.717148,-1.210442,0.300828,-1.326281,0.551848,-0.850202,-0.504109,0.758743,0.929306,-1.141863,0.221823
1,0.975633,-0.730859,-0.845908,-1.88998,-1.210442,-1.784744,0.753988,-0.078993,1.831732,-0.504094,-0.591211,-0.79869,-1.489717,0.221823
2,1.065299,-0.160332,-0.742186,-1.122669,-1.491343,-0.293288,0.753988,-0.273826,-0.315499,-0.504112,-0.507167,-1.365688,-1.528312,0.221823
3,1.692961,-0.243214,-1.733304,-2.312994,-1.491343,-2.039252,0.753988,-0.457309,1.774593,-0.503883,-0.428376,-1.276974,1.987859,-2.08968
4,2.186125,-0.271919,0.29503,-0.788711,-0.929541,-0.28275,0.753988,-0.303145,0.463399,-0.504112,-0.686285,-1.184403,-0.073348,0.221823


With our new data, we can now implement PCA!