# Machine Learning Music Recommender

This notebook contains a small demonstration of my in progress model that I will be using for a web application that recommends people a few songs based off of some quiz. The model it trained on a very large kaggle dataset of various songs obtained from the spotify api (more than 100k entries). 

As for the model itself, It is scikit learn's K-means model. The idea of recommending songs via this model is to first train it on the dataset. Once trained, the labels from the k-means model are obtained and applied to the dataset. This creates "pools" of tracks with all tracks being close in terms of feature parameters. When a prediction is made, the model will predict which cluster or pool the song belongs to. Finally, five (or more) songs are obtained at random from the cluster which is then returned back.

***
#### Now for the demonstration. 

First we must import all of our libraries such as os (for finding the path to the dataset), pandas (for data management), and scikit learn's PCA, StandardScaler, and Kmeans classes.

In [1]:
import os
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Next, we will read the data from the csv file, remove an unused column, and initialize our K-means model. 

It is also important to note that I have chosen a fairly large value for k. The pools of songs need to be small enough so that they all share similar traits. 

It should also be noted that k was not derived from the silhouette or elbow method.

In [3]:
path_to_data = os.path.join("..", "data","dataset.csv")
music_df = pd.read_csv(path_to_data).drop('Unnamed: 0', axis=1) # drop useless column
kmeans = KMeans(n_clusters=500, init='k-means++', n_init="auto", random_state=42)

The next step is to select our features and train the model.

In [None]:
# These features were chosen as I feel that they will be the easiest to translate into quiz questions.
features = [
    'danceability',
    'energy',
    'loudness',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo'
]

# Our features must first be standardized in order to not cause issues while training
scaler = StandardScaler()
features_scaled = scaler.fit_transform(music_df[features])

# Principle Component Analysis is also done for dimensionality reduction 
pca = PCA(n_components=0.70) 
pca.fit(features_scaled)
X_pca = pca.transform(features_scaled)

# We then fit the data to the K-Means Model
kmeans.fit(X_pca)

Next, we can start the process of making a prediction which requires that we transform our prediction data into a pandas dataframe.

This is due to the fact that we must use our "pca" and "scaler" objects to transform our prediction so that the prediction falls inline with our overall transformed data.

We will also assign each observation in the original dataset a "cluster id" which determines what cluster it belongs in. 



In [25]:
# data format
# danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo (in that order)

# Drugs 3iiDWuaIzuGKZezHvQY4GA
# pred_data_values = [0.77, 0.649, -6.824, 0.194, 0.108, 0.000683, 0.134, 0.522, 84.012]

# I beg you 5kKSQULHCPFE7CKMPrkAtP
pred_data_values = [0.456, 0.893, -2.825, 0.0813, 0.00321, 0.0, 0.121, 0.478, 127.884]

# fly me to the moon pt 2 5V0kQxkQeXNTnGNLRGZ6bX
# pred_data_values = [0.322, 0.00207, -35.061, 0.0523, 0.996, 0.889, 0.0822, 0.149, 75.769]
    
# Transform into pandas dataframe
prediction_data = pd.DataFrame([pred_data_values], columns=features)

# Scale and conduct pca on input
prediction_data = scaler.transform(prediction_data)
prediction_data = pca.transform(prediction_data)

# assign labels to original dataset
cluster_labels = kmeans.labels_
music_df['cluster_id'] = cluster_labels
labels_assigned = True

# Make cluster prediction
cluster_prediction = kmeans.predict(prediction_data)


Finally, we use our cluster prediction to grab five random samples from said cluster and return that data.

In [27]:
# Get all observations belonging to the predicted cluster
prediction_result = music_df[music_df['cluster_id'] == cluster_prediction[0]]

# Grab 5 observations at random
prediction_result.sample(n=5)

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,cluster_id
71198,3jcMkuNGMExEorJSML33D9,Rob Zombie,Halloween Metal Nights,Demon Speeding,0,224306,False,0.548,0.809,11,...,0,0.0328,0.000298,3e-06,0.156,0.409,133.984,4,metal,178
28757,4RdZfdkgW1OmkajhvcOxA1,Tritonal;HALIENE;SCHALA;Jorza,Long Way Home,Long Way Home,45,214153,False,0.511,0.795,6,...,0,0.0509,0.0461,0.0,0.138,0.415,129.855,4,dub,178
49524,25jyi2W1s3ZF0aWuAK0Mb9,RWND,Hardstyle The Ultimate Collection Best Of 2019,M.A.R.C.,21,267204,False,0.523,0.777,8,...,1,0.0687,0.000363,0.0874,0.0803,0.417,149.882,4,hardstyle,178
110116,6T0e1wKkONiIZQ1FTjMlGZ,Armin van Buuren;Gareth Emery;Owl City,"Feel Again, Pt. 2",Forever & Always,4,194002,False,0.58,0.921,8,...,1,0.0463,0.00787,8.1e-05,0.097,0.37,144.98,4,trance,178
49166,5SL8PsuleYXVqmCTI2UP7o,RWND;Soulspeaker,Remote Control,Remote Control,26,212606,False,0.519,0.964,7,...,1,0.106,0.00891,0.0639,0.0929,0.418,150.05,4,hardstyle,178


In the future, I would like to explore options outside of just getting random samples. My current idea is to have something like k-nearest-neighbor selecting five of the most similar songs to the prediction in the cluster.