# KNN Model Training

In this notebook, the KNN model for recommending similar songs for an input song will be trained and evaluated.

In [22]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

In [None]:
#import aggregated data that was generated in explorative_data_analysis.ipynb
songs_df = pd.read_table("data/spotify_knn_features.csv", sep=",")
songs_df

Unnamed: 0,acousticness,artists_id,country,danceability,duration_ms,energy,id,instrumentalness,key,liveness,...,speechiness,tempo,time_signature,valence,mean_syllables_word,mean_words_sentence,n_sentences,n_words,sentence_similarity,vocabulary_wealth
0,0.294000,['3mxJuHRn2ZWD5OofvJtDZY'],BE,0.698,235584.0,0.606,5qljLQuKnNJf4F4vfxQB0V,0.000003,10.0,0.1510,...,0.0262,115.018,4.0,0.6220,1.39,3.13,39.0,208.0,0.028340,0.64
1,0.863000,['4xWMewm6CYMstu0sPgd9jJ'],BE,0.719,656960.0,0.308,3VAX2MJdmdqARLSU5hPMpm,0.000000,6.0,0.2530,...,0.9220,115.075,3.0,0.5890,1.44,25.56,106.0,5106.0,0.000180,0.57
2,0.750000,['3hYaK5FF3YAglCj5HZgBnP'],BE,0.466,492840.0,0.931,1L3YAhsEMrGVvCgDXj2TYn,0.000000,4.0,0.9380,...,0.9440,79.565,4.0,0.0850,1.22,3.59,71.0,396.0,0.143260,0.37
3,0.763000,['2KQsUB9DRBcJk17JWX1eXD'],BE,0.719,316578.0,0.126,6aCe9zzoZmCojX7bbgKKtf,0.000000,3.0,0.1130,...,0.9380,112.822,3.0,0.5330,1.90,11.14,211.0,3930.0,0.000135,0.71
4,0.770000,['3hYaK5FF3YAglCj5HZgBnP'],BE,0.460,558880.0,0.942,1Vo802A38tPFHmje1h91um,0.000000,7.0,0.9170,...,0.9430,81.260,4.0,0.0906,1.22,3.59,71.0,396.0,0.143260,0.37
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101934,0.005640,['6n3YUZcayLRuAunJUUelvz'],AR,0.602,178893.0,0.904,4e5wI6VC4eVDTtpyZ409Pw,0.000000,11.0,0.0875,...,0.0327,130.186,4.0,0.7870,1.20,2.76,41.0,250.0,0.326829,0.37
101935,0.000406,['4iudEcmuPlYNdbP3e1bdn1'],AR,0.177,213133.0,0.823,58nHFSWj5N5JxNtWgS85TL,0.005370,7.0,0.2420,...,0.0604,184.260,4.0,0.3630,1.17,2.33,15.0,71.0,0.047619,0.64
101936,0.004510,['4iudEcmuPlYNdbP3e1bdn1'],AR,0.539,226107.0,0.883,2RDgs05sg2vrpwiAEUkWd0,0.000001,6.0,0.0606,...,0.0653,118.043,4.0,0.4060,1.16,3.21,33.0,188.0,0.147727,0.47
101937,0.333000,['023YMawCG3OvACmRjWxLWC'],AR,0.716,224133.0,0.748,1pXtUVmSS3Aky3j6nQ4sQT,0.000007,9.0,0.0899,...,0.1510,110.015,4.0,0.7600,1.38,3.01,79.0,361.0,0.041220,0.62


Now lets start by selecting features that we want to use to identify similar songs.

In [None]:
# select features to use for training the KNN
feature_columns = [
    'acousticness', 'danceability', 'duration_ms', 'energy', 
    'instrumentalness', 'liveness', 'loudness', 'popularity', 
    'speechiness', 'tempo', 'time_signature', 'valence', 
    #'mean_syllables_word', 'mean_words_sentence', 'n_sentences', 
    #'n_words', 'sentence_similarity', 'vocabulary_wealth'
]

Now we scale the features to ensure distances can be interpreted accordingly.

In [30]:
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(songs_df[feature_columns])

With the data being scaled now, the next step is to train the model with it.

In [31]:
# Train knn model
k = 5
knn = NearestNeighbors(n_neighbors=k, metric='euclidean')
knn.fit(X_scaled)

With the trained model, it is now time to test it by letting it recommend songs for an input. To make that easier to do, let's implement a function for this task:

In [39]:
def recommend_songs_for(song_id: object, knn: NearestNeighbors, k=5):
    song_index = songs_df[songs_df['id'] == song_id].index

    # if no song index was found return
    if len(song_index) == 0:
        return "Song not found!"

    # scale input features to match the input used for training.
    song_features = songs_df.loc[song_index, feature_columns]
    input_song = scaler.transform(song_features)

    # Find nearest neighbours
    distances, indices = knn.kneighbors(input_song, n_neighbors=k+1)
    recommended_songs_ids = indices[0][1:]  # Excluding the first because that is the input itself
    
    recommended_songs = songs_df.iloc[recommended_songs_ids][['id', 'name']]

    return recommended_songs


Aftet that is implemented, let's test it!

In [40]:
y = recommend_songs_for("58nHFSWj5N5JxNtWgS85TL", knn, k)
y

Unnamed: 0,id,name
6686,17bsX0becPLtW4eUwGBU6o,Entrevue Séduction (feat. Pierre Niney)
70640,1mjWhM7GQTTxJxg2F0iCRS,"Dil Mein Ho Tum (From ""Cheat India"")"
22322,3Dmvzf6OvkDQsNDsZQSCty,Kiss the Sun
54279,4QmVwOTAzBcK6NvRzohOnS,Painkiller
4812,5hZ7N5EWJWmvDJyDPMmWU5,Polly-Wolly-Doodle
