Lab | Clustering songs
Introduction
Now it's time to cluster the songs of the hot_songs and not_hot_songs databases according to the song's audio features. For this purpose, you need to consider the following questions:

Are you going to use all the audio features? I
If not, which ones do you think that makes more sense to be used?
It might make sense to use a dimensionality reduction technique to visualize the songs with only two features?
What is the optimal number of clusters (for methods that need to know this beforehand)?
What is the best distance to use?
What clustering method provides better results?
Does the clustering method need a transformer?
Considerations
Be aware that this process is extremely time-consuming!!! (it might take several hours on your laptop). Therefore, when testing different options, save the models into your disk in order to be able to use the best model later. You don't want to retrain the best model again when you know what are the optimal parameters for each.

To determine which clustering method performs best, you need to be practical and think about how many clusters you might want to have alongside with a clustering metric to evaluate how good or bad the songs were clustered. If the number of clusters is small, each cluster will be too big and generic. On the contrary, if the number of clusters is too big then each cluster will be too specific and it will be poorly populated (this also depends on how heterogeneous is your dataset).

On the other hand, when you train your clustering model make sure to concatenate both databases together (ie: hot_songs and not_hot_songs) before. If you don't combine both datasets, the clusters obtained with the hot_songs will be different than the ones obtained with the not_hot_songs database even though they might have the same label because they will contain different songs. However, after this, you will not know to which original dataframe belongs each song. To prevent this problem, before the concatenation you can add a new column named "dataset" with a "flag" to remind yourself in which dataset was included ("Hot", "Not hot") each song.

Finally, add a new column to the full dataset for each clustering method with the cluster membership of each song

#STEPS


#1 add hot and not_hot column with H/N
#2 merge both dataframes
#3 use a dimensionality reduction technique: PCA or UMAP
#4 use a clustering method: KMEANS, [DBSCAN, HDBSCAN]
#(add a new column to the full dataset for each clustering method with the cluster membership of each song)
(#5evaluate the better)


In [7]:
#!conda install -c conda-forge dbcv

In [38]:
#import librairies
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import make_classification
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import euclidean
from sklearn.metrics import silhouette_score, calinski_harabasz_score
import matplotlib.pyplot as plt
import seaborn as sns
import colorcet as cc
import time
import pickle


In [39]:
#songs datasets with ids + features
hot_songs = pd.read_csv(r"C:\Users\priya\Documents\IRONHACK\Week_6\Day_2\Afternoon\lab-spotify-api\extended_hot_songs.csv")
not_hot_songs = pd.read_csv(r"C:\Users\priya\Documents\IRONHACK\Week_6\Day_2\Afternoon\lab-spotify-api\extended_not_hot_songs.csv")

In [40]:
#1 add hot and not_hot column with H/N
H = 'hot'
N = 'not_hot'

hot_songs['label'] = H
not_hot_songs['label'] = N

In [41]:
#check
hot_songs

Unnamed: 0.1,Unnamed: 0,rank,title,artist,id,label
0,5,6,Snooze,SZA,,hot
1,6,7,Water,Tyla,,hot
2,7,8,Last Night,Morgan Wallen,,hot
3,8,9,Fast Car,Luke Combs,,hot
4,9,10,Agora Hills,Doja Cat,,hot
...,...,...,...,...,...,...
75,95,96,Tourniquet,Zach Bryan,,hot
76,96,97,Y Lloro,Junior H,,hot
77,97,98,Murder On The Dancefloor,Sophie Ellis-Bextor,,hot
78,98,99,Amargura,Karol G,,hot


In [42]:
#check
not_hot_songs

Unnamed: 0,artist,title,id,label
0,Britney Spears,Oops!...I Did It Again,,not_hot
1,blink-182,All The Small Things,,not_hot
2,Faith Hill,Breathe,,not_hot
3,Bon Jovi,It's My Life,,not_hot
4,*NSYNC,Bye Bye Bye,,not_hot
...,...,...,...,...
1995,Jonas Brothers,Sucker,,not_hot
1996,Taylor Swift,Cruel Summer,,not_hot
1997,Blanco Brown,The Git Up,,not_hot
1998,Sam Smith,Dancing With A Stranger (with Normani),,not_hot


In [43]:
#2 merge both dataframes
merged_df = pd.concat([hot_songs, not_hot_songs], ignore_index=True)

In [44]:
#check
merged_df

Unnamed: 0.1,Unnamed: 0,rank,title,artist,id,label
0,5.0,6.0,Snooze,SZA,,hot
1,6.0,7.0,Water,Tyla,,hot
2,7.0,8.0,Last Night,Morgan Wallen,,hot
3,8.0,9.0,Fast Car,Luke Combs,,hot
4,9.0,10.0,Agora Hills,Doja Cat,,hot
...,...,...,...,...,...,...
2075,,,Sucker,Jonas Brothers,,not_hot
2076,,,Cruel Summer,Taylor Swift,,not_hot
2077,,,The Git Up,Blanco Brown,,not_hot
2078,,,Dancing With A Stranger (with Normani),Sam Smith,,not_hot


In [45]:
merged_df.to_csv('full_data.csv', index=False)

In [47]:
merged_df1 = pd.read_csv(r"C:\Users\priya\Documents\IRONHACK\Week_6\Day_3\Afternoon\lab-clustering-songs\full_data1.csv")
merged_df1

Unnamed: 0,artist,song,id,danceability,energy,key,loudness,mode,speechiness,acousticness,...,valence,tempo,type,id.1,uri,track_href,analysis_url,duration_ms,time_signature,H_or_N
0,Jack Harlow,Lovin On Me,spotify:track:4xhsWYTOGcal8zt0J161CU,0.943,0.558,2,-4.911,1,0.0568,0.00260,...,0.606,104.983,audio_features,4xhsWYTOGcal8zt0J161CU,spotify:track:4xhsWYTOGcal8zt0J161CU,https://api.spotify.com/v1/tracks/4xhsWYTOGcal...,https://api.spotify.com/v1/audio-analysis/4xhs...,138411,4,H
1,Taylor Swift,Cruel Summer,spotify:track:1BxfuPKGuaTgP7aM0Bbdwr,0.552,0.702,9,-5.707,1,0.1570,0.11700,...,0.564,169.994,audio_features,1BxfuPKGuaTgP7aM0Bbdwr,spotify:track:1BxfuPKGuaTgP7aM0Bbdwr,https://api.spotify.com/v1/tracks/1BxfuPKGuaTg...,https://api.spotify.com/v1/audio-analysis/1Bxf...,178427,4,H
2,Tate McRae,Greedy,spotify:track:3rUGC1vUpkDG9CZFHMur1t,0.750,0.733,6,-3.180,0,0.0319,0.25600,...,0.844,111.018,audio_features,3rUGC1vUpkDG9CZFHMur1t,spotify:track:3rUGC1vUpkDG9CZFHMur1t,https://api.spotify.com/v1/tracks/3rUGC1vUpkDG...,https://api.spotify.com/v1/audio-analysis/3rUG...,131872,1,H
3,Doja Cat,Paint The Town Red,spotify:track:2IGMVunIBsBLtEQyoI1Mu7,0.868,0.538,5,-8.603,1,0.1740,0.26900,...,0.732,99.968,audio_features,2IGMVunIBsBLtEQyoI1Mu7,spotify:track:2IGMVunIBsBLtEQyoI1Mu7,https://api.spotify.com/v1/tracks/2IGMVunIBsBL...,https://api.spotify.com/v1/audio-analysis/2IGM...,231750,4,H
4,Zach Bryan Featuring Kacey Musgraves,I Remember Everything,spotify:track:4KULAymBBJcPRpk1yO4dOG,0.429,0.453,0,-7.746,1,0.0459,0.55400,...,0.155,77.639,audio_features,4KULAymBBJcPRpk1yO4dOG,spotify:track:4KULAymBBJcPRpk1yO4dOG,https://api.spotify.com/v1/tracks/4KULAymBBJcP...,https://api.spotify.com/v1/audio-analysis/4KUL...,227196,4,H
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2095,Jonas Brothers,Sucker,spotify:track:22vgEDb5hykfaTwLuskFGD,0.775,0.598,2,-7.274,1,0.0535,0.00175,...,0.356,129.988,audio_features,0lnIJmgcUpEpe4AZACjayW,spotify:track:0lnIJmgcUpEpe4AZACjayW,https://api.spotify.com/v1/tracks/0lnIJmgcUpEp...,https://api.spotify.com/v1/audio-analysis/0lnI...,232560,4,N
2096,Taylor Swift,Cruel Summer,spotify:track:1BxfuPKGuaTgP7aM0Bbdwr,0.773,0.747,5,-4.061,1,0.0885,0.02420,...,0.801,126.014,audio_features,0ErK6K0kYr0Ow2RkPMhmMs,spotify:track:0ErK6K0kYr0Ow2RkPMhmMs,https://api.spotify.com/v1/tracks/0ErK6K0kYr0O...,https://api.spotify.com/v1/audio-analysis/0ErK...,246240,4,N
2097,Blanco Brown,The Git Up,spotify:track:01tA4XmJ4fGQNwti6b2hPm,0.664,0.573,5,-6.519,1,0.0277,0.61300,...,0.566,76.023,audio_features,3d8y0t70g7hw2FOWl9Z4Fm,spotify:track:3d8y0t70g7hw2FOWl9Z4Fm,https://api.spotify.com/v1/tracks/3d8y0t70g7hw...,https://api.spotify.com/v1/audio-analysis/3d8y...,160097,4,N
2098,Sam Smith,Dancing With A Stranger (with Normani),spotify:track:3xgT3xIlFGqZjYW9QlhJWp,0.275,0.238,9,-13.119,1,0.0389,0.97700,...,0.257,88.980,audio_features,0fF9YHMCuv1mAQv4z5SU7L,spotify:track:0fF9YHMCuv1mAQv4z5SU7L,https://api.spotify.com/v1/tracks/0fF9YHMCuv1m...,https://api.spotify.com/v1/audio-analysis/0fF9...,155167,4,N


In [48]:
#check
merged_df1.columns

Index(['artist', 'song', 'id', 'danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'type', 'id.1', 'uri', 'track_href', 'analysis_url',
       'duration_ms', 'time_signature', 'H_or_N'],
      dtype='object')

In [49]:
columns_to_drop = ['type', 'id.1', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature', 'mode',  'valence']

# Check if columns exist in the DataFrame
columns_to_drop_existing = list(set(columns_to_drop).intersection(merged_df.columns))

# Drop the existing columns
merged_df1 = merged_df1.drop(columns=columns_to_drop_existing)

merged_df1.columns

Index(['artist', 'song', 'id', 'danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'type', 'id.1', 'uri', 'track_href', 'analysis_url',
       'duration_ms', 'time_signature', 'H_or_N'],
      dtype='object')

In [50]:
#3 use a dimensionality reduction technique: PCA
#X needs to  be scaled before
X_features = ['danceability', 'energy', 'key', 'loudness',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'tempo']

In [51]:
X_features = merged_df1[X_features]
X_features

Unnamed: 0,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,tempo
0,0.943,0.558,2,-4.911,0.0568,0.00260,0.000002,0.0937,104.983
1,0.552,0.702,9,-5.707,0.1570,0.11700,0.000021,0.1050,169.994
2,0.750,0.733,6,-3.180,0.0319,0.25600,0.000000,0.1140,111.018
3,0.868,0.538,5,-8.603,0.1740,0.26900,0.000003,0.0901,99.968
4,0.429,0.453,0,-7.746,0.0459,0.55400,0.000002,0.1020,77.639
...,...,...,...,...,...,...,...,...,...
2095,0.775,0.598,2,-7.274,0.0535,0.00175,0.000004,0.2530,129.988
2096,0.773,0.747,5,-4.061,0.0885,0.02420,0.000009,0.1090,126.014
2097,0.664,0.573,5,-6.519,0.0277,0.61300,0.000363,0.0857,76.023
2098,0.275,0.238,9,-13.119,0.0389,0.97700,0.912000,0.1450,88.980


In [52]:
#check
X_features.describe()

Unnamed: 0,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,tempo
count,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0,2100.0
mean,0.6651,0.707474,5.41,-5.783525,0.102778,0.146547,0.027393,0.181648,120.097899
std,0.142213,0.166388,3.615381,2.652679,0.096477,0.200268,0.131568,0.142002,27.21062
min,0.0,5e-05,0.0,-39.264,0.0,1e-05,0.0,0.0215,0.0
25%,0.574,0.608,2.0,-6.73025,0.0395,0.014975,0.0,0.0901,98.9935
50%,0.674,0.7305,6.0,-5.368,0.05935,0.06015,0.0,0.125,120.0215
75%,0.763,0.834,8.0,-4.2055,0.126,0.193,8.4e-05,0.237,135.02775
max,0.975,0.997,11.0,0.836,0.53,0.989,0.985,0.971,210.857


In [53]:
#check
X_features.info

<bound method DataFrame.info of       danceability  energy  key  loudness  speechiness  acousticness  \
0            0.943   0.558    2    -4.911       0.0568       0.00260   
1            0.552   0.702    9    -5.707       0.1570       0.11700   
2            0.750   0.733    6    -3.180       0.0319       0.25600   
3            0.868   0.538    5    -8.603       0.1740       0.26900   
4            0.429   0.453    0    -7.746       0.0459       0.55400   
...            ...     ...  ...       ...          ...           ...   
2095         0.775   0.598    2    -7.274       0.0535       0.00175   
2096         0.773   0.747    5    -4.061       0.0885       0.02420   
2097         0.664   0.573    5    -6.519       0.0277       0.61300   
2098         0.275   0.238    9   -13.119       0.0389       0.97700   
2099         0.794   0.653    7    -7.839       0.1040       0.04890   

      instrumentalness  liveness    tempo  
0             0.000002    0.0937  104.983  
1             0

In [54]:
import pickle

scaler = StandardScaler()
scaler.fit(X_features)
X_scaled = scaler.transform(X_features)
filename = "features_scaler.pickle" # Path with filename

with open(filename, "wb") as file:
        pickle.dump(scaler,file)

X_scaled_df = pd.DataFrame(X_scaled, columns = X_features.columns)
display(X_features.head())
print()
display(X_scaled_df.head()) #all the columns will have mean = 0 and sd = 1

Unnamed: 0,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,tempo
0,0.943,0.558,2,-4.911,0.0568,0.0026,2e-06,0.0937,104.983
1,0.552,0.702,9,-5.707,0.157,0.117,2.1e-05,0.105,169.994
2,0.75,0.733,6,-3.18,0.0319,0.256,0.0,0.114,111.018
3,0.868,0.538,5,-8.603,0.174,0.269,3e-06,0.0901,99.968
4,0.429,0.453,0,-7.746,0.0459,0.554,2e-06,0.102,77.639





Unnamed: 0,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,tempo
0,1.954576,-0.898561,-0.943417,0.329001,-0.476684,-0.718942,-0.208238,-0.619492,-0.55561
1,-0.795475,-0.032907,0.993216,0.028855,0.562148,-0.147571,-0.208098,-0.539897,1.834136
2,0.597134,0.15345,0.163231,0.981704,-0.734837,0.546665,-0.208255,-0.476502,-0.333769
3,1.427072,-1.018791,-0.113431,-1.063131,0.738397,0.611594,-0.20823,-0.64485,-0.739957
4,-1.660581,-1.529768,-1.496741,-0.739985,-0.589691,2.035028,-0.20824,-0.561028,-1.560751


In [55]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_scaled_df)
principal_components = pca.transform(X_scaled_df)
principal_components_df = pd.DataFrame(principal_components, columns=['PCA_'+ str(i) for i in range(1,X_scaled_df.shape[1]+1)]) 
#let's set the option "n_components" to a given integer number
principal_components_df.head()

Unnamed: 0,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5,PCA_6,PCA_7,PCA_8,PCA_9
0,-0.07435,-1.895594,-0.66469,0.676747,1.022623,0.235777,-1.031004,0.326677,-0.579963
1,-0.285888,0.83007,0.90297,-1.706133,0.116136,-1.011508,0.012445,0.326083,-0.133038
2,-0.331692,-0.622462,-0.989349,-0.162083,-0.136723,-0.230015,-0.264017,-0.936058,-0.266621
3,1.450978,-1.877292,0.500704,0.249409,0.203748,-0.055556,-0.200579,0.083571,0.330609
4,2.652758,0.478202,-1.165714,1.851899,-0.296796,-1.364102,0.971367,-0.342504,-0.323604


In [56]:
#if the 2 first PCA_component represent at least 80%, we can use this method
# 0.2514264890029777 +  0.39486977572454984 = 0.6462962647275275 : it is not good
cumulated_explained_variance_ratio = [sum(pca.explained_variance_ratio_[0:i+1]) for i,value in enumerate(pca.explained_variance_ratio_)]
cumulated_explained_variance_ratio

[0.25908195721841726,
 0.3986545152543034,
 0.5197750311269332,
 0.6338629699413055,
 0.7409561301674833,
 0.8409351103702284,
 0.9145527674212305,
 0.9725056799562947,
 1.0000000000000002]

In [58]:
conda install -c conda-forge umap-learn


Note: you may need to restart the kernel to use updated packages.




  current version: 23.7.4
  latest version: 23.11.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.11.0




Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\priya\anaconda3

  added / updated specs:
    - umap-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    pynndescent-0.5.11         |     pyhca7485f_0          48 KB  conda-forge
    python_abi-3.11            |          2_cp311           5 KB  conda-forge
    umap-learn-0.5.5           |  py311h1ea47a8_0         186 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         239 KB

The following NEW packages will be INSTALLED:

  pynndescent        conda-forge/noarch::pynndescent-0.5.11-pyhca7485f_0 
  python_abi         conda-forge/win-64::python_abi-3.11-2_cp311 
  umap-learn         conda-forge/win-64::umap-learn-0.5.5-py311h1ea47a8_0 




In [59]:
#3 use a dimensionality reduction technique: UMAP
#UMAP, n_components=2

from umap import UMAP

reducer = UMAP(n_components=2,random_state=42) #test with different numbers of n_components
reducer.fit(X_features)

X_umap_transformed = reducer.transform(X_features)
X_umap_transformed_df = pd.DataFrame(X_umap_transformed, columns=["UMAP_1","UMAP_2"])
X_umap_transformed_df.head()

  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")


Unnamed: 0,UMAP_1,UMAP_2
0,-4.715898,2.757841
1,10.702214,7.594029
2,-2.254171,-0.301227
3,-3.746928,4.808707
4,-3.948433,13.550782
