<a href="https://colab.research.google.com/github/Calaside/GTM-Setup---Challenge-2/blob/main/D_Remy_Spotify_clustering_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spotify clustering

In this challenge, we'll be using a dataset from Spotify that contains metadata for songs on the platform.

By metadata we mean info about the song such as name, artists, metrics about it's sound and other musical attributes.

We will use this dataset to try and cluster songs together that are closely related! This is the underlying theory behind how recommender algorithms work on sites such as Spotify, Netflix, etc.

## Data Exploration

Please run the cell below to return the spotify song data!

In [None]:
import pandas as pd

spotify_df = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_spotify_data.csv')
spotify_df.head()

Unnamed: 0,name,artists,popularity,danceability,valence,energy,explicit,key,liveness,loudness,speechiness,tempo
0,We're For The Dark - Remastered 2010,['Badfinger'],22,0.678,0.559,0.432,0,3,0.0727,-12.696,0.0334,117.674
1,Sixty Years On - Piano Demo,['Elton John'],25,0.456,0.259,0.368,0,6,0.156,-10.692,0.028,143.783
2,Got to Find Another Way,['The Guess Who'],21,0.433,0.833,0.724,0,0,0.17,-9.803,0.0378,84.341
3,Feelin' Alright - Live At The Fillmore East/1970,['Joe Cocker'],22,0.436,0.87,0.914,0,5,0.855,-6.955,0.061,174.005
4,Caravan - Take 7,['Van Morrison'],23,0.669,0.564,0.412,0,7,0.401,-13.095,0.0679,78.716


For the purposes of our analyses, we will only need the numeric features from our dataset. Select only these and save them in a variable called `spotify_numeric`

In [None]:
# your code here
spotify_numeric=spotify_df.select_dtypes(include="number")
spotify_numeric


Unnamed: 0,popularity,danceability,valence,energy,explicit,key,liveness,loudness,speechiness,tempo
0,22,0.678,0.559,0.432,0,3,0.0727,-12.696,0.0334,117.674
1,25,0.456,0.259,0.368,0,6,0.1560,-10.692,0.0280,143.783
2,21,0.433,0.833,0.724,0,0,0.1700,-9.803,0.0378,84.341
3,22,0.436,0.870,0.914,0,5,0.8550,-6.955,0.0610,174.005
4,23,0.669,0.564,0.412,0,7,0.4010,-13.095,0.0679,78.716
...,...,...,...,...,...,...,...,...,...,...
9995,72,0.786,0.608,0.808,0,7,0.0822,-3.702,0.0881,105.029
9996,68,0.717,0.734,0.753,0,7,0.1010,-6.020,0.0605,137.936
9997,76,0.634,0.637,0.858,0,4,0.2580,-2.226,0.0809,91.688
9998,70,0.671,0.195,0.623,1,2,0.6430,-7.161,0.3080,75.055


Have a read through your features and try to understand what they are related to!

Spotify generate their own features that relate to abstract characteristcs that can be attributed to a piece of music (e.g. 'valence' or 'danceability'), you don't need to worry about how these are calculated!

Then we also have some information that is more literal such as the 'key', 'tempo' and whether a song is 'explicit' or not.

Investiate the distributions of some of your variables below:

- What is the ratio of explicit vs non-explicit songs?
- How is popularity distributed?
- How are Spotify's internal song metrics distributed?

In [None]:
spotify_df['explicit'].value_counts()

Unnamed: 0_level_0,count
explicit,Unnamed: 1_level_1
0,8968
1,1032


In [None]:
import plotly.express as px

fig = px.histogram(spotify_df, x = 'popularity')
fig.show()

In [None]:
spotify_df['popularity'].values

array([22, 25, 21, ..., 76, 70, 74])

In [None]:
fig = px.histogram(spotify_df, x = 'danceability')
fig.show()

In [None]:
fig = px.histogram(spotify_df, x = 'valence')
fig.show()

In [None]:
fig = px.histogram(spotify_df, x = 'energy')
fig.show()

In [None]:
fig = px.histogram(spotify_df, x = 'key')
fig.show()

In [None]:
fig = px.histogram(spotify_df, x = 'liveness')
fig.show()

In [None]:
fig = px.histogram(spotify_df, x = 'speechiness')
fig.show()

In [None]:
fig = px.histogram(spotify_df, x = 'tempo')
fig.show()

In [None]:
fig = px.histogram(spotify_df, x = 'loudness')
fig.show()

We will need a scaler

The cell below will visualize three of your features in 3D space. Feel free to switch up the variables that are being used for the *x*, *y*, *z* axes.

Because we are using plotly express, you can use your cursor to move around / zoom in & out of the chart.

In [None]:
import plotly.express as px

fig = px.scatter_3d(spotify_numeric,
                    x='danceability',
                    y='energy',
                    z='speechiness',
                    opacity=0.7,
                    width=500,
                    height=500
           )
fig.show()

## First model

Our goal in this challenge is to cluster our songs into similar groups! The plot above may or may not reveal things that look like clusters, but remember! We can only visualise three of our variables here at a time.

When we train a clustering model it will cluster our songs in n-dimensional space, where n is the number of features being fed into the model.

Let's start by instantiating a simple KMeans model, with 8 clusters.

Fit this to your numeric spotify data and save the labels that your model has stored in a variable called `labels_simple`.

<details>
    <summary><i>Hint</i></summary>

To get the labels, have a look at the attributes your model has once it has been fitted to your data.
</details>

In [None]:
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
spotify_numeric_scaled = scale.fit_transform(spotify_numeric)

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 8).fit(spotify_numeric_scaled)

In [None]:
labels_simple = kmeans.labels_
labels_simple

array([0, 4, 0, ..., 2, 5, 6], dtype=int32)

What is the distribution of our labels? How many songs do we have in each cluster?

In [None]:
import plotly.express as px

fig = px.histogram(x = labels_simple)
fig.show()

We can also now visualise our songs in 3D space again, but this time colour them by our new labels to see what clusters we have created! Run the cell below to see how it's looking.

In [None]:
fig = px.scatter_3d(spotify_numeric,
                    x='danceability',
                    y='energy',
                    z='speechiness',
                    color=labels_simple,
                    width=500,
                    height=500)
fig.show()

It looks a little bit chaotic doesn't it... I'm not sure I'd be forking out the monthly suscription costs if my discover weekly was as all the place as this chart is.

Do you have any intuitions as to why our labels might look so poorly clustered?

<details>
    <summary><i>Answer</i></summary>

Remember that KMeans (and most unsupervised learning algorithms) are distance based. We have **not** scaled our numeric features yet. Perhaps doing this will make things look a bit clearer?
</details>

In the cell below, perform the preprocessing on our dataset that you deem necessary!

## Preprocessing

In [None]:
# your code here

## Modelling with preprocessed data

Now, let's train and fit a model in the same way that we did above. However, this time we will use the scaled data! Save the labels in a variable called `labels_scaled`

In [None]:
# your code here

Run the cell below to see how our clusters look in 3D space, but with our newly scaled data.

In [None]:
fig_scaled = px.scatter_3d(spotify_scaled,
                           x='danceability',
                           y='energy',
                           z='speechiness',
                           color=labels_scaled,
                           width=500,
                           height=500)
fig_scaled.show()

## Finding the right value for *K*

It should look a bit more tidy, maybe a bit more stratified! Progress!

**However, it still doesn't look perfect**. Remember though, we are only looking at 3 dimensions out of the 10 dimensions that our model is trained on.

It might be that, if we could visualise 10 dimensionsal space, we would see some much more intuitively shaped clusters!

So far we have been using 8 clusters for our models so far, but we havented tested whether this makes sense.

Let's use *the elbow method* to check how many of clusters we should ideally be using for this dataset. Do this below. Remember to use a plot to visualise your results.



In [None]:
distorsion = {}

for k in range (1, 11):
  kmeans = KMeans(n_clusters = k, random_state = 42).fit(spotify_numeric_scaled)
  distorsion[k] = kmeans.inertia_

In [None]:
fig = px.line(x = distorsion.keys(), y = distorsion.values())
fig.show()

The best number k of clusters would be around 4.

## Creating a model with the ideal number of clusters

It looks as though have around 6 clusters makes sense with our dataset. Create a new KMeans model using 6 clusters and plot it in 3D space using the same process that we have done above.

Or 6.

In [None]:
best_model = KMeans(n_clusters = 6).fit(spotify_numeric_scaled)

In [None]:
df_spotify_numeric_scaled = pd.DataFrame(spotify_numeric_scaled, columns = spotify_numeric.columns)

In [None]:
df_spotify_numeric_scaled['label'] = best_model.labels_

In [None]:
fig_scaled = px.scatter_3d(df_spotify_numeric_scaled,
                           x='danceability',
                           y='energy',
                           z='speechiness',
                           color='label',
                           width=500,
                           height=500)
fig_scaled.show()

The chart doesn't reveal a whole lot more, but perhaps we can create some theoretical playlists based on our clusters?

Add the new labels from our model that has 6 clusters to our original spotify dataframe as a column called 'label'.

In [None]:
spotify_df['label'] = best_model.labels_

In [None]:
fig = px.histogram(x = spotify_df['label'])
fig.show()

## Generating Spotify playlists based on our clusters!

We should now see the original meta-data for our spotify songs, but **with the added label of which cluster they are located in** based on our KMeans algorithm

Let's generate 6 playlists (one for each cluster) that contains 15 random songs from that cluster.

Below we have created a dictionary called `daily_mixes`. Inside this dictionary we want to store keys that are the name of the cluster labels, and then as values we want dataframes that only contains the songs from that specific cluster.

Finish the for loop below to obtain this dictionary!

In [None]:
daily_mixes = {}
music = []

for cluster in range (0, 6):
  music = spotify_df[spotify_df['label'] == cluster].sample(n =15)
  music.reset_index(drop = True, inplace = True)
  daily_mixes[cluster] = music

Run the cell below to print out our 6 playlists!!!

In [None]:
for key,value in daily_mixes.items():
  print("-" * 50)
  print(f"Here are some songs for playlist {key}")
  print("-" * 50)
  display(value.sample(5)[['name', 'artists']])

--------------------------------------------------
Here are some songs for playlist 0
--------------------------------------------------


Unnamed: 0,name,artists
13,I've Got a Testimony,"['Rev. Clay Evans', 'The AARC Mass Choir']"
10,Brother and Sister,['The Gun Club']
2,Whistle Down the Wind,['Nick Heyward']
5,The Big Crash,['Eddie Money']
1,Mendocino County Line,"['Willie Nelson', 'Lee Ann Womack']"


--------------------------------------------------
Here are some songs for playlist 1
--------------------------------------------------


Unnamed: 0,name,artists
8,"La Bohème / Act 1: ""Questo Mar Rosso""","['Giacomo Puccini', 'Rolando Panerai', 'Lucian..."
14,Blistered,['Johnny Cash']
12,A Light In The Black,['Rainbow']
0,El Caballo Blanco,['José Alfredo Jimenez']
9,Trust in Him,['The Clark Sisters']


--------------------------------------------------
Here are some songs for playlist 2
--------------------------------------------------


Unnamed: 0,name,artists
7,Tango del Pecado (feat. Bajofondo Tango Club &...,"['Calle 13', 'Bajofondo Tango Club', 'Panasuyo']"
5,Team,['Iggy Azalea']
3,Spit These Bars,"['Drag-On', 'Kasseem Dean']"
10,Newsboy,['Robin Williams']
4,Freaky,['Tory Lanez']


--------------------------------------------------
Here are some songs for playlist 3
--------------------------------------------------


Unnamed: 0,name,artists
4,Affair In San Miguel,"['The Rippingtons', 'Steve Reid', 'Brandon Fie..."
13,Bigmouth Strikes Again - Demo,['The Smiths']
7,Wild Child - Remastered,['Lou Reed']
11,Runnin From The Devil,['Ohio Players']
9,Sing-Along Song,['Stryper']


--------------------------------------------------
Here are some songs for playlist 4
--------------------------------------------------


Unnamed: 0,name,artists
6,Piedra,['Caifanes']
5,Te Quiero Así,['Valentín Elizalde']
8,"Are You Ready - 12"" Version",['Billy Ocean']
3,The Itsy Bitsy Spider,['Nursery Rhymes']
2,"Tear it Down - From ""Camp Rock 2: The Final Jam""","['Meaghan Martin', 'Matthew ""Mdot"" Finley']"


--------------------------------------------------
Here are some songs for playlist 5
--------------------------------------------------


Unnamed: 0,name,artists
7,The Chill Of An Early Fall,['George Strait']
14,Giant Steps,['Joe Pass']
10,Éjszakai,['Vas Bela']
12,Pouring Rain,"['Rain Sounds', 'Rain Sounds & White Noise', '..."
9,Getting To Know You,"['Richard Rodgers', 'Julie Andrews', 'John Mau..."


### Running clustering with DBSCAN

As a bonus, let's try and run a clustering analysis using [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)!

Remember, with `DBSCAN` we don't need to *a-prior* select the number of clusters we will end up with.

Instantiate and fit a `DBSCAN` model. Read the documentation and be sure to trial out different values for `epsilon` and `min_samples` - **this is essential to return reasonable results!** [This article](https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd) has some helpful tips on how to help pick reasonable values

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
spotify_numeric_scaled = scale.fit_transform(spotify_numeric)

In [None]:
X = spotify_numeric_scaled
clusters = DBSCAN(eps = 0.5, min_samples=3).fit(X)

How many clusters has the model created? What is their distribution? Save your labels in a variable called `dbscan_labels`. Is this the same as what we came up with using the Elbow Method?

In [None]:
clusters.labels_

array([-1, -1, -1, ..., -1, -1, -1])

In [None]:
fig = px.line(x = clusters.labels_)
fig.show()

Run the cell below to plot your clusters using the DBSCAN labels.

In [None]:
dbscan_labels = clusters.labels_

In [None]:
df_spotify_numeric_scaled = pd.DataFrame(spotify_numeric_scaled, columns = spotify_numeric.columns)

In [None]:
fig_dbscan = px.scatter_3d(df_spotify_numeric_scaled,
                           x='danceability',
                           y='energy',
                           z='speechiness',
                           color=dbscan_labels,
                           width=500,
                           height=500)
fig_dbscan.show()

Using your fitted model, add in your predicted cluster labels for each song to the spotify dataframe in a new column called 'label_dbscan'

<details>
    <summary><i>Hint</i></summary>

Your number of clusters will be very dependent on the parameters you specified when instantiaing your model!
</details>

In [None]:
import numpy as np

In [None]:
spotify_df['label_dbscan'] = dbscan_labels

The cell below will generate some new playlists using the DBSCAN clusters!

In [None]:
daily_mixes_dbscan = {}

for num_cluster in np.unique(dbscan_labels):

  daily_mixes_dbscan[num_cluster] = spotify_df[spotify_df['label_dbscan'] == num_cluster]


for key,value in daily_mixes_dbscan.items():
  print("-" * 50)
  print(f"Here are some songs for playlist {key}")
  print("-" * 50)
  display(value.sample(5)[['name', 'artists']])

--------------------------------------------------
Here are some songs for playlist -1
--------------------------------------------------


Unnamed: 0,name,artists
1496,Platinum Jazz,['War']
9559,Apollo,['St. Paul & The Broken Bones']
9877,F&MU,['Kehlani']
5861,Wrong Idea,['Snoop Dogg']
5044,Tearjerker,['Red Hot Chili Peppers']


--------------------------------------------------
Here are some songs for playlist 0
--------------------------------------------------


ValueError: Cannot take a larger sample than population when 'replace=False'

You've just completed your first unsupervised clustering! **Congrats**! This is a *very* commonplace methodology, especially in recommender systems.

By no means is the example we have gone through meant to be perfect (especially with a subjective topic such as music + limited features), and it can churn out some pretty chaotic results, but **the principles will very much hold true for all clustering tasks**.

Importantly, *never forgot to scale your data if you are using a distance-based algorithm*!

Finally, here are some links to more information about Spotify data / the Spotify API (perhaps some project inspiration)

- [Audio Analysis theory with the Spotify Web API](https://www.youtube.com/watch?v=goUzHd7cTuA)
- Spotify API [docs](https://developer.spotify.com/documentation/web-api/)
- Spotify API Wrappers [Tekore](https://github.com/felix-hilden/tekore) and [Spotipy](https://github.com/plamere/spotipy)