# Not Just Another Genre Predictor - Supervised and Unsupervised Learning Algorithms for Genre Classification

**<span style="color:CornflowerBlue">by Samantha Garcia, Brainstation Data Science Student (May - August 2022) </span>**

## Notebook 5: Findings

### Table of Contents

1. Recap<br>
    1.1 [Business and Machine Learning Questions](#1.1)<br>
    1.2 [Modeling Expectations](#1.2)<br>
    1.3 [Feature Dictionary](#1.3)<br>
    1.4 [Imports and Scaling](#1.4)<br>
    
    
2. Supervised vs Unsupervised Learning and Genre Prediction<br>
    2.1 [Intention](#2.1)<br>
    2.2 [Expectation](#2.2)<br>
    2.3 [Outcome](#2.3)<br>
    
    
3. Interpretation of Clustering Result<br>
    3.1 [Visualing the Clusters](#3.1)<br>
    3.2 [Understanding the Clusters](#3.2)<br>
    3.2.1 [Cluster 0](#3.2.1)<br>
    3.2.2 [Cluster 1](#3.2.2)<br>
    3.3.3 [Cluster 2](#3.2.3)<br>
    3.3.4 [Cluster 3](#3.2.4)<br>


4. [Conclusion](#4)<br>
        
---

## 1. Recap

### 1.1 Business and Machine Learning Questions  <a id="1.1"></a>

#### Business Question

>Can we predict a song's popularity (and therefore value) based on its attributes

This is a limited measurement without access to more granular information such as geographical consumption metrics, social media penetration, etc.

We will work on a starter model focusing initially on song attribute clustering and genre analysis with a view to further development in the future to layer on value metrics, given access to the more granular information mentioned above and below.

**Creating a tool that can classify any given song into a genre or attribute-based cluster is a good first step in this direction. Which leads us to the Machine Learning Question:**

#### Machine Learning Question

>Can we derive song FEATURE PROFILES (by inspecting the song's audio attributes derived by Spotify) and ascribe VALUE to individual song features or FEATURE PROFILES

Often, songs are given a genre subjectively by music managers' opinion or songs carry genres from the Artists' genre identity, Developing an in-house model for categorising genre based on attributes, then layering on a popularity or value measure, can be of value particularly when tailored for use alongside private data held by the song copyright owners.

**We will implement 2 learning algorithms as follows:**

1. a supervised learning model where we tell the algorithm what genres the songs belong to, based on subjective 'human' genre labels - the aim here is to build a tool whereby any user can upload a song and be told what genre the song belongs to


2. an unsupervised learning model where we ask the algorithm to cluster songs based on their attributes, without seeing the songs' existing genre labels - the aim here is to ignore conventional genre labels and create a tool where songs are grouped based on machine-derived feature profiles and are given VALUE IMPORTANCE based on clearly identifiable clusters

**A secondary goal from implementing the two models will be:**

3. can we define the song clusters in a way that is understandable to the average human music listener? we could cross reference the cluster outputs from the unsupervised algorithm against the genre classes from the supervised learning algorithm to deteremine whether any of the genres are 'natural' genres or not

### 1.2 Modeling Expectations  <a id="1.2"></a>

For the supervised learning algorithm we will perform hyperparamter optimisation and model selection to choose a best fit model for multiclass classification. We will build a pipeline and perform grid search, and apply scaling to transform the data within the pipeline. Grid search will implement a 5 fold cross validation as default, we will optimise for that within the pipeline also.

We will NOT use PCA as we believe the dataset has a small enough number of features - runtime may be more efficient with PCA but feature importance will be eaier to measure from our best estimators without appying PCA. If we find our models take prohibitively long to run then we will reconsider.

The models we will look at are:

- Logistic Regression (baseline)
- KNN
- SVM (multiclass)
- Random Forests

For the unsupervised learning algorithm we will look at 3 clustering methods to see which is the best fit for our dataset:

- K-Means
- DBSCAN
- Hierarchical Clustering

If we have time we will also look at how Neural Networks perform at classifying our songs.

**Expected outcomes:**

We expect that supervised learning will be challenged to correctly predict genre as there is much crossover across contemporary genres - we expect the unsupervised models to provide more interesting results and find clusters that can be interpreted and defined based on song types.

These clusters can then be combined with our metrics in future iterations of this project to:

- understand how value is created across song clusters
- help in deciding how to exploit any given song (by seeing which cluster it falls into and placing it in suitable mediums for exploitation)

... and much more.

We expect classical music, EDM and other contemporary genres to be identifiable separately as they are conceptually quite different from each other as groups.


### 1.3 Feature Dictionary  <a id="1.3"></a>

Let's revisit our data dictionary, focusing on the features we will analyse for the above models.

<table>
  <tr>
    <th style="text-align: left; background: lightgrey">Column Name</th>
    <th style="text-align: left; background: lightgrey">Column Contents</th>
  </tr>
  <tr>
    <td style="text-align: left"> <code>song_name</code> </td>
    <td style="text-align: left">Name of Song</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>danceability</code></td>
    <td style="text-align: left">Describes how suitable a track is for dancing based on a combination of musical elements including temppo, rhythm stability, beat stregth and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>energy</code></td>
    <td style="text-align: left">Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud and noisy. For example, death metal has high energy while a Bach prelude scores low on the scale. Perceptual features contribuiting to this attribute include dynamic range, perceived loudness, timbre, onset rate and general entropy.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>key</code></td>
    <td style="text-align: left">The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>loudness</code></td>
    <td style="text-align: left">The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>mode</code></td>
    <td style="text-align: left">Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>speechiness</code></td>
    <td style="text-align: left">Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>acousticness</code></td>
    <td style="text-align: left">number A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>instrumentalness</code></td>
    <td style="text-align: left">Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>liveness</code></td>
    <td style="text-align: left">Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>valence</code></td>
    <td style="text-align: left">A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>tempo</code></td>
    <td style="text-align: left">The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. </td>
  </tr>
  <tr>
    <td style="text-align: left"><code>id</code></td>
    <td style="text-align: left">The Spotify ID for the track. This is our unique identifier, consider making this the index.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>uri</code></td>
    <td style="text-align: left">The Spotify URI for the track. A Spotify URI is a unique resource indicator code (and link) for music on their platform. It is a link to directly share your songs to fans.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>track_href</code></td>
    <td style="text-align: left">A link to the Web API endpoint providing full details of the track.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>analysis_url</code></td>
    <td style="text-align: left">A URL to access the full audio analysis of this track. An access token is required to access this data.</td>
  </tr>
  <tr>
    <td style="text-align: left"><code>duration_ms</code></td>
    <td style="text-align: left">The duration of the track in milliseconds.</td>
  </tr>
      <tr>
    <td style="text-align: left"><code>time_signature</code></td>
    <td style="text-align: left">An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4".</td>
  </tr>
          <tr>
    <td style="text-align: left"><code>genre</code></td>
    <td style="text-align: left">Subjective genre given to each group of songs based on user perception.</td>
  </tr>
</table>

The dataset currently contains columns we won't be using for analysis, which we will drop when we perform feature selction below.

---

## 2. Supervised v Unsupervised Learning and Genre Prediction

### 2.1 Intention   <a id="2.1"></a>

The project started with a wider Machine Learning Question that goes beyond this project:

Can we derive song FEATURE PROFILES (by inspecting the song's audio attributes derived by Spotify **THIS PROJECT**) and ascribe VALUE to individual song features or FEATURE PROFILES **(FUTURE ITERATION USING PRIVATE DATA)**.

**This one question led to 2 parts to be looked at within this project:**

1. a supervised learning model where we tell the algorithm what genres the songs belong to, based on subjective 'human' genre labels - the aim here is to build a tool whereby any user can upload a song and be told what genre the song belongs to


2. an unsupervised learning model where we ask the algorithm to cluster songs based on their attributes, without seeing the songs' existing genre labels - the aim here is to ignore conventional genre labels and create a tool where songs are grouped based on machine-derived feature profiles and are given VALUE IMPORTANCE based on clearly identifiable clusters

**And a secondary goal from implementing the two models:**

3. can we define the song clusters in a way that is understandable to the average human music listener? we could cross reference the cluster outputs from the unsupervised algorithm against the genre classes from the supervised learning algorithm to deteremine whether any of the genres are 'natural' genres or not

### 2.2 Expectation   <a id="2.2"></a>

We expect that supervised learning will be challenged to correctly predict genre as there is much crossover across contemporary genres - we expect the unsupervised models to provide more interesting results and find clusters that can be interpreted and defined based on song types.

We expect classical music, EDM and other contemporary genres to be identifiable separately as they are conceptually quite different from each other as groups.

There was also an expectation that more features would be required to truly get interesting results.

### 2.3 Outcome   <a id="2.3"></a>

1. **Supervised learning model** - these models were indeed challenging to get good results from. They performed overall very similarly to each other with overall accuracy maxing out at 63%


<img src="../images/supervised_results.png" width="400" height="400">

    Precision and recall were good for classical music and EDM was also better than the rest, **which met our expectation**.
    
    Pop, rock, rap/hiphop and latin were confused with each other in all of the superviesd model types
    
<img src="../images/confusion_matrix.png" width="400" height="400">

We found 4 features that were more important than others in prediciting genre, which are in line with the important features we found in our unsupervised modeling, discussed in detail in the sections below. The following graph is the Random Forests model feature importance:

<img src="../images/feature_importance.png" width="400" height="400">
    
    Supervised learning found it difficult to predict human genre tags - the conclusion is that there is something cultural that defines these contemporary genres outside of song features - features that capture this cultural aspect could be used to enhance this modelling in the future, and would ideally include indicators such as typical demographic audience (perhaps geographical location, average age, social media metrics, synch information (games, films, tv shows the songs are used in), vintage, etc
    
2. **Unsupervised Learning** - the unsupervised learning algorithms performed better at categorising the songs we found 4 distinct clusters which are analysed in more detail below.

Questions 3 above asked if we would be able to define the clusters in a way that is understandable to the average human music listener and could we cross reference the clusters against 'human' genre classes - we manage to do that in the sections below.

In [101]:
# import required packages
# packages will be added here as they come up during coding
# this is therefore a complete list of all packages used within this notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import webbrowser
from pivottablejs import pivot_ui
import qgrid

In [102]:
# import dfs from last notebook (4.a. clustering part 1)
# notebook couldn't handle everything in one so have had to split
feature_means_kmeans = pd.read_csv('../data/feature_means_kmeans.csv', index_col=0)
kmeans_df = pd.read_csv('../data/kmeans_df.csv', index_col=0)

In [103]:
feature_means_kmeans

Unnamed: 0,kmeans_labels,acousticness,danceability,instrumentalness,key,liveness,mode,speechiness,tempo,time_signature,valence,EAL
0,0,0.288034,0.564322,0.03957,0.555005,0.211254,0.0,0.092548,0.409418,0.729124,0.517556,0.734248
1,1,0.147872,0.568631,0.189231,0.432278,0.21415,1.0,0.077771,0.427052,0.736055,0.524864,0.71972
2,2,0.850929,0.392597,0.385575,0.448382,0.178991,1.0,0.037475,0.365865,0.688073,0.329352,0.777741
3,3,0.424023,0.472853,0.787659,0.542847,0.170446,0.0,0.041861,0.399268,0.713479,0.33235,0.726969


## 3 Interpretation of Clustering Result

### 3.1 Visualising the Clusters   <a id="3.1"></a>

In [104]:
# Convert from wide data to long data to plot radar chart
df = pd.melt(
    feature_means_kmeans, 
    id_vars=['kmeans_labels'], 
    var_name='category', 
    value_name='score',
    value_vars=['acousticness', 'danceability', 'instrumentalness', 'key', 'liveness', 'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'EAL']
)
# check conversion for the first cluster
df[df.kmeans_labels == 0]

Unnamed: 0,kmeans_labels,category,score
0,0,acousticness,0.288034
4,0,danceability,0.564322
8,0,instrumentalness,0.03957
12,0,key,0.555005
16,0,liveness,0.211254
20,0,mode,0.0
24,0,speechiness,0.092548
28,0,tempo,0.409418
32,0,time_signature,0.729124
36,0,valence,0.517556


In [105]:
# plot on a single chart

import warnings

import plotly.express as px

warnings.simplefilter(action='ignore', category=FutureWarning)

fig = px.line_polar(df,
        r='score',
        theta='category', 
        color='kmeans_labels', 
        line_close=True,
        line_shape='spline',  # or linear
        hover_name='kmeans_labels',
        hover_data={'kmeans_labels':False},
        markers=True,
        labels= {'category':'Feature', 'score':'Score'},
        # text='kmeans_labels',   
        range_r=[0,1], 
        direction='clockwise',  # or counterclockwise
        start_angle=90

)

# fill in the area between the line and the axis
fig.update_traces(fill='toself') 

fig.show()

In the previous notebook we saw that several features have very low range and seem to contribute little to cluster difference:

In [207]:
# replicate the min max table from previous notebook with feature ranges (variability)
min_max_df = feature_means_kmeans.agg(['min', 'max'], axis=0).T
min_max_df['feature_range'] = min_max_df['max'] - min_max_df['min']
min_max_df

Unnamed: 0,min,max,feature_range
kmeans_labels,0.0,3.0,3.0
acousticness,0.147872,0.850929,0.703057
danceability,0.392597,0.568631,0.176034
instrumentalness,0.03957,0.787659,0.748089
key,0.432278,0.555005,0.122727
liveness,0.170446,0.21415,0.043704
mode,0.0,1.0,1.0
speechiness,0.037475,0.092548,0.055073
tempo,0.365865,0.427052,0.061188
time_signature,0.688073,0.736055,0.047982


The columns with narrow range all have less than 0.1 variability across the clusters:

- liveness 
- speechiness
- tempo
- time_signature 
- EAL

We have plotted all features in the polar plot above, but compare now to the one below with only 4 features. You don't need anything but the 4 features to describe the genre differences.

We are determining **FEATURE IMPORTANCE** while looking at our unsupervised cluster model results.

#### A Note on Mode

For readers' understanding of the 'mode' feature:

Description from audio features dictionary above: "Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0."

Although the Valence feature is an indicator of perceptual positiveness or negativeness of a song, Mode can also contribute to that perception. Major songs USUALLY sound more cheerful/ happy/ positive whereas Minor songs USUALLY sound more sombre/ sad/ negative. That isn't isn't to say a song can't be Major and sad.

> "The difference between major and minor chords and scales boils down to a difference of one essential note – the third. The third is what gives major-sounding scales and chords their brighter, cheerier sound, and what gives minor scales and chords their darker, sadder sound" (SOURCE:https://www.studybass.com/lessons/bass-scales/the-difference-between-major-and-minor/#:~:text=The%20difference%20between%20major%20and%20minor%20chords%20and%20scales%20boils,chords%20their%20darker%2C%20sadder%20sound).

**In addition to removing the above low variability features, we will remove Mode when looking at our clusters, using Valence as a proxy for mood.**

#### Finally - Key

We saw in section 2.3 of Notebook 2 while preparing for the supervised learning models that key had low variability in the reduced dataset - we removed the feature there. IN this bigger sample it has slightly more variability, however song in all genres are generally in all keys... mode has been removed (key and mode build o ntop of each other to determine the full expression of the song's sound signature - e.g. C Major v C Minor sound quite different, but that is more because of the mode than the key - indeed C Major sounds quite different to D Major but as we can see from the scaled range, there isn't much variability in this feature across clusters or genres (as seen in Notebook 2).

**In addition to removing mode, we will remove key, leaving us with 4 features in total**.

In [106]:
# according to the best fit model random forests feature importance...
# mode and time_signature were the least important for determining clusters in the supervised learning model
# let's look at the cluster feature maps without those features
feature_means_kmeans_reduced = feature_means_kmeans.drop(\
                            feature_means_kmeans[['EAL', 'time_signature','tempo','speechiness','liveness', 'key', 'mode']], axis=1)
feature_means_kmeans_reduced

Unnamed: 0,kmeans_labels,acousticness,danceability,instrumentalness,valence
0,0,0.288034,0.564322,0.03957,0.517556
1,1,0.147872,0.568631,0.189231,0.524864
2,2,0.850929,0.392597,0.385575,0.329352
3,3,0.424023,0.472853,0.787659,0.33235


In [107]:
# Convert from wide data to long data to plot radar chart
df_reduced = pd.melt(
    feature_means_kmeans_reduced, 
    id_vars=['kmeans_labels'], 
    var_name='category', 
    value_name='score',
value_vars=['acousticness', 'danceability', 'instrumentalness', 'valence']
)
# check conversion for the first cluster
df_reduced[df_reduced.kmeans_labels == 0]

Unnamed: 0,kmeans_labels,category,score
0,0,acousticness,0.288034
4,0,danceability,0.564322
8,0,instrumentalness,0.03957
12,0,valence,0.517556


In [108]:
# plot on a single chart

import warnings

import plotly.express as px

warnings.simplefilter(action='ignore', category=FutureWarning)

fig = px.line_polar(df_reduced,
        r='score',
        theta='category', 
        color='kmeans_labels', 
        line_close=True,
        line_shape='spline',  # or linear
        hover_name='kmeans_labels',
        hover_data={'kmeans_labels':False},
        markers=True,
        labels= {'category':'Feature', 'score':'Score'},
        # text='kmeans_labels',   
        range_r=[0,1], 
        direction='clockwise',  # or counterclockwise
        start_angle=90

)

# fill in the area between the line and the axis
fig.update_traces(fill='toself') 

fig.show()

### 3.2 Understanding the Clusters   <a id="3.2"></a>

We have calculated feature means for each cluster, now we will bring in the track_ids and 'human' genre tags from notebook 1 so that we can conceptualise and understand the clusters.

Clusters are analysed below and song examples provided, note that clusters are analysed based on MEANS and therefore songs may vary across some features against the descriptions provided for the cluster they are in, but they should tend towards the mean and largely sound like the cluster profile indicates.

We have provided example songs to listen to in the codeblocks below, using the follow process (which can be replicated by the reader):

1. Filter on the dataframe pivot table provided below for 1. the desired cluster and 2. the desired 'human genre'
2. Double click on the track_id and copy any track_id from the list of tracks provided from on the chosen filters
3. Paste the track_id into the spotify url at the end of the webbrowser codeblock and run codeblock to listen




In [109]:
# read back in tracks_genres_df_cleaned for genre and track labels
genres_df = pd.read_csv('../data/tracks_genres_df_cleaned.csv', index_col=0)

# join the genres onto the kmeans results dataframe
genre_col = genres_df[['genre_group', 'track_id']]
genre_col.reset_index(drop=True,inplace=True)
kmeans_df_w_genres = kmeans_df.join(genre_col)
kmeans_df_w_genres[['kmeans_labels', 'genre_group', 'track_id']].head()

Unnamed: 0,kmeans_labels,genre_group,track_id
0,1,rap/hiphop,5CMN9BOEdo8EWoEnMuxvfs
1,1,rap/hiphop,122VGKCzTh6G0QIu8e4lka
2,1,rap/hiphop,4iAusuifPGTnYbxgdINuDE
3,1,rap/hiphop,3OmiK2O1NtFXbY18CZC3r2
4,0,rap/hiphop,1T8Qcl9NYcbIQ4CUYfnzGO


In [110]:
kmeans_df_w_genres_grouped = kmeans_df_w_genres.groupby(['kmeans_labels'])['genre_group'].value_counts().unstack(fill_value=0)
kmeans_df_w_genres_grouped.T

kmeans_labels,0,1,2,3
genre_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
asian pop,1103,1762,989,46
ballroom,937,1063,1101,398
blues,482,704,592,75
caribbean,1642,2168,109,95
classical,1139,92,9798,3306
country,306,2033,751,19
dance,1342,2851,185,1473
disco,205,293,18,95
edm,1063,2824,76,1783
electro,1141,2191,307,1039


#### The following is for track selection within clusters, as described above

In [158]:
# for use in this notebook, to pivot clusters and tracks to listen in spotify
qgrid_widget = qgrid.show_grid(kmeans_df_w_genres[['kmeans_labels', 'genre_group', 'track_id']], show_toolbar=True)
qgrid_widget

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

In [127]:
# look at feature means across clusters for 4 most important features
feature_means_kmeans_reduced

Unnamed: 0,kmeans_labels,acousticness,danceability,instrumentalness,valence
0,0,0.288034,0.564322,0.03957,0.517556
1,1,0.147872,0.568631,0.189231,0.524864
2,2,0.850929,0.392597,0.385575,0.329352
3,3,0.424023,0.472853,0.787659,0.33235


#### A Note on Acousticness and Instrumentalness

**Acousticness** can be thought of in terms of "plugged" vs "unplugged". An acoustic musical session is "unplugged" i.e. played with acoustic instruments and therefore having a more 'natural' analogue sound. Conversely a "plugged" musical session is one played with electric/ digital instruments/ methods and have a more electric sound.

**Instrumentalness** (as described in the features dictionary at the beginning of this notebook) is defined on a 0-1 scale as follows:

- 'vocal' (low instrumentalness)
- 'instrumental' (medium to high instrumentalness)

According to the feature dictionary, an instrumental track is defined as anything from 0.5 upwards on the 0-1 scale. 

Instrumentalness in the mid-range may contain some vocals. A fully instrulmental track with no vocals has value 1. A vocal track has with no intrumentals has value 0.

In [128]:
# remind ourselves of feature ranges
min_max_df_reduced = feature_means_kmeans_reduced.agg(['min', 'max'], axis=0).T
min_max_df_reduced

Unnamed: 0,min,max
kmeans_labels,0.0,3.0
acousticness,0.147872,0.850929
danceability,0.392597,0.568631
instrumentalness,0.03957,0.787659
valence,0.329352,0.524864


In [194]:
# create bins of lows and highs for wider range features (acousticness and instrumentalness)
bins = [0.0, 0.15, 0.45, 1.0]    
labels = ["LOW", "MEDIUM", "HIGH"]
ser = range(0,7)
clusters_features_df = pd.DataFrame(data=ser, index=[0,1,2,3,4,5,6])

for f in list(feature_means_kmeans_reduced[['acousticness','instrumentalness']]):
    feature_df = pd.cut(feature_means_kmeans_reduced[f], bins=bins, labels=labels)
    clusters_features_df = pd.concat([clusters_features_df,feature_df], axis=1, join='inner')
clusters_features_df = clusters_features_df.drop(columns=[0])
clusters_features_df

Unnamed: 0,acousticness,instrumentalness
0,MEDIUM,LOW
1,LOW,MEDIUM
2,HIGH,MEDIUM
3,MEDIUM,HIGH


In [195]:
# create bins of lows and highs for narrow range features (danceability and valence)
bins = [0.0, 0.4, 0.5, 1.0]    
labels = ["LOW", "MEDIUM", "HIGH"]
ser = range(0,7)
clusters_features_df

for f in list(feature_means_kmeans_reduced[['danceability','valence']]):
    feature_df = pd.cut(feature_means_kmeans_reduced[f], bins=bins, labels=labels)
    clusters_features_df = pd.concat([clusters_features_df,feature_df], axis=1, join='inner')
clusters_features_df

Unnamed: 0,acousticness,instrumentalness,danceability,valence
0,MEDIUM,LOW,HIGH,HIGH
1,LOW,MEDIUM,HIGH,HIGH
2,HIGH,MEDIUM,LOW,LOW
3,MEDIUM,HIGH,MEDIUM,LOW


In [196]:
# for use in presentation, to pivot clusters and tracks to listen in spotify
pivot_ui(kmeans_df_w_genres[['kmeans_labels', 'genre_group', 'track_id']],outfile_path='pivottablejs.html')
webbrowser.open_new_tab('pivottablejs.html')

True

### 3.2.1 Cluster 0   <a id="3.2.1"></a>

In [197]:
clusters_features_df[clusters_features_df.index == 0]

Unnamed: 0,acousticness,instrumentalness,danceability,valence
0,MEDIUM,LOW,HIGH,HIGH


In [189]:
kmeans_df_w_genres_grouped[kmeans_df_w_genres_grouped.index == 0].T.sort_values(0, ascending=False).head(7)

kmeans_labels,0
genre_group,Unnamed: 1_level_1
rap/hiphop,2634
rock,2543
folk,1929
pop,1896
hardcore_rock,1643
caribbean,1642
latin,1499


#### Cluster 0 Description

- 'Human' genres profile: mainly rap/hiphop, rock, folk and pop
- High danceability and valence (energetic and happy sounding)
- These are happy tracks that are danceable and where vocals are a key feature (low instrumentalness)
- they sound plugged but may retain some acousticness (medium)

Example songs:

In [204]:
# rap/hiphop:
webbrowser.open_new_tab('http://open.spotify.com/track/5XzYBtUD8WW5S9LfrL13IQ')

True

In [201]:
# rock:
webbrowser.open_new_tab('http://open.spotify.com/track/0L9k4wQpKCER4MqH33vWJd')

True

In [51]:
# folk:
webbrowser.open_new_tab('http://open.spotify.com/track/2l913EwGQOhT08wZQwKHCD')

True

### 3.2.2 Cluster 1  <a id="3.2.2"></a>

In [198]:
clusters_features_df[clusters_features_df.index == 1]

Unnamed: 0,acousticness,instrumentalness,danceability,valence
1,LOW,MEDIUM,HIGH,HIGH


In [188]:
kmeans_df_w_genres_grouped[kmeans_df_w_genres_grouped.index == 1].T.sort_values(1, ascending=False).head(7)

kmeans_labels,1
genre_group,Unnamed: 1_level_1
rock,6789
indie,3676
rap/hiphop,3645
hardcore_rock,3089
latin,3061
dance,2851
pop,2841


#### Cluster 1 Description

- 'Human' genres profile: rock, rap/hiphop, indie and latin
- High danceability and valence (happy songs to dance to), same as Cluster 0
- BUT sound more plugged (Lower acousticness) and higher instrumentalness (less vocal) than Cluster 0
- unsurprisingly, most of the EDM, Dance and Electro tracks fall into this Cluster (and Cluster 3)

Example songs:

In [203]:
# rock:
webbrowser.open_new_tab('http://open.spotify.com/track/1yJkF5R6pjUWjaJBTwYySe')

True

In [71]:
# indie:
webbrowser.open_new_tab('http://open.spotify.com/track/2dVm7zrInA0V0RD9RHCOGZ')

True

In [205]:
# rap/hiphop:
webbrowser.open_new_tab('http://open.spotify.com/track/4LAS8LPhxEJU4ZCIPmgdZN')

True

In [72]:
# latin:
webbrowser.open_new_tab('http://open.spotify.com/track/3SgTm2eBByF3vdh5G6HDrH')

True

### 3.2.3 Cluster 2  <a id="3.2.3"></a>

In [199]:
clusters_features_df[clusters_features_df.index == 2]

Unnamed: 0,acousticness,instrumentalness,danceability,valence
2,HIGH,MEDIUM,LOW,LOW


In [190]:
kmeans_df_w_genres_grouped[kmeans_df_w_genres_grouped.index == 2].T.sort_values(2, ascending=False).head(6)

kmeans_labels,2
genre_group,Unnamed: 1_level_1
classical,9798
jazz,2910
folk,2590
rock,1455
indie,1260
pop,1201


#### Cluster 2 Description

- 'Human' genres profile: classical, jazz, folk
- Low danceability and valence (more moddy and sombre than clusters 0 and 1)
- high acousticness (sound less plugged)

Example songs:

In [76]:
# classical:
webbrowser.open_new_tab('http://open.spotify.com/track/0zoqG48Lr338BxBHm46Rjg')

True

In [77]:
# jazz:
webbrowser.open_new_tab('http://open.spotify.com/track/2nrcTS39YW440mZy4Btbkj')

True

In [78]:
# folk:
webbrowser.open_new_tab('http://open.spotify.com/track/73FjZmy9Abj8rTVTQzXDrn')

True

### 3.2.4 Cluster 3  <a id="3.2.4"></a>

In [200]:
clusters_features_df[clusters_features_df.index == 3]

Unnamed: 0,acousticness,instrumentalness,danceability,valence
3,MEDIUM,HIGH,MEDIUM,LOW


In [192]:
kmeans_df_w_genres_grouped[kmeans_df_w_genres_grouped.index == 3].T.sort_values(3, ascending=False).head(5)

kmeans_labels,3
genre_group,Unnamed: 1_level_1
classical,3306
edm,1783
jazz,1492
dance,1473
electro,1039


#### Cluster 3 Description

- 'Human' genres profile: classical, edm, jazz, dance, electro
- somewhat plugged, highly instrumental but moody sounding

Example songs:

In [83]:
# classical:
webbrowser.open_new_tab('http://open.spotify.com/track/5zKN05WUUcCVTg14hghwjv')

True

In [84]:
# edm:
webbrowser.open_new_tab('http://open.spotify.com/track/4ybPUs477Vq8ZdQeYVEZfF')

True

### 4. Conclusion  <a id="4"></a>

The key takeaways from our unsupervised clustering model are:

1. 4 features are important in determining song clusters: acousticness, danceability, instrumentalness and valence


2. We could have run the KMeans model with higher n_clusters (5 or 6) but there would likely have been lots of similarities in the other 2 clusters 2 of those seen above. It would be interesting to see if running that would yield similar results in terms of songs and cluster feature profiles (i.e. see if the clusters combine in the same way as above).


3. The 4 cluster song groups can be described conceptually as follows:

<img src="../images/cluster_groups_final.png" width="1000" height="1000">

**Note that the 'human' genre indications are INDICATIVE of the cluster group - they do not define the cluster group - giving those genres is only for the purposes of establishing familiarity for the reader - indeed it is important to focus more on the descriptions and bear in mind that a sombre/ moody rap/hiphop or pop track would fall into cluster 0 or 2, depending on how vocal it is.**