# **Machine Learning Project: Clustering and Classifying Spotify Tracks**
## Introduction

Music is more than just sound, it is a complex combination of patterns, emotions, and measurable characteristics.
By analyzing thousands of Spotify tracks through machine learning, our project aims to sort and group different musical structures together, so that people can have a better understanding at which genre their favorite songs correspond to.

In this project, we analyze a Kaggle large dataset of Spotify tracks including various audio features such as danceability, energy, acousticness, instrumentalness, tempo, and more.


The main goal is to begin with unsupervised learning to create groups (clusters) of tracks that share similar musical characteristics. These clusters are then used as new labels for a supervised learning phase.

## PHASE 1 — UNSUPERVISED LEARNING

### A. Data Exploration

We first import the libraries we will use :

In [8]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

We load the imported dataset and we print the first rows of it.

In [2]:
df = pd.read_csv("dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


Our dataset has 114 000 rows of 21 columns :

In [7]:
df.shape

(114000, 21)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        114000 non-null  int64  
 1   track_id          114000 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        114000 non-null  int64  
 6   duration_ms       114000 non-null  int64  
 7   explicit          114000 non-null  bool   
 8   danceability      114000 non-null  float64
 9   energy            114000 non-null  float64
 10  key               114000 non-null  int64  
 11  loudness          114000 non-null  float64
 12  mode              114000 non-null  int64  
 13  speechiness       114000 non-null  float64
 14  acousticness      114000 non-null  float64
 15  instrumentalness  114000 non-null  float64
 16  liveness          11

In [11]:
df.describe()


Unnamed: 0.1,Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0
mean,56999.5,33.238535,228029.2,0.5668,0.641383,5.30914,-8.25896,0.637553,0.084652,0.31491,0.15605,0.213553,0.474068,122.147837,3.904035
std,32909.109681,22.305078,107297.7,0.173542,0.251529,3.559987,5.029337,0.480709,0.105732,0.332523,0.309555,0.190378,0.259261,29.978197,0.432621
min,0.0,0.0,0.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,28499.75,17.0,174066.0,0.456,0.472,2.0,-10.013,0.0,0.0359,0.0169,0.0,0.098,0.26,99.21875,4.0
50%,56999.5,35.0,212906.0,0.58,0.685,5.0,-7.004,1.0,0.0489,0.169,4.2e-05,0.132,0.464,122.017,4.0
75%,85499.25,50.0,261506.0,0.695,0.854,8.0,-5.003,1.0,0.0845,0.598,0.049,0.273,0.683,140.071,4.0
max,113999.0,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0


In this project, we will focus on the caracteristics of the song such as its energy, danceability, loudness etc. We therefore won't consider some columns of data as popularity, artist or album_name for example. We will justify our choices in more details during the data processing.

In [14]:
df.isnull().sum()

Unnamed: 0          0
track_id            0
artists             1
album_name          1
track_name          1
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

-->  We only have 1 value missing in 3 of our column.

In [30]:
print(df.duplicated().sum())

0


-->  We have 0 duplicate rows

In [23]:
(df["track_id"].value_counts() > 1).sum()


np.int64(16641)

--> However, there are 16641 rows having the same id as at least an other row. We will get rid of the duplicates in the data processing

Let's now search for outliers on useful columns :

#### Duration_ms column :

In [31]:
df["duration_ms"].describe()


count    1.140000e+05
mean     2.280292e+05
std      1.072977e+05
min      0.000000e+00
25%      1.740660e+05
50%      2.129060e+05
75%      2.615060e+05
max      5.237295e+06
Name: duration_ms, dtype: float64

In [3]:
df = df[df["duration_ms"] > 60000]

We take the tracks longer than 1minute, because if we have a very small song, we can't really analyse properly the characteristics of the track.

In [9]:
df["duration_ms"].describe()

count    1.131270e+05
mean     2.294231e+05
std      1.065234e+05
min      6.000900e+04
25%      1.752000e+05
50%      2.135000e+05
75%      2.620750e+05
max      5.237295e+06
Name: duration_ms, dtype: float64

#### Speechiness column :

In [13]:
df["speechiness"].describe()

count    113127.000000
mean          0.084196
std           0.104942
min           0.000000
25%           0.035800
50%           0.048800
75%           0.084100
max           0.963000
Name: speechiness, dtype: float64

In [4]:
df = df[df["speechiness"] < 0.5]

Here we decide to delete the tracks in which its half or more is considered to be composed of spoken words and not sung ones. By doing this, we get rid of the podcasts or audio books.

In [17]:
df["speechiness"].describe()

count    111983.000000
mean          0.076738
std           0.073155
min           0.000000
25%           0.035700
50%           0.048500
75%           0.081950
max           0.499000
Name: speechiness, dtype: float64

#### Loudness column :


In [18]:
df["loudness"].describe()


count    111983.000000
mean         -8.202857
std           4.977790
min         -49.531000
25%          -9.932000
50%          -6.970000
75%          -4.985000
max           4.532000
Name: loudness, dtype: float64

In [5]:
df = df[df["loudness"]>-35]
df = df[df["loudness"]<-3]

We take the values between -35 and -3 because in spotify normalised dB analyzer, tracks below -35dB are mostly ambient or field recordings, and tracks above -3dB mostly are anomalies or very compressed songs.

In [23]:
df["loudness"].describe()


count    105989.000000
mean         -8.475499
std           4.736174
min         -34.991000
25%         -10.142000
50%          -7.205000
75%          -5.277000
max          -3.001000
Name: loudness, dtype: float64

#### Tempo column :

In [24]:
df["tempo"].describe()


count    105989.000000
mean        121.723567
std          29.562619
min           0.000000
25%          99.098000
50%         121.967000
75%         139.990000
max         243.372000
Name: tempo, dtype: float64

In [6]:
df = df[df["tempo"]>40]
df = df[df["tempo"]<200]


The average BPM range for music in general is set between 40BPM (R&B / Chill) and 200BPM (Punk Rock / EDM). This is why the  values outside this range are considered errors.

In [26]:
df["tempo"].describe()


count    105314.000000
mean        121.468664
std          28.727789
min          41.858000
25%          99.059000
50%         121.904000
75%         139.956000
max         199.998000
Name: tempo, dtype: float64

#### Liveness column:

In [3]:
df["liveness"].describe()

count    114000.000000
mean          0.213553
std           0.190378
min           0.000000
25%           0.098000
50%           0.132000
75%           0.273000
max           1.000000
Name: liveness, dtype: float64

In [7]:
df = df[df["liveness"]<0.8]

We take the songs having less than 0.8, because above this value the track is considered performed live, with the detection of an audience. Having songs performed live would alter the precision of its other original characteristics.

In [7]:
df["liveness"].describe()

count    110612.000000
mean          0.192250
std           0.148295
min           0.000000
25%           0.097000
50%           0.128000
75%           0.255000
max           0.799000
Name: liveness, dtype: float64

### B. Data Preprocessing

We will apply a Principal Component Analysis on our dataset, since we have a lot a highly correlated columns such as energy and loudness for example. These close values will perturb the KMeans clustering so we will group them in new columns less correlated between each other.

Let's first delete the missing values and the duplicated rows before selecting the columns :

In [11]:
df = df.dropna()
df.isnull().sum()

Unnamed: 0          0
track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64

Even if the missing values were on columns that we won't use, these songs could be errors so we find it better deleting the rows directly. Since they are just a few, the lost of data is very negligible.

In [13]:
df = df.drop_duplicates(subset='track_id', keep='first')
print("Remaining rows after removing duplicate track IDs:", len(df))

Remaining rows after removing duplicate track IDs: 80358


The dataset had 16,641 duplicated track IDs.
Since track_id should be unique, we removed all duplicated IDs to ensure each track appears only once in the dataset.


Let's now select all the columns that will be useful for our project. We only want musical features. That is why we will only consider the following features :

*danceability: represents how suitable a track is for dancing, capturing rhythm stability and beat strength.*

*energy: measures intensity and activity, useful for separating calm acoustic tracks from loud electronic or rock music.*

*loudness: reflects overall volume and mastering style, which varies strongly across genres.*

*speechiness: detects the presence of spoken words, helping identify rap, hip-hop, podcasts, and spoken-word recordings.*

*acousticness: indicates whether a track is acoustic, helping distinguish folk/classical from electronic styles.*

*instrumentalness: measures the likelihood that a track contains no vocals, useful for identifying instrumental genres.*

*liveness: detects live recordings, which form natural clusters distinct from studio tracks.*

*valence: describes the emotional positivity of a track, separating sad/depressing songs from happy/energetic ones.*

*tempo: BPM can strongly influence musical style (ex: techno vs R&B).*

*duration_ms: track length helps identify interludes, extended live tracks, and ambient music.*

In [16]:
selected_features = [
    'danceability',
    'energy',
    'loudness',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
    'duration_ms'
]

df_features = df[selected_features].copy()

print("Selected features:")
df_features.head()


Selected features:


Unnamed: 0,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,0.676,0.461,-6.746,0.143,0.0322,1e-06,0.358,0.715,87.917,230666
1,0.42,0.166,-17.235,0.0763,0.924,6e-06,0.101,0.267,77.489,149610
2,0.438,0.359,-9.734,0.0557,0.21,0.0,0.117,0.12,76.332,210826
3,0.266,0.0596,-18.515,0.0363,0.905,7.1e-05,0.132,0.143,181.74,201933
4,0.618,0.443,-9.681,0.0526,0.469,0.0,0.0829,0.167,119.949,198853


We now have a new dataframe called df_features, having only useful columns. We got rid of columns like track_id, artists, album_name, track_name, popularity because they are irrelevant in the musical analysis of a song.

In [17]:
print("Final number of rows:", len(df_features))
print("Final number of columns:", df_features.shape[1])

Final number of rows: 80358
Final number of columns: 10
