# Spotify - All Time Top 2000s Mega Dataset
by: Bondoc, Alyana; Dalisay, Andres; To, Justin

We will be using the dataset [Spotify - All Time Top 2000s Mega Dataset](https://www.kaggle.com/datasets/iamsumat/spotify-top-2000s-mega-dataset) for this project. 

# Phase 1

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import norm
from scipy.stats import chi2_contingency
from scipy.stats import f_oneway

%load_ext autoreload
%autoreload 2

## The Dataset

The dataset consists of a list of the top tracks to have come out between 1956 and 2019 that are available on Spotify. It contains information about the track and how it scales across multiple sound features. The dataset was extracted from the playlist, “Top 2000s”, on Spotify. The data was then passed to PlaylistMachinery(@plamere), which was then able to retrieve attributes per song in the playlist. Sumat Singh then scraped the data from the site using Python with Selenium to form the dataset that the group will be using for the analysis.

The dataset contains 15 different columns that represent an attribute as well as 1995 different rows wherein 1994 of them represent an observation with 1 representing the column names, thus giving it a shape of (1995,15).

The column/variables consist of: 

- **`Title`**: Title of the track.
- **`Artist`**: Artist/group who made the track.
- **`Top Genre`**: Genre of the track.
- **`Year`**: Year the track was released.
- **`Length`**: Duration of the track in seconds.
- **`Beats Per Minute`**: Average count of beats per 1 minute interval of the track.
- **`Energy`**: Scale measuring how energetic and upbeat the track is. The higher the value, the more energetic.
- **`Danceability`**: Scale measuring how usable the track is for dancing. The higher the value, the more danceable.
- **`Loudness`**: Scale measuring how loud the track is, measured in decibels. The higher the value, the louder.
- **`Valence`**: Scale measuring the mood of the song, whether it be positive or negative. The higher the value, the more positive.
- **`Accoustic`**: Scale measuring how acoustic the track is. The higher the value, the more acoustic.
- **`Speechiness`**: Scale measuring the track’s word count. The higher the value, the more words were used.
- **`Liveliness`**: Scale measuring the likeliness that it is a live recording. The higher the value, the more likely it is that the track was recorded live.
- **`Popularity`**: Scale measuring how popular the song is. The higher the value, the more popular the song is as of 2019.

All of this data was taken by PlaylistMachinery from the Spotify API. More detailed descriptions of the audio features can be seen in [Spotify's Web API Documentation.](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features)

The dataset was provided as a `.csv` file.

In [None]:
df = pd.read_csv("Spotify-2000.csv")
df

In [None]:
df.describe()

Let's check the data types of the variables.

In [None]:
print(df.dtypes);

### Cleaning the Dataset

Before we can start exploring our dataset, we had to first preprocess the data by cleaning and formatting it into a more usable and ideal form.

First, let's check if there are any null values in the dataset that need to be taken care of.

In [None]:
df.isnull().values.any()

We can conclude there aren't any missing or null values in our dataset.

Let us then look for duplicate values in the dataset, and if there are, understand the context behind them and decided on how to deal with them.

The field that requires investigation is the `Title`, as it is the one that is commonly used to identify its record.

In [None]:
df['Title'].value_counts()

As we can see here, we have detected multiple iterations of certain track `Titles`. Let's investigate further by including the `Artist's` name in the value count.

In [None]:
(df['Title']+" "+df['Artist']).value_counts()

Thus, we conclude that the duplicate entities in the title field actually belong to different `Artists`. With that said, we can also say that there are no duplicate song entries in the database

Now, we are going to be looking at categorical values to see if they happen to have duplicate iterations.

The two columns that contain categorical values are `Year` and `Top Genre`

In [None]:
print(df['Year'].value_counts())

We can see that there aren't really any categorical duplicates or different iterations of similar categorical values for the field `Year`

In [None]:
print(df['Top Genre'].value_counts())

We can see that there also aren't really any categorical duplicates or different iterations of similar categorical values for the field `Top Genre`

Next, we want to convert each column name into snake_case format so that it'll be much easier to refer to them in our code.

In [None]:
list(df.columns)

In [None]:
df.rename(columns = {
    'Index':'index',
    'Title': 'title',
    'Artist': 'artist',
    'Top Genre': 'top_genre',
    'Year': 'year',
    'Beats Per Minute (BPM)': 'bpm',
    'Energy': 'energy',
    'Danceability': 'danceability',
    'Loudness (dB)': 'loudness',
    'Liveness': 'liveness',
    'Valence': 'valence',
    'Length (Duration)': 'length',
    'Acousticness': 'acousticness',
    'Speechiness': 'speechiness',
    'Popularity': 'popularity'
}, inplace=True)

list(df.columns)

Looking at the datatypes of each column, the `title`, `artist`, `top_genre`, and `length` columns all had the Object datatype, while the rest of the columns had proper datatypes as integers.

In [None]:
print(df.dtypes);

The former three mentioned were converted into Strings, while the `length` column had to be converted into integers. However, the `length` column had some values with 4 digits that were formatted with a comma as seen in the example below:

$$ 1,412 $$

The column couldn’t be converted into an integer datatype with the commas in its values, so the commas were removed before being converted into an integer.

In [None]:
df['title'] = df['title'].astype('string');
df['artist'] = df['artist'].astype('string');
df['top_genre'] = df['top_genre'].astype('string');

# Some length values have commas in them, so remove the first before converting to int.
df['length'] = df['length'].str.replace(',','');
df['length'] = df['length'].astype(str).astype('int64');

To check that the proper datatypes were used, the datatypes were printed out.

In [None]:
print(df.dtypes);

Next, we thought that specific years wouldn't be too useful in our analysis, so we decided to bin them into decades; that is, year ranges such as 1960-1969, 2000-2009, etc. The only bin that wouldn't be complete would be the 1950-1959 bin, since the earliest year recorded in the data was 1956. We decided to make it into a 1956-1959 bin instead of dropping all records with those years altogether because we thought the song attributes for those songs could still be useful.

In [None]:
df['period'] = pd.cut(x = df['year'], bins = [1956, 1960, 1970, 1980, 1990, 2000, 2010, 2019], labels = ['1956-1959', '1960-1969', '1970-1979', '1980-1989', '1990-1999', '2000-2009','2010-2019'], include_lowest = True)
df = df[['index', 'title', 'artist', 'top_genre', 'period', 'year', 'bpm', 'energy', 'danceability', 'loudness', 'liveness', 'valence', 'length', 'acousticness', 'speechiness', 'popularity']]
df

## Exploratory Data Analysis

### What is the average danceability of each genre?

To answer this question, the variables of interest are:
- **`top_genre`**: Genre of the track.
- **`danceability`**: Scale measuring how usable the track is for dancing. The higher the value, the more danceable.

First, we make a sliced copy of the main DataFrame with only the `top_genre` and `danceability` columns.

In [None]:
genre_dance = df[['top_genre', 'danceability']].copy()
genre_dance

Let's take a look at the distribution of danceability values.

In [None]:
genre_dance.hist('danceability', edgecolor='w', figsize=(8,4))

In [None]:
genre_dance['danceability'].skew()

As we can see from the histogram and the skew value, the distribution is very close to normal. This means we can confidently use the mean to get the average for danceability across all the genres.

In [None]:
genre_dance.agg({'danceability' : ['mean', 'median', 'std']})

The average danceability of all the songs in the dataset is a value of **53.24**.

Next, let's group all the records by genre, and take some summary statistics from them

In [None]:
genre_dance_summary = genre_dance.groupby('top_genre').agg({'danceability': ['skew', 'mean', 'median', 'std', 'count']})

genre_dance_summary.sort_values(('danceability', 'mean'), ascending=False).head(10)

As there are some genres in the dataset that are represented by one song only, they may not be a very good representation of other songs in that genre. So let's remove those first by using `.dropna()` to take out all the records with `NaN` values in the `std` and `skew` columns (They're `NaN` because they cannot compute without more than one value.)

In [None]:
genre_dance_summary.sort_values(('danceability', 'mean'), ascending=False).dropna().head(10)

Here, we can see the full list of song genres with more than one record. Note that the `skew` value was included in order to determine whether it would be more accurate to use the `mean` or `median` as the measure of central tendency.

From the results, we can see that `reggae fusion` has the highest average `danceability` value. However, it is only represented by a count of 4 songs. 

The next largest `danceability` average with a genre represented by >=10 songs would be `disco`. Because of its `skew` value of **-0.52**, the `median` may be more accurate to look at, with a value of **69.0**.

Now, let's take a look at the genre with the lowest average `danceability` value.

In [None]:
genre_dance_summary.sort_values(('danceability', 'mean'), ascending=True).dropna().head(10)

From the DataFrame above, we can see that `chanson`, a lyric-driven French song genre, has the lowest mean `danceability`, with a value of **35.00**.

Let's take a look at the first 10 values sorted by `count` and see what the averages for the high-count genres are.

In [None]:
genre_dance_summary.sort_values(('danceability', 'count'), ascending=False).dropna().head(10)

As we can see, `album rock` with the highest song count has a mean of **51.41**. `dance pop`, rather appropriately, has a relatively high danceability median of **64.0**.

### Which Attributes and Sound Features are Correlated?

To this, we first slice the dataframe and only incorporate the different sound features per top_genre and their mean. Afterwards, we apply panda's **.corr()** function to get the correlation value between them.


In [None]:
genre_attributes = df.copy()
genre_attributes.drop(['index','period','artist','popularity','title','year'], axis=1, inplace=True)
genre_attributes_summary = genre_attributes.groupby('top_genre').agg('mean')
genre_attributes_summary.dropna()
genre_attributes_summary.corr()

It has been found that there are a couple of sound features that have correlation between them, such as `energy` and `loudness` which have a positive correlation between them. Let's take a look at the scatterplot for these features.

In [None]:
genre_attributes[['energy', 'loudness']].plot.scatter(x='energy', y='loudness', alpha=0.5)

As you can see, there is a distinct upwards trend, denoting a positive correlation. This means that, usually, the louder the song is, the more energy it's perceived to have.

Another one is `acousticness` and `energy` with a negative correlation. Let's see the scatterplot for these two.

In [None]:
genre_attributes[['acousticness', 'energy']].plot.scatter(x='acousticness', y='energy', alpha=0.5)

Given this negative correlation, we can say that the perceived energy in a song tends to lower the more it uses more acoustic elements or instruments rather than more electronic elements.

 `acousticness` and `loudness` also have a negative correlation between them. Let's see the scatterplot.

In [None]:
genre_attributes[['acousticness', 'loudness']].plot.scatter(x='acousticness', y='loudness', alpha=0.5)

From this, similar to the relationship between `acousticness` and `loudness`, the more acoustic elements there are in a song, the more it tends to be louder. This could be because most acoustic instruments are usually limited in terms of how loud a sound they can produce.

Lastly, there is a slightly positive correlation between `valence` and `danceability`.

In [None]:
genre_attributes[['valence', 'danceability']].plot.scatter(x='valence', y='danceability', alpha=0.5)

We can infer from this that danceable songs tend to be "happier" sounding.

### What are the energy level intervals of all the songs in the dataset?

To answer this question, the variables of interest are:
- **`energy`**: Scale measuring how energetic and upbeat the track is. The higher the value, the more energetic.

Since the sample population is independent and is at least 30, the group will find the confidence interval of the energy level of the songs in the dataset.

First, let's take a look if the distribution of values in energy is normal.

In [None]:
df['energy'].skew()

Since the value is negative but still close to zero, it can be said that the distribution is slightly negatively skewed. Let's validate this using a histogram.

In [None]:
df.hist('energy', edgecolor='w', figsize=(8,4))

With this, it can be said that the sample is approximately symmetric.

Now, let's take a look at energy's mean, meadian, and standard deviation to get the confidence interval.

In [None]:
energy_agg = df.agg({"energy": ["mean", "median", "std"]})

sample_mean = energy_agg.loc["mean"][0]
sample_median = energy_agg.loc["median"][0]
sample_std = energy_agg.loc["std"][0]

energy_agg

The mean, median, and standard deviation are: **59.68**, **61.00**, and **22.15**.

#### Confidence Interval

A confidence interval for a population mean is of the following form:

$$\bar{x} \pm z^* \frac{s}{\sqrt{n}}$$

Where $z^*$, also known as the **critical value**, is the z-score that corresponds to the middle 95% of the data.

The z-score of a 95% confidence is **1.96**. Let's validate the critical value of a 95% confidence interval using the formula below. 

In [None]:
z_star_95 = norm.ppf(0.975)
print('{:.2f}'.format(z_star_95))

#### Margin of Error

We can compute the **margin of error** using the formula

$$z^* \frac{s}{\sqrt{n}}$$

Compute and display the margin of error given a 95% confidence level.

In [None]:
n = df['index'].count()

margin_of_error = z_star_95 * (sample_std / np.sqrt(n))
print('Margin of Error: {:.2f}'.format(margin_of_error))

Now, let's compute for the 95% confidence interval of the dataset. The 95% confidence interval is the sample mean $\pm$ the margin of error. 

In [None]:
minimum_value = sample_mean - margin_of_error
maximum_value = sample_mean + margin_of_error
print('({:.2f},'.format(minimum_value), '{:.2f})'.format(maximum_value))

With this, even though we do not have the complete list of the popular songs of spotify, we are 95% confident that the true average energy level is between **58.71** to **60.65**.

## Research Question

Based on the results of our EDAs, we found that the `danceability` feature had varying averages across different genres, with some genres having similar averages. This leads us to question whether we could possibly find similar average values for other song features across genres, and find out how many features between certain genres are similar.

We also found evidence that a couple of sound features had relationships between them. These correlations lead us to believe that these sound features could possibly be utilized to find relationships and similarities between the different genres as well.

We also discovered that the the 95% confidence interval for all the songs across all genres in the dataset was between 58.71 and 60.65. To find out if certain genres are similar from another, it would be important to look at their confidence interval. If the intervals overlap, it can be said that the difference between them are not statistically significant.

Given all of this information we've acquired about our dataset, we have decided on the research question, **"Based on the attributes and features of songs, which music genres are similar (or dissimilar) to one another?"**

Music is arguably one of mankind's most significant innovations as it is able to transcend past the visual barriers and limitations by stimulating our imagination and emotions through sound. From Beethoven's Waldstein, to Tupac's All Eyez on Me, to Taylor Swift's Midnight Rain; each and every one of these songs from completely different genre's are able to tell different stories and provoke different emotions. With that said, given the amount of songs and genres available, it can be quite difficult to find the right songs given a certain taste in music. 

Being able to know which genres go well or are similar with others can definitely help the selection and filtration process. The group plans to answer this question with the use of cossine similarity and data mining techniques, such as clustering, using the given sound feature data per genre. The resulting data may also be used in recommender systems for music platforms, much like Spotify, to boost their efficiency by manipulating the algorithm into looking for suggestions from genres in the same or closest clusters first, instead of searching for recommendations randomly across the different genres.

# Phase 2

## Data Pre-processing

### Limiting Genres

Since the number of songs per genre in the dataset varies, we decided to get the top 20 genres with the most song counts.

We decided to get the top 20 genres instead of setting a threshold number of songs since we also wanted to limit the amount of genres we'd have to analyze once we clustered our dataset.

In [None]:
genre_df = df.copy()
genre_df = genre_df[genre_df['top_genre'].isin(genre_df['top_genre'].value_counts().loc[lambda x: x>20].reset_index()['index'])]

genre_df.info()

Here are the 20 genres left after filtering.

In [None]:
genre_df['top_genre'].unique()

### Choosing Columns

In order to answer our research question, we will employ the use of k-means clustering.

But before clustering our data, we want to check which song attributes can be considered factors for data modelling. Let's take a look at the columns available in our DataFrame:

In [None]:
for col in df.columns:
    print(col)

From these columns, we want to get the ones that refer to each song's musical attributes. These will be `energy`, `danceability`, `liveness`, `valence`, `acousticness`, and `speechiness`. We chose these attributes because these are values that are computed from Spotify's algorithms based on things like a song's key, beats per minute (BPM), loudness, etc., and are good metrics for representing a song's musical and emotional qualities. 

There are a few columns that we considered to add, but ultimately decided not to for various reasons:

>`bpm`: We decided not to add this because this value is already used to compute the other attributes mentioned above. Thus, we feel it would be redundant to consider this as a factor for data modelling by itself.
>
>`loudness`: Similar to `bpm`, this is also a value already used to compute the attributes mentioned above.
>
>`length`: We believe that the length of a song isn't necessarily indicative of a song's musical qualities. In addition, there are some outlier songs with abnormally long durations that may affect our data modelling results.
>
>`popularity`: The popularity of a song is a more subjective metric that isn't really indicative of a song's musical qualities.

In [None]:
genre_df = genre_df[["energy", "danceability", "liveness", "valence", "acousticness", "speechiness", "top_genre"]]
genre_df.reset_index(drop = True, inplace = True)
genre_df.info()

In [None]:
genre_df['top_genre'].value_counts()

In order for our KMeans code to work later for our data modelling, we have to set `top_genre`'s datatype to an object.

In [None]:
genre_df['top_genre'] = genre_df['top_genre'].astype('object')
genre_df.info()

In addition, because these attributes are already set on a scale from 0 to 100, normalization is not needed anymore.

Before using `energy`, `danceability`, `liveness`, `valence`, `acousticness`, and `speechiness` for clustering though, we first perform statistical inference on these attributes in order to see if they're actually significant enough to be considered factors for data modelling.

## Data Cleaning for Statistical Inference

First, let's check whether all factors to be used, `energy`, `danceability`, `liveness`, `valence`, `acousticness`, and `speechiness`, are normally distributed so that we know whether we can perform ANOVA on them.

In [None]:
genre_df['energy'].skew()

`energy` is fairly symmetrical.

In [None]:
genre_df['danceability'].skew()

`danceability` is fairly symmetrical.

In [None]:
genre_df['liveness'].skew()

`liveness` has a highly skewed distribution. Let's normalize this using the boxcox transformation which was adapted from [Geeks for Geeks](https://www.geeksforgeeks.org/box-cox-transformation-using-python/).

In [None]:
genre_df_std = genre_df.copy()

fitted_data, fitted_lambda = stats.boxcox(genre_df_std['liveness'])

genre_df_std = genre_df.drop(columns=["liveness"]).copy()
genre_df_std.insert(2, "liveness", fitted_data)
genre_df_std['liveness'].skew()

Now `liveness` is fairly symmetrical.

In [None]:
genre_df['valence'].skew()

`valence` is fairly symmetrical.

In [None]:
genre_df['acousticness'].skew()

`danceability` is only moderately skewed.

In [None]:
genre_df['speechiness'].skew()

`speechiness` has a highly skewed distribution. Let's normalize this using the boxcox transformation. 

In [None]:
fitted_data, fitted_lambda = stats.boxcox(genre_df_std['speechiness'])

genre_df_std = genre_df_std.drop(columns=["speechiness"])
genre_df_std.insert(5, "speechiness", fitted_data)
genre_df_std['speechiness'].skew()

Now `speechiness` is fairly symmetrical.

## Statistical Inference

Here are the means of the different attributes for each genre.

In [None]:
genre_df.groupby('top_genre').agg("mean")

In [None]:
genre_df_std.groupby('top_genre').agg("mean")

For statistical testing, ANOVA will be used. 

### First Statistical Test - Energy

For the first statistical test, let's check whether there is a significant difference in the mean of `energy`.

**Null Hypothesis**: There is no difference in energy per genre.

**Alternative Hypothesis**: There is a difference in energy per genre.

Let's first visualize the mean using a box plot.

In [None]:
genre_df.boxplot("energy", by="top_genre", figsize=(15,10))
plt.show()

It can be seen that their means do differ with one another, however let's validate this using one-way ANOVA.

In [None]:
values = []
for i in genre_df_std['top_genre'].unique():
    _df = genre_df_std[genre_df_std['top_genre']==i]
    values.append(_df['energy'].values)
    
f_oneway(*values)

Since the p-value **5.3529e-30** is less than the significance level of 0.05, we can reject the null hypothesis. Therefore, energy has a **significant difference** and will be a factor in clustering.

### Second Statistical Test - Danceability

For the second statistical test, let's check whether there is a significant difference in the mean of `danceability`.

**Null Hypothesis**: There is no difference in danceability per genre.

**Alternative Hypothesis**: There is a difference in danceability per genre.

Let's first visualize the mean using a box plot.

In [None]:
genre_df_std.boxplot("danceability", by="top_genre", figsize=(15,10))
plt.show()

It can be seen that their means do differ with one another, however let's validate this using one-way ANOVA.

In [None]:
values = []
for i in genre_df_std['top_genre'].unique():
    _df = genre_df_std[genre_df_std['top_genre']==i]
    values.append(_df['danceability'].values)
    
f_oneway(*values)

Since the p-value **8.8769e-16** is less than the significance level of 0.05, we can reject the null hypothesis. Therefore, danceability has a **significant difference** and will be a factor in clustering.

### Third Statistical Test - liveness

For the third statistical test, let's check whether there is a significant difference in the mean of `liveness`.

**Null Hypothesis**: There is no difference in liveness per genre.

**Alternative Hypothesis**: There is a difference in liveness per genre.

Let's first visualize the mean using a box plot.

In [None]:
genre_df_std.boxplot("liveness", by="top_genre", figsize=(15,10))
plt.show()

It may seem that their means are close to one another in this case. Let's validate this using one-way ANOVA.

In [None]:
values = []
for i in genre_df_std['top_genre'].unique():
    _df = genre_df_std[genre_df_std['top_genre']==i]
    values.append(_df['liveness'].values)
    
f_oneway(*values)

Since the p-value **0.7180** is greater than the significance level of 0.05, we can accept the null hypothesis. Therefore, liveness has **no significant difference** and will not be a factor in clustering.

### Fourth Statistical Test - valence

For the fourth statistical test, let's check whether there is a significant difference in the mean of `valence`.

**Null Hypothesis**: There is no difference in valence per genre.

**Alternative Hypothesis**: There is a difference in valence per genre.

Let's first visualize the mean using a box plot.

In [None]:
genre_df_std.boxplot("valence", by="top_genre", figsize=(15,10))
plt.show()

It can be seen that their means do differ with one another, however let's validate this using one-way ANOVA.

In [None]:
values = []
for i in genre_df_std['top_genre'].unique():
    _df = genre_df_std[genre_df_std['top_genre']==i]
    values.append(_df['valence'].values)
    
f_oneway(*values)

Since the p-value **4.6986e-13** is less than the significance level of 0.05, we can reject the null hypothesis. Therefore, valence has a **significant difference** and will be a factor in clustering.

### Fifth Statistical Test - acousticness

For the fifth statistical test, let's check whether there is a significant difference in the mean of `acousticness`.

**Null Hypothesis**: There is no difference in acousticness per genre.

**Alternative Hypothesis**: There is a difference in acousticness per genre.

Let's first visualize the mean using a box plot.

In [None]:
genre_df_std.boxplot("acousticness", by="top_genre", figsize=(15,10))
plt.show()

It can be seen that their means do differ with one another, however let's validate this using one-way ANOVA.

In [None]:
values = []
for i in genre_df_std['top_genre'].unique():
    _df = genre_df_std[genre_df_std['top_genre']==i]
    values.append(_df['acousticness'].values)
    
f_oneway(*values)

Since the p-value **9.5081e-39** is less than the significance level of 0.05, we can reject the null hypothesis. Therefore, acousticness has a **significant difference** and will be a factor in clustering.

### Sixth Statistical Test - speechiness

For the second statistical test, let's check whether there is a significant difference in the mean of `speechiness`.

**Null Hypothesis**: There is no difference in speechiness per genre.

**Alternative Hypothesis**: There is a difference in speechiness per genre.

Let's first visualize the mean using a box plot.

In [None]:
genre_df_std.boxplot("speechiness", by="top_genre", figsize=(15,10))
plt.show()

In this case, we're not sure whether there is a significant difference in their mean so let's validate this using one-way ANOVA.

In [None]:
values = []
for i in genre_df_std['top_genre'].unique():
    _df = genre_df_std[genre_df_std['top_genre']==i]
    values.append(_df['speechiness'].values)
    
f_oneway(*values)

Since the p-value **7.5578e-09** is less than the significance level of 0.05, we can reject the null hypothesis. Therefore, speechiness has a **significant difference** and will be a factor in clustering.

With this, only the 5 factors `energy`, `danceability`, `valence`, `acousticness`, and `speechiness` would be significant for clustering. `liveliness` would not be significant so it will be excluded in clustering.

In [None]:
genre_df = genre_df.drop(columns=["liveness"])
genre_df.info()

## Data Modelling

As stated earlier at the start of the Phase 2 section, we will employ the use of k-means clustering in order to answer our research question. Through clustering, we'll be able to see which genres are clustered together, along with how many songs of each genre there are in each cluster.

First, let's take a look at our DataFrame again.

In [None]:
genre_df.info()

Next, let's check the number of songs per genre.

In [None]:
genre_df['top_genre'].value_counts()

Now, in order for us to select a good value for `k`, we'll use the elbow method. The following code for visualizing the line graph for the elbow method was adapted from [Geeks for Geeks](https://www.geeksforgeeks.org/determining-the-number-of-clusters-in-data-mining/).

In [None]:
import sys
!{sys.executable} -m pip install scikit-learn
import sklearn
from sklearn.cluster import KMeans


# determining the maximum number of clusters
# using the simple method
dataset = genre_df.drop(columns=["top_genre"]).copy()
 
# selecting optimal value of 'k'
# using elbow method
 
# wcss - within cluster sum of
# squared distances
wcss = {}
 
for k in range(2,10):
    model = KMeans(n_clusters=k)
    model.fit(dataset)
    wcss[k] = model.inertia_
     
# plotting the wcss values
# to find out the elbow value
plt.plot(wcss.keys(), wcss.values(), 'gs-')
plt.xlabel('Values of "k"')
plt.ylabel('WCSS')
plt.show()

From the plot, we can see that the average distances of each data point for each cluster seems to stabilize around `k = 5`. As such, we'll use `k = 5` as our number of clusters.

We then import our own KMeans functions, which we also used for our CSMODEL notebooks.

In [None]:
from kmeans import KMeans

In [None]:
kmeans = KMeans(5, 0, 5, 1465, genre_df)

We first have to initialize the centroids for our dataset.

In [None]:
kmeans.initialize_centroids(genre_df)

Then, we use the `train()` function from our Python file, which is the main function for processing the clusters.

In [None]:
groups = kmeans.train(genre_df, 300)


Now, using `print_results()`, we can see the final clusters that were generated.

In [None]:
test=kmeans.print_results(groups[0], -1, genre_df)

After running the K-Means clustering algorithm on the dataset through 29 iterations, we've finished with 5 different clusters. Let's explore them each individually

Normally, we would visualize the clustered result on a plotted axis, however, due to its multidimensionality, we found it very difficult and inefficient to do so. Instead, we've decided to analyze the clustering results both per cluster, and per genre.

### Analysis Per Cluster


Let's first analyze the different genres per cluster and see which genres dominate which clusters.

#### Cluster 0

In [None]:
import matplotlib.pyplot as plt

cluster_0 = test[test['Cluster'] == 0].sort_values(by='Count', ascending=False).copy()

plt.barh(cluster_0['Genre'],cluster_0['Count'])
plt.show()

In [None]:
cluster_0

In [None]:
cluster_0['Count'].sum()

As we can see from the results, Cluster 0 is mostly dominated by `Rock` and `Metal` genres such as `Album Rock` ,`Alternative Metal`, `Modern Rock`, and `Alternative Rock` to name a few. Cluster 0 had a total of 313 songs in it.

#### Cluster 1

In [None]:
cluster_1 = test[test['Cluster'] == 1].sort_values(by='Count', ascending=False).copy()

plt.barh(cluster_1['Genre'],cluster_1['Count'])
plt.show()

In [None]:
cluster_1

In [None]:
cluster_1['Count'].sum()

As we can see from the results, Cluster 1 is mostly dominated by `Rock` and `Pop` genres such `Dutch Pop`, and `Album Rock`. It was also dominated by `Adult Standards`.

#### Cluster 2

In [None]:
cluster_2 = test[test['Cluster'] == 2].sort_values(by='Count', ascending=False).copy()

plt.barh(cluster_2['Genre'],cluster_2['Count'])
plt.show()

In [None]:
cluster_2

In [None]:
cluster_2['Count'].sum()

Cluster 2 was dominated by `album rock` followed by `adult standards` and `dutch cabaret`.

---

#### Cluster 3

In [None]:
cluster_3 = test[test['Cluster'] == 3].sort_values(by='Count', ascending=False).copy()

plt.barh(cluster_3['Genre'],cluster_3['Count'])
plt.show()

In [None]:
cluster_3

In [None]:
cluster_3['Count'].sum()

A majority of `Cluster 3` also consisted of `album rock` and `dance pop`.

---

#### Cluster 4

In [None]:
cluster_4 = test[test['Cluster'] == 4].sort_values(by='Count', ascending=False).copy()

plt.barh(cluster_4['Genre'],cluster_4['Count'])
plt.show()

In [None]:
cluster_4

In [None]:
cluster_4['Count'].sum()

In [None]:
groups[1]

`Cluster 4` was dominated by `album rock` and `dutch pop`.

In [None]:
groups[1]

We can see here that `Cluster 0` has the highest `energy` and `speechiness` out of all the clusters, while also having the lowest `acousticness`. The prominence of `rock` and `metal` tracks in the cluster makes sense, as songs from the aforementioned genres tend to use a lot of electronic sounds and instruments, thus lowering its acousticness. 

Songs from `Cluster 1`, on the other hand, had the lowest `energy` and `speechiness`, as well as relatively low `valence`. It also had the highest `acousticness`. Low `valence` indicates more negative-sounding songs, which can have emotions such as sadness, depression, anger, etc. But because its `energy` is really low relatively speaking, we can speculate that the songs in this cluster have more of an sad and depressive mood because of the assumption that songs with a sadder, melancholic mood would have less `energy` as well. Its high marks in `acousticness` also supports this assumption, as a lot of sad and emotional songs tend to be sang acoustically. 

`Cluster 2`'s Songs had relatively high `danceability`, `acousticness` and `valence`. This would explain the prominence of songs from the genre of `Dutch Cabaret`, which tends to be played with non-electric instruments and a more positive tune. It is also often danced to.

Songs from `Cluster 3` had the highest `danceability` and `valence`, as well as relatively high `energy` and `speechiness`, while having relatively low `acousticness`. Songs with high `valence` tend to be associated with high `energy` and `danceability` as it helps create a more upbeat mood and environment. Its low `acousticness` implies the use of electronic sounds and instruments, which would explain the prominence of `rock` and `dance pop` in the cluster.

`Cluster 4`'s songs had the lowest `danceability`, and `valence`, as well as relatively low `energy` and `speechiness`. 

### Analysis Per Genre

Due to the wide variety of different genres, we'll only be looking at and analyzing the Top 3 Genres with the higest count of tracks.

In [None]:
print(genre_df['top_genre'].value_counts())

Let's first look at the genre `Album Rock`

In [None]:
test.loc[test["Genre"]=="album rock"]

In [None]:
plt.bar(test.loc[test["Genre"]=="album rock"]['Cluster'],test.loc[test["Genre"]=="album rock"]['Count'])
plt.show()

122 of the album rock tracks went to the `Cluster 3`, making it the most prominent Cluster for the genre. It was least prominent in `Cluster 1`

Next, let's look at `Adult Standards`

In [None]:
test.loc[test["Genre"]=="adult standards"]

In [None]:
plt.bar(test.loc[test["Genre"]=="adult standards"]['Cluster'],test.loc[test["Genre"]=="adult standards"]['Count'])
plt.show()

Majority of songs belonging to the `adult standards` genre was grouped into `Cluster 1`. Songs from this genre had no presence whatsoever in `Cluster 0`.

Lastly, let's look at `dutch pop`

In [None]:
test.loc[test["Genre"]=="dutch pop"]

In [None]:
plt.bar(test.loc[test["Genre"]=="dutch pop"]['Cluster'],test.loc[test["Genre"]=="dutch pop"]['Count'])
plt.show()

`dutch pop` was most prominent in `Cluster 0` with 24 entries, and least prominent in `Cluster 3` and `Cluster 2`.

## Insights and Conclusion

Based on our statistical inference and  clustering analysis, we've been able to somewhat conclude which genres are similar based on some of the different sound features used by Spotify, such as `energy`, `danceability`, `acousticness`, `valence`, and `speechiness`. 

We've found that it was quite difficult to define and highlight a relationship of similarity between two or multiple genres due to the variety of songs each of them cover. However, we believe that we've found sufficient evidence to prove that certain genres can be grouped together based on these sound attributes. 

Cluster 0 was heavily dominated by `rock` and `metal` genres, as well as genres that tend to have high `energy` and low `acousticness`. This included album rock and alternative metal. 

On the other hand, slower and more emotional songs were most prominent in Cluster 1, taking over 40% of songs from the genre, `adult standards`. It's also home to a large chunk of genres `dutch indie` and `dutch pop`. 

Cluster 2 can be described to be more "Musical-ish", with very upbeat and acoustic tones that make it very easy to dance to. You can also find a huge portion of `dutch cabaret` and `british invasion` in here. 

In Cluster 3, you can expect to find more electronic, upbeat, and danceable genres such as `dance pop` , `album rock` and `dance rock`. 

Lastly, in Cluster 4 is where you can find genres with songs that tend be slower and moodier in nature, such as `dutch pop` and `adult standards`.

Among all the genres, though, `album rock` topped pretty much all of the clusters. The fact that it was the genre with the most amount of value counts in the dataset may have contributed to that. However, upon our research, `album rock` is a loosely-bound genre itself. It doesn't necessarily refer to a specific style of rock, but rather its format--rock songs that are part of an album. Because of this, `album rock` attributes naturally vary, and so some album rock songs can be more similar to more heavy rock-oriented genres such as `alternative metal`, while others can be more similar to `pop` or other pop genres.

Another prominent genre among the clusters was `adult standards`. According to [Wikipedia](https://en.wikipedia.org/wiki/Adult_standards), the adult standards genre "is aimed at 'mature' adults, meaning mainly those people over 50 years of age, but it is mostly targeted for senior citizens." Similar to `album rock`, it can encompass a fairly wide range of genres as well, and is more indicative of its format than its musical qualities. However, artists commonly associated with `adult standards` such as Frank Sinatra and Ella Fitzgerald do have music which can be described as less "energetic" than music in later decades. As such, it's no wonder the genre was absent in `Cluster 0`, which was dominated by heavy rock genres.

Throughout the analysis and process, we've realized that songs under a single genre can be quite varying in terms of sound features, at least a lot more than we initially expected, which definitely affected the results and steps we needed to take in order to achieve them. With that said, although no set genre is fully defined by a single cluster, the clustering process still helped us identify which genres are vaguely similar by taking a closer look at the sound features per cluster, which was obtained using the centroids used during the clustering process. These sound features helped us understand the relationships between the genres more, and gave us insight as to how they could be perceived as similar.