# Spotify - All Time Top 2000s Mega Dataset
by: Bondoc, Alyana; Dalisay, Andres; To, Justin

We will be using the dataset [Spotify - All Time Top 2000s Mega Dataset](https://www.kaggle.com/datasets/iamsumat/spotify-top-2000s-mega-dataset) for this project. 

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import norm

%load_ext autoreload
%autoreload 2

## The Dataset

The dataset consists of a list of the top tracks to have come out between 1956 and 2019 that are available on Spotify. It contains information about the track and how it scales across multiple sound features. The dataset was extracted from the playlist, “Top 2000s”, on Spotify. The data was then passed to PlaylistMachinery(@plamere), which was then able to retrieve attributes per song in the playlist. Sumat Singh then scraped the data from the site using Python with Selenium to form the dataset that the group will be using for the analysis.

The dataset contains 15 different columns that represent an attribute as well as 1995 different rows wherein 1994 of them represent an observation with 1 representing the column names, thus giving it a shape of (1995,15).

The column/variables consist of: 

- **`Title`**: Title of the track.
- **`Artist`**: Artist/group who made the track.
- **`Top Genre`**: Genre of the track.
- **`Year`**: Year the track was released.
- **`Length`**: Duration of the track in seconds.
- **`Beats Per Minute`**: Average count of beats per 1 minute interval of the track.
- **`Energy`**: Scale measuring how energetic and upbeat the track is. The higher the value, the more energetic.
- **`Danceability`**: Scale measuring how usable the track is for dancing. The higher the value, the more danceable.
- **`Loudness`**: Scale measuring how loud the track is, measured in decibels. The higher the value, the louder.
- **`Valence`**: Scale measuring the mood of the song, whether it be positive or negative. The higher the value, the more positive.
- **`Accoustic`**: Scale measuring how acoustic the track is. The higher the value, the more acoustic.
- **`Speechiness`**: Scale measuring the track’s word count. The higher the value, the more words were used.
- **`Liveliness`**: Scale measuring the likeliness that it is a live recording. The higher the value, the more likely it is that the track was recorded live.
- **`Popularity`**: Scale measuring how popular the song is. The higher the value, the more popular the song is as of 2019.

All of this data was taken by PlaylistMachinery from the Spotify API. More detailed descriptions of the audio features can be seen in [Spotify's Web API Documentation.](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features)

The dataset was provided as a `.csv` file.

In [None]:
df = pd.read_csv("Spotify-2000.csv")
df

In [None]:
df.describe()

Let's check the data types of the variables.

In [None]:
print(df.dtypes);

## Cleaning the Dataset

Before we can start exploring our dataset, we had to first preprocess the data by cleaning and formatting it into a more usable and ideal form.

First, we want to convert each column name into snake_case format so that it'll be much easier to refer to them in our code.

In [None]:
list(df.columns)

In [None]:
df.rename(columns = {
    'Index':'index',
    'Title': 'title',
    'Artist': 'artist',
    'Top Genre': 'top_genre',
    'Year': 'year',
    'Beats Per Minute (BPM)': 'bpm',
    'Energy': 'energy',
    'Danceability': 'danceability',
    'Loudness (dB)': 'loudness',
    'Liveness': 'liveness',
    'Valence': 'valence',
    'Length (Duration)': 'length',
    'Acousticness': 'acousticness',
    'Speechiness': 'speechiness',
    'Popularity': 'popularity'
}, inplace=True)

list(df.columns)

Looking at the datatypes of each column, the `title`, `artist`, `top_genre`, and `length` columns all had the Object datatype, while the rest of the columns had proper datatypes as integers.

In [None]:
print(df.dtypes);

The former three mentioned were converted into Strings, while the `length` column had to be converted into integers. However, the `length` column had some values with 4 digits that were formatted with a comma as seen in the example below:

$$ 1,412 $$

The column couldn’t be converted into an integer datatype with the commas in its values, so the commas were removed before being converted into an integer.

In [None]:
df['title'] = df['title'].astype('string');
df['artist'] = df['artist'].astype('string');
df['top_genre'] = df['top_genre'].astype('string');

# Some length values have commas in them, so remove the first before converting to int.
df['length'] = df['length'].str.replace(',','');
df['length'] = df['length'].astype(str).astype('int64');

To check that the proper datatypes were used, the datatypes were printed out.

In [None]:
print(df.dtypes);

Next, we thought that specific years wouldn't be too useful in our analysis, so we decided to bin them into decades; that is, year ranges such as 1960-1969, 2000-2009, etc. The only bin that wouldn't be complete would be the 1950-1959 bin, since the earliest year recorded in the data was 1956. We decided to make it into a 1956-1959 bin instead of dropping all records with those years altogether because we thought the song attributes for those songs could still be useful.

In [None]:
df['period'] = pd.cut(x = df['year'], bins = [1956, 1960, 1970, 1980, 1990, 2000, 2010, 2019], labels = ['1956-1959', '1960-1969', '1970-1979', '1980-1989', '1990-1999', '2000-2009','2010-2019'], include_lowest = True)
df = df[['index', 'title', 'artist', 'top_genre', 'period', 'year', 'bpm', 'energy', 'danceability', 'loudness', 'liveness', 'valence', 'length', 'acousticness', 'speechiness', 'popularity']]
df

# Exploratory Data Analysis

## What is the average danceability of each genre?

To answer this question, the variables of interest are:
- **`top_genre`**: Genre of the track.
- **`danceability`**: Scale measuring how usable the track is for dancing. The higher the value, the more danceable.

First, we make a sliced copy of the main DataFrame with only the `top_genre` and `danceability` columns.

In [None]:
genre_dance = df[['top_genre', 'danceability']].copy()
genre_dance

Let's take a look at the distribution of danceability values.

In [None]:
genre_dance.hist('danceability', edgecolor='w', figsize=(8,4))

In [None]:
genre_dance['danceability'].skew()

As we can see from the histogram and the skew value, the distribution is very close to normal. This means we can confidently use the mean to get the average for danceability across all the genres.

In [None]:
genre_dance.agg({'danceability' : ['mean', 'median', 'std']})

The average danceability of all the songs in the dataset is a value of **53.24**.

Next, let's group all the records by genre, and take some summary statistics from them

In [None]:
genre_dance_summary = genre_dance.groupby('top_genre').agg({'danceability': ['skew', 'mean', 'median', 'std', 'count']})

genre_dance_summary.sort_values(('danceability', 'mean'), ascending=False).head(10)

As there are some genres in the dataset that are represented by one song only, they may not be a very good representation of other songs in that genre. So let's remove those first by using `.dropna()` to take out all the records with `NaN` values in the `std` and `skew` columns (They're `NaN` because they cannot compute without more than one value.)

In [None]:
genre_dance_summary.sort_values(('danceability', 'mean'), ascending=False).dropna().head(10)

Here, we can see the full list of song genres with more than one record. Note that the `skew` value was included in order to determine whether it would be more accurate to use the `mean` or `median` as the measure of central tendency.

From the results, we can see that `reggae fusion` has the highest average `danceability` value. However, it is only represented by a count of 4 songs. 

The next largest `danceability` average with a genre represented by >=10 songs would be `disco`. Because of its `skew` value of **-0.52**, the `median` may be more accurate to look at, with a value of **69.0**.

Now, let's take a look at the genre with the lowest average `danceability` value.

In [None]:
genre_dance_summary.sort_values(('danceability', 'mean'), ascending=True).dropna().head(10)

From the DataFrame above, we can see that `chanson`, a lyric-driven French song genre, has the lowest mean `danceability`, with a value of **35.00**.

Let's take a look at the first 10 values sorted by `count` and see what the averages for the high-count genres are.

In [None]:
genre_dance_summary.sort_values(('danceability', 'count'), ascending=False).dropna().head(10)

As we can see, `album rock` with the highest song count has a mean of **51.41**. `dance pop`, rather appropriately, has a relatively high danceability median of **64.0**.

## Which Attributes and Sound Features are Correlated?

To this, we first slice the dataframe and only incorporate the different sound features per top_genre and their mean. Afterwards, we apply panda's **.corr()** function to get the correlation value between them.


In [None]:
genre_attributes = df.copy()
genre_attributes.drop(['index','period','artist','popularity','title','year'], axis=1, inplace=True)
genre_attributes_summary = genre_attributes.groupby('top_genre').agg('mean')
genre_attributes_summary.dropna()
genre_attributes_summary.corr()

It has been found that there are a couple of sound features that have correlation between them, such as `energy` and `loudness` which have a positive correlation between them. Let's take a look at the scatterplot for these features.

In [None]:
genre_attributes[['energy', 'loudness']].plot.scatter(x='energy', y='loudness', alpha=0.5)

As you can see, there is a distinct upwards trend, denoting a positive correlation. This means that, usually, the louder the song is, the more energy it's perceived to have.

Another one is `acousticness` and `energy` with a negative correlation. Let's see the scatterplot for these two.

In [None]:
genre_attributes[['acousticness', 'energy']].plot.scatter(x='acousticness', y='energy', alpha=0.5)

Given this negative correlation, we can say that the perceived energy in a song tends to lower the more it uses more acoustic elements or instruments rather than more electronic elements.

 `acousticness` and `loudness` also have a negative correlation between them. Let's see the scatterplot.

In [None]:
genre_attributes[['acousticness', 'loudness']].plot.scatter(x='acousticness', y='loudness', alpha=0.5)

From this, similar to the relationship between `acousticness` and `loudness`, the more acoustic elements there are in a song, the more it tends to be louder. This could be because most acoustic instruments are usually limited in terms of how loud a sound they can produce.

Lastly, there is a slightly positive correlation between `valence` and `danceability`.

In [None]:
genre_attributes[['valence', 'danceability']].plot.scatter(x='valence', y='danceability', alpha=0.5)

We can infer from this that danceable songs tend to be "happier" sounding.

## What are the energy level intervals of all the songs in the dataset?

To answer this question, the variables of interest are:
- **`energy`**: Scale measuring how energetic and upbeat the track is. The higher the value, the more energetic.

Since the sample population is independent and is at least 30, the group will find the confidence interval of the energy level of the songs in the dataset.

First, let's take a look if the distribution of values in energy is normal.

In [None]:
df['energy'].skew()

Since the value is negative but still close to zero, it can be said that the distribution is slightly negatively skewed. Let's validate this using a histogram.

In [None]:
df.hist('energy', edgecolor='w', figsize=(8,4))

With this, it can be said that the sample is approximately symmetric.

Now, let's take a look at energy's mean, meadian, and standard deviation to get the confidence interval.

In [None]:
energy_agg = df.agg({"energy": ["mean", "median", "std"]})

sample_mean = energy_agg.loc["mean"][0]
sample_median = energy_agg.loc["median"][0]
sample_std = energy_agg.loc["std"][0]

energy_agg

The mean, median, and standard deviation are: **59.68**, **61.00**, and **22.15**.

## Confidence Interval

A confidence interval for a population mean is of the following form:

$$\bar{x} \pm z^* \frac{s}{\sqrt{n}}$$

Where $z^*$, also known as the **critical value**, is the z-score that corresponds to the middle 95% of the data.

The z-score of a 95% confidence is **1.96**. Let's validate the critical value of a 95% confidence interval using the formula below. 

In [None]:
z_star_95 = norm.ppf(0.975)
print('{:.2f}'.format(z_star_95))

## Margin of Error

We can compute the **margin of error** using the formula

$$z^* \frac{s}{\sqrt{n}}$$

Compute and display the margin of error given a 95% confidence level.

In [None]:
n = df['index'].count()

margin_of_error = z_star_95 * (sample_std / np.sqrt(n))
print('Margin of Error: {:.2f}'.format(margin_of_error))

Now, let's compute for the 95% confidence interval of the dataset. The 95% confidence interval is the sample mean $\pm$ the margin of error. 

In [None]:
minimum_value = sample_mean - margin_of_error
maximum_value = sample_mean + margin_of_error
print('({:.2f},'.format(minimum_value), '{:.2f})'.format(maximum_value))

With this, even though we do not have the complete list of the popular songs of spotify, we are 95% confident that the true average energy level is between **58.71** to **60.65**.

# Research Question

Based on the results of our EDAs, we found that the `danceability` feature had varying averages across different genres, with some genres having similar averages. This leads us to question whether we could possibly find similar average values for other song features across genres, and find out how many features between certain genres are similar.

We also found evidence that a couple of sound features had relationships between them. These correlations lead us to believe that these sound features could possibly be utilized to find relationships and similarities between the different genres as well.

We also discovered that the the 95% confidence interval for all the songs across all genres in the dataset was between 58.71 and 60.65. To find out if certain genres are similar from another, it would be important to look at their confidence interval. If the intervals overlap, it can be said that the difference between them are not statistically significant.

Given all of this information we've acquired about our dataset, we have decided on the research question, **"Based on the attributes and features of songs, which music genres are similar (or dissimilar) to one another?"**

Music is arguably one of mankind's most significant innovations as it is able to transcend past the visual barriers and limitations by stimulating our imagination and emotions through sound. From Beethoven's Waldstein, to Tupac's All Eyez on Me, to Taylor Swift's Midnight Rain; each and every one of these songs from completely different genre's are able to tell different stories and provoke different emotions. With that said, given the amount of songs and genres available, it can be quite difficult to find the right songs given a certain taste in music. 

Being able to know which genres go well or are similar with others can definitely help the selection and filtration process. The group plans to answer this question with the use of cossine similarity and data mining techniques, such as clustering, using the given sound feature data per genre. The resulting data may also be used in recommender systems for music platforms, much like Spotify, to boost their efficiency by manipulating the algorithm into looking for suggestions from genres in the same or closest clusters first, instead of searching for recommendations randomly across the different genres.