# Spotify Dataset 1922-2021, ~600k Tracks
## by Abdulrahman Yaseen

## Investigation Overview

I wanted to look in figuring out what features are best for predicting the popularity of the tracks in the dataset

## Dataset Overview

There are 586672 tracks in the dataset with 19 features

### Features:
- acousticness (Ranges from 0 to 1)
- danceability (Ranges from 0 to 1)
- energy (Ranges from 0 to 1)
- duration_ms (Integer typically ranging from 200k to 300k)
- instrumentalness (Ranges from 0 to 1)
- valence (Ranges from 0 to 1)
- popularity (Ranges from 0 to 100)
- tempo (Float typically ranging from 50 to 150)
- liveness (Ranges from 0 to 1)
- loudness (Float typically ranging from -60 to 0)
- speechiness (Ranges from 0 to 1)
- mode (0 = Minor, 1 = Major)
- explicit (0 = No explicit content, 1 = Explicit content)
- key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…)
- timesignature (The predicted timesignature, most typically 4)
- artists (List of artists mentioned)
- release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary)
- name (Name of the song)

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [None]:
# load in the dataset into a pandas dataframe
df = pd.read_csv("../input/spotify-dataset-19212020-160k-tracks/tracks.csv")

In [None]:
# Copying dataset
df_clean = df.copy()
#[''] will be replaced with NaNs.
df_clean.artists = df_clean.artists.replace("['']",np.nan)
#Change date column to date format.
df_clean['release_date'] = pd.to_datetime(df_clean['release_date'])


## Popularity Distribution

>The popularity is right skewed as the popularity is calculated on the total number of plays the track has had and how recent those plays are. So it's reasonable that most of the tracks have a popularity of zero, and zero tracks have a popularity of 100

In [None]:
# start with a standard-scaled plot
binsize = 1
bins = np.arange(0, df_clean['popularity'].max()+binsize, binsize)

plt.figure(figsize=[14.70, 8.27])
plt.hist(data = df_clean, x = 'popularity', bins = bins)
plt.title('Popularity Distribution')
plt.xlabel('popularity value between 0 and 100, with 100 being the most popular.')
plt.ylabel('Count')
plt.show()

## Duration Distribution

> The durations ranges from 100,000 ms to 500,000 ms with large freq at duration of 200,000 ms (3.3 minutes)
Fun Fact: Spotify has 454.06 days of music or 1.24 year of continous listening. In other words it will take 1.24 years for you to listen to all the songs

In [None]:
# start with a standard-scaled plot
bins = np.arange(0, 1000000, 10000)

plt.figure(figsize=[14.70, 8.27])
plt.hist(data = df_clean, x = 'duration_ms', bins = bins)
plt.title('Duration Distribution')
plt.xlabel('duration_ms')
plt.ylabel('Count')
plt.show()

## (Track Releases Over Years)

> The production of songs is increasing over years

In [None]:
# Return the Series having unique values
df_clean['year'] = df_clean['release_date'].dt.year
base_color = sb.color_palette()[0]
plt.figure(figsize = [20, 8.27])
plt.xticks(rotation=90)
plt.title('Track Releases Over Years')
# Use the `color` argument
sb.countplot(data=df_clean, x='year', color=base_color);

## The First  Feature Affecting The Popularity

> The more energtic the more popular

In [None]:
#Scatter between energy and popularity 
plt.subplots(1,1,figsize=(14.70, 8.27))
ax1 = sb.regplot(data = df_clean.sample(500), x = 'energy', y = 'popularity');
ax1.set_title('Correlation between energy and popularity');
ax1.set_xlabel('energy (Ranges from 0 to 1)');

## The Seacond  Feature Affecting The Popularity

> The more loud the more popular

In [None]:
#Scatter between loudness and popularity 
plt.subplots(1,1,figsize=(14.70, 8.27))
ax2 = sb.regplot(data = df_clean.sample(500), x = 'loudness', y = 'popularity');
ax2.set_title('Correlation between loudness and popularity');
ax2.set_xlabel('loudness (Float typically ranging from -60 to 0)');

## Audio Characteristics Over Years

> Tracks have become more Energetic and Danceable in the recent years. The loudness and tempo has also increased. The tracks have become less "Acoustic" Also the more energetic and loud the more popular

In [None]:
# Audio characteristics over year
plt.figure(figsize=(14.70, 8.27))
sb.set(style="whitegrid")
columns = ["acousticness","danceability","energy","speechiness","liveness","valence"]
for col in columns:
    x = df_clean.groupby("year")[col].mean()
    ax= sb.lineplot(x=x.index,y=x,label=col);
ax.set_title('Audio characteristics over year');
ax.set_ylabel('Measure');
ax.set_xlabel('Year');

### Further looking into energy effect taking loudness range and years in consideration
> "popularity VS "energy" by "years" and with size "loudness_range"

In [None]:
df_clean['loudness_range'] = pd.qcut(df_clean['loudness'], q=4)
# "popularity VS "energy" by "years" and with size "loudness_range"
plt.subplots(1,1,figsize=(14.70, 8.27))
sb.scatterplot(x="energy", y="popularity",
                hue="year", size="loudness_range",
                palette="ch:r=-.2,d=.3_r",
                sizes=(20, 200), linewidth=0,
                data=df_clean.sample(500)).set(title='...energy...');

### Further looking into danceability effect taking loudness range and years in consideration
> "popularity VS "danceability" by "years" and with size "loudness_range"

In [None]:

# "popularity VS "danceability" by "years" and with size "loudness_range"
plt.subplots(1,1,figsize=(14.70, 8.27))
sb.scatterplot(x="danceability", y="popularity",
                hue="year", size="loudness_range",
                palette="ch:r=-.2,d=.3_r",
                sizes=(20, 200), linewidth=0,
                data=df_clean.sample(500)).set(title='...danceability...');

### Observations
- Mostly 2000 songs are added for each year on spotify

- Tracks have become more Energetic and Danceable in the recent years. The loudness and tempo has also increased.

- The tracks have become less "Acoustic"

- Spotify has 454.06 days of music or 1.24 year of continous listening. In other words it will take 1.24 years for you to listen to all the songs

- Valence and Danceability are highly related and so is speechiness and Danceability