## Basic Setup and Cleaning

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set_theme() # set the default Seaborn style for graphics

In [None]:
# Import dataset and check basic information
full_dataset = pd.read_csv('datasets/spotify_songs.csv')
print(full_dataset.shape)
# print(full_dataset.dtypes)
# full_dataset.head()
# full_dataset.describe()
full_dataset.info()

From the printed `.info()` above, we can see that some of the records in the `full_dataset` contain `NA` values in certain variables. Hence, we drop those records from `full_dataset`.

In [None]:
# Apply the dropna() function to remove records with missing values
# then check the information of the cleaned dataset
full_dataset.dropna(inplace=True)
full_dataset.info()

## Define the Polularity Levels

Since our goal is to roughly predict the popularity, we need to divide the `track_popularity` into several levels. To achieve this, let's perform some EDA and gain a general insight of this variable first.

In [None]:
# Extract the track_popularity column and check its distribution
popularity = pd.DataFrame(full_dataset["track_popularity"])
print(popularity.describe())
print(popularity.value_counts())

f, axes = plt.subplots(3, 1, figsize=(24, 12), sharex=True)
sb.boxplot(data = popularity, orient = "h", ax=axes[0])
sb.histplot(data = popularity, kde = True, ax=axes[1])
sb.violinplot(data = popularity, orient = "h", ax=axes[2])

From the plots and `.value_counts()` above, we can see that there are unexpectedly high number of records with `track_popularity <= 1`. Let's assume this is a sort of anomaly and remove such records from our `full_dataset` to calculate a more credible mean and standard deviation.

In [None]:
# Remove records whose value of popularity is lower than or equal to 1
full_dataset = full_dataset[full_dataset['track_popularity'] > 1]
popularity = full_dataset["track_popularity"]

# Check the skewness and distribution of the track_popularity column
from scipy.stats import skew

print("Skewness of popularity:", skew(popularity))
print(popularity.value_counts())
f, axes = plt.subplots(3, 1, figsize=(24, 12), sharex=True)
sb.boxplot(data = popularity, orient = "h", ax=axes[0])
sb.histplot(data = popularity, kde = True, ax=axes[1])
sb.violinplot(data = popularity, orient = "h", ax=axes[2])

From the calculated skewness and plots above, we can clearly see that **`track_popularity` is similar to a normal distribution** if we ignore the clustering phenomenon at the low end.

Thus, we can divide `track_popularity` into **6 levels** -- `very_low`, `low`, `somewhat_low`, `somewhat_high`, `high`, and `very_high` -- by mean and standard deviation.

In [None]:
# Calculate the mean and standard deviation of track_popularity
mean = popularity.mean()
std = popularity.std()

# Define the level divisions
very_low = mean - 2 * std
low = mean - std
medium = mean
high = mean + std
very_high = mean + 2 * std

# Create a new column "popularity_level" based on the level divisions
full_dataset["popularity_level"] = pd.cut(full_dataset["track_popularity"], bins=[0, very_low, low, medium, high, very_high, float('inf')], labels=["very_low", "low", "somewhat_low", "somewhat_high", "high", "very_high"])

# Check the distribution of popularity levels
popularity_level = pd.DataFrame(full_dataset["popularity_level"].value_counts(sort=False), columns=["count"])
popularity_level["density"] = popularity_level["count"] / len(full_dataset)
popularity_level

## Reduce Sample Size

Currently the sample size is over 30k, and we choose to reduce it to 5k by random sampling.

In [None]:
sampled_dataset = full_dataset.sample(n=5000)
sampled_dataset.to_csv('datasets/cleaned_dataset.out.csv', index=False)
print(sampled_dataset.shape)
print(sampled_dataset["popularity_level"].value_counts(sort=False) / len(sampled_dataset))