# Spotify Global Music Dataset

## 1. Installind & Importing Required Libraries

We first install and then import the libraries needed for data manipulation, visualization, and dataset downloading.

```pip install numpy```  
```pip install pandas```  
```pip install seaborn```  
```pip install matplotlib```  

```pip install kagglehub```  
```pip install kagglehub[pandas-datasets]```

In [None]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

## 2. Setting File Path

We specify the name of the data file in order to import it.

In [None]:
file_path = "track_data_final.csv"

## 3. Loading the dataset

### 3.1. Loading the Dataset with KaggleHub

We try to use KaggleHub, which directly gets the data from the online dataset _(code obtained from the Kaggle website)_.

In [None]:
# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "wardabilal/spotify-global-music-dataset-20092025",
  file_path,
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

### 3.2. Loading the CSV Locally

If the KaggleHub import doesn't work, we can also import the dataset from a local CSV file.

In [None]:
df = pd.read_csv(f"./data/{file_path}")

## 4. Cleaning the Data

We need to remove any unnecessary columns that don't provide any value to our analysis. For instance, the track number or the album it belongs won't give us any insight on how well the song performs (the album will benefit from the track, not in the other way). We'll also remove the album total tracks, the artist genre (half of the artists don't have any) and the track name and id.
<br>
We'll keep the date, as it gives us a chronological indicator.

In [None]:
# Dropping columns that do not contribute to song performance analysis
df = df.drop("track_number", axis=1)
df = df.drop("track_name", axis=1)
df = df.drop("track_id", axis=1)
df = df.drop("artist_genres", axis=1)
df = df.drop("album_id", axis=1)
df = df.drop("album_name", axis=1)
df = df.drop("album_total_tracks", axis=1)

### 4.1 Checking for Missing Values

In [None]:
# Checking for missing values
df.isnull().sum()

### 4.2 Handling Missing Artist Information

Some rows may have missing artist names or popularity metrics. <br>
Artist names and release dates are replaced by ```"unknown"``` when missing.

In [None]:
# Identify missing artist names
missing_artist_name = [
    [index, row] for index, row in df.iterrows() 
    if pd.isna(row["artist_name"])
]
print(missing_artist_name)

In [None]:
# Filling missing numerical fields with their most common value
mode_artist_popularity = df["artist_popularity"].mode()[0]
df["artist_popularity"] = df["artist_popularity"].fillna(mode_artist_popularity)

mode_artist_followers = df["artist_followers"].mode()[0]
df["artist_followers"] = df["artist_followers"].fillna(mode_artist_followers)

# Replace missing string fields
df["artist_name"] = df["artist_name"].fillna("unknown")
df["album_release_date"] = df["album_release_date"].fillna("unknown")

### 4.3 Converting In Numeric Format

We need to convert artist names, album types, and explicit content to numerical categories.

In [None]:
# Change the strings to unique identifiers
print(df["album_type"].unique())
print(df["artist_name"].unique())

In [None]:
# Mapping artist names into unique identifiers
conversion_artist = {}
unique_id = 0

for artist in df["artist_name"]:
    if artist not in conversion_artist:
        conversion_artist[artist] = unique_id
        unique_id += 1

df["artist_name"] = df["artist_name"].map(conversion_artist)

In [None]:
# Mapping album_type
df["album_type"] = df["album_type"].map({"compilation": 0, "single": 1, "album": 2})

In [None]:
# Mapping explicit
df["explicit"] = df["explicit"].map({False: 0, True: 1})

### 4.4. Decomposing date

We split the date into **year**, **month**, and **day**, and the original date column is removed.

In [None]:
# Initialize year/month/day columns
df["year"] = [None] * len(df)
df["month"] = [None] * len(df)
df["day"] = [None] * len(df)

date_order = ["year", "month", "day"]

In [None]:
for index, row in df.iterrows():
    date_parts = row["album_release_date"].split("-")

    for i in range(3 - len(date_parts)):                # Other way : date_parts.extend([0] * (3 - len(date_parts)))
        date_parts.append(0)

    for i in range(3):
        df.loc[index, date_order[i]] = date_parts[i]

df = df.drop("album_release_date", axis=1)

______________________________________________________________________________________________________________________________________________________

1. Outlier Removal (remove extreme values)
2. Visualize data :
    1. &#x2611; Histograms
        1. Track Popularity
        2. Track Duration (ms)
        3. Artist Popularity
    2. &#x2610; Boxplots for outliers
    3. &#x2611; Bar graphs
        1. Number of tracks released per year
        2. Average track popularity per album type
    4. &#x2611; Scatter plots
        1. Track popularity vs. artist popularity
        2. Track popularity vs. artist followers
        3. Track popularity vs. duration
        4. Track popularity vs. release year
    5. &#x2611; Pie chart
        1. Percentage of explicit lyrics
        2. Distribution of album types
    6. &#x2611; Correlation Heatmap

In [None]:
# Histogram of track popularity
plt.figure(figsize=(13, 6))
plt.hist(df.track_popularity, edgecolor='black')
plt.xlabel("Track Popularity", fontsize=15)
plt.ylabel("Count", fontsize=15)
plt.title("Distribution of Track Popularity", fontsize=20)
plt.show()

In [None]:
# Histogram of artist followers
plt.figure(figsize=(13, 6))
plt.hist(df.artist_popularity, edgecolor='black', color='green')
plt.xlabel("Artist Popularity", fontsize=15)
plt.ylabel("Count", fontsize=15)
plt.title("Distribution of Artist Popularity", fontsize=20)
plt.show()

In [None]:
# Histogram of track duration
plt.figure(figsize=(13, 6))
plt.hist(df.track_duration_ms, edgecolor='black', color='red')
plt.xlabel("Track Duration (ms)", fontsize=15)
plt.ylabel("Count", fontsize=15)
plt.title("Distribution of Track Duration", fontsize=20)
plt.show()

In [None]:
# Bar chart of number of tracks released per year
plt.figure(figsize=(13, 6))
tracks_per_year = df.year.value_counts().sort_index()[1:] # Exclude year 0
plt.bar(tracks_per_year.index, tracks_per_year.values, color='green', alpha=0.8)
plt.xticks(tracks_per_year.index[::5], rotation=45)
plt.xlabel("Release Year", fontsize=15)
plt.ylabel("Number of Tracks", fontsize=15)
plt.title("Number of Tracks Released per Year", fontsize=20)
plt.show()

In [None]:
# Bar chart of average track popularity per album type
plt.figure(figsize=(13, 6))
avg_popularity_per_album_type = df.groupby("album_type").track_popularity.mean()
plt.bar(avg_popularity_per_album_type.index, avg_popularity_per_album_type.values)
plt.xticks(avg_popularity_per_album_type.index, ['Compilation', 'Single', 'Album'])
plt.xlabel("Album Type", fontsize=15)
plt.ylabel("Average Track Popularity", fontsize=15)
plt.title("Average Track Popularity per Album Type", fontsize=20)
plt.show()

In [None]:
# Scatter plot of track popularity vs. artist popularity
plt.figure(figsize=(13, 6))
plt.scatter(df.artist_popularity, df.track_popularity, color='blue', marker='+')
plt.xlabel("Artist Popularity", fontsize=15)
plt.ylabel("Track Popularity", fontsize=15)
plt.title("Track Popularity vs. Artist Popularity", fontsize=20)
plt.show()

In [None]:
# Scatter plot of track popularity vs. artist followers
plt.figure(figsize=(13, 6))
plt.scatter(df.artist_followers, df.track_popularity, color='green', marker='+')
plt.xscale('log')
plt.xlabel("Artist Followers", fontsize=15)
plt.ylabel("Track Popularity", fontsize=15)
plt.title("Track Popularity vs. Artist Followers", fontsize=20)
plt.show()

In [None]:
# Scatter plot of track popularity vs. duration
plt.figure(figsize=(13, 6))
plt.scatter(df.track_duration_ms, df.track_popularity, color='red', marker='+')
plt.xlabel("Duration (ms)", fontsize=15)
plt.ylabel("Track Popularity", fontsize=15)
plt.title("Track Popularity vs. Duration", fontsize=20)
plt.show()

In [None]:
# Scatter plot of track popularity vs. release year
plt.figure(figsize=(13, 6))
plt.scatter(sorted(df.year), df.track_popularity, color='purple', marker='+')
plt.xticks(df.year.unique()[::10])
plt.xlabel("Release Year", fontsize=15)
plt.ylabel("Track Popularity", fontsize=15)
plt.title("Track Popularity vs. Release Year", fontsize=20)
plt.show()

In [None]:
# Pie chart of explicit tracks
plt.figure(figsize=(13, 6))
data = df.explicit.value_counts().tolist()
labels = ['Non-Explicit', 'Explicit']
plt.pie(data, labels=labels, autopct='%1.1f%%', startangle=135)
plt.title("Distribution of Explicit Tracks", fontsize=20)
plt.show()

In [None]:
# Pie chart of album types
plt.figure(figsize=(13, 6))
data = df.album_type.value_counts().tolist()
labels = ['Compilation', 'Single', 'Album']
plt.pie(data, labels=labels, autopct='%1.1f%%', startangle=148)
plt.title("Distribution of Album Types", fontsize=20)
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt=".2f", linewidths=0.5,  cmap="coolwarm")
plt.title("Correlation Heatmap", fontsize=20)
plt.show()