# EDA

### Source of data
- [Kaggle Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download)

# Column Description

### `track_id`
The Spotify ID for the track.

### `artists`
The artists' names who performed the track. If there is more than one artist, they are separated by a `;`.

### `album_name`
The album name in which the track appears.

### `track_name`
Name of the track.

### `popularity`
The popularity of a track is a value between 0 and 100, with 100 being the most popular.  
- The popularity is calculated by an algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.  
- Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past.  
- Duplicate tracks (e.g., the same track from a single and an album) are rated independently.  
- Artist and album popularity is derived mathematically from track popularity.

### `duration_ms`
The track length in milliseconds.

### `explicit`
Whether or not the track has explicit lyrics:  
- `true` = yes, it does.  
- `false` = no, it does not OR unknown.

### `danceability`
Describes how suitable a track is for dancing, based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.  
- A value of 0.0 is least danceable.  
- A value of 1.0 is most danceable.

### `energy`
A measure from 0.0 to 1.0 representing a perceptual measure of intensity and activity.  
- Energetic tracks feel fast, loud, and noisy.  
- For example, death metal has high energy, while a Bach prelude scores low on the scale.

### `key`
Represents the musical key of the track, encoded as an integer:  
- Values range from `0` (C) to `11` (B) based on the chromatic scale:  
  - `0` = C (Do), `1` = C♯/D♭, `2` = D (Ré), ..., `11` = B (Si).  
- `-1` indicates no key detected.  

This feature helps understand the harmonic structure and can be used for tasks like creating smooth transitions between tracks or analyzing musical patterns.

### `loudness`
The overall loudness of a track in decibels (dB).

### `mode`
Indicates the mode of the track:  
- `1` = Major (associated with a happy or bright mood).  
- `0` = Minor (associated with a sad or dark mood).  

This feature helps to understand the emotional tone of the track and analyze musical patterns.

### `speechiness`
Detects the presence of spoken words in a track.  
- The more exclusively speech-like the recording (e.g., talk show, audio book, poetry), the closer to 1.0 the attribute value.  
- Values above `0.66` describe tracks that are probably made entirely of spoken words.  
- Values between `0.33` and `0.66` describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music.  
- Values below `0.33` most likely represent music and other non-speech-like tracks.

### `acousticness`
A confidence measure from 0.0 to 1.0 of whether the track is acoustic.  
- `1.0` represents high confidence the track is acoustic.

### `instrumentalness`
Predicts whether a track contains no vocals:  
- "Ooh" and "aah" sounds are treated as instrumental in this context.  
- Rap or spoken word tracks are clearly "vocal."  
- The closer the instrumentalness value is to `1.0`, the greater the likelihood the track contains no vocal content.

### `liveness`
Detects the presence of an audience in the recording.  
- Higher liveness values represent an increased probability that the track was performed live.  
- A value above `0.8` provides strong likelihood that the track is live.

### `valence`
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.  
- Tracks with high valence sound more positive (e.g., happy, cheerful, euphoric).  
- Tracks with low valence sound more negative (e.g., sad, depressed, angry).

### `tempo`
The overall estimated tempo of a track in beats per minute (BPM).  
- In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

### `time_signature`
An estimated time signature:  
- The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).  
- The time signature ranges from `3` to `7`, indicating time signatures of 3/4 to 7/4.

### `track_genre`
The genre in which the track belongs.


In [None]:
import pandas as pd
from mlflow.pyfunc.stdin_server import params

dataset_path = "../data/raw/dataset.csv"
# Load data
df = pd.read_csv(dataset_path, low_memory=False)

In [None]:
# Shape
df.shape

In [None]:
# Describe
df.describe()

In [None]:
# Columns names
df.columns

In [None]:
# Numerical columns
df.select_dtypes(include=['int64', 'float64']).columns

In [None]:
# Store names of numerical columns
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns

In [None]:
# Categorical columns
df.select_dtypes(include=['object']).columns

In [None]:
# Store names of categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns

In [None]:
df[numerical_columns].head()

In [None]:
df[categorical_columns].head()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Show the percentage of missing values
df.isnull().sum() / df.shape[0] * 100

In [None]:
# Drop missing values
df.dropna(inplace=True)

In [None]:
# missing values
df.isnull().sum()

## Data Visualization

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
# Bar plot for track_genre
plt.figure(figsize=(50, 6))
sns.countplot(x='track_genre', data=df)
plt.title('Track Genre Distribution')
plt.xticks(rotation=90)
plt.show()

# Extremely balanced dataset

In [None]:
# Which genre is the most popular? Sort by popularity and show the top 10
df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False).head(10)

In [None]:
# Which genre is the least popular? Sort by popularity and show the top 10
df.groupby('track_genre')['popularity'].mean().sort_values(ascending=True).head(10)

In [None]:
# Subplot
fig, ax = plt.subplots(1, 2, figsize=(20, 6))

# Most popular genres
df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False).head(10).plot(kind='bar', ax=ax[0])
ax[0].set_title('Top 10 Most Popular Genres')
ax[0].set_ylabel('Popularity')
ax[0].set_xlabel('Genre')
ax[0].tick_params(axis='x', rotation=90)

# Least popular genres
df.groupby('track_genre')['popularity'].mean().sort_values(ascending=True).head(10).plot(kind='bar', ax=ax[1])
ax[1].set_title('Top 10 Least Popular Genres')
ax[1].set_ylabel('Popularity')
ax[1].set_xlabel('Genre')
ax[1].tick_params(axis='x', rotation=90)

plt.show()

## Correlation matrix

In [None]:
columns = ['popularity', 'duration_ms', 'danceability', 'energy', 'loudness', 
           'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[columns]), columns=columns)

In [None]:
# Describe
df_scaled.describe()

In [None]:
correlation_matrix = df_scaled.corr()

In [None]:
# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()

In [None]:
# Filter the correlation values
correlation_values = correlation_matrix.unstack().sort_values(ascending=False)
correlation_values = correlation_values[correlation_values != 1]  # Remove self-correlations
correlated_features = correlation_values[(correlation_values >= 0.40) | (correlation_values <= -0.40)]

In [None]:
correlated_features

In [None]:
# Scatter plot
plt.figure(figsize=(20, 6))
sns.scatterplot(x='energy', y='loudness', data=df_scaled)
plt.title('Energy vs Loudness')
plt.show()

In [None]:
# Important features : ['danceability', 'energy', 'loudness', 'instrumentalness', 'valence']

In [None]:
# Scatter plot them
plt.figure(figsize=(20, 6))
sns.scatterplot(x='danceability', y='energy', data=df_scaled)
plt.title('Danceability vs Energy')
plt.show()

In [None]:
import plotly.express as px

# Select the important features
important_features = ['danceability', 'energy', 'loudness', 'instrumentalness', 'valence']

# Create a pairplot
fig = px.scatter_matrix(df_scaled, dimensions=important_features, title='Pairplot of Important Features')
fig.update_layout(width=1000, height=800)
fig.show()

In [None]:
# Min Max Scaler test
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[columns]), columns=columns)

In [None]:
# correlation matrix
correlation_matrix = df_scaled.corr()

In [None]:
# Plot
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()

In [None]:
import plotly.express as px

# Select the important features
important_features = ['danceability', 'energy', 'loudness', 'instrumentalness', 'valence']

# Create a pairplot
fig = px.scatter_matrix(df_scaled, dimensions=important_features, title='Pairplot of Important Features')
fig.update_layout(width=1000, height=800)
fig.show()

In [None]:
# Polynomial Features test
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
df_poly = poly.fit_transform(df[columns])
df_poly = pd.DataFrame(df_poly, columns=poly.get_feature_names_out(columns))

In [None]:
df_poly.head()

In [None]:
# Correlation matrix with plotly
correlation_matrix = df_poly.corr()

In [None]:
# Plot it with plotly
fig = px.imshow(correlation_matrix, title='Correlation Matrix')
fig.update_layout(width=1500, height=1300)
fig.show()

In [None]:
corr_matrix = correlation_matrix.abs()

In [None]:
import numpy as np

upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

In [None]:
upper_triangle

In [None]:
# Features that have a correlation lower than 0.8, we can drop them
to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.8)]

In [None]:
# Drop them
df_poly_processed = df_poly.drop(columns=to_drop)

In [None]:
df_poly_processed.shape

In [None]:
df_poly_processed.head()

In [None]:
# Corr matrix and plot
correlation_matrix = df_poly_processed.corr()

In [None]:
# Plot
fig = px.imshow(correlation_matrix, title='Correlation Matrix')
fig.update_layout(width=1500, height=1300)
fig.show()

In [None]:
# PCA test
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
df_pca = pca.fit_transform(df_poly_processed)

print(f"Original shape: {df_poly_processed.shape}")
print(f"Reduced shape: {df_pca.shape}")

In [None]:
# Random Forest Classifier, accuracy score :
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = df_poly_processed
y = df['track_genre']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

params = {
    'n_estimators': 600,
    'random_state': 42
}

clf = RandomForestClassifier(**params)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

In [None]:
print(f"Accuracy: {accuracy}")