Q1: predicting Spotify track popularity via logistic regression

In this notebook, we identify whether a traditional machine learnig model can identify whether a spotify track is popular based on their spotify popularity score.
We frame this as a Binary Classification Problem, where tracks are labelled as popular or not based on their Spotify popularity score.

Dataset overview.

The dataset used in this project is the "Spotify Global Music Dataset (2009-2025)" sourced from Kaggle.

Each track includes a set of audio features such as danceability, energy, tempo, valence and a popularity score, valued between 0 and 100.

These features are pulled directly from Spotify and provide a quantatve value to represent each of the musical characteristics.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [10]:
df = pd.read_csv("track_data_final.csv")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'track_data_final.csv'

Feature Selection
We select a subset of numeric audio features that describe musical properties of each track.

These features are commonly used in music analysis and are suitable for machine learning models.

In [None]:
features = ["danceability", "energy", "loudness", "speechiness",
            "acousticness", "instrumentalness", "liveness",
            "valence", "tempo"]


Creating the binary target variable

Spotify provides a popularity score scaling from 0 to 100. To convert this to a classification task, we need to determine wether a song is "popular" or "not-popular" using binary values, where a popularity score of >70 makes it "popular" and anything <70 being "not popular"

In [None]:
df["popular"] = (df["popularity"]) >=70).astype(int)
df["popular"].value_counts()

Data Cleaning

We remove any rows with missing values in the selected features or target variable.

This ensures that the model is trained on complete and consistent data.


In [None]:
df = df.dropna(subset=features + ["popular"])

In [None]:
from sklearn.model_selection import train_test_split

X = df[features]
y = df["popular"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Feature Scaling

Logistic regression is sensitive to the scale of input features.

We therefore standardise the data so that each feature has a mean of 0 and a standard
deviation of 1.


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Model Training

We use **logistic regression** as our traditional machine learning model.

It is simple, interpretable, and provides a strong baseline for classification problems.


In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)


Model Evaluation

We evaluate the model using accuracy, a confusion matrix, and a classification report.


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Discussion

The logistic regression model provides a reasonable baseline for predicting track popularity.

However, it assumes a linear relationship between features and the target, which may limit its ability to capture more complex patterns in the data.


Conclusion

In this notebook, we demonstrated that a traditional machine learning model can predict
Spotify track popularity with reasonable performance. This baseline will be used as a
comparison point for more complex models in the following notebooks.
