Q1: predicting Spotify track popularity via logistic regression

In this notebook, we identify whether a traditional machine learning model can identify whether a spotify track is popular based on their spotify popularity score.
We frame this as a Binary Classification Problem, where tracks are labelled as popular or not based on their Spotify popularity score.

Why a Traditional Machine learning model?

Before using neural networks, it is good practice to establish a strong baseline using a traditional machine learning approach.

The reason for choosing a traditional model in this project are as follows:
- The dataset is of moderate size, meaning complex models may overfit.
- Traditional models are faster to train and easier to interpret.
- they provide a srtrong baseline to compare against neural networks in Q2.
-They require fewer hyperparametes and are easier for beginners to understand.

Dataset overview.

The dataset used in this project is the "Spotify Global Music Dataset (2009-2025)" sourced from Kaggle.

Each track includes a set of audio features such as track duration, artist popularity, the albums release date, the tracks name and a popularity score, valued between 0 and 100.

These features are pulled directly from Spotify and provide a quantatve value to represent each of the musical characteristics.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
df = pd.read_csv("track_data_final.csv")
df.head()

Unnamed: 0,track_id,track_name,track_number,track_popularity,track_duration_ms,explicit,artist_name,artist_popularity,artist_followers,artist_genres,album_id,album_name,album_release_date,album_total_tracks,album_type
0,6pymOcrCnMuCWdgGVTvUgP,3,57,61,213173,False,Britney Spears,80.0,17755451.0,['pop'],325wcm5wMnlfjmKZ8PXIIn,The Singles Collection,2009-11-09,58,compilation
1,2lWc1iJlz2NVcStV5fbtPG,Clouds,1,67,158760,False,BUNT.,69.0,293734.0,['stutter house'],2ArRQNLxf9t0O0gvmG5Vsj,Clouds,2023-01-13,1,single
2,1msEuwSBneBKpVCZQcFTsU,Forever & Always (Taylor’s Version),11,63,225328,False,Taylor Swift,100.0,145396321.0,[],4hDok0OAJd57SGIT8xuWJH,Fearless (Taylor's Version),2021-04-09,26,album
3,7bcy34fBT2ap1L4bfPsl9q,I Didn't Change My Number,2,72,158463,True,Billie Eilish,90.0,118692183.0,[],0JGOiO34nwfUdDrD612dOp,Happier Than Ever,2021-07-30,16,album
4,0GLfodYacy3BJE7AI3A8en,Man Down,7,57,267013,False,Rihanna,90.0,68997177.0,[],5QG3tjE5L9F6O2vCAPph38,Loud,2010-01-01,13,album


Feature Selection

We select a subset of numeric audio features that describe musical properties of each track.

These features are commonly used in music analysis and are suitable for machine learning models.

In [None]:
features = ["track_name", "track_duration_ms", "artist_popularity", "artist_followers",
            "artist_genres", "album_total_tracks", "album_release_date"]


Creating the binary target variable

Spotify provides a popularity score scaling from 0 to 100. To convert this to a classification task, we need to determine wether a song is "popular" or "not-popular" using binary values, where a popularity score of >70 makes it "popular" and anything <70 being "not popular"

In [None]:
df["popular"] = ((df["track_popularity"]) >=70).astype(int)
df["popular"].value_counts()

Unnamed: 0_level_0,count
popular,Unnamed: 1_level_1
0,6328
1,2450


Data Cleaning

We remove any rows with missing values in the selected features or target variable.

This ensures that the model is trained on complete and consistent data.


In [None]:
df = df.dropna(subset=features + ["popular"])

In [None]:
from sklearn.model_selection import train_test_split

X = df[features]
y = df["popular"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Feature Scaling

Logistic regression is sensitive to the scale of input features.

We therefore standardise the data so that each feature has a mean of 0 and a standard
deviation of 1.


In [None]:
from sklearn.preprocessing import StandardScaler

# Identify numerical features from the 'features' list that are suitable for scaling
numerical_features = ["track_duration_ms", "artist_popularity", "artist_followers", "album_total_tracks"]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[numerical_features])
X_test_scaled = scaler.transform(X_test[numerical_features])

Model Training

We use **logistic regression** as our traditional machine learning model.

It is simple, interpretable, and provides a strong baseline for classification problems.


In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)


Model Evaluation

We evaluate the model using accuracy, a confusion matrix, and a classification report.


In [None]:
y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print(f"model Accuracy: {accuracy: .2%}")
print("\nConfusion Matrix:\n [[Actual \\ Predicted] \n [Not Popular \\ Popular]]\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n \n", classification_report(y_test, y_pred))



model Accuracy:  71.68%

Confusion Matrix:
 [[Actual \ Predicted] 
 [Not Popular \ Popular]]
 [[1187   71]
 [ 426   71]]

Classification Report:
 
               precision    recall  f1-score   support

           0       0.74      0.94      0.83      1258
           1       0.50      0.14      0.22       497

    accuracy                           0.72      1755
   macro avg       0.62      0.54      0.52      1755
weighted avg       0.67      0.72      0.66      1755



Discussion

The logistic regression model provides a reasonable baseline for predicting track popularity.

However, it assumes a linear relationship between features and the target, which may limit its ability to capture more complex patterns in the data.


Conclusion

In this notebook, we demonstrated that a traditional machine learning model can predict
Spotify track popularity with reasonable performance. This baseline will be used as a
comparison point for more complex models in the following notebooks.
