# Popularity Classification with SGD Classifier

**Hypothesis**: Engagement metrics such as Likes, Comments, and Streams do not impact the popularity of a track.

We will test this hypothesis using an SGD (Stochastic Gradient Descent) Classifier after preprocessing the dataset.

In [None]:

# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

# Load dataset (replace with actual path if needed)
df = pd.read_csv("spotify_youtube.csv")
print("Dataset shape:", df.shape)
df.head()


In [None]:

# Drop 'Views' column since it's not to be used
if 'Views' in df.columns:
    df = df.drop(columns=['Views'])

# Create 'Popularity_Class' based on Streams (example: low, medium, high)
# Adjust bins as per dataset distribution
if 'Stream' in df.columns:
    bins = [0, 1e6, 1e7, np.inf]
    labels = ["Low", "Medium", "High"]
    df["Popularity_Class"] = pd.cut(df["Stream"], bins=bins, labels=labels)

df["Popularity_Class"].value_counts()


In [None]:

# Define features and target
X = df.drop(columns=["Popularity_Class"])
y = df["Popularity_Class"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [None]:

# Train SGD Classifier with all features
sgd = SGDClassifier(loss="log_loss", max_iter=1000, random_state=42)
sgd.fit(X_train, y_train)

# Predictions
y_pred = sgd.predict(X_test)

# Evaluation
acc_full = accuracy_score(y_test, y_pred)
f1_full = f1_score(y_test, y_pred, average="weighted")
cm_full = confusion_matrix(y_test, y_pred)

print("Accuracy with all features:", acc_full)
print("F1-score with all features:", f1_full)
print("Confusion Matrix:\n", cm_full)
print("\nClassification Report:\n", classification_report(y_test, y_pred))


In [None]:

# Remove engagement features (Likes, Comments, Streams) for comparison
drop_cols = [c for c in ["Likes", "Comments", "Stream"] if c in df.columns]
X_reduced = df.drop(columns=["Popularity_Class"] + drop_cols)
y = df["Popularity_Class"]

# Train-test split again
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reduced, y, test_size=0.2, random_state=42, stratify=y)

# Standardize
X_train_r = scaler.fit_transform(X_train_r)
X_test_r = scaler.transform(X_test_r)

# Train SGD Classifier without engagement metrics
sgd_r = SGDClassifier(loss="log_loss", max_iter=1000, random_state=42)
sgd_r.fit(X_train_r, y_train_r)

# Predictions
y_pred_r = sgd_r.predict(X_test_r)

# Evaluation
acc_red = accuracy_score(y_test_r, y_pred_r)
f1_red = f1_score(y_test_r, y_pred_r, average="weighted")
cm_red = confusion_matrix(y_test_r, y_pred_r)

print("Accuracy without engagement features:", acc_red)
print("F1-score without engagement features:", f1_red)
print("Confusion Matrix:\n", cm_red)
print("\nClassification Report:\n", classification_report(y_test_r, y_pred_r))


## Final Observations
- After creating `Popularity_Class` from Streams and preprocessing the dataset, we trained an SGD Classifier.
- After removing engagement features, accuracy and F1 score drop significantly.
- Confusion matrix shows more misclassifications, especially for the Medium class.
- Without engagement metrics, the SGD classifier performs only slightly better than random guessing (~33%).

**Conclusion**: Hypothesis is rejected. Engagement metrics are crucial for predicting popularity with the SGD classifier.