# Session 13-14 Recommender System

# Exercise: Music Recommendation System

You are a data scientist working at a tech company.

Your task is to design a music recommendation system, similar to Spotify, using the following dataset:

ðŸ”— Spotify Recommendation Dataset (Kaggle)

https://www.kaggle.com/datasets/bricevergnou/spotify-recommendation

The goal of this system is to improve user experience by recommending songs or artists based on usersâ€™ listening history, preferences, and behavior.

# Step 1. Model Selection

## Choice: Content-Based Filtering + Supervised Learning

## Why?

* We have song features (audio characteristics)

* We have user preference labels (liked)

* No need for other usersâ€™ data

* Works well for personal taste modeling

## Weâ€™ll use:

* Logistic Regression (simple, interpretable)

* Can later extend to Random Forest / XGBoost

# Step 2 â€” Train-Test Split

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset (example path)
df = pd.read_csv("data.csv")

# |Display basic information
print(df.head())
print(df.info())

In [None]:
# Features used for recommendation
features = [
    'acousticness', 'danceability', 'duration_ms', 'energy',
    'instrumentalness', 'key', 'liveness', 'loudness',
    'mode', 'speechiness', 'tempo', 'time_signature', 'valence'
]

X = df[features]        # Input features (song characteristics)
y = df['liked']         # Target: 1 = liked, 0 = disliked

# Split data: 70% training, 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])

# Step 3 â€” Model Development

We:

1. Scale numeric features

2. Train a classifier

3. Predict whether a song will be liked

In [None]:
# Step 3: Model development

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create a pipeline:
# 1. Scale features
# 2. Train logistic regression classifier
model = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Train the model
model.fit(X_train, y_train)

print("Model training completed.")

In scikit-learn, `Pipeline` is a way to bundle multiple data-processing steps and a model into one single object, so they always run in the correct order.

`model = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])`

This means:

StandardScaler()
â†’ First, scale all song features (so loudness, tempo, etc. are on similar ranges)

LogisticRegression()
â†’ Then, use the scaled features to learn whether a song is liked or not

### Why this is useful (in plain words):

* You donâ€™t forget to scale data during training or prediction

* Training and testing use exactly the same preprocessing

* The whole system behaves like one clean model

* It prevents data leakage (a very common ML mistake)

# Step 4 â€” Evaluation Metrics

We evaluate using:

* Accuracy

* Precision

* Recall

* F1-score

* ROC-AUC

In [None]:
# Step 4: Evaluation metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report
)

# Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))

print("\nDetailed classification report:\n")
print(classification_report(y_test, y_pred))

## Drawing ROC curve

In [None]:
# (1): Import required functions
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# (2): Compute False Positive Rate (FPR) and True Positive Rate (TPR)
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

# (3): Compute Area Under the Curve (AUC)
roc_auc = auc(fpr, tpr)

# (4): Plot ROC curve

plt.figure()
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {roc_auc:.3f})")

# Diagonal line = random classifier
plt.plot([0, 1], [0, 1], linestyle="--", label="Random classifier")

# Labels and title
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - SVM Titanic Survival Prediction")
plt.legend()
plt.show()

# Step 5 â€” Testing, Validation & Recommendation

## Recommend New Songs

We recommend songs that:

Are not in the userâ€™s training set

Have high predicted probability of being liked

In [None]:
# Step 5: Recommendation generation

def recommend_songs(model, df, features, top_n=10):
    """
    Recommend top N songs based on predicted liking probability
    """
    df = df.copy()

    # Predict probability of liking each song
    df['like_probability'] = model.predict_proba(df[features])[:, 1]

    # Recommend songs with highest probability
    recommendations = (
        df.sort_values('like_probability', ascending=False)
          .head(top_n)
    )

    return recommendations[['like_probability'] + features]

# Get top 10 recommended songs
top_recommendations = recommend_songs(model, df, features, top_n=10)

print("Top Recommended Songs:")
print(top_recommendations)

## How This Mimics Spotify

| Spotify Concept        | Your Model Equivalent                              |
|------------------------|----------------------------------------------------|
| User taste             | Learned from liked and disliked songs              |
| Audio features         | Spotify audio features (danceability, energy, etc.)|
| Personalization        | Model trained only on your own listening data      |
| Cold start handling    | Works immediately using song features              |
| Explainability         | Feature coefficients show *why* a song is liked    |