# Spotify Recommendation System - Baseline Model

This notebook provides a baseline collaborative filtering model for song recommendations using the Spotify Million Playlist Dataset.

**Model:** Matrix Factorization using Truncated SVD (Singular Value Decomposition).
**Objective:** Given a set of songs from a user's playlist, recommend new songs they might like.

## 1. Setup

Install and import the necessary libraries. `scikit-learn` provides our SVD implementation, and `pandas` is for data manipulation.

In [None]:
!pip install pandas scikit-learn tqdm

In [None]:
import pandas as pd
import numpy as np
import json
import os
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors
from tqdm import tqdm

## 2. Data Loading

**Action Required:** Download the Spotify Million Playlist Dataset and place the `data` directory in the same folder as this notebook or provide the correct path.

You can get the data here: [https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge)

We'll load a slice of the data to keep memory usage manageable for this baseline.

In [None]:
def load_playlist_slice(path, slice_num=0):
    """Loads a single JSON slice file from the dataset."""
    filename = f'mpd.slice.{slice_num*1000}-{(slice_num+1)*1000-1}.json'
    with open(os.path.join(path, filename)) as f:
        data = json.load(f)
    return data['playlists']

# CONFIGURATION: Point this to the 'data' directory of the dataset
data_path = './data/' 

# Load the first 1000 playlists as a sample
playlists = load_playlist_slice(data_path, slice_num=0)

print(f"Loaded {len(playlists)} playlists.")

## 3. Data Preprocessing & Train-Test Split

We will now create two interaction matrices:
1. `interaction_matrix_train`: The model will be trained on this. We'll hide a few songs from each playlist.
2. `test_set`: This will store the hidden songs for each playlist, which we'll use to evaluate our recommendations.

In [None]:
all_tracks = []
for p in playlists:
    all_tracks.extend([track['track_uri'] for track in p['tracks']])

unique_tracks = sorted(list(set(all_tracks)))
track_to_idx = {track: i for i, track in enumerate(unique_tracks)}
idx_to_track = {i: track for track, i in track_to_idx.items()}
pid_to_idx = {p['pid']: i for i, p in enumerate(playlists)}

n_playlists = len(playlists)
n_tracks = len(unique_tracks)

print(f"Number of unique playlists (users): {n_playlists}")
print(f"Number of unique tracks (items): {n_tracks}")

# Create training matrix and test set
rows, cols = [], []
test_set = {}

for p in playlists:
    playlist_idx = pid_to_idx[p['pid']]
    tracks = [t['track_uri'] for t in p['tracks']]
    
    # For playlists with enough tracks, hold out some for testing
    if len(tracks) > 5:
        np.random.shuffle(tracks)
        num_holdout = int(len(tracks) * 0.2) # Hold out 20% of tracks
        train_tracks = tracks[:-num_holdout]
        test_tracks = tracks[-num_holdout:]
        test_set[playlist_idx] = test_tracks
    else:
        train_tracks = tracks
        
    for track_uri in train_tracks:
        if track_uri in track_to_idx:
            rows.append(playlist_idx)
            cols.append(track_to_idx[track_uri])

values = np.ones(len(rows), dtype=np.float32)
interaction_matrix_train = csr_matrix((values, (rows, cols)), shape=(n_playlists, n_tracks))

print(f"Training matrix created with shape: {interaction_matrix_train.shape}")
print(f"Test set created for {len(test_set)} playlists.")

## 4. Model Training (Matrix Factorization)

We train the SVD model on the **training matrix** only.

**Hyperparameter to Tweak:**
- `n_components`: The number of latent factors (embedding size). A larger number can capture more complex patterns but may overfit. Common values are between 50 and 200.

In [None]:
# HYPERPARAMETER
N_COMPONENTS = 100 # Embedding size

svd = TruncatedSVD(n_components=N_COMPONENTS, random_state=42)

# Create playlist (user) embeddings
playlist_embeddings = svd.fit_transform(interaction_matrix_train)

# Create track (item) embeddings
track_embeddings = svd.components_.T

print(f"Playlist embeddings shape: {playlist_embeddings.shape}")
print(f"Track embeddings shape: {track_embeddings.shape}")

## 5. Generating Recommendations

The recommendation logic remains the same, but now we base it on the tracks in the *training* set for a given user.

In [None]:
# Fit a KNN model on the track embeddings for fast lookups
knn = NearestNeighbors(n_neighbors=20, metric='cosine', algorithm='brute')
knn.fit(track_embeddings)

def recommend_from_playlist(playlist_idx, train_matrix, n_recs=10):
    """Generates song recommendations for a given playlist index."""
    
    # Get the indices of tracks in the user's training playlist
    input_track_indices = train_matrix[playlist_idx].indices
    
    if len(input_track_indices) == 0:
        return []
    
    # Calculate the average embedding for the input playlist
    playlist_vector = np.mean(track_embeddings[input_track_indices], axis=0)
    
    # Find the nearest neighbors (songs) to this average vector
    distances, indices = knn.kneighbors(playlist_vector.reshape(1, -1), n_neighbors=n_recs + len(input_track_indices))
    
    recommendations = []
    for idx in indices.flatten():
        # Get the track URI from its index
        rec_uri = idx_to_track[idx]
        # Add to recommendations if it's not in the original input
        if rec_uri not in [idx_to_track[i] for i in input_track_indices]:
            recommendations.append(rec_uri)
    
    return recommendations[:n_recs]


### Test the Recommendation Function

Let's see what the model recommends for a sample playlist from our dataset.

In [None]:
test_playlist_idx = list(test_set.keys())[0]

print(f"Recommendations for playlist index {test_playlist_idx}:")
recs = recommend_from_playlist(test_playlist_idx, interaction_matrix_train, n_recs=10)
for i, uri in enumerate(recs):
    print(f"{i+1}. {uri}")

print("
Held-out songs (ground truth):")
for i, uri in enumerate(test_set[test_playlist_idx]):
    print(f"{i+1}. {uri}")

## 6. Evaluating the Model

Now we'll formalize the evaluation using Precision@k and Recall@k. We calculate these metrics for every user in our test set and then average the results.

- **Precision@k**: What proportion of our top-k recommendations are relevant (i.e., in the holdout set)?
- **Recall@k**: What proportion of the relevant items in the holdout set did we successfully recommend in our top-k?

In [None]:
def precision_at_k(k, recommendations, holdout_items):
    recs_at_k = recommendations[:k]
    hits = len(set(recs_at_k) & set(holdout_items))
    return hits / k

def recall_at_k(k, recommendations, holdout_items):
    recs_at_k = recommendations[:k]
    hits = len(set(recs_at_k) & set(holdout_items))
    return hits / len(holdout_items) if len(holdout_items) > 0 else 0

def evaluate_model(k=10):
    avg_precision = 0
    avg_recall = 0
    test_user_count = len(test_set)

    for user_idx in tqdm(test_set.keys(), desc="Evaluating"):
        recommendations = recommend_from_playlist(user_idx, interaction_matrix_train, n_recs=k)
        holdout = test_set[user_idx]
        
        avg_precision += precision_at_k(k, recommendations, holdout)
        avg_recall += recall_at_k(k, recommendations, holdout)

    avg_precision /= test_user_count
    avg_recall /= test_user_count

    print(f"
Evaluation Results (k={k}):")
    print(f"- Average Precision@{k}: {avg_precision:.4f}")
    print(f"- Average Recall@{k}: {avg_recall:.4f}")

# Run the evaluation
evaluate_model(k=10)

## 7. Next Steps & Tweaking

To improve your model, you can now:

1.  **Use More Data:** Train on more slices of the dataset and see how the evaluation metrics change.
2.  **Tune Hyperparameters:** Change `N_COMPONENTS` and `k` in the evaluation to see their effect on precision and recall. Which `k` gives the best balance?
3.  **Try Different Models:** The `implicit` library is a great next step for a more advanced ALS model, which is often better for this type of implicit feedback data.