# Collaborative Filtering with LightFM

This notebook implements a **collaborative filtering (CF)** approach using LightFM to leverage historical interaction data.

## Why Collaborative Filtering?
- Discovers hidden patterns from user-program interactions
- Can recommend programs based on what similar users liked
- Complements content-based filtering for a hybrid system
- Effective for personalization as interaction data grows

## Why LightFM?
- **Hybrid architecture**: Supports both CF and content features
- **WARP loss**: Optimizes ranking (perfect for top-k recommendations)
- **Implicit feedback**: Works with like/view data (not just ratings)
- **Cold-start handling**: Can incorporate user/item features
- Fast and scalable for production use

## Import Libraries
LightFM provides matrix factorization with advanced loss functions.

In [None]:
import pandas as pd
import numpy as np
from lightfm import LightFM
from lightfm.data import Dataset
import joblib


In [None]:
users = pd.read_csv("../data/raw/users.csv")
programs = pd.read_csv("../data/raw/programs.csv")
interactions = pd.read_csv("../data/raw/interactions.csv")


## Load Data
Load users, programs, and historical interactions (likes) generated in notebook 01.

In [None]:
dataset = Dataset()
dataset.fit(
    users=users["user_id"],
    items=programs["program_id"]
)


## Initialize LightFM Dataset
Create a dataset object that maps user and program IDs to internal indices.

This step establishes the vocabulary for both users and items before building the interaction matrix.

In [None]:
(interactions_matrix, _) = dataset.build_interactions(
    [(row.user_id, row.program_id, row.interaction)
     for _, row in interactions.iterrows()]
)


## Build Interaction Matrix
Convert interaction data into a sparse user-item matrix format required by LightFM.

**Format:** (user_id, program_id, interaction_value)
- Interaction value = 1 indicates a positive signal (liked/viewed)

In [None]:
model = LightFM(
    no_components=10,
    loss="warp",
    random_state=42
)

model.fit(
    interactions_matrix,
    epochs=20,
    num_threads=4
)


## Train LightFM Model

**Hyperparameters:**
- `no_components=10`: Latent factor dimension (user/item embeddings)
  - Lower = faster, less overfitting
  - Higher = more capacity to learn complex patterns
- `loss="warp"`: **WARP** (Weighted Approximate-Rank Pairwise)
  - Optimizes for **ranking quality** (top-k recommendations)
  - Better than BPR or logistic for recommendation systems
  - Focuses on getting the most relevant items at the top
- `epochs=20`: Number of training passes
- `num_threads=4`: Parallel processing for speed

**Why 10 components?**
With only 6 programs and 100 users, 10 dimensions is sufficient to capture patterns without overfitting.

In [None]:
user_id_map, _, item_id_map, _ = dataset.mapping()
reverse_item_map = {v: k for k, v in item_id_map.items()}


## Extract ID Mappings
Retrieve the internal index mappings to convert between external IDs (e.g., "u_0", "p_1") and internal indices.

This is needed to generate predictions and convert results back to original IDs.

In [None]:
def recommend_cf(user_id, k=3):
    user_idx = user_id_map[user_id]
    scores = model.predict(
        user_idx,
        np.arange(len(item_id_map))
    )
    top_items = np.argsort(scores)[::-1][:k]
    return [reverse_item_map[i] for i in top_items]
recommend_cf("u_0")


## Recommendation Function
Generate top-k program recommendations for a given user based on CF predictions.

**Process:**
1. Convert user ID to internal index
2. Predict scores for all programs
3. Sort by score (descending)
4. Return top-k program IDs

**Note:** Unlike content-based (which uses similarity), CF uses learned latent factors.

In [None]:
joblib.dump(model, "../models/lightfm.pkl")
joblib.dump(dataset, "../models/lightfm_dataset.pkl")


## Save Models
Persist the trained LightFM model and dataset mappings for deployment.

**Why save the dataset?**
The dataset contains the ID mappings needed to convert between external user/program IDs and internal indices during inference.