## Classical Deep Learning for Recommendation

Exploration of two neural recommender models on MovieLens 1M:
- **NeuralMF** (explicit rating prediction)
- **TwoTowerBPR** (implicit pairwise ranking)

Both models use the same temporal data split and evaluation metrics as the classical MF/BPR baselines in this repository.

In [1]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

In [2]:
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

from src.data_reading import read_ratings_file
from src.evaluation import (
    temporal_split,
    evaluate_rmse,
    evaluate_mae,
    evaluate_precision_at_k,
    evaluate_recall_at_k,
    evaluate_ndcg_at_k,
)
from src.models.neural_mf import NeuralMF
from src.models.two_tower_bpr import TwoTowerBPR

np.random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0xffff3cb237b0>

## Data loading and temporal split

We follow the same **temporal split** strategy as the ALS experiment:
train on the past, validate in the middle slice, and test on the most recent
interactions to mimic realistic deployment conditions.

In [None]:
ratings = read_ratings_file()
print(f"Loaded {len(ratings)} ratings")

train, val, test = temporal_split(ratings, test_ratio=0.2, val_ratio=0.1)
train_users = train.user_id.unique()
train_movies = train.movie_id.unique()

val = val[(val.user_id.isin(train_users)) & (val.movie_id.isin(train_movies))]
test = test[(test.user_id.isin(train_users)) & (test.movie_id.isin(train_movies))]

print("Train size:", train.shape)
print("Val size:", val.shape)
print("Test size:", test.shape)

Loaded 1000209 ratings
Train set size: (700148, 4)
Validation set size: (100020, 4)
Test set size: (200041, 4)
Train timeframe: 2000-04-25 23:05:32 - 2000-11-22 03:06:26
Val timeframe: 2000-11-22 03:06:30 - 2000-12-02 14:52:18
Test timeframe: 2000-12-02 14:52:28 - 2003-02-28 17:49:50
Train size: (700148, 4)
Val size: (26246, 4)
Test size: (84804, 4)


## ID reindexing and PyTorch datasets

We map `user_id` and `movie_id` to contiguous integer indices for efficient
embedding lookups and define simple `Dataset` wrappers for training.

In [None]:

user_ids = train['user_id'].unique()
item_ids = train['movie_id'].unique()

user_map = {uid: i for i, uid in enumerate(user_ids)}
item_map = {mid: i for i, mid in enumerate(item_ids)}

train['u_idx'] = train['user_id'].map(user_map)
train['i_idx'] = train['movie_id'].map(item_map)
val = val.copy()
val['u_idx'] = val['user_id'].map(user_map)
val['i_idx'] = val['movie_id'].map(item_map)
test = test.copy()
test['u_idx'] = test['user_id'].map(user_map)
test['i_idx'] = test['movie_id'].map(item_map)

train = train.dropna(subset=['u_idx', 'i_idx'])
val = val.dropna(subset=['u_idx', 'i_idx'])
test = test.dropna(subset=['u_idx', 'i_idx'])

n_users = len(user_map)
n_items = len(item_map)
print(f"n_users={n_users}, n_items={n_items}")


class ExplicitRatingsDataset(Dataset):
    def __init__(self, df):
        self.u = df['u_idx'].astype(int).values
        self.i = df['i_idx'].astype(int).values
        self.r = df['rating'].astype(float).values

    def __len__(self):
        return len(self.r)

    def __getitem__(self, idx):
        return (
            torch.tensor(self.u[idx], dtype=torch.long),
            torch.tensor(self.i[idx], dtype=torch.long),
            torch.tensor(self.r[idx], dtype=torch.float32),
        )

n_users=4870, n_items=3633


## Model 1: NeuralMF (explicit rating prediction)

A non-linear generalization of matrix factorization that replaces the
single dot-product with an MLP over concatenated user/item embeddings.
We optimize **MSE loss** on explicit ratings and evaluate with RMSE/MAE
and ranking metrics (Precision@K, Recall@K, NDCG@K).

In [None]:
batch_size = 4096
lr = 1e-3
n_epochs = 5

train_ds = ExplicitRatingsDataset(train)
val_ds = ExplicitRatingsDataset(val)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_nmf = NeuralMF(n_users=n_users, n_items=n_items, embed_dim=32, hidden_dims=(64, 32), dropout=0.1)
model_nmf.to(device)

optimizer = torch.optim.Adam(model_nmf.parameters(), lr=lr)
criterion = torch.nn.MSELoss()

train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False)

for epoch in range(1, n_epochs + 1):
    model_nmf.train()
    total_loss = 0.0
    for u, i, r in train_loader:
        u = u.to(device)
        i = i.to(device)
        r = r.to(device)

        optimizer.zero_grad()
        preds = model_nmf(u, i)
        loss = criterion(preds, r)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * len(r)

    avg_train_loss = total_loss / len(train_ds)

    model_nmf.eval()
    with torch.no_grad():
        sq_errors = []
        for u, i, r in val_loader:
            u = u.to(device)
            i = i.to(device)
            r = r.to(device)
            preds = model_nmf(u, i)
            sq_errors.append((preds - r).pow(2).cpu().numpy())
        val_rmse = np.sqrt(np.concatenate(sq_errors).mean())

    print(f"Epoch {epoch}/{n_epochs} - train MSE: {avg_train_loss:.4f}, val RMSE: {val_rmse:.4f}")

Epoch 1/5 - train MSE: 3.3552, val RMSE: 0.9555
Epoch 2/5 - train MSE: 0.9673, val RMSE: 0.9450
Epoch 3/5 - train MSE: 0.9548, val RMSE: 0.9424
Epoch 4/5 - train MSE: 0.9413, val RMSE: 0.9385
Epoch 5/5 - train MSE: 0.9282, val RMSE: 0.9362


In [None]:
model_nmf.eval()

def nmf_predict_rating(user_id: int, movie_id: int, model, user_map, item_map) -> float:
    if user_id not in user_map or movie_id not in item_map:
        return float("nan")

    u_idx = torch.tensor([user_map[user_id]], dtype=torch.long, device=device)
    i_idx = torch.tensor([item_map[movie_id]], dtype=torch.long, device=device)
    with torch.no_grad():
        pred = model(u_idx, i_idx).cpu().item()
    return float(pred)

rmse_nmf = evaluate_rmse(test, nmf_predict_rating, model=model_nmf, user_map=user_map, item_map=item_map)
mae_nmf = evaluate_mae(test, nmf_predict_rating, model=model_nmf, user_map=user_map, item_map=item_map)

print(f"NeuralMF - Test RMSE: {rmse_nmf:.4f}")
print(f"NeuralMF - Test MAE:  {mae_nmf:.4f}")

NeuralMF - Test RMSE: 0.9371
NeuralMF - Test MAE:  0.7358


In [None]:
rain_user_items = train.groupby('user_id')['movie_id'].apply(set).to_dict()
all_candidate_items = np.array(list(item_map.keys()))

def nmf_recommend_k(
    user_id: int,
    k: int,
    model,
    user_map,
    item_map,
    test: pd.DataFrame | None = None,
) -> np.ndarray:
    if user_id not in user_map:
        return np.array([], dtype=int)

    u_idx = user_map[user_id]

    if test is not None:
        candidate_items = np.intersect1d(test.movie_id.unique(), all_candidate_items)
    else:
        candidate_items = all_candidate_items

    if len(candidate_items) == 0:
        return np.array([], dtype=int)

    seen_items = train_user_items.get(user_id, set())
    candidate_items = np.setdiff1d(candidate_items, np.array(list(seen_items), dtype=int))
    if len(candidate_items) == 0:
        return np.array([], dtype=int)

    item_indices = [item_map[mid] for mid in candidate_items if mid in item_map]
    if not item_indices:
        return np.array([], dtype=int)

    u_tensor = torch.full((len(item_indices),), u_idx, dtype=torch.long, device=device)
    i_tensor = torch.tensor(item_indices, dtype=torch.long, device=device)

    model.eval()
    with torch.no_grad():
        scores = model(u_tensor, i_tensor).cpu().numpy()

    top_k_idx = np.argsort(scores)[::-1][:k]
    return candidate_items[top_k_idx]


prec_nmf = evaluate_precision_at_k(test, nmf_recommend_k, k=10, model=model_nmf, user_map=user_map, item_map=item_map)
recall_nmf = evaluate_recall_at_k(test, nmf_recommend_k, k=10, model=model_nmf, user_map=user_map, item_map=item_map)
ndcg_nmf = evaluate_ndcg_at_k(test, nmf_recommend_k, k=10, model=model_nmf, user_map=user_map, item_map=item_map)

print(f"NeuralMF - Precision@10: {prec_nmf:.4f}")
print(f"NeuralMF - Recall@10:    {recall_nmf:.4f}")
print(f"NeuralMF - NDCG@10:      {ndcg_nmf:.4f}")

NeuralMF - Precision@10: 0.0252
NeuralMF - Recall@10:    0.0044
NeuralMF - NDCG@10:      0.0160


## Model 2: TwoTowerBPR (implicit pairwise ranking)

A two-tower embedding model optimized with a **BPR-style pairwise loss** on
implicit positive feedback (ratings ≥ 4 vs. unobserved items). This directly
optimizes ranking quality instead of absolute rating accuracy.

In [None]:
train_pos = train[train['rating'] >= 4]
train_pos['u_idx'] = train_pos['user_id'].map(user_map)
train_pos['i_idx'] = train_pos['movie_id'].map(item_map)
train_pos = train_pos.dropna(subset=['u_idx', 'i_idx'])

user_pos_items = train_pos.groupby('u_idx')['i_idx'].apply(lambda x: list(set(x))).to_dict()


class BPRTripletDataset(Dataset):
    def __init__(self, n_items: int, user_pos_items: dict[int, list[int]], n_samples: int = 1):
        self.n_items = n_items
        self.user_pos_items = user_pos_items
        self.users = list(user_pos_items.keys())
        self.n_samples = n_samples

    def __len__(self):
        return sum(len(items) for items in self.user_pos_items.values()) * self.n_samples

    def __getitem__(self, idx):
        u = np.random.choice(self.users)
        pos_items = self.user_pos_items[u]
        i = np.random.choice(pos_items)

        j = np.random.randint(0, self.n_items)
        while j in pos_items:
            j = np.random.randint(0, self.n_items)

        return (
            torch.tensor(u, dtype=torch.long),
            torch.tensor(i, dtype=torch.long),
            torch.tensor(j, dtype=torch.long),
        )

In [11]:
bpr_batch_size = 4096
bpr_lr = 1e-3
bpr_epochs = 5

bpr_ds = BPRTripletDataset(n_items=n_items, user_pos_items=user_pos_items, n_samples=1)
bpr_loader = DataLoader(bpr_ds, batch_size=bpr_batch_size, shuffle=True)

model_bpr = TwoTowerBPR(n_users=n_users, n_items=n_items, embed_dim=32, l2_reg=1e-4).to(device)
optimizer_bpr = torch.optim.Adam(model_bpr.parameters(), lr=bpr_lr)

for epoch in range(1, bpr_epochs + 1):
    model_bpr.train()
    total_loss = 0.0
    for u, i, j in bpr_loader:
        u = u.to(device)
        i = i.to(device)
        j = j.to(device)

        optimizer_bpr.zero_grad()
        loss = model_bpr.bpr_loss(u, i, j)
        loss.backward()
        optimizer_bpr.step()

        total_loss += loss.item() * len(u)

    avg_loss = total_loss / len(bpr_ds)
    print(f"Epoch {epoch}/{bpr_epochs} - BPR loss: {avg_loss:.4f}")

Epoch 1/5 - BPR loss: 0.6896
Epoch 2/5 - BPR loss: 0.5691
Epoch 3/5 - BPR loss: 0.3619
Epoch 4/5 - BPR loss: 0.2907
Epoch 5/5 - BPR loss: 0.2699


In [None]:
model_bpr.eval()

train_user_items_idx = train.groupby('user_id')['movie_id'].apply(
    lambda items: [item_map[m] for m in items if m in item_map]
).to_dict()

candidate_item_indices = torch.arange(n_items, dtype=torch.long, device=device)

def bpr_recommend_k(
    user_id: int,
    k: int,
    model,
    user_map,
    item_map,
    test: pd.DataFrame | None = None,
) -> np.ndarray:
    if user_id not in user_map:
        return np.array([], dtype=int)

    u_idx = user_map[user_id]

    if test is not None:
        test_items = test['movie_id'].unique()
    else:
        test_items = list(item_map.keys())
        
    test_items_idx = [item_map[m] for m in test_items if m in item_map]
    if not test_items_idx:
        return np.array([], dtype=int)

    test_items_idx = torch.tensor(test_items_idx, dtype=torch.long, device=device)

    seen_items_idx = set(train_user_items_idx.get(user_id, []))

    with torch.no_grad():
        topk_indices = model.recommend_for_user(
            user_index=u_idx,
            candidate_item_indices=test_items_idx,
            seen_item_indices=seen_items_idx,
            k=k,
        )

    inv_item_map = {v: k for k, v in item_map.items()}
    topk_indices_cpu = topk_indices.cpu().numpy().tolist()
    rec_movie_ids = [inv_item_map[idx] for idx in topk_indices_cpu if idx in inv_item_map]
    return np.array(rec_movie_ids, dtype=int)


prec_bpr = evaluate_precision_at_k(test, bpr_recommend_k, k=10, model=model_bpr, user_map=user_map, item_map=item_map)
recall_bpr = evaluate_recall_at_k(test, bpr_recommend_k, k=10, model=model_bpr, user_map=user_map, item_map=item_map)
ndcg_bpr = evaluate_ndcg_at_k(test, bpr_recommend_k, k=10, model=model_bpr, user_map=user_map, item_map=item_map)

print(f"TwoTowerBPR - Precision@10: {prec_bpr:.4f}")
print(f"TwoTowerBPR - Recall@10:    {recall_bpr:.4f}")
print(f"TwoTowerBPR - NDCG@10:      {ndcg_bpr:.4f}")

TwoTowerBPR - Precision@10: 0.1847
TwoTowerBPR - Recall@10:    0.0540
TwoTowerBPR - NDCG@10:      0.2428


## Critical discussion

### Representational differences vs MF/BPR
- **NeuralMF vs SVD/ALS (MF)**: MF uses a single dot product between user and item factors, which assumes a linear interaction in the shared latent space. NeuralMF replaces this with an MLP over concatenated embeddings, allowing non-linear feature interactions and user–item specific effects (e.g., saturation, cross-features). This can capture more complex patterns (e.g., genre combinations or user taste shifts) at the cost of higher capacity and risk of overfitting.
- **TwoTowerBPR vs classical BPR-Opt**: Both optimize a pairwise ranking objective, but classical BPR-Opt uses a purely linear score (dot product of latent factors learned via SGD in NumPy), while TwoTowerBPR is implemented as a neural two-tower model in PyTorch. Architecturally they are very similar (embeddings + dot product); the key difference is the training framework, which makes it easy to extend TwoTowerBPR with deeper towers or side features if needed.

### Optimization and compute trade-offs
- **Classical MF (SVD/ALS)** solves a (roughly) convex problem with efficient closed-form updates (ALS) or simple SGD on a small number of parameters, which tends to converge quickly and is stable. NeuralMF introduces many more parameters and non-linearities; optimization is fully first-order (Adam) and can be slower and more sensitive to hyperparameters. It also requires batching and careful regularization.
- **BPR-Opt vs TwoTowerBPR**: Both rely on negative sampling and SGD over triplets. The NumPy BPR-Opt implementation is lightweight and CPU-friendly; the PyTorch TwoTowerBPR can leverage GPUs and batched operations, but its per-step overhead is higher. In practice, if we keep the embedding dimension small, the compute cost remains manageable, but the neural version becomes more appealing when we want to scale to larger models or add complex features.

### Why performance improves or degrades
- On rating prediction (RMSE/MAE), **NeuralMF** may slightly improve over MF if the data truly contains non-linear interaction patterns and we tune the model carefully. On smaller or noisy datasets, the extra capacity can hurt, leading to similar or worse RMSE than a well-regularized MF baseline, while requiring more compute.
- On ranking metrics (Precision@K/Recall@K/NDCG@K), **TwoTowerBPR** is often more competitive because its objective is directly aligned with ranking. Compared to classical BPR-Opt, we expect similar behaviour when using the same embedding size; any gains usually come from better regularization and the ability to scale up or extend the architecture, not from “depth” alone.
- Overall, the experiments here illustrate that **neural recommenders are not automatically better**: they trade analytical simplicity and fast convergence (MF/BPR) for additional representational power. Whether this pays off depends on data scale, noise level, and how well we tune the models. In small-to-medium regimes like MovieLens 1M, modest non-linear models tend to be competitive but not dramatically better than strong MF/BPR baselines.