# MovieLens Recommender – Ensemble for Ranking (Precision@K / Recall@K)

In this notebook we move from **rating prediction (RMSE)** to
**ranking-oriented evaluation** and ensembles.

We will:

1. Use the MovieLens `ml-latest-small` dataset (`ratings.csv`).
2. Train three **base models** on explicit ratings:
   - Bias model (global + user + item effects).
   - Item-based k-NN collaborative filtering.
   - Matrix factorisation with biases.
3. Define **relevance** as `rating ≥ threshold` (e.g. 4.0).
4. Evaluate each base model with **ranking metrics**:
   - Hit-rate@K.
   - Precision@K.
   - Recall@K.
5. Build a **stacked ensemble for ranking**:
   - Use base model scores as features.
   - Train a **logistic regression** meta-model to predict relevance.
   - Rank by predicted probability of relevance.
6. Compare base models, simple average, and stacked ensemble on ranking
   metrics, and interpret the meta-model weights.

This gives a template for **blending recommender models** when your goal
is top-N recommendation quality rather than rating RMSE.


## 1. Imports and configuration

We use:

- `pandas`, `numpy` – data handling.
- `matplotlib`, `seaborn` – plots.
- `scikit-learn` – splits, similarity, and meta-learning.

All recommender logic (bias, item-kNN, MF) is implemented in pure
Python/NumPy for transparency.


In [None]:
from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple, Set

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_DIR: Path = Path("data") / "ml-latest-small"
RATINGS_PATH: Path = DATA_DIR / "ratings.csv"

if not RATINGS_PATH.exists():
    raise FileNotFoundError(
        f"Ratings file not found at {RATINGS_PATH.resolve()}. "
        "Please ensure MovieLens 'ml-latest-small' is under data/ml-latest-small/."
    )

ratings_df = pd.read_csv(RATINGS_PATH)
print("Ratings shape:", ratings_df.shape)
ratings_df.head()


### 1.1 Utility metrics

We keep RMSE just for reference, but our target metrics will be
hit-rate@K, precision@K, recall@K.


In [None]:
def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute root mean squared error.

    Args:
        y_true: True ratings.
        y_pred: Predicted ratings.

    Returns:
        RMSE value.
    """
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))


sns.histplot(ratings_df["rating"], bins=10)
plt.title("Rating distribution")
plt.xlabel("Rating")
plt.show()


## 2. Train / meta / test splits

We reuse the three-way split structure suitable for stacking:

1. Split into `train_full` (80%) and `test` (20%).
2. Split `train_full` into `base_train` (80%) and `meta_train` (20%).

- Base models are fitted on `base_train`.
- Meta-model is trained on `meta_train` **using base model scores**.
- `test` is held out until final evaluation.


In [None]:
# Step 1: main train / test split

train_full_df, test_df = train_test_split(
    ratings_df,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

# Step 2: split train_full into base_train and meta_train

base_train_df, meta_train_df = train_test_split(
    train_full_df,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

print("Base train size:", base_train_df.shape[0])
print("Meta train size:", meta_train_df.shape[0])
print("Test size:      ", test_df.shape[0])


## 3. Base models (same as previous ensemble notebook)

We define three base models:

1. `BiasModel` – global + user + item biases.
2. `ItemKNNModel` – item-based k-NN CF.
3. `MatrixFactorizationModel` – latent factor model.

Each has a consistent interface:

- `fit(df)` – train on a ratings DataFrame.
- `predict_df(df)` – predict scores for `(userId, movieId)` pairs.

The *scores* they output are used for both rating and ranking tasks.


### 3.1 BiasModel


In [None]:
class BiasModel:
    """Global + user + item bias recommender.

    Predicts ratings as global mean plus user-specific and item-specific
    deviations from that mean.
    """

    def __init__(self) -> None:
        self.mu: float | None = None
        self.user_bias: Dict[int, float] = {}
        self.item_bias: Dict[int, float] = {}

    def fit(self, df: pd.DataFrame) -> None:
        """Fit bias terms from a ratings DataFrame.

        Args:
            df: DataFrame with columns `userId`, `movieId`, `rating`.
        """
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        self.mu = float(df["rating"].mean())

        user_mean = df.groupby("userId")["rating"].mean()
        item_mean = df.groupby("movieId")["rating"].mean()

        self.user_bias = (user_mean - self.mu).to_dict()
        self.item_bias = (item_mean - self.mu).to_dict()

    def predict_row(self, user_id: int, movie_id: int) -> float:
        """Predict rating for a single user–item pair.

        Args:
            user_id: User identifier.
            movie_id: Movie identifier.

        Returns:
            Predicted rating.
        """
        if self.mu is None:
            raise RuntimeError("Model has not been fitted.")

        bu = self.user_bias.get(user_id, 0.0)
        bi = self.item_bias.get(movie_id, 0.0)
        return float(self.mu + bu + bi)

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        """Predict ratings for multiple user–item pairs.

        Args:
            df: DataFrame with `userId`, `movieId`.

        Returns:
            Array of predictions aligned with `df` rows.
        """
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self.predict_row(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


bias_model = BiasModel()
bias_model.fit(base_train_df)

bias_meta_preds = bias_model.predict_df(meta_train_df)
print("Bias model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), bias_meta_preds))


### 3.2 ItemKNNModel


In [None]:
class ItemKNNModel:
    """Item-based k-NN collaborative filtering model.

    Uses cosine similarity between item rating vectors and a similarity-
    weighted average over neighbours.
    """

    def __init__(self, k: int = 40, default_rating: float = 3.5) -> None:
        self.k = k
        self.default_rating = float(default_rating)

        self.user_id_to_index: Dict[int, int] = {}
        self.item_id_to_index: Dict[int, int] = {}
        self.R: np.ndarray | None = None
        self.item_sim: np.ndarray | None = None

    def fit(self, df: pd.DataFrame) -> None:
        """Fit the k-NN model from a ratings DataFrame.

        Args:
            df: DataFrame with `userId`, `movieId`, `rating`.
        """
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        unique_users = df["userId"].unique()
        unique_items = df["movieId"].unique()

        self.user_id_to_index = {uid: idx for idx, uid in enumerate(unique_users)}
        self.item_id_to_index = {iid: idx for idx, iid in enumerate(unique_items)}

        n_users = len(unique_users)
        n_items = len(unique_items)

        R = np.zeros((n_users, n_items), dtype=np.float32)
        for row in df.itertuples(index=False):
            u_idx = self.user_id_to_index[row.userId]
            i_idx = self.item_id_to_index[row.movieId]
            R[u_idx, i_idx] = row.rating

        self.R = R
        self.item_sim = cosine_similarity(R.T)

    def _predict_single(self, user_id: int, movie_id: int) -> float:
        if self.R is None or self.item_sim is None:
            raise RuntimeError("Model has not been fitted.")

        u_idx = self.user_id_to_index.get(user_id)
        i_idx = self.item_id_to_index.get(movie_id)
        if u_idx is None or i_idx is None:
            return self.default_rating

        user_ratings = self.R[u_idx, :]
        sims = self.item_sim[i_idx, :]

        rated_mask = user_ratings > 0
        rated_indices = np.where(rated_mask)[0]
        if rated_indices.size == 0:
            return self.default_rating

        sims_rated = sims[rated_indices]
        ratings_rated = user_ratings[rated_indices]

        k_use = min(self.k, rated_indices.size)
        top_idx = np.argsort(sims_rated)[-k_use:]

        neigh_sims = sims_rated[top_idx]
        neigh_ratings = ratings_rated[top_idx]

        if np.all(neigh_sims == 0):
            return float(neigh_ratings.mean())

        pred = float(np.dot(neigh_sims, neigh_ratings) / np.sum(np.abs(neigh_sims)))
        return pred

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self._predict_single(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


bias_global_mean = float(base_train_df["rating"].mean())

item_knn_model = ItemKNNModel(k=40, default_rating=bias_global_mean)
item_knn_model.fit(base_train_df)

item_meta_preds = item_knn_model.predict_df(meta_train_df)
print("Item-kNN model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), item_meta_preds))


### 3.3 MatrixFactorizationModel


In [None]:
@dataclass
class MFConfig:
    n_factors: int = 30
    n_epochs: int = 12
    lr: float = 0.01
    reg: float = 0.05


class MatrixFactorizationModel:
    """Matrix factorisation with biases trained via SGD.

    Predicts ratings using a global mean, user/item biases and latent
    factors for users and items.
    """

    def __init__(self, config: MFConfig, random_state: int = 42) -> None:
        self.config = config
        self.random_state = random_state

        self.mu: float | None = None
        self.user_bias: Dict[int, float] = {}
        self.item_bias: Dict[int, float] = {}
        self.P: Dict[int, np.ndarray] = {}
        self.Q: Dict[int, np.ndarray] = {}

    def fit(self, df: pd.DataFrame) -> None:
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        rng = np.random.default_rng(self.random_state)

        user_ids = df["userId"].unique()
        item_ids = df["movieId"].unique()

        self.mu = float(df["rating"].mean())

        self.user_bias = {u: 0.0 for u in user_ids}
        self.item_bias = {i: 0.0 for i in item_ids}

        k = self.config.n_factors
        self.P = {u: 0.1 * rng.standard_normal(k) for u in user_ids}
        self.Q = {i: 0.1 * rng.standard_normal(k) for i in item_ids}

        lr = self.config.lr
        reg = self.config.reg

        user_arr = df["userId"].to_numpy()
        item_arr = df["movieId"].to_numpy()
        rating_arr = df["rating"].to_numpy()

        n_obs = len(df)

        for epoch in range(self.config.n_epochs):
            idx = rng.permutation(n_obs)
            se = 0.0
            for t in idx:
                u = int(user_arr[t])
                i = int(item_arr[t])
                r_ui = float(rating_arr[t])

                bu = self.user_bias[u]
                bi = self.item_bias[i]
                pu = self.P[u]
                qi = self.Q[i]

                pred = self.mu + bu + bi + float(np.dot(pu, qi))
                err = r_ui - pred
                se += err * err

                # Bias updates
                self.user_bias[u] = bu + lr * (err - reg * bu)
                self.item_bias[i] = bi + lr * (err - reg * bi)

                # Latent factors updates
                pu_new = pu + lr * (err * qi - reg * pu)
                qi_new = qi + lr * (err * pu - reg * qi)

                self.P[u] = pu_new
                self.Q[i] = qi_new

            train_rmse = float(np.sqrt(se / n_obs))
            print(f"Epoch {epoch+1}/{self.config.n_epochs} - train RMSE: {train_rmse:.4f}")

    def predict_single(self, user_id: int, movie_id: int) -> float:
        if self.mu is None:
            raise RuntimeError("Model has not been fitted.")

        bu = self.user_bias.get(user_id)
        bi = self.item_bias.get(movie_id)
        pu = self.P.get(user_id)
        qi = self.Q.get(movie_id)

        if bu is None or bi is None or pu is None or qi is None:
            return float(self.mu)

        return float(self.mu + bu + bi + float(np.dot(pu, qi)))

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self.predict_single(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


mf_config = MFConfig(n_factors=30, n_epochs=10, lr=0.01, reg=0.05)
mf_model = MatrixFactorizationModel(config=mf_config, random_state=RANDOM_STATE)

mf_model.fit(base_train_df)

mf_meta_preds = mf_model.predict_df(meta_train_df)
print("MF model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), mf_meta_preds))


We now have three base models with predictions on `meta_train_df`.
Next we define relevance and ranking metrics, and build a logistic
stacking model for ranking.


## 4. Relevance definition and ranking metrics

We transform explicit ratings into a binary **relevance** signal:

- relevant if `rating ≥ REL_THRESHOLD`.
- not relevant otherwise.

Then we compute per-user ranking metrics (hit-rate@K, precision@K,
recall@K) by ranking the user's items according to model scores.


In [None]:
REL_THRESHOLD: float = 4.0
K_EVAL: int = 10


def get_user_relevant_items(df: pd.DataFrame, user_id: int, threshold: float = REL_THRESHOLD) -> Set[int]:
    """Return set of relevant movieIds for a user.

    Args:
        df: Ratings DataFrame.
        user_id: User identifier.
        threshold: Rating threshold to consider relevant.

    Returns:
        Set of movieIds.
    """
    mask = (df["userId"] == user_id) & (df["rating"] >= threshold)
    return set(df.loc[mask, "movieId"].unique())


def ranking_metrics_for_model_on_test(
    test_df: pd.DataFrame,
    scores: np.ndarray,
    k: int = K_EVAL,
    threshold: float = REL_THRESHOLD,
) -> Dict[str, float]:
    """Compute hit-rate, precision@K, recall@K from per-row scores.

    Args:
        test_df: Test ratings DataFrame.
        scores: Score array aligned with `test_df` rows.
        k: Cutoff for top-K.
        threshold: Rating threshold for relevance.

    Returns:
        Dict with hit_rate, precision_at_k, recall_at_k, n_eval_users.
    """
    df_scores = test_df.copy()
    df_scores["score"] = scores

    users = df_scores["userId"].unique()

    hits = 0
    sum_precision = 0.0
    sum_recall = 0.0
    n_eval_users = 0

    for u in users:
        user_rows = df_scores[df_scores["userId"] == u]
        relevant_items = get_user_relevant_items(test_df, u, threshold=threshold)
        if not relevant_items:
            continue  # skip users without positives in test

        n_eval_users += 1

        user_rows_sorted = user_rows.sort_values("score", ascending=False)
        top_k = user_rows_sorted.head(k)

        recommended_items = set(top_k["movieId"].tolist())
        n_relevant_in_top = len(recommended_items & relevant_items)

        if n_relevant_in_top > 0:
            hits += 1

        precision_u = n_relevant_in_top / min(k, len(user_rows_sorted))
        recall_u = n_relevant_in_top / len(relevant_items)

        sum_precision += precision_u
        sum_recall += recall_u

    if n_eval_users == 0:
        raise ValueError("No users with relevant items in test for ranking evaluation.")

    hit_rate = hits / n_eval_users
    precision_at_k = sum_precision / n_eval_users
    recall_at_k = sum_recall / n_eval_users

    return {
        "hit_rate": hit_rate,
        "precision_at_k": precision_at_k,
        "recall_at_k": recall_at_k,
        "n_eval_users": float(n_eval_users),
    }


## 5. Base models as rankers on test set


In [None]:
y_test = test_df["rating"].to_numpy()

bias_test_scores = bias_model.predict_df(test_df)
item_test_scores = item_knn_model.predict_df(test_df)
mf_test_scores = mf_model.predict_df(test_df)

metrics_bias = ranking_metrics_for_model_on_test(test_df, bias_test_scores, k=K_EVAL)
metrics_item = ranking_metrics_for_model_on_test(test_df, item_test_scores, k=K_EVAL)
metrics_mf = ranking_metrics_for_model_on_test(test_df, mf_test_scores, k=K_EVAL)

metrics_bias, metrics_item, metrics_mf


In [None]:
# Simple untrained average of scores

avg_test_scores = (bias_test_scores + item_test_scores + mf_test_scores) / 3.0
metrics_avg = ranking_metrics_for_model_on_test(test_df, avg_test_scores, k=K_EVAL)
metrics_avg


In [None]:
rows = []
for name, m in [
    ("BiasModel", metrics_bias),
    ("ItemKNNModel", metrics_item),
    ("MFModel", metrics_mf),
    ("SimpleAverage", metrics_avg),
]:
    rows.append(
        {
            "model": name,
            "hit_rate": m["hit_rate"],
            "precision_at_k": m["precision_at_k"],
            "recall_at_k": m["recall_at_k"],
        }
    )

base_ranking_df = pd.DataFrame(rows)
base_ranking_df


In [None]:
base_ranking_melt = base_ranking_df.melt(id_vars="model", var_name="metric", value_name="value")

sns.barplot(data=base_ranking_melt, x="metric", y="value", hue="model")
plt.ylim(0, 1)
plt.ylabel("Score")
plt.title(f"Base models – ranking metrics (K={K_EVAL}, threshold={REL_THRESHOLD})")
plt.show()


## 6. Stacked ensemble for ranking (logistic regression)

We now train a **logistic regression** meta-model that learns to map
base scores to a probability of relevance.

For each row in `meta_train_df` we have:

- Base features: `[bias_score, item_knn_score, mf_score]`.
- Label: `1` if `rating ≥ REL_THRESHOLD`, else `0`.

The meta-model learns to combine base scores in a way that better
separates relevant from non-relevant items.


In [None]:
# Build meta-training features and labels for ranking

X_meta_rank = np.vstack([
    bias_meta_preds,
    item_meta_preds,
    mf_meta_preds,
]).T

y_meta_rank = (meta_train_df["rating"].to_numpy() >= REL_THRESHOLD).astype(int)

print("X_meta_rank shape:", X_meta_rank.shape)
print("Positive rate in meta labels:", y_meta_rank.mean())


In [None]:
meta_rank_model = LogisticRegression(
    penalty="l2",
    C=1.0,
    solver="lbfgs",
    max_iter=1000,
    random_state=RANDOM_STATE,
)

meta_rank_model.fit(X_meta_rank, y_meta_rank)

print("Meta-ranking coefficients (Bias, ItemKNN, MF):", meta_rank_model.coef_)
print("Meta-ranking intercept:", meta_rank_model.intercept_)


### 6.1 Evaluate stacked ensemble on test set

We apply the logistic meta-model to the test set:

1. Compute base scores on test (already done).
2. Form `X_test_rank`.
3. Predict `p(relevant)` using `predict_proba`.
4. Rank items per user by this probability.
5. Compute hit-rate@K, precision@K, recall@K.


In [None]:
X_test_rank = np.vstack([
    bias_test_scores,
    item_test_scores,
    mf_test_scores,
]).T

stacked_test_proba = meta_rank_model.predict_proba(X_test_rank)[:, 1]

metrics_stacked_rank = ranking_metrics_for_model_on_test(
    test_df,
    stacked_test_proba,
    k=K_EVAL,
    threshold=REL_THRESHOLD,
)

metrics_stacked_rank


In [None]:
rows_full = rows.copy()
rows_full.append(
    {
        "model": "StackedLogistic",
        "hit_rate": metrics_stacked_rank["hit_rate"],
        "precision_at_k": metrics_stacked_rank["precision_at_k"],
        "recall_at_k": metrics_stacked_rank["recall_at_k"],
    }
)

ranking_compare_df = pd.DataFrame(rows_full)
ranking_compare_df


In [None]:
ranking_compare_melt = ranking_compare_df.melt(id_vars="model", var_name="metric", value_name="value")

sns.barplot(data=ranking_compare_melt, x="metric", y="value", hue="model")
plt.ylim(0, 1)
plt.ylabel("Score")
plt.title(f"Base models vs stacked ensemble – ranking (K={K_EVAL})")
plt.show()


## 7. Interpreting meta-ranking weights

We inspect logistic regression coefficients as **learned weights** over
base scores.


In [None]:
coef_names = ["BiasModel", "ItemKNNModel", "MFModel"]
coef_values = meta_rank_model.coef_[0]

coef_rank_df = pd.DataFrame({"base_model": coef_names, "weight": coef_values})
coef_rank_df


In [None]:
sns.barplot(data=coef_rank_df, x="base_model", y="weight")
plt.title("Meta-ranking weights (logistic regression coefficients)")
plt.ylabel("Coefficient")
plt.show()


Interpretation:

- Larger positive coefficient → that base model's score strongly
  increases the probability of relevance.
- Near-zero coefficient → little contribution.
- Negative coefficient → meta-model uses that score mostly as a
  corrective signal.

Because base scores are rating-like (~0.5–5.0), coefficients correspond
roughly to how much a one-point change in a base score changes the log
odds of relevance.


## 8. Design rationale

### 8.1 Why treat ranking as classification over base scores?

Ranking metrics (precision@K, recall@K) are non-differentiable.
Optimising them directly is hard. Instead, we:

- Train base models on ratings as usual.
- Use their scores as features in a **probabilistic classifier**.
- Let the classifier learn to map scores → probability of relevance.

When we rank by these probabilities, we implicitly optimise for
separating positives from negatives, which aligns well with ranking
metrics.

### 8.2 Why logistic regression?

- It is simple and interpretable.
- Works well with unbalanced labels (few relevant items).
- Produces probabilistic scores.
- Coefficients tell us how much each base model matters.

### 8.3 Why a single global meta-model?

We train a **global model** across all users and items. This is a good
starting point and often performs well. More complex designs include:

- Adding user features (e.g. user activity, segments).
- Adding item features (e.g. popularity, genre-based scores).
- Using different meta-models per user cluster.


## 9. Extensions

1. **More base models**
   - Add LightFM, Surprise SVD, content-based models.
   - Use their scores as additional columns in `X_meta_rank`.

2. **Cross-validated stacking**
   - Generate out-of-fold base scores for a more robust meta-training
     dataset.

3. **Alternative meta-learners**
   - Gradient boosting or random forests over base scores.

4. **User-level calibration**
   - Learn user-specific or segment-specific meta-models if you suspect
     behaviour differs strongly across groups.

This notebook gives you a full-ranking ensemble pipeline, parallel to
the RMSE-focused stacking notebook but aligned with top-N recommendation
metrics.
