# MovieLens Recommender – Ensemble & Stacked Models

In this notebook we focus on **ensembles** for recommender systems on the
MovieLens `ml-latest-small` dataset.

So far we have built single models:

- Baselines (global mean, biases).
- Item-based collaborative filtering.
- Matrix factorisation.

Here we:

1. Define several **base recommenders**:
   - Bias model (global + user + item effects).
   - Item-based k-NN collaborative filtering.
   - Matrix factorisation (latent factors).
2. Split the data into:
   - Base training set (to fit base models).
   - Meta training set (to fit a **stacking meta-model**).
   - Test set (final evaluation).
3. Build a **stacked model**:
   - Use base model predictions as features.
   - Train a **linear regression meta-learner** to combine them.
4. Compare:
   - Individual base models.
   - Simple weighted average ensembles.
   - Stacked ensemble with a learned combiner.
5. Visualise and interpret:
   - RMSE comparison.
   - True vs predicted plots.
   - Learned weights of the meta-model.

This gives you a realistic pattern for building **blended recommenders**
where multiple modelling approaches contribute to a final score.


## 1. Imports and configuration

We keep dependencies simple:

- `pandas`, `numpy` – data wrangling.
- `matplotlib`, `seaborn` – plots.
- `scikit-learn` – splitting and meta-learning.

All recommender models are implemented in **plain Python / NumPy** so you
can fully inspect and adapt them.


In [None]:
from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Iterable, List, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LinearRegression

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_DIR: Path = Path("data") / "ml-latest-small"
RATINGS_PATH: Path = DATA_DIR / "ratings.csv"
MOVIES_PATH: Path = DATA_DIR / "movies.csv"

if not RATINGS_PATH.exists():
    raise FileNotFoundError(
        f"Ratings file not found at {RATINGS_PATH.resolve()}. "
        "Please ensure MovieLens 'ml-latest-small' is under data/ml-latest-small/."
    )

ratings_df = pd.read_csv(RATINGS_PATH)
movies_df: pd.DataFrame | None = None
if MOVIES_PATH.exists():
    movies_df = pd.read_csv(MOVIES_PATH)

print("Ratings shape:", ratings_df.shape)
ratings_df.head()


### 1.1 Utility: RMSE

We use RMSE as the primary error metric on ratings.


In [None]:
def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute root mean squared error.

    Args:
        y_true: True ratings.
        y_pred: Predicted ratings.

    Returns:
        RMSE value.
    """
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))


sns.histplot(ratings_df["rating"], bins=10)
plt.title("Rating distribution")
plt.xlabel("Rating")
plt.show()


## 2. Train / validation / test splitting for stacking

We want **three** disjoint sets:

- `base_train` – used to fit base models.
- `meta_train` – used to fit the meta-learner on base model predictions.
- `test` – held-out set for final evaluation only.

Procedure:

1. First split full data into `train_full` (80%) and `test` (20%).
2. Then split `train_full` into `base_train` (80%) and `meta_train` (20%).

In practice, you could use cross-validation or time-based splits. Here we
keep it simple and random.


In [None]:
# Step 1: main train / test split

train_full_df, test_df = train_test_split(
    ratings_df,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

# Step 2: split train_full into base_train and meta_train

base_train_df, meta_train_df = train_test_split(
    train_full_df,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

print("Base train size:", base_train_df.shape[0])
print("Meta train size:", meta_train_df.shape[0])
print("Test size:      ", test_df.shape[0])


The intuition:

- Base models never see `meta_train` or `test` when fitting.
- Meta-model sees base predictions on `meta_train` only.
- Test set is used only once at the end.


## 3. Base models

We implement three base recommenders:

1. **BiasModel** – global + user + item biases.
2. **ItemKNNModel** – item-based k-NN collaborative filtering.
3. **MatrixFactorizationModel** – latent factors with biases trained via SGD.

Each model follows a small, consistent interface:

- `fit(df)` – train from a DataFrame with `userId`, `movieId`, `rating`.
- `predict_df(df)` – predict ratings for a DataFrame of pairs.

This makes it easy to plug them into stacking.


### 3.1 BiasModel

The bias model uses:

\begin{align}
\hat r_{ui} = \mu + b_u + b_i
\end{align}

Where:

- `μ` is the global mean rating.
- `b_u` is user bias.
- `b_i` is item bias.

It is simple, fast and often a very strong baseline.


In [None]:
class BiasModel:
    """Global + user + item bias recommender.

    This model estimates a baseline rating as the sum of:
    - global mean
    - user-specific deviation
    - item-specific deviation
    """

    def __init__(self) -> None:
        self.mu: float | None = None
        self.user_bias: Dict[int, float] = {}
        self.item_bias: Dict[int, float] = {}

    def fit(self, df: pd.DataFrame) -> None:
        """Fit bias terms from a ratings DataFrame.

        Args:
            df: DataFrame with columns `userId`, `movieId`, `rating`.
        """
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        self.mu = float(df["rating"].mean())

        # User and item average deviations from global mean
        user_mean = df.groupby("userId")["rating"].mean()
        item_mean = df.groupby("movieId")["rating"].mean()

        self.user_bias = (user_mean - self.mu).to_dict()
        self.item_bias = (item_mean - self.mu).to_dict()

    def predict_row(self, user_id: int, movie_id: int) -> float:
        """Predict rating for a single user–item pair.

        Args:
            user_id: User identifier.
            movie_id: Movie identifier.

        Returns:
            Predicted rating.
        """
        if self.mu is None:
            raise RuntimeError("Model has not been fitted.")

        bu = self.user_bias.get(user_id, 0.0)
        bi = self.item_bias.get(movie_id, 0.0)
        return float(self.mu + bu + bi)

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        """Predict ratings for a DataFrame of pairs.

        Args:
            df: DataFrame with columns `userId` and `movieId`.

        Returns:
            Array of predictions aligned with df rows.
        """
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self.predict_row(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


bias_model = BiasModel()
bias_model.fit(base_train_df)

# Quick check on meta_train
bias_meta_preds = bias_model.predict_df(meta_train_df)
print("Bias model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), bias_meta_preds))


### 3.2 ItemKNNModel – item-based collaborative filtering

This model:

1. Builds a user–item rating matrix on `base_train`.
2. Computes cosine similarity between item vectors.
3. Predicts a rating for `(u, i)` as a weighted average of `u`'s ratings
   on the **k most similar items** they have rated.

We use the bias model's global mean as a fallback when no information is
available.


In [None]:
class ItemKNNModel:
    """Item-based k-NN collaborative filtering recommender.

    Uses cosine similarity between item rating vectors and a weighted
    average over a user's rated neighbours.
    """

    def __init__(self, k: int = 40, default_rating: float | None = None) -> None:
        self.k = k
        self.default_rating = default_rating

        self.user_id_to_index: Dict[int, int] = {}
        self.item_id_to_index: Dict[int, int] = {}
        self.R: np.ndarray | None = None  # user-item matrix
        self.item_sim: np.ndarray | None = None  # item-item similarity

    def fit(self, df: pd.DataFrame) -> None:
        """Fit the k-NN model from a ratings DataFrame.

        Args:
            df: DataFrame with `userId`, `movieId`, `rating`.
        """
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        unique_users = df["userId"].unique()
        unique_items = df["movieId"].unique()

        self.user_id_to_index = {uid: idx for idx, uid in enumerate(unique_users)}
        self.item_id_to_index = {iid: idx for idx, iid in enumerate(unique_items)}

        n_users = len(unique_users)
        n_items = len(unique_items)

        R = np.zeros((n_users, n_items), dtype=np.float32)
        for row in df.itertuples(index=False):
            u_idx = self.user_id_to_index[row.userId]
            i_idx = self.item_id_to_index[row.movieId]
            R[u_idx, i_idx] = row.rating

        self.R = R
        # Item-item cosine similarity
        self.item_sim = cosine_similarity(R.T)

    def _predict_single(self, user_id: int, movie_id: int) -> float:
        """Predict rating for a single (user, item) pair.

        Args:
            user_id: User identifier.
            movie_id: Movie identifier.

        Returns:
            Predicted rating.
        """
        if self.R is None or self.item_sim is None:
            raise RuntimeError("Model has not been fitted.")

        # Fallback default
        default = float(self.default_rating if self.default_rating is not None else 3.5)

        u_idx = self.user_id_to_index.get(user_id)
        i_idx = self.item_id_to_index.get(movie_id)
        if u_idx is None or i_idx is None:
            return default

        user_ratings = self.R[u_idx, :]
        sims = self.item_sim[i_idx, :]

        rated_mask = user_ratings > 0
        rated_indices = np.where(rated_mask)[0]
        if rated_indices.size == 0:
            return default

        sims_rated = sims[rated_indices]
        ratings_rated = user_ratings[rated_indices]

        k_use = min(self.k, rated_indices.size)
        top_idx = np.argsort(sims_rated)[-k_use:]

        neigh_sims = sims_rated[top_idx]
        neigh_ratings = ratings_rated[top_idx]

        if np.all(neigh_sims == 0):
            return float(neigh_ratings.mean())

        pred = float(np.dot(neigh_sims, neigh_ratings) / np.sum(np.abs(neigh_sims)))
        return pred

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        """Predict ratings for many (userId, movieId) pairs.

        Args:
            df: DataFrame with userId, movieId.

        Returns:
            Predictions as a numpy array.
        """
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self._predict_single(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


# Use bias global mean as default rating for ItemKNN
bias_global_mean = float(base_train_df["rating"].mean())

item_knn_model = ItemKNNModel(k=40, default_rating=bias_global_mean)
item_knn_model.fit(base_train_df)

item_meta_preds = item_knn_model.predict_df(meta_train_df)
print("Item-kNN model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), item_meta_preds))


### 3.3 MatrixFactorizationModel – latent factors + biases

We use a simple matrix factorisation model:

\begin{align}
\hat r_{ui} = \mu + b_u + b_i + p_u^T q_i
\end{align}

- `μ` – global mean.
- `b_u`, `b_i` – biases.
- `p_u`, `q_i` – latent vectors of dimension `k`.

Training is done with **stochastic gradient descent** on observed ratings.
We keep the implementation simple and reasonably small so it is easy to
read.


In [None]:
@dataclass
class MFConfig:
    n_factors: int = 30
    n_epochs: int = 15
    lr: float = 0.01
    reg: float = 0.05


class MatrixFactorizationModel:
    """Matrix factorisation with biases trained via SGD.

    This is a simple reference implementation suitable for small datasets.
    """

    def __init__(self, config: MFConfig, random_state: int = 42) -> None:
        self.config = config
        self.random_state = random_state

        self.mu: float | None = None
        self.user_bias: Dict[int, float] = {}
        self.item_bias: Dict[int, float] = {}
        self.P: Dict[int, np.ndarray] = {}
        self.Q: Dict[int, np.ndarray] = {}

    def fit(self, df: pd.DataFrame) -> None:
        """Fit the MF model on a ratings DataFrame.

        Args:
            df: DataFrame with userId, movieId, rating.
        """
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        rng = np.random.default_rng(self.random_state)

        user_ids = df["userId"].unique()
        item_ids = df["movieId"].unique()

        self.mu = float(df["rating"].mean())

        self.user_bias = {u: 0.0 for u in user_ids}
        self.item_bias = {i: 0.0 for i in item_ids}

        k = self.config.n_factors

        self.P = {u: 0.1 * rng.standard_normal(k) for u in user_ids}
        self.Q = {i: 0.1 * rng.standard_normal(k) for i in item_ids}

        lr = self.config.lr
        reg = self.config.reg

        user_arr = df["userId"].to_numpy()
        item_arr = df["movieId"].to_numpy()
        rating_arr = df["rating"].to_numpy()

        n_obs = len(df)

        for epoch in range(self.config.n_epochs):
            idx = rng.permutation(n_obs)
            se = 0.0

            for t in idx:
                u = int(user_arr[t])
                i = int(item_arr[t])
                r_ui = float(rating_arr[t])

                bu = self.user_bias[u]
                bi = self.item_bias[i]
                pu = self.P[u]
                qi = self.Q[i]

                pred = self.mu + bu + bi + float(np.dot(pu, qi))
                err = r_ui - pred

                se += err * err

                # Update biases
                self.user_bias[u] = bu + lr * (err - reg * bu)
                self.item_bias[i] = bi + lr * (err - reg * bi)

                # Update latent factors
                pu_new = pu + lr * (err * qi - reg * pu)
                qi_new = qi + lr * (err * pu - reg * qi)

                self.P[u] = pu_new
                self.Q[i] = qi_new

            train_rmse = float(np.sqrt(se / n_obs))
            print(f"Epoch {epoch+1}/{self.config.n_epochs} - train RMSE: {train_rmse:.4f}")

    def predict_single(self, user_id: int, movie_id: int) -> float:
        """Predict rating for a single user–item pair.

        Args:
            user_id: User identifier.
            movie_id: Movie identifier.

        Returns:
            Predicted rating.
        """
        if self.mu is None:
            raise RuntimeError("Model has not been fitted.")

        bu = self.user_bias.get(user_id)
        bi = self.item_bias.get(movie_id)
        pu = self.P.get(user_id)
        qi = self.Q.get(movie_id)

        if bu is None or bi is None or pu is None or qi is None:
            return float(self.mu)

        return float(self.mu + bu + bi + float(np.dot(pu, qi)))

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        """Predict ratings for a DataFrame of pairs.

        Args:
            df: DataFrame with userId and movieId.

        Returns:
            Predictions as a numpy array.
        """
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self.predict_single(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


mf_config = MFConfig(n_factors=30, n_epochs=12, lr=0.01, reg=0.05)
mf_model = MatrixFactorizationModel(config=mf_config, random_state=RANDOM_STATE)

mf_model.fit(base_train_df)

mf_meta_preds = mf_model.predict_df(meta_train_df)
print("MF model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), mf_meta_preds))


We now have three base models, all trained on `base_train_df`, and all
producing predictions on `meta_train_df`.

Next step: **stacking**.


## 4. Stacking setup – meta-features and meta-target

For stacking we build a new supervised learning problem:

- **Input features**: predictions from base models.
- **Target**: the true rating.

For each row `(userId, movieId, rating)` in `meta_train_df` we compute:

```text
x = [bias_pred, item_knn_pred, mf_pred]
y = rating
```

We then train a **regression model** that learns how to combine the base
models. Here we start with a simple **LinearRegression** (a learned
weighted average).


In [None]:
# Construct meta-training matrix X_meta and target y_meta

X_meta = np.vstack([
    bias_meta_preds,
    item_meta_preds,
    mf_meta_preds,
]).T

y_meta = meta_train_df["rating"].to_numpy()

print("Meta feature matrix shape:", X_meta.shape)
print("First row (bias, itemKNN, MF):", X_meta[0])


### 4.1 Meta-learner: LinearRegression

We choose `LinearRegression` as a simple, interpretable meta-learner:

- It learns a combination:

\begin{align}
\hat r_{ui}^{\text{ensemble}} = w_0 + w_1 \hat r_{ui}^{\text{bias}} + w_2 \hat r_{ui}^{\text{item}} + w_3 \hat r_{ui}^{\text{mf}}
\end{align}

- Coefficients tell us how much each base model contributes.

You could also use more complex meta-models (e.g. GradientBoostingRegressor)
if you suspect non-linear interactions between base models.


In [None]:
meta_model = LinearRegression()
meta_model.fit(X_meta, y_meta)

print("Meta-model coefficients (w1, w2, w3):", meta_model.coef_)
print("Meta-model intercept (w0):", meta_model.intercept_)


The coefficients can be interpreted as **weights** assigned to each base
model. Larger magnitude means a stronger influence.


## 5. Evaluation on the test set

We now evaluate:

1. The three base models separately.
2. A simple **uniform average** ensemble.
3. The **stacked ensemble** using the learned meta-model.

All of these are evaluated on **the same test set**, which was not used
in fitting either base models or the meta-model.


In [None]:
# Base model predictions on test set

y_test = test_df["rating"].to_numpy()

bias_test_preds = bias_model.predict_df(test_df)
item_test_preds = item_knn_model.predict_df(test_df)
mf_test_preds = mf_model.predict_df(test_df)

# Simple uniform average ensemble
simple_ensemble_test = (bias_test_preds + item_test_preds + mf_test_preds) / 3.0

# Stacked ensemble predictions on test set
X_test_meta = np.vstack([
    bias_test_preds,
    item_test_preds,
    mf_test_preds,
]).T

stacked_test_preds = meta_model.predict(X_test_meta)

# Compute RMSE for all
results = {
    "BiasModel": rmse(y_test, bias_test_preds),
    "ItemKNNModel": rmse(y_test, item_test_preds),
    "MFModel": rmse(y_test, mf_test_preds),
    "SimpleAverageEnsemble": rmse(y_test, simple_ensemble_test),
    "StackedEnsemble": rmse(y_test, stacked_test_preds),
}

results


In [None]:
results_df = pd.DataFrame(
    {
        "model": list(results.keys()),
        "rmse": list(results.values()),
    }
)

results_df.sort_values("rmse")


In [None]:
sns.barplot(data=results_df, x="model", y="rmse")
plt.xticks(rotation=20)
plt.ylabel("RMSE (lower is better)")
plt.title("Base models vs ensembles on test set")
plt.show()


Typically, you will see:

- The best **single model** outperforms the others.
- The **simple average ensemble** is often competitive or slightly better
  than many single models.
- The **stacked ensemble** is at least as good as the best base model,
  sometimes noticeably better.

The exact numbers depend on random seeds and hyperparameters.


## 6. Visual diagnostics

We visualise how the stacked ensemble compares to a strong base model
(usually matrix factorisation) in rating space:

- True vs predicted scatter.
- Error distribution.


In [None]:
# Choose MF as a strong base model for comparison

plt.scatter(y_test, mf_test_preds, alpha=0.2, label="MF")
plt.scatter(y_test, stacked_test_preds, alpha=0.2, label="Stacked", marker="x")
plt.plot([0.5, 5], [0.5, 5], linestyle="--", color="black")
plt.xlabel("True rating")
plt.ylabel("Predicted rating")
plt.xlim(0.5, 5.0)
plt.ylim(0.5, 5.0)
plt.title("True vs predicted: MF vs Stacked Ensemble")
plt.legend()
plt.show()

# Error distributions
mf_errors = y_test - mf_test_preds
stacked_errors = y_test - stacked_test_preds

sns.histplot(mf_errors, color="tab:blue", label="MF", kde=False, bins=30, alpha=0.6)
sns.histplot(stacked_errors, color="tab:orange", label="Stacked", kde=False, bins=30, alpha=0.6)
plt.title("Prediction error distribution (MF vs Stacked)")
plt.xlabel("True - predicted")
plt.legend()
plt.show()


You should see that the stacked ensemble often has slightly smaller
absolute errors and a narrower error distribution.


## 7. Interpreting the meta-model weights

Since we used `LinearRegression`, we can inspect the learned coefficients
as a **data-driven weighting scheme** over base models.

Example interpretation:

- If `w2` (item-kNN) is small, the meta-model is effectively ignoring it.
- If `w3` (MF) is large, the stack heavily trusts the MF model.
- If `w1` (bias model) is non-zero, it acts as a stabiliser / shrinkage
  term for cold or uncertain regions.


In [None]:
coef_names = ["BiasModel", "ItemKNNModel", "MFModel"]
coef_values = meta_model.coef_

coef_df = pd.DataFrame({"base_model": coef_names, "weight": coef_values})
coef_df


In [None]:
sns.barplot(data=coef_df, x="base_model", y="weight")
plt.title("Meta-model weights for base models")
plt.ylabel("Linear coefficient")
plt.show()


These weights are not constrained to sum to 1 or be positive, but they
still give a good sense of relative importance.

If you want a **convex combination** (non-negative weights that sum to 1),
common choices include:

- Non-negative least squares.
- Simple grid search over weight triplets.


## 8. Design choices – why this ensemble setup?

### 8.1 Why these base models?

We chose three base recommenders with **different biases**:

1. **BiasModel**
   - Very stable, low-variance, captures global structure.
   - Handles sparse users/items gracefully.

2. **ItemKNNModel**
   - Local, memory-based model.
   - Good at finding highly similar items based on overlapping ratings.
   - Can exploit neighbourhood structure that MF might smooth over.

3. **MatrixFactorizationModel**
   - Latent-factor model that captures global interaction patterns.
   - Often best single model in practice on explicit feedback.

By combining them, the ensemble can:

- Lean on MF when enough data is available.
- Use k-NN when a user has strong local neighbourhood signals.
- Fall back to the bias model in sparse regions.

### 8.2 Why stacking instead of just averaging?

Averaging assumes that **each base model is equally good everywhere**.
Stacking lets the meta-model learn:

- Different relative importance across the rating range.
- How to correct systematic biases of base models.

For example, the meta-model may learn that:

- MF is slightly over-confident on very high ratings.
- Item-kNN is noisy for certain sparse items.

### 8.3 Why a linear meta-model?

- Simple, fast, easy to inspect.
- Coefficients are directly interpretable as weights.
- Often a good baseline; you can replace it with non-linear models later.

If you want more power, you can swap `LinearRegression` for e.g. a
`GradientBoostingRegressor` while keeping the rest of the pipeline.


## 9. Extensions and production considerations

Some natural extensions of this notebook:

1. **More base models**
   - Add a content-based model (e.g. using genres or tags).
   - Add external models such as Surprise SVD or LightFM.
   - Treat their predictions as additional meta-features.

2. **Cross-validated stacking**
   - Instead of a single `meta_train` split, use K-fold CV:
     - For each fold, get out-of-fold predictions from base models.
     - Train meta-model on all out-of-fold predictions.

3. **Ranking-focused ensembles**
   - Here we optimised RMSE (rating prediction).
   - For top-N recommendation, you could:
     - Compute ranking metrics (precision@K, recall@K) per model.
     - Learn a meta-model over ranking losses or per-user calibration.

4. **Serving ensembles**
   - In production, you may:
     - Precompute base-model scores offline.
     - Store them as features in a feature store.
     - Use a lightweight online model (e.g. linear layer) to combine them
       at request time.

This notebook provides a concrete, end-to-end template for **stacked
recommender ensembles**, showing the full path from design to training,
combination and evaluation.
