# MovieLens Recommender – Hybrid Content + Ranking Ensemble

This notebook extends the ranking ensemble by adding a **content-based
base model** and blending it together with classic collaborative filtering.

We will:

1. Use MovieLens `ml-latest-small` (`ratings.csv`, `movies.csv`).
2. Train 3 collaborative base models on explicit ratings:
   - Bias model (global + user + item effects).
   - Item-based k-NN collaborative filtering.
   - Matrix factorisation with biases.
3. Build a **genre-based content model** from `movies.csv`:
   - Represent each movie by a multi-hot genre vector.
   - Represent each user by an average of genres of movies they liked.
   - Score `(user, item)` by cosine similarity between user and item
     genre vectors.
4. Define relevance as `rating ≥ threshold` (e.g. 4.0).
5. Evaluate all base models plus a simple average on:
   - Hit-rate@K.
   - Precision@K.
   - Recall@K.
6. Train a **logistic regression stacking model** on top of **four base
   scores** (3 CF + 1 content).
7. Compare ranking performance and inspect learned weights for
   collaborative vs content model.

This gives a more realistic ensemble:

- **Collaborative** signals: patterns in who rated what.
- **Content** signals: what the items are about (genres).


## 1. Imports and configuration

Dependencies:

- `pandas`, `numpy` – data.
- `sklearn` – splitting, similarity, logistic regression.
- `matplotlib`, `seaborn` – plots.

Models are implemented in plain Python/NumPy with small, typed classes.


In [None]:
from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple, Set

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer, normalize

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_DIR: Path = Path("data") / "ml-latest-small"
RATINGS_PATH: Path = DATA_DIR / "ratings.csv"
MOVIES_PATH: Path = DATA_DIR / "movies.csv"

for p in [RATINGS_PATH, MOVIES_PATH]:
    if not p.exists():
        raise FileNotFoundError(
            f"Required file not found: {p.resolve()}\n"
            "Please ensure MovieLens 'ml-latest-small' is under data/ml-latest-small/."
        )

ratings_df = pd.read_csv(RATINGS_PATH)
movies_df = pd.read_csv(MOVIES_PATH)

print("Ratings shape:", ratings_df.shape)
print("Movies shape: ", movies_df.shape)
ratings_df.head()


In [None]:
def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute root mean squared error (RMSE).

    Args:
        y_true: True rating values.
        y_pred: Predicted rating values.

    Returns:
        RMSE as a float.
    """
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))


sns.histplot(ratings_df["rating"], bins=10)
plt.title("Rating distribution")
plt.xlabel("Rating")
plt.show()


## 2. Train / meta / test splits

We use a 3-way split suitable for stacking:

1. `train_full` (80%) vs `test` (20%).
2. `train_full` → `base_train` (80%) + `meta_train` (20%).

- Base models are trained on `base_train`.
- Meta-model is trained on predictions on `meta_train`.
- `test` is never seen during training.


In [None]:
train_full_df, test_df = train_test_split(
    ratings_df,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

base_train_df, meta_train_df = train_test_split(
    train_full_df,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

print("Base train size:", len(base_train_df))
print("Meta train size:", len(meta_train_df))
print("Test size:      ", len(test_df))


## 3. Collaborative base models

We reuse three models:

1. **BiasModel** – global mean + user and item biases.
2. **ItemKNNModel** – item-based k-NN CF using cosine similarity.
3. **MatrixFactorizationModel** – latent factor model with SGD training.

All expose a common API:

- `fit(df)`
- `predict_df(df)`


### 3.1 BiasModel


In [None]:
class BiasModel:
    """Global + user + item bias recommender.

    Rating estimate: mu + b_u + b_i
    where mu is global mean and b_u / b_i are learned deviations.
    """

    def __init__(self) -> None:
        self.mu: float | None = None
        self.user_bias: Dict[int, float] = {}
        self.item_bias: Dict[int, float] = {}

    def fit(self, df: pd.DataFrame) -> None:
        """Fit bias terms from a ratings DataFrame.

        Args:
            df: DataFrame with columns `userId`, `movieId`, `rating`.
        """
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        self.mu = float(df["rating"].mean())

        user_mean = df.groupby("userId")["rating"].mean()
        item_mean = df.groupby("movieId")["rating"].mean()

        # Deviations from global mean
        self.user_bias = (user_mean - self.mu).to_dict()
        self.item_bias = (item_mean - self.mu).to_dict()

    def predict_row(self, user_id: int, movie_id: int) -> float:
        """Predict rating for a single user–item pair.

        Args:
            user_id: User identifier.
            movie_id: Movie identifier.

        Returns:
            Predicted rating as float.
        """
        if self.mu is None:
            raise RuntimeError("Model has not been fitted.")

        bu = self.user_bias.get(user_id, 0.0)
        bi = self.item_bias.get(movie_id, 0.0)
        return float(self.mu + bu + bi)

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        """Predict ratings for many user–item pairs.

        Args:
            df: DataFrame with `userId`, `movieId`.

        Returns:
            Array of predictions aligned with df rows.
        """
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self.predict_row(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


bias_model = BiasModel()
bias_model.fit(base_train_df)

bias_meta_preds = bias_model.predict_df(meta_train_df)
print("Bias model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), bias_meta_preds))


### 3.2 ItemKNNModel – item-based CF


In [None]:
class ItemKNNModel:
    """Item-based k-NN collaborative filtering.

    Uses cosine similarity over item rating vectors and a similarity-
    weighted average of neighbour ratings per user.
    """

    def __init__(self, k: int = 40, default_rating: float = 3.5) -> None:
        self.k = k
        self.default_rating = float(default_rating)

        self.user_id_to_index: Dict[int, int] = {}
        self.item_id_to_index: Dict[int, int] = {}
        self.R: np.ndarray | None = None
        self.item_sim: np.ndarray | None = None

    def fit(self, df: pd.DataFrame) -> None:
        """Fit k-NN model from a ratings DataFrame.

        Args:
            df: DataFrame with `userId`, `movieId`, `rating`.
        """
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        unique_users = df["userId"].unique()
        unique_items = df["movieId"].unique()

        self.user_id_to_index = {uid: idx for idx, uid in enumerate(unique_users)}
        self.item_id_to_index = {iid: idx for idx, iid in enumerate(unique_items)}

        n_users = len(unique_users)
        n_items = len(unique_items)

        R = np.zeros((n_users, n_items), dtype=np.float32)
        for row in df.itertuples(index=False):
            u_idx = self.user_id_to_index[row.userId]
            i_idx = self.item_id_to_index[row.movieId]
            R[u_idx, i_idx] = row.rating

        self.R = R
        self.item_sim = cosine_similarity(R.T)

    def _predict_single(self, user_id: int, movie_id: int) -> float:
        if self.R is None or self.item_sim is None:
            raise RuntimeError("Model has not been fitted.")

        u_idx = self.user_id_to_index.get(user_id)
        i_idx = self.item_id_to_index.get(movie_id)
        if u_idx is None or i_idx is None:
            return self.default_rating

        user_ratings = self.R[u_idx, :]
        sims = self.item_sim[i_idx, :]

        rated_mask = user_ratings > 0
        rated_indices = np.where(rated_mask)[0]
        if rated_indices.size == 0:
            return self.default_rating

        sims_rated = sims[rated_indices]
        ratings_rated = user_ratings[rated_indices]

        k_use = min(self.k, rated_indices.size)
        top_idx = np.argsort(sims_rated)[-k_use:]

        neigh_sims = sims_rated[top_idx]
        neigh_ratings = ratings_rated[top_idx]

        if np.all(neigh_sims == 0):
            return float(neigh_ratings.mean())

        pred = float(np.dot(neigh_sims, neigh_ratings) / np.sum(np.abs(neigh_sims)))
        return pred

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self._predict_single(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


bias_global_mean = float(base_train_df["rating"].mean())
item_knn_model = ItemKNNModel(k=40, default_rating=bias_global_mean)
item_knn_model.fit(base_train_df)

item_meta_preds = item_knn_model.predict_df(meta_train_df)
print("Item-kNN model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), item_meta_preds))


### 3.3 MatrixFactorizationModel – latent factors


In [None]:
@dataclass
class MFConfig:
    """Configuration for matrix factorisation model."""

    n_factors: int = 30
    n_epochs: int = 10
    lr: float = 0.01
    reg: float = 0.05


class MatrixFactorizationModel:
    """Matrix factorisation with biases trained via SGD.

    Rating estimate: mu + b_u + b_i + p_u^T q_i
    """

    def __init__(self, config: MFConfig, random_state: int = 42) -> None:
        self.config = config
        self.random_state = random_state

        self.mu: float | None = None
        self.user_bias: Dict[int, float] = {}
        self.item_bias: Dict[int, float] = {}
        self.P: Dict[int, np.ndarray] = {}
        self.Q: Dict[int, np.ndarray] = {}

    def fit(self, df: pd.DataFrame) -> None:
        """Fit MF model on a ratings DataFrame.

        Args:
            df: DataFrame with `userId`, `movieId`, `rating`.
        """
        if df.empty:
            raise ValueError("Training DataFrame is empty.")

        rng = np.random.default_rng(self.random_state)

        user_ids = df["userId"].unique()
        item_ids = df["movieId"].unique()

        self.mu = float(df["rating"].mean())
        self.user_bias = {u: 0.0 for u in user_ids}
        self.item_bias = {i: 0.0 for i in item_ids}

        k = self.config.n_factors
        self.P = {u: 0.1 * rng.standard_normal(k) for u in user_ids}
        self.Q = {i: 0.1 * rng.standard_normal(k) for i in item_ids}

        lr = self.config.lr
        reg = self.config.reg

        user_arr = df["userId"].to_numpy()
        item_arr = df["movieId"].to_numpy()
        rating_arr = df["rating"].to_numpy()

        n_obs = len(df)

        for epoch in range(self.config.n_epochs):
            idx = rng.permutation(n_obs)
            se = 0.0

            for t in idx:
                u = int(user_arr[t])
                i = int(item_arr[t])
                r_ui = float(rating_arr[t])

                bu = self.user_bias[u]
                bi = self.item_bias[i]
                pu = self.P[u]
                qi = self.Q[i]

                pred = self.mu + bu + bi + float(np.dot(pu, qi))
                err = r_ui - pred
                se += err * err

                # Bias updates
                self.user_bias[u] = bu + lr * (err - reg * bu)
                self.item_bias[i] = bi + lr * (err - reg * bi)

                # Latent factor updates
                pu_new = pu + lr * (err * qi - reg * pu)
                qi_new = qi + lr * (err * pu - reg * qi)

                self.P[u] = pu_new
                self.Q[i] = qi_new

            train_rmse = float(np.sqrt(se / n_obs))
            print(f"Epoch {epoch+1}/{self.config.n_epochs} - train RMSE: {train_rmse:.4f}")

    def predict_single(self, user_id: int, movie_id: int) -> float:
        if self.mu is None:
            raise RuntimeError("Model has not been fitted.")

        bu = self.user_bias.get(user_id)
        bi = self.item_bias.get(movie_id)
        pu = self.P.get(user_id)
        qi = self.Q.get(movie_id)

        if bu is None or bi is None or pu is None or qi is None:
            return float(self.mu)

        return float(self.mu + bu + bi + float(np.dot(pu, qi)))

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        preds: List[float] = []
        for row in df.itertuples(index=False):
            preds.append(self.predict_single(int(row.userId), int(row.movieId)))
        return np.array(preds, dtype=float)


mf_config = MFConfig(n_factors=30, n_epochs=8, lr=0.01, reg=0.05)
mf_model = MatrixFactorizationModel(config=mf_config, random_state=RANDOM_STATE)

mf_model.fit(base_train_df)

mf_meta_preds = mf_model.predict_df(meta_train_df)
print("MF model RMSE on meta_train:", rmse(meta_train_df["rating"].to_numpy(), mf_meta_preds))


## 4. Genre-based content model

We now build a **content-based base model** from genres in `movies.csv`.

Steps:

1. Parse `genres` column into a list of genre tokens per movie.
2. Build a multi-hot genre matrix using `MultiLabelBinarizer`.
3. Create **movie genre vectors** (rows of that matrix).
4. For each user, build a **user profile vector** as the average of genre
   vectors of movies they rated above a threshold.
5. Score `(user, item)` as the cosine similarity between the user vector
   and the item genre vector.

This is a classic, interpretable content-based model.


In [None]:
def build_genre_matrix(movies: pd.DataFrame) -> Tuple[pd.DataFrame, MultiLabelBinarizer]:
    """Build a multi-hot genre matrix for movies.

    Args:
        movies: DataFrame with columns `movieId`, `genres`.

    Returns:
        Tuple of (genre_df, mlb) where genre_df has movieId as index and
        one column per genre.
    """
    movies = movies.copy()
    movies["genres_list"] = movies["genres"].fillna("(no genres listed)")

    def split_genres(s: str) -> List[str]:
        if s == "(no genres listed)" or s == "No Genres Listed":
            return []
        return [g.strip() for g in s.split("|") if g.strip()]

    movies["genres_list"] = movies["genres_list"].apply(split_genres)

    mlb = MultiLabelBinarizer()
    genre_mat = mlb.fit_transform(movies["genres_list"])

    genre_df = pd.DataFrame(
        genre_mat,
        columns=[f"genre:{g}" for g in mlb.classes_],
        index=movies["movieId"].to_numpy(),
    )
    return genre_df, mlb


genre_df, mlb = build_genre_matrix(movies_df)
print("Genre matrix shape:", genre_df.shape)
genre_df.head()


In [None]:
class GenreContentModel:
    """Content-based model using movie genres.

    For each user, builds a genre profile based on liked movies and
    scores items by cosine similarity between user and item genre
    vectors.
    """

    def __init__(self, genre_df: pd.DataFrame, rel_threshold: float = 4.0) -> None:
        self.genre_df = genre_df
        self.rel_threshold = rel_threshold
        # userId -> genre vector (np.ndarray)
        self.user_profiles: Dict[int, np.ndarray] = {}

    def fit(self, df: pd.DataFrame) -> None:
        """Fit user profiles from ratings.

        Args:
            df: Ratings DataFrame with `userId`, `movieId`, `rating`.
        """
        # Filter to liked movies
        liked = df[df["rating"] >= self.rel_threshold]

        # Join genres for liked items
        joined = liked.merge(self.genre_df, left_on="movieId", right_index=True, how="left")

        genre_cols = self.genre_df.columns

        # Build profiles as average genre vector of liked movies
        profiles = joined.groupby("userId")[genre_cols].mean().fillna(0.0)

        # L2-normalise for cosine similarity
        profiles_mat = normalize(profiles.to_numpy(), norm="l2")

        for uid, vec in zip(profiles.index.to_numpy(), profiles_mat):
            self.user_profiles[int(uid)] = vec

    def _score_single(self, user_id: int, movie_id: int) -> float:
        """Score a single user–item pair by cosine similarity.

        Args:
            user_id: User identifier.
            movie_id: Movie identifier.

        Returns:
            Similarity score in [0, 1] (approximately), or 0 if no
            profile / item vector is available.
        """
        user_vec = self.user_profiles.get(user_id)
        if user_vec is None:
            return 0.0

        if movie_id not in self.genre_df.index:
            return 0.0

        item_vec = self.genre_df.loc[movie_id].to_numpy(dtype=float)
        if not np.any(item_vec):
            return 0.0

        # Normalise item vector to unit length
        item_vec_norm = item_vec / (np.linalg.norm(item_vec) + 1e-12)
        score = float(np.dot(user_vec, item_vec_norm))
        return score

    def predict_df(self, df: pd.DataFrame) -> np.ndarray:
        """Score multiple user–item pairs.

        Args:
            df: DataFrame with `userId`, `movieId`.

        Returns:
            Array of similarity scores aligned with df rows.
        """
        scores: List[float] = []
        for row in df.itertuples(index=False):
            scores.append(self._score_single(int(row.userId), int(row.movieId)))
        return np.array(scores, dtype=float)


genre_model = GenreContentModel(genre_df=genre_df, rel_threshold=4.0)
genre_model.fit(base_train_df)

genre_meta_scores = genre_model.predict_df(meta_train_df)
print("Genre-content scores on meta_train: min=", genre_meta_scores.min(), "max=", genre_meta_scores.max())


## 5. Relevance and ranking metrics

We now switch focus to **ranking quality**.

- Define relevance: rating ≥ `REL_THRESHOLD`.
- For each user, rank test items by a model's scores.
- Compute:
  - Hit-rate@K.
  - Precision@K.
  - Recall@K.


In [None]:
REL_THRESHOLD: float = 4.0
K_EVAL: int = 10


def get_user_relevant_items(df: pd.DataFrame, user_id: int, threshold: float = REL_THRESHOLD) -> Set[int]:
    """Return set of relevant movieIds for a user.

    Args:
        df: Ratings DataFrame.
        user_id: User identifier.
        threshold: Rating threshold for relevance.

    Returns:
        Set of movieIds considered relevant.
    """
    mask = (df["userId"] == user_id) & (df["rating"] >= threshold)
    return set(df.loc[mask, "movieId"].unique())


def ranking_metrics_for_model_on_test(
    test_df: pd.DataFrame,
    scores: np.ndarray,
    k: int = K_EVAL,
    threshold: float = REL_THRESHOLD,
) -> Dict[str, float]:
    """Compute hit-rate, precision@K, recall@K from per-row scores.

    Args:
        test_df: Test ratings DataFrame.
        scores: Score array aligned with test_df.
        k: Cutoff for top-K recommendations.
        threshold: Rating threshold for relevance.

    Returns:
        Dict with hit_rate, precision_at_k, recall_at_k, n_eval_users.
    """
    df_scores = test_df.copy()
    df_scores["score"] = scores

    users = df_scores["userId"].unique()

    hits = 0
    sum_precision = 0.0
    sum_recall = 0.0
    n_eval_users = 0

    for u in users:
        user_rows = df_scores[df_scores["userId"] == u]
        relevant_items = get_user_relevant_items(test_df, u, threshold=threshold)
        if not relevant_items:
            continue

        n_eval_users += 1

        user_rows_sorted = user_rows.sort_values("score", ascending=False)
        top_k = user_rows_sorted.head(k)

        recommended_items = set(top_k["movieId"].tolist())
        n_rel_top = len(recommended_items & relevant_items)

        if n_rel_top > 0:
            hits += 1

        precision_u = n_rel_top / min(k, len(user_rows_sorted))
        recall_u = n_rel_top / len(relevant_items)

        sum_precision += precision_u
        sum_recall += recall_u

    if n_eval_users == 0:
        raise ValueError("No users with relevant items in test for ranking evaluation.")

    hit_rate = hits / n_eval_users
    precision_at_k = sum_precision / n_eval_users
    recall_at_k = sum_recall / n_eval_users

    return {
        "hit_rate": hit_rate,
        "precision_at_k": precision_at_k,
        "recall_at_k": recall_at_k,
        "n_eval_users": float(n_eval_users),
    }


## 6. Base models (collaborative + content) as rankers

We now evaluate 4 base models and a simple average ensemble on the test
set using ranking metrics.


In [None]:
y_test = test_df["rating"].to_numpy()

bias_test_scores = bias_model.predict_df(test_df)
item_test_scores = item_knn_model.predict_df(test_df)
mf_test_scores = mf_model.predict_df(test_df)
genre_test_scores = genre_model.predict_df(test_df)

metrics_bias = ranking_metrics_for_model_on_test(test_df, bias_test_scores, k=K_EVAL)
metrics_item = ranking_metrics_for_model_on_test(test_df, item_test_scores, k=K_EVAL)
metrics_mf = ranking_metrics_for_model_on_test(test_df, mf_test_scores, k=K_EVAL)
metrics_genre = ranking_metrics_for_model_on_test(test_df, genre_test_scores, k=K_EVAL)

# Simple average of all four models
avg4_test_scores = (bias_test_scores + item_test_scores + mf_test_scores + genre_test_scores) / 4.0
metrics_avg4 = ranking_metrics_for_model_on_test(test_df, avg4_test_scores, k=K_EVAL)

metrics_bias, metrics_item, metrics_mf, metrics_genre, metrics_avg4


In [None]:
rows = []
for name, m in [
    ("BiasModel", metrics_bias),
    ("ItemKNNModel", metrics_item),
    ("MFModel", metrics_mf),
    ("GenreContent", metrics_genre),
    ("SimpleAverage4", metrics_avg4),
]:
    rows.append(
        {
            "model": name,
            "hit_rate": m["hit_rate"],
            "precision_at_k": m["precision_at_k"],
            "recall_at_k": m["recall_at_k"],
        }
    )

base_rank_df = pd.DataFrame(rows)
base_rank_df


In [None]:
base_rank_melt = base_rank_df.melt(id_vars="model", var_name="metric", value_name="value")

sns.barplot(data=base_rank_melt, x="metric", y="value", hue="model")
plt.ylim(0, 1)
plt.ylabel("Score")
plt.title(f"Collaborative vs content vs simple average (K={K_EVAL})")
plt.show()


The genre-based content model may not beat MF, but it often helps with
cold items or users with clear genre tastes. The goal is **complementary
signals**, not necessarily a standalone winner.


## 7. Stacked ensemble with collaborative + content scores

We now build a **logistic regression meta-model** for ranking, using
four base scores as features:

- BiasModel score.
- ItemKNNModel score.
- MFModel score.
- GenreContent score.

Meta-training data comes from `meta_train_df`.


In [None]:
# Build meta features and labels on meta_train_df

bias_meta_scores = bias_meta_preds
item_meta_scores = item_meta_preds
mf_meta_scores = mf_meta_preds
genre_meta_scores = genre_meta_scores  # already computed above

X_meta_rank = np.vstack([
    bias_meta_scores,
    item_meta_scores,
    mf_meta_scores,
    genre_meta_scores,
]).T

y_meta_rank = (meta_train_df["rating"].to_numpy() >= REL_THRESHOLD).astype(int)

print("X_meta_rank shape:", X_meta_rank.shape)
print("Positive rate (meta):", y_meta_rank.mean())


In [None]:
meta_rank_model = LogisticRegression(
    penalty="l2",
    C=1.0,
    solver="lbfgs",
    max_iter=1000,
    random_state=RANDOM_STATE,
)

meta_rank_model.fit(X_meta_rank, y_meta_rank)

print("Meta-ranking coefficients (Bias, ItemKNN, MF, Genre):", meta_rank_model.coef_)
print("Meta-ranking intercept:", meta_rank_model.intercept_)


### 7.1 Evaluate stacked ensemble on test

We now apply the meta-model to test rows:

- Build `X_test_rank` from the four base scores on test.
- Compute `p(relevant)` for each row.
- Rank per user by that probability.
- Compute hit-rate@K, precision@K, recall@K.


In [None]:
X_test_rank = np.vstack([
    bias_test_scores,
    item_test_scores,
    mf_test_scores,
    genre_test_scores,
]).T

stacked_test_proba = meta_rank_model.predict_proba(X_test_rank)[:, 1]

metrics_stacked = ranking_metrics_for_model_on_test(
    test_df,
    stacked_test_proba,
    k=K_EVAL,
    threshold=REL_THRESHOLD,
)

metrics_stacked


In [None]:
rows_full = rows.copy()
rows_full.append(
    {
        "model": "StackedLogistic4",
        "hit_rate": metrics_stacked["hit_rate"],
        "precision_at_k": metrics_stacked["precision_at_k"],
        "recall_at_k": metrics_stacked["recall_at_k"],
    }
)

rank_compare_df = pd.DataFrame(rows_full)
rank_compare_df


In [None]:
rank_compare_melt = rank_compare_df.melt(id_vars="model", var_name="metric", value_name="value")

sns.barplot(data=rank_compare_melt, x="metric", y="value", hue="model")
plt.ylim(0, 1)
plt.ylabel("Score")
plt.title(f"Ranking ensemble with collaborative + content (K={K_EVAL})")
plt.show()


Typically you will see that the stacked ensemble (`StackedLogistic4`)
performs at least as well as the best single CF model and often a bit
better than the simple average. The gain may be modest on this small
dataset, but the pattern scales when you add stronger and more diverse
base models.


## 8. Interpreting collaborative vs content weights

We inspect the logistic regression coefficients for the four base
models to see how much each contributes to predicted relevance.


In [None]:
coef_names = ["BiasModel", "ItemKNNModel", "MFModel", "GenreContent"]
coef_values = meta_rank_model.coef_[0]

coef_df = pd.DataFrame({"base_model": coef_names, "weight": coef_values})
coef_df


In [None]:
sns.barplot(data=coef_df, x="base_model", y="weight")
plt.title("Meta-ranking weights: collaborative vs content")
plt.ylabel("Logistic regression coefficient")
plt.show()


Interpretation:

- **MFModel** often gets a strong positive weight (strong CF signal).
- **ItemKNNModel** may get smaller / moderate weight (local CF signal).
- **BiasModel** can act as a stabiliser.
- **GenreContent** may be smaller but still positive, indicating that
  genre similarity helps resolve some cases (especially for cold items
  or users with clear genre tastes).

Negative weights would mean the meta-model is using a base model mostly
as a correction term, but that is less common here.


## 9. Design summary and extensions

### 9.1 Why add a content model?

Collaborative models only use **who rated what**. They struggle when:

- An item is new (few or no ratings).
- A user has very few interactions.

Content-based models use **what the item is** (genres, tags, text):

- They can recommend items that are similar in content even if
  interactions are sparse.
- They make ensembles more robust and interpretable.

### 9.2 Why still use logistic regression on top?

- We want a **single final score** that reflects relevance.
- Base models can have different scales and biases.
- Logistic regression learns how to reweight and rescale them so that
  the probability of relevance is best separated from non-relevance.

### 9.3 Easy extensions

1. **Richer content**
   - Use tags (from `tags.csv`) and build TF-IDF features.
   - Use embeddings from descriptions / plots if available.

2. **More base models**
   - Add LightFM, Surprise SVD, etc., and treat their scores as
     additional features in `X_meta_rank`.

3. **User segments**
   - Add user-level features (e.g. number of ratings, average rating)
     as meta-features.

4. **Cross-validated stacking**
   - Use out-of-fold base predictions instead of a single meta split for
     more robust meta-training.

This notebook gives a full, concrete example of **hybrid collaborative
+ content ensembles aimed at ranking metrics**, with explanations of why
each design choice is made.
