# Hybrid MovieLens Recommender – LightFM with Genres & Tags

This notebook extends the previous MovieLens recommenders by building a
**hybrid model**:

- We still use **collaborative signals** (user–item interactions).
- We now add **content features** from `movies.csv` and `tags.csv`:
  - Genres as item features.
  - User-generated tags as item features.

We will:

1. Load MovieLens ratings, movies and tags.
2. Build an **implicit feedback** interaction matrix from ratings.
3. Construct item features from **genres + tags**.
4. Train two LightFM models:
   - Model A: **interactions only**.
   - Model B: **interactions + item features (genres + tags)**.
5. Compare their performance with **precision@K** and **recall@K**.
6. Inspect item features and latent neighbourhoods for a sample movie.

Assumed files (MovieLens `ml-latest-small`):

```text
data/ml-latest-small/ratings.csv
data/ml-latest-small/movies.csv
data/ml-latest-small/tags.csv
```

## 0. Environment setup (outside this notebook)

Before running this notebook you need to install LightFM:

```bash
pip install lightfm
```

Or with conda:

```bash
conda install -c conda-forge lightfm
```

We only rely on `lightfm` and standard scientific Python libraries.


## 1. Imports and configuration

We will use:

- `pandas`, `numpy` – data handling.
- `scipy.sparse` – sparse matrices.
- `matplotlib`, `seaborn` – visualisations.
- `lightfm` – hybrid recommender model.

We treat MovieLens ratings as **implicit feedback** by thresholding the
rating: rating ≥ 4.0 → positive interaction.


In [None]:
from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Iterable, List, Sequence, Tuple, Set

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import sparse

from lightfm import LightFM
from lightfm.data import Dataset as LFMDataset
from lightfm.evaluation import precision_at_k as lfm_precision_at_k, recall_at_k as lfm_recall_at_k
from lightfm.cross_validation import random_train_test_split

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_DIR: Path = Path("data") / "ml-latest-small"
RATINGS_PATH: Path = DATA_DIR / "ratings.csv"
MOVIES_PATH: Path = DATA_DIR / "movies.csv"
TAGS_PATH: Path = DATA_DIR / "tags.csv"

for p in [RATINGS_PATH, MOVIES_PATH, TAGS_PATH]:
    if not p.exists():
        raise FileNotFoundError(
            f"Required file not found: {p.resolve()}. "
            "Please ensure you have the 'ml-latest-small' MovieLens dataset under data/ml-latest-small/."
        )


## 2. Load data and quick EDA

We load:

- `ratings.csv` – userId, movieId, rating, timestamp.
- `movies.csv` – movieId, title, genres.
- `tags.csv` – userId, movieId, tag, timestamp.

We keep EDA light here, since detailed analysis already exists in previous
notebooks.


In [None]:
ratings_df = pd.read_csv(RATINGS_PATH)
movies_df = pd.read_csv(MOVIES_PATH)
tags_df = pd.read_csv(TAGS_PATH)

print("Ratings:", ratings_df.shape)
print("Movies:", movies_df.shape)
print("Tags:   ", tags_df.shape)

ratings_df.head()

In [None]:
n_users: int = ratings_df["userId"].nunique()
n_items: int = ratings_df["movieId"].nunique()

print(f"Users: {n_users}, Movies (in ratings): {n_items}, Ratings: {len(ratings_df)}")
print(f"Density: {len(ratings_df) / (n_users * n_items):.6f}")

sns.histplot(ratings_df["rating"], bins=10)
plt.title("Rating distribution")
plt.xlabel("Rating")
plt.show()


## 3. Build implicit feedback interactions

We convert ratings into implicit interactions:

- If `rating >= 4.0` → positive interaction (1).
- Otherwise → no interaction (0).

This is a simple heuristic, but often used for MovieLens–LightFM examples.


In [None]:
POS_THRESH: float = 4.0

implicit_df = ratings_df.copy()
implicit_df["interaction"] = (implicit_df["rating"] >= POS_THRESH).astype(int)

print(implicit_df["interaction"].value_counts(normalize=True))

# Filter only positive interactions for LightFM interactions matrix
pos_df = implicit_df.loc[implicit_df["interaction"] == 1, ["userId", "movieId"]].copy()
print("Positive interactions:", pos_df.shape[0])
pos_df.head()

## 4. Build item features from genres and tags

We want **string features** per item so that LightFM can treat them as
high-cardinality sparse features.

### 4.1 Genres as features

From `movies.csv` we have a `genres` column with pipe-separated genres,
for example:

```text
Adventure|Comedy|Sci-Fi
```

We turn each genre into a feature token of the form `"genre:Comedy"`.


In [None]:
def extract_genre_features(movies: pd.DataFrame) -> Dict[int, List[str]]:
    """Extract genre-based feature tokens per movie.

    Args:
        movies: DataFrame with at least columns movieId, genres.

    Returns:
        Mapping from movieId to list of genre feature tokens.
    """
    features: Dict[int, List[str]] = {}
    for row in movies.itertuples(index=False):
        movie_id = int(row.movieId)
        genres_str: str = row.genres if isinstance(row.genres, str) else ""
        if genres_str == "(no genres listed)" or genres_str == "No Genres Listed":
            genres_list: List[str] = []
        else:
            genres_list = [g.strip() for g in genres_str.split("|") if g.strip()]
        tokens = [f"genre:{g}" for g in genres_list]
        features[movie_id] = tokens
    return features


genre_features = extract_genre_features(movies_df)

# Quick sanity check
sample_movie = movies_df.sample(1, random_state=RANDOM_STATE).iloc[0]
print("Sample movie:", sample_movie["movieId"], sample_movie["title"], sample_movie["genres"])
print("Genre feature tokens:", genre_features[int(sample_movie["movieId"])] )


### 4.2 Tags as features

`tags.csv` contains user-generated free text tags per `(userId, movieId)`.

We:

1. Lowercase and strip tags.
2. Aggregate all tags per movie.
3. Turn them into tokens like `"tag:space-travel"`.

To keep feature space under control, we can:

- Optionally filter very rare tags.
- Or cap the number of tags per movie.

Here we keep it simple but add a frequency filter.


In [None]:
def extract_tag_features(
    tags: pd.DataFrame,
    min_tag_freq: int = 5,
    max_tags_per_movie: int = 20,
) -> Dict[int, List[str]]:
    """Extract tag-based feature tokens per movie.

    Args:
        tags: DataFrame with columns userId, movieId, tag.
        min_tag_freq: Minimum global frequency to keep a tag.
        max_tags_per_movie: Maximum number of tags per movie to retain.

    Returns:
        Mapping from movieId to list of tag feature tokens.
    """
    tags = tags.copy()
    tags["tag_clean"] = tags["tag"].astype(str).str.lower().str.strip()

    # Filter extremely rare tags
    tag_counts = tags["tag_clean"].value_counts()
    frequent_tags = set(tag_counts[tag_counts >= min_tag_freq].index)

    tags = tags[tags["tag_clean"].isin(frequent_tags)]

    movie_to_tags: Dict[int, List[str]] = {}

    for movie_id, grp in tags.groupby("movieId"):
        unique_tags = list(dict.fromkeys(grp["tag_clean"]))  # preserve order, deduplicate
        limited_tags = unique_tags[:max_tags_per_movie]
        tokens = [f"tag:{t}" for t in limited_tags]
        movie_to_tags[int(movie_id)] = tokens

    return movie_to_tags


tag_features = extract_tag_features(tags_df, min_tag_freq=3, max_tags_per_movie=20)

print("Number of movies with at least one tag feature:", len(tag_features))

# Inspect features for the same sample movie (if present)
movie_id_sample = int(sample_movie["movieId"])
print("Tag feature tokens for sample (if any):", tag_features.get(movie_id_sample, []))


### 4.3 Combine genre and tag features per movie

For each `movieId` we combine:

- `genre:*` tokens.
- `tag:*` tokens (when available).

We will later pass these as item features into the LightFM `Dataset`.


In [None]:
def combine_item_features(
    movies: pd.DataFrame,
    genre_feat: Dict[int, List[str]],
    tag_feat: Dict[int, List[str]],
) -> Dict[int, List[str]]:
    """Combine genre and tag features for each movie.

    Args:
        movies: Movies DataFrame with movieId.
        genre_feat: Genre feature tokens per movieId.
        tag_feat: Tag feature tokens per movieId.

    Returns:
        Mapping from movieId to combined feature token list.
    """
    combined: Dict[int, List[str]] = {}
    for row in movies.itertuples(index=False):
        movie_id = int(row.movieId)
        g_tokens = genre_feat.get(movie_id, [])
        t_tokens = tag_feat.get(movie_id, [])
        tokens = g_tokens + t_tokens
        # Ensure at least one feature; if none, add a dummy token
        if not tokens:
            tokens = ["bias:item"]
        combined[movie_id] = tokens
    return combined


item_features_tokens = combine_item_features(movies_df, genre_features, tag_features)

# Feature counts per movie for a quick visual
feat_counts = pd.Series({mid: len(feats) for mid, feats in item_features_tokens.items()})

sns.histplot(feat_counts, bins=20)
plt.title("Number of features per movie (genres + tags)")
plt.xlabel("# features")
plt.show()

feat_counts.describe()

## 5. Build LightFM Dataset and interactions

We now create a `Dataset` for LightFM and build:

- User and item ID mappings.
- Interaction matrix from positive events.
- Item feature matrix from our tokens.

We will train **two models** on the same interactions:

- Model A: without item features.
- Model B: with item features.


In [None]:
# Collect full sets of user and item ids
all_users: np.ndarray = ratings_df["userId"].unique()
all_items: np.ndarray = movies_df["movieId"].unique()

# Collect all unique feature tokens
all_tokens: Set[str] = set()
for feats in item_features_tokens.values():
    all_tokens.update(feats)

print("# users:", len(all_users), "# items:", len(all_items), "# feature tokens:", len(all_tokens))

lfm_dataset = LFMDataset()
lfm_dataset.fit(
    users=all_users,
    items=all_items,
    item_features=list(all_tokens),
)

# Build interactions (positive events only)

(interactions, weights) = lfm_dataset.build_interactions(
    (int(row.userId), int(row.movieId)) for row in pos_df.itertuples(index=False)
)

print("Interactions shape:", interactions.shape)
print("# positive interactions:", interactions.getnnz())


In [None]:
# Build item feature matrix

item_features_list: List[Tuple[int, List[str]]] = [
    (int(mid), feats) for mid, feats in item_features_tokens.items()
]

item_features_matrix = lfm_dataset.build_item_features(item_features_list)

print("Item features matrix shape:", item_features_matrix.shape)


## 6. Train/test split for implicit data

We perform a random train/test split on the interaction matrix using
LightFM's `random_train_test_split` helper.


In [None]:
train_interactions, test_interactions = random_train_test_split(
    interactions,
    test_percentage=0.2,
    random_state=np.random.RandomState(RANDOM_STATE),
)

print("Train interactions:", train_interactions.getnnz())
print("Test interactions: ", test_interactions.getnnz())


## 7. Train LightFM models

We train two LightFM models with the same hyperparameters:

- **Model A (no features)** – only interactions.
- **Model B (with features)** – interactions + item feature matrix.

We use the **BPR** loss, a common choice for implicit feedback ranking.


In [None]:
@dataclass
class LightFMConfig:
    no_components: int = 40
    learning_rate: float = 0.05
    loss: str = "bpr"
    epochs: int = 25
    num_threads: int = 4


config = LightFMConfig()

# Model A: interactions only

model_no_features = LightFM(
    no_components=config.no_components,
    learning_rate=config.learning_rate,
    loss=config.loss,
    random_state=RANDOM_STATE,
)

model_no_features.fit(
    train_interactions,
    epochs=config.epochs,
    num_threads=config.num_threads,
    verbose=True,
)

# Model B: interactions + item features

model_with_features = LightFM(
    no_components=config.no_components,
    learning_rate=config.learning_rate,
    loss=config.loss,
    random_state=RANDOM_STATE,
)

model_with_features.fit(
    train_interactions,
    item_features=item_features_matrix,
    epochs=config.epochs,
    num_threads=config.num_threads,
    verbose=True,
)


## 8. Evaluate with precision@K and recall@K

We compare the two models using LightFM's evaluation functions.

We pass `train_interactions` so that already-seen interactions are not
recommended as positives.


In [None]:
def evaluate_lightfm_models(
    model_a: LightFM,
    model_b: LightFM,
    train: sparse.spmatrix,
    test: sparse.spmatrix,
    item_features: sparse.spmatrix,
    k: int = 10,
    num_threads: int = 4,
) -> pd.DataFrame:
    """Evaluate two LightFM models with and without item features.

    Args:
        model_a: Model trained without item features.
        model_b: Model trained with item features.
        train: Train interaction matrix.
        test: Test interaction matrix.
        item_features: Item feature matrix.
        k: Cutoff for precision@k and recall@k.
        num_threads: Number of threads for LightFM evaluation.

    Returns:
        DataFrame with precision@k and recall@k for both models.
    """
    prec_a = lfm_precision_at_k(
        model_a,
        test,
        train_interactions=train,
        k=k,
        num_threads=num_threads,
    ).mean()

    rec_a = lfm_recall_at_k(
        model_a,
        test,
        train_interactions=train,
        k=k,
        num_threads=num_threads,
    ).mean()

    prec_b = lfm_precision_at_k(
        model_b,
        test,
        train_interactions=train,
        k=k,
        num_threads=num_threads,
        item_features=item_features,
    ).mean()

    rec_b = lfm_recall_at_k(
        model_b,
        test,
        train_interactions=train,
        k=k,
        num_threads=num_threads,
        item_features=item_features,
    ).mean()

    results = pd.DataFrame(
        {
            "model": ["LightFM (no features)", "LightFM (genres+tags)"],
            f"precision@{k}": [prec_a, prec_b],
            f"recall@{k}": [rec_a, rec_b],
        }
    )
    return results


k_eval = 10
results_df = evaluate_lightfm_models(
    model_a=model_no_features,
    model_b=model_with_features,
    train=train_interactions,
    test=test_interactions,
    item_features=item_features_matrix,
    k=k_eval,
    num_threads=config.num_threads,
)

results_df


In [None]:
# Visual comparison

results_melt = results_df.melt(id_vars="model", var_name="metric", value_name="value")

sns.barplot(data=results_melt, x="metric", y="value", hue="model")
plt.ylim(0, 1)
plt.title("LightFM – impact of item features (genres + tags)")
plt.ylabel("Score")
plt.show()


If the item features are informative, we often see an improvement in
precision@K and/or recall@K when using them, especially for **cold or
sparse items**.


## 9. Inspecting recommendations and features for a sample user

We now:

1. Pick a random user.
2. Generate top-N recommendations with the **feature-aware** model.
3. Join with movie titles for readability.
4. Show the content features for a recommended movie.


In [None]:
def get_id_mappings(dataset: LFMDataset) -> Tuple[Dict[int, int], Dict[int, int]]:
    """Return user and item external→internal id mappings from a Dataset.

    Args:
        dataset: LightFM Dataset instance.

    Returns:
        Tuple of (user_id_to_internal, item_id_to_internal).
    """
    user_id_mapping, _, item_id_mapping, _ = dataset.mapping()
    # mapping() returns dicts: user_id_mapping, user_feature_mapping,
    # item_id_mapping, item_feature_mapping.
    return user_id_mapping, item_id_mapping


user_id_map, item_id_map = get_id_mappings(lfm_dataset)

# Reverse mappings
internal_to_user = {internal: uid for uid, internal in user_id_map.items()}
internal_to_item = {internal: mid for mid, internal in item_id_map.items()}


def recommend_for_user(
    model: LightFM,
    dataset: LFMDataset,
    user_id: int,
    item_features: sparse.spmatrix | None,
    n: int = 10,
) -> pd.DataFrame:
    """Generate top-N recommendations for a given user.

    Args:
        model: Trained LightFM model.
        dataset: LightFM Dataset.
        user_id: External userId.
        item_features: Item feature matrix, or None.
        n: Number of recommendations.

    Returns:
        DataFrame with movieId, score.
    """
    user_id_map, item_id_map = get_id_mappings(dataset)
    if user_id not in user_id_map:
        raise ValueError(f"User {user_id} not in dataset.")

    user_internal = user_id_map[user_id]

    n_items_internal = len(item_id_map)
    item_internal_ids = np.arange(n_items_internal, dtype=np.int64)

    # Predict scores for all items
    scores = model.predict(
        user_ids=np.repeat(user_internal, n_items_internal),
        item_ids=item_internal_ids,
        item_features=item_features,
        num_threads=config.num_threads,
    )

    # Exclude items the user has already interacted with
    user_rows = interactions.tocsr()[user_internal]
    already_interacted = set(user_rows.indices)

    candidate_indices = [i for i in item_internal_ids if i not in already_interacted]
    candidate_scores = scores[candidate_indices]

    top_idx = np.argsort(candidate_scores)[-n:][::-1]
    top_internal = [candidate_indices[i] for i in top_idx]
    top_scores = candidate_scores[top_idx]

    movie_ids = [internal_to_item[i] for i in top_internal]
    recs_df = pd.DataFrame({"movieId": movie_ids, "score": top_scores})
    return recs_df


sample_user_id: int = int(ratings_df["userId"].sample(1, random_state=RANDOM_STATE).iloc[0])
print("Sample user:", sample_user_id)

recs_df = recommend_for_user(
    model=model_with_features,
    dataset=lfm_dataset,
    user_id=sample_user_id,
    item_features=item_features_matrix,
    n=10,
)

recs_with_titles = recs_df.merge(movies_df, on="movieId", how="left")
recs_with_titles[["movieId", "title", "score"]]


### 9.1 Inspect content features for one recommended movie

For one of the top recommendations, we print its genres and tag features
so you can see what the model had available as side-information.


In [None]:
if not recs_with_titles.empty:
    top_movie_id = int(recs_with_titles.iloc[0]["movieId"])
    top_movie_row = movies_df.loc[movies_df["movieId"] == top_movie_id].iloc[0]
    print("Top recommended movie:")
    print(top_movie_row["movieId"], top_movie_row["title"], top_movie_row["genres"])

    feats = item_features_tokens.get(top_movie_id, [])
    print("\nContent feature tokens (genres + tags):")
    for f in feats:
        print("-", f)
else:
    print("No recommendations to inspect.")


## 10. Summary

In this notebook we built a **hybrid recommender** on MovieLens:

1. Converted ratings to implicit feedback and built an interactions matrix.
2. Created **item feature tokens** from `movies.csv` and `tags.csv`:
   - Genres → `genre:*` features.
   - Frequent tags → `tag:*` features.
3. Constructed a LightFM `Dataset` with both interactions and item features.
4. Trained two LightFM models:
   - Interactions only.
   - Interactions + item features.
5. Compared their ranking performance (**precision@10**, **recall@10**).
6. Generated recommended movies for a sample user and inspected the
   content features used by the hybrid model.

This gives you a realistic template for **content-aware collaborative
filtering** and a natural extension path towards:

- Adding **user features** (demographics, segments).
- Using richer text features (plot summaries via language models).
- Deploying the trained model in a real-time inference service.
