# Model Evaluation – Product Bundle Recommender

## Objective
This notebook evaluates the offline performance of the product bundle
recommender using historical transaction data.

Evaluation is performed using:
- Time-based train/test split
- Hit@K metric
- Basket reconstruction logic

This evaluation reflects realistic production constraints.


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from itertools import combinations

In [2]:
BASE_DIR = Path().resolve().parent
CLEAN_FILE = BASE_DIR / "data" / "processed" / "clean_transactions.parquet"

df = pd.read_parquet(CLEAN_FILE)
df.head()


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


In [3]:
FEATURE_PATH = BASE_DIR / "data" / "features"

weighted_co_df = pd.read_parquet(FEATURE_PATH / "weighted_co_occurrence.parquet")
popularity_df = pd.read_parquet(FEATURE_PATH / "product_popularity.parquet")

In [4]:
# Popularity lookup
popularity_map = dict(
    zip(popularity_df["StockCode"], popularity_df["popularity"])
)

# Co-occurrence lookup
co_map = {}

for _, row in weighted_co_df.iterrows():
    a, b, score = row["product_a"], row["product_b"], row["weighted_score"]
    co_map.setdefault(a, {})[b] = score
    co_map.setdefault(b, {})[a] = score

In [5]:
def score_product(target_product, candidate_product):
    co_score = co_map.get(target_product, {}).get(candidate_product, 0)
    pop = popularity_map.get(candidate_product, 1)
    return co_score / np.log1p(pop)

def recommend_bundle(product_id, top_k=5):
    if product_id not in co_map:
        return []
    
    scored = [
        (prod, score_product(product_id, prod))
        for prod in co_map[product_id]
    ]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [p for p, _ in scored[:top_k]]

## Time-Based Split

To avoid data leakage, training data consists of transactions
before a cutoff date, while testing data consists of transactions after it.

In [6]:
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])

cutoff_date = df["InvoiceDate"].quantile(0.8)

train_df = df[df["InvoiceDate"] <= cutoff_date]
test_df = df[df["InvoiceDate"] > cutoff_date]

train_df.shape, test_df.shape


((424122, 8), (105982, 8))

In [7]:
test_baskets = (
    test_df.groupby("InvoiceNo")["StockCode"]
           .apply(lambda x: list(set(x)))
)

test_baskets = test_baskets[test_baskets.apply(len) > 1]

len(test_baskets)

3225

## Hit@K Metric

A hit is counted if at least one of the actual co-purchased
products appears in the top-K recommendations.

In [8]:
def hit_at_k(baskets, k=5):
    hits = 0
    total = 0

    for basket in baskets:
        for target in basket:
            actual = set(basket) - {target}
            if not actual:
                continue

            recommended = recommend_bundle(target, top_k=k)
            if set(recommended) & actual:
                hits += 1
            total += 1

    return hits / total if total > 0 else 0

In [9]:
for k in [3, 5, 10]:
    score = hit_at_k(test_baskets, k=k)
    print(f"Hit@{k}: {score:.4f}")

Hit@3: 0.6912
Hit@5: 0.7672
Hit@10: 0.8537


## Baseline Comparison

We compare the model against a popularity-only baseline
to ensure the recommender provides added value.

In [10]:
popular_products = sorted(
    popularity_map.items(),
    key=lambda x: x[1],
    reverse=True
)

popular_products = [p for p, _ in popular_products]

def popularity_recommendation(top_k=5):
    return popular_products[:top_k]

In [11]:
def baseline_hit_at_k(baskets, k=5):
    hits = 0
    total = 0

    for basket in baskets:
        for target in basket:
            actual = set(basket) - {target}
            if not actual:
                continue

            recommended = popularity_recommendation(k)
            if set(recommended) & actual:
                hits += 1
            total += 1

    return hits / total if total > 0 else 0

In [12]:
for k in [3, 5, 10]:
    model_score = hit_at_k(test_baskets, k)
    baseline_score = baseline_hit_at_k(test_baskets, k)
    
    print(f"Hit@{k} — Model: {model_score:.4f}, Baseline: {baseline_score:.4f}")

Hit@3 — Model: 0.6912, Baseline: 0.3627
Hit@5 — Model: 0.7672, Baseline: 0.4711
Hit@10 — Model: 0.8537, Baseline: 0.6794


## Evaluation Summary

- The bundle recommender outperforms a popularity-only baseline.
- Time-based evaluation avoids data leakage.
- Hit@K demonstrates the model’s ability to recover co-purchased items.

These results indicate the model is suitable for production deployment
as a baseline recommendation system.