# Feature Engineering â€“ Product Bundle Recommendation

## Objective
The goal of this notebook is to transform cleaned transactional data into
model-ready features for a product bundle recommender system.

This includes:
- Basket construction
- Product co-occurrence computation
- Time-aware weighting
- Popularity normalization

No recommendation logic or evaluation is performed here.


In [1]:
import pandas as pd
import numpy as np

from pathlib import Path
from itertools import combinations

In [2]:
BASE_DIR = Path().resolve().parent
CLEAN_FILE = BASE_DIR / "data" / "processed" / "clean_transactions.parquet"

print("Loading clean data from:", CLEAN_FILE)

df = pd.read_parquet(CLEAN_FILE)

Loading clean data from: D:\Code Playground\ML_Ops\product-bundle-recommender-system\data\processed\clean_transactions.parquet


In [3]:
df.head(), df.shape

(  InvoiceNo StockCode                          Description  Quantity  \
 0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
 1    536365     71053                  WHITE METAL LANTERN         6   
 2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
 3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
 4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
 
           InvoiceDate  UnitPrice  CustomerID         Country  
 0 2010-12-01 08:26:00       2.55       17850  United Kingdom  
 1 2010-12-01 08:26:00       3.39       17850  United Kingdom  
 2 2010-12-01 08:26:00       2.75       17850  United Kingdom  
 3 2010-12-01 08:26:00       3.39       17850  United Kingdom  
 4 2010-12-01 08:26:00       3.39       17850  United Kingdom  ,
 (530104, 8))

## Basket Construction

Each basket corresponds to one InvoiceNo and contains
a list of unique products (StockCode) purchased together.


In [4]:
baskets = (
    df.groupby("InvoiceNo")["StockCode"]
      .apply(lambda x: list(set(x)))
)

baskets.head()


InvoiceNo
536365    [71053, 84029G, 22752, 84029E, 84406B, 21730, ...
536366                                       [22632, 22633]
536367    [21777, 21755, 84879, 21754, 22745, 84969, 226...
536368                         [22960, 22912, 22913, 22914]
536369                                              [21756]
Name: StockCode, dtype: object

Single-item baskets do not contribute to co-purchase learning
and are excluded from co-occurrence computation.


In [5]:
baskets = baskets[baskets.apply(len) > 1]

print("Number of multi-item baskets:", len(baskets))


Number of multi-item baskets: 18319


## Product Co-Occurrence

For each basket, all unordered product pairs are generated.
Each pair represents a co-purchase event.


In [None]:
co_occurrence = {}

for basket in baskets:
    for prod_a, prod_b in combinations(sorted(basket), 2):
        co_occurrence[(prod_a, prod_b)] = co_occurrence.get((prod_a, prod_b), 0) + 1

In [8]:
co_df = (
    pd.DataFrame(
        [(k[0], k[1], v) for k, v in co_occurrence.items()],
        columns=["product_a", "product_b", "co_count"]
    )
)

co_df.head()

Unnamed: 0,product_a,product_b,co_count
0,21730,22752,26
1,21730,71053,29
2,21730,84029E,24
3,21730,84029G,26
4,21730,84406B,23


## Time-Aware Weighting

More recent co-purchases are given higher importance
to reflect current buying trends.


In [9]:
df["month"] = df["InvoiceDate"].dt.to_period("M")

latest_month = df["month"].max()

df["time_weight"] = (
    (latest_month - df["month"]).apply(lambda x: np.exp(-0.3 * x.n))
)

In [10]:
weighted_pairs = {}

for invoice, group in df.groupby("InvoiceNo"):
    products = list(set(group["StockCode"]))
    weight = group["time_weight"].mean()

    if len(products) > 1:
        for a, b in combinations(sorted(products), 2):
            weighted_pairs[(a, b)] = weighted_pairs.get((a, b), 0) + weight

In [11]:
weighted_co_df = pd.DataFrame(
    [(k[0], k[1], v) for k, v in weighted_pairs.items()],
    columns=["product_a", "product_b", "weighted_score"]
)

weighted_co_df.head()


Unnamed: 0,product_a,product_b,weighted_score
0,21730,22752,5.792119
1,21730,71053,5.049617
2,21730,84029E,4.028111
3,21730,84029G,4.095442
4,21730,84406B,1.165169


## Product Popularity

Popularity is used later to reduce bias toward extremely popular items.


In [12]:
popularity = (
    df.groupby("StockCode")["Quantity"]
      .sum()
      .rename("popularity")
      .reset_index()
)

popularity.head()


Unnamed: 0,StockCode,popularity
0,10002,860
1,10080,303
2,10120,193
3,10123C,5
4,10124A,16


In [14]:
FEATURE_PATH = BASE_DIR / "data" / "features"
FEATURE_PATH.mkdir(parents=True, exist_ok=True)

co_df.to_parquet(FEATURE_PATH / "co_occurrence.parquet", index=False)
weighted_co_df.to_parquet(FEATURE_PATH / "weighted_co_occurrence.parquet", index=False)
popularity.to_parquet(FEATURE_PATH / "product_popularity.parquet", index=False)

print("Feature artifacts saved successfully.")


Feature artifacts saved successfully.


## Feature Engineering Summary

- Transactional data converted into baskets
- Product co-occurrence computed
- Time-aware weighted co-purchase scores created
- Product popularity extracted
- Features saved for downstream modeling

The dataset is now ready for recommendation logic.
