### H&M Recommendation System — 02 Recall (Candidate Generation)

In this notebook, I design and evaluate several **candidate generation (recall)** strategies
for the H&M Personalized Fashion Recommendations task.

Based on the EDA from `01_EDA.ipynb`, I observed that:

- Customer behavior shows a strong **long-tail distribution** with clear personal preferences
  and repeated purchases of similar product types.
- Article popularity is highly skewed, with a small subset of items dominating sales.
- The transaction timeline exhibits a strong **weekly pattern**, as well as several large spikes
  likely corresponding to promotional events.
- Articles have rich and well-structured **categorical metadata** (e.g. product groups, sections,
  garment groups), which provide useful semantic information.
- Customers frequently buy multiple items in a single day, indicating strong **co-purchase patterns**.

These observations motivate the following recall strategies:

1. **User-history-based recall**  
   - Use each customer's historical purchases (from the training period) as personalized candidates.

2. **Recent popularity recall**  
   - Use globally trending items from a recent time window (e.g. last 28 days before validation)
     as a popularity-based candidate pool.

3. **Category-based recall**  
   - Use item metadata (e.g. product_type or product_group) to recommend popular items from
     the categories that a user frequently buys.

4. **Co-purchase-based recall**  
   - Use simple co-purchase statistics ("customers who bought A also bought B") to find items
     related to a user's previously purchased products.

Together, these recall pools aim to cover:
- **Personal preferences** (user history)
- **Global trends** (recent popularity)
- **Semantic similarity** (category-based)
- **Behavioral complementarity** (co-purchase)

The final candidate set (top 100 ~ 500) for each customer will be built by combining these pools.


#### Step 1: Load Dataset

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 5)
pd.set_option("display.max_rows", 5)

DATA_DIR = "../hm_data"  # 改成你的實際資料夾
os.listdir(DATA_DIR)

['sample_submission.csv',
 'articles.csv',
 'customers.csv',
 'transactions_train.csv']

In [2]:
# Load transactions
transactions = pd.read_csv(
    os.path.join(DATA_DIR, "transactions_train.csv"),
    parse_dates=["t_dat"]
)
print("transactions:", transactions.shape)

# Load articles metadata (for category-based recall)
articles = pd.read_csv(os.path.join(DATA_DIR, "articles.csv"))
print("articles:", articles.shape)

transactions: (31788324, 5)
articles: (105542, 25)


In [4]:
transactions["t_dat"].min(), transactions["t_dat"].max()

(Timestamp('2018-09-20 00:00:00'), Timestamp('2020-09-22 00:00:00'))

In [5]:
# Split Train / Valid
VALID_START = pd.to_datetime("2020-09-16")
print("VALID_START:", VALID_START)

train_df = transactions[transactions["t_dat"] < VALID_START].copy()
valid_df = transactions[transactions["t_dat"] >= VALID_START].copy()

print("Train:", train_df.shape, "Valid:", valid_df.shape)


VALID_START: 2020-09-16 00:00:00


Train: (31548013, 5) Valid: (240311, 5)


#### Step 2: Validation Ground Truth

In [6]:
valid_gt = (
    valid_df.groupby("customer_id")["article_id"]
    .apply(lambda x: set(x.astype(int)))
    .to_dict()
)

len(valid_gt)

68984

#### Step 3: Evaluation Metrics

In [None]:
# Evaluate Recall@K & Hit Rate
def evaluate_recall(candidates_dict, gt_dict, k=100, return_recalls=False):
    """
    candidates_dict: {customer_id: [candidate_item_ids]}
    gt_dict: {customer_id: set(ground_truth_items)}
    """
    recalls = []
    hits = 0
    total_users = 0

    for cust, gt_items in gt_dict.items():
        if cust not in candidates_dict:
            continue
        total_users += 1

        cand = candidates_dict[cust][:k]
        cand_set = set(cand)
        inter = cand_set & gt_items

        if len(gt_items) > 0:
            recall_u = len(inter) / len(gt_items)
            recalls.append(recall_u)

        if len(inter) > 0:
            hits += 1

    mean_recall = float(np.mean(recalls)) if recalls else 0.0
    hit_rate = hits / total_users if total_users > 0 else 0.0

    result = {
        f"mean_recall@{k}": mean_recall,
        f"hit_rate@{k}": hit_rate,
        "num_users": total_users,
    }
    if return_recalls:
        return result, recalls
    return result
