### H&M Recommendation System — 02 Recall (Candidate Generation)

In this notebook, I design and evaluate several **candidate generation (recall)** strategies
for the H&M Personalized Fashion Recommendations task.

Based on the EDA from `01_eda.ipynb`, I observed that:

- Customer behavior shows a strong **long-tail distribution** with clear personal preferences
  and repeated purchases of similar product types.
- Article popularity is highly skewed, with a small subset of items dominating sales.
- The transaction timeline exhibits a strong **weekly pattern**, as well as several large spikes
  likely corresponding to promotional events.
- Articles have rich and well-structured **categorical metadata** (e.g. product groups, sections,
  garment groups), which provide useful semantic information.
- Customers frequently buy multiple items in a single day, indicating strong **co-purchase patterns**.

These observations motivate the following recall strategies:

1. **User-history-based recall**  
   - Use each customer's historical purchases (from the training period) as personalized candidates.

2. **Recent popularity recall**  
   - Use globally trending items from a recent time window (e.g. last 28 days before validation)
     as a popularity-based candidate pool.

3. **Category-based recall**  
   - Use item metadata (e.g. product_type or product_group) to recommend popular items from
     the categories that a user frequently buys.

4. **Co-purchase-based recall**  
   - Use simple co-purchase statistics ("customers who bought A also bought B") to find items
     related to a user's previously purchased products.

Together, these recall pools aim to cover:
- **Personal preferences** (user history)
- **Global trends** (recent popularity)
- **Semantic similarity** (category-based)
- **Behavioral complementarity** (co-purchase)

The final candidate set (top 100 ~ 500) for each customer will be built by combining these pools.


#### Step 1: Load Dataset

In [10]:
import os
import gc
from datetime import datetime, timedelta
from collections import defaultdict, Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import pickle

pd.set_option("display.max_columns", 5)
pd.set_option("display.max_rows", 5)

DATA_DIR = "../hm_data"
RES_DIR = "../data/recall"
TRAIN_DIR = "../data/train"

In [2]:
# Load transactions
transactions = pd.read_csv(
    os.path.join(DATA_DIR, "transactions_train.csv"),
    parse_dates=["t_dat"]
)
print("transactions:", transactions.shape)

# Load articles metadata (for category-based recall)
articles = pd.read_csv(os.path.join(DATA_DIR, "articles.csv"))
print("articles:", articles.shape)

transactions: (31788324, 5)
articles: (105542, 25)


In [3]:
transactions["t_dat"].min(), transactions["t_dat"].max()

(Timestamp('2018-09-20 00:00:00'), Timestamp('2020-09-22 00:00:00'))

In [6]:
# save recall results
def save_pickle(obj, path):
    with open(path, "wb") as f:
        pickle.dump(obj, f)

def load_pickle(path):
    with open(path, "rb") as f:
        return pickle.load(f)

In [11]:
# Split Train / Valid
VALID_START = pd.to_datetime("2020-09-16")
print("VALID_START:", VALID_START)

train_df = transactions[transactions["t_dat"] < VALID_START].copy()
valid_df = transactions[transactions["t_dat"] >= VALID_START].copy()

print("Train:", train_df.shape, "Valid:", valid_df.shape)

save_pickle(train_df, os.path.join(TRAIN_DIR, "train_df.pkl"))
save_pickle(valid_df, os.path.join(TRAIN_DIR, "valid_df.pkl"))


VALID_START: 2020-09-16 00:00:00


Train: (31548013, 5) Valid: (240311, 5)


#### Step 2: Validation Ground Truth

In [None]:
valid_gt = (
    valid_df.groupby("customer_id")["article_id"]
    .apply(lambda x: set(x.astype(int)))
    .to_dict()
)

len(valid_gt)
save_pickle(valid_gt, os.path.join(TRAIN_DIR, "validation_groundtruth.pkl"))


#### Step 3: Evaluation Metrics

In [6]:
# Evaluate Recall@K & Hit Rate
def evaluate_recall(candidates_dict, gt_dict, k=100, return_recalls=False):
    """
    candidates_dict: {customer_id: [candidate_item_ids]}
    gt_dict: {customer_id: set(ground_truth_items)}
    """
    recalls = []
    hits = 0
    total_users = 0

    for cust, gt_items in gt_dict.items():
        if cust not in candidates_dict:
            continue
        total_users += 1

        cand = candidates_dict[cust][:k]
        cand_set = set(cand)
        inter = cand_set & gt_items

        if len(gt_items) > 0:
            recall_u = len(inter) / len(gt_items)
            recalls.append(recall_u)

        if len(inter) > 0:
            hits += 1

    mean_recall = float(np.mean(recalls)) if recalls else 0.0
    hit_rate = hits / total_users if total_users > 0 else 0.0

    result = {
        f"mean_recall@{k}": mean_recall,
        f"hit_rate@{k}": hit_rate,
        "num_users": total_users,
    }
    if return_recalls:
        return result, recalls
    return result


#### 1. User-history-based recall

Idea:
- For each customer, use their historical purchases in the **training period**
  as personalized candidate items.
- We keep only the most recent N items per user to control candidate size.

In [None]:
# Sort by customer and date
train_df_sorted = train_df.sort_values(["customer_id", "t_dat"])

MAX_HISTORY_ITEMS = 150

user_hist_items = (
    train_df_sorted.groupby("customer_id")["article_id"]
    .apply(lambda x: list(x.astype(int).tail(MAX_HISTORY_ITEMS)))
    .to_dict()
)

print(len(user_hist_items))

user_history_candidates = user_hist_items

for k in [12, 50, 100]:
    metrics = evaluate_recall(user_history_candidates, valid_gt, k=k)
    print(f"[User-history] Recall@{k}:", metrics)

save_pickle(user_history_candidates, os.path.join(RES_DIR, "recall_user_history.pkl"))

1356709
[User-history] Recall@12: {'mean_recall@12': 0.010695502032411243, 'hit_rate@12': 0.019050022077840158, 'num_users': 63412}
[User-history] Recall@50: {'mean_recall@50': 0.029676517897718507, 'hit_rate@50': 0.054374566328139785, 'num_users': 63412}
[User-history] Recall@100: {'mean_recall@100': 0.042171081995056175, 'hit_rate@100': 0.0793225257049139, 'num_users': 63412}


#### 2. Recent Popularity Recall

Idea:
- Compute globally popular items based on a **recent time window** before validation
  (e.g., last 28 days).
- Use the same set of trending items as candidates for all customers.

In [16]:
RECENT_DAY = 28
recent_start = VALID_START - timedelta(days=RECENT_DAY)

recent_df = train_df[train_df["t_dat"] >= recent_start]
print("Recent window:", recent_df["t_dat"].min(), "to", recent_df["t_dat"].max())
print("Recent df shape:", recent_df.shape)

recent_popularity = (
    recent_df.groupby("article_id")["customer_id"]
    .count()
    .sort_values(ascending=False)
)

print(recent_popularity.head(10))

TOPK_RECENT = 200
top_recent_items = list(recent_popularity.index.astype(int)[:TOPK_RECENT])

recent_pop_candidates = {cust: top_recent_items for cust in valid_gt.keys()}

for k in [12, 50, 100]:
    metrics = evaluate_recall(recent_pop_candidates, valid_gt, k=k)
    print(f"[Recent popularity] Recall@{k}:", metrics)

save_pickle(recent_pop_candidates, os.path.join(RES_DIR, "recall_recent_popularity.pkl"))

Recent window: 2020-08-19 00:00:00 to 2020-09-15 00:00:00
Recent df shape: (1059723, 5)
article_id
751471001    2758
706016001    2408
             ... 
916468003    1822
448509014    1807
Name: customer_id, Length: 10, dtype: int64
[Recent popularity] Recall@12: {'mean_recall@12': 0.0193215469061964, 'hit_rate@12': 0.04970717847616839, 'num_users': 68984}


[Recent popularity] Recall@50: {'mean_recall@50': 0.061846482469658876, 'hit_rate@50': 0.14835324133132322, 'num_users': 68984}
[Recent popularity] Recall@100: {'mean_recall@100': 0.10520628280566931, 'hit_rate@100': 0.23590977617998377, 'num_users': 68984}


In [20]:
for col in articles.columns:
    print(col)

article_id
product_code
prod_name
product_type_no
product_type_name
product_group_name
graphical_appearance_no
graphical_appearance_name
colour_group_code
colour_group_name
perceived_colour_value_id
perceived_colour_value_name
perceived_colour_master_id
perceived_colour_master_name
department_no
department_name
index_code
index_name
index_group_no
index_group_name
section_no
section_name
garment_group_no
garment_group_name
detail_desc


#### 3. Category-based Recall

Idea:
- Use article metadata (e.g., `product_type_name` or `product_group_name`) to build
  category-level popularity tables.
- For each user, identify their **favorite categories** based on training history.
- Recommend **popular items within those categories** as candidates.

This helps:
- Cover items that are semantically similar to what the user likes.
- Go beyond exact repeats of previously purchased items.

In [None]:
cat_cols = ["article_id", "product_type_name"]
articles_cat = articles[cat_cols].copy()
print(articles_cat.head())

# build article_id -> category dict
article_to_ptype = dict(zip(articles_cat["article_id"].astype(int),
                            articles_cat["product_type_name"]))

train_cat = train_df[["customer_id", "article_id", "t_dat"]].copy()
train_cat["article_id"] = train_cat["article_id"].astype(int)
train_cat["product_type_name"] = train_cat["article_id"].map(article_to_ptype)

print(train_cat.head())


   article_id product_type_name  product_group_name
0   108775015          Vest top  Garment Upper body
1   108775044          Vest top  Garment Upper body
2   108775051          Vest top  Garment Upper body
3   110065001               Bra           Underwear
4   110065002               Bra           Underwear
                                         customer_id  article_id      t_dat  \
0  000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...   663713001 2018-09-20   
1  000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...   541518023 2018-09-20   
2  00007d2de826758b65a93dd24ce629ed66842531df6699...   505221004 2018-09-20   
3  00007d2de826758b65a93dd24ce629ed66842531df6699...   685687003 2018-09-20   
4  00007d2de826758b65a93dd24ce629ed66842531df6699...   685687004 2018-09-20   

  product_type_name  product_group_name  
0    Underwear body           Underwear  
1               Bra           Underwear  
2           Sweater  Garment Upper body  
3           Sweater  Garment Upper body  
4  

In [24]:
# Number of purchases per user across different product_type_no
user_type_counts = (
    train_cat.groupby(["customer_id", "product_type_name"])["article_id"]
    .count()
    .reset_index(name="cnt")
)
user_type_counts.head()

Unnamed: 0,customer_id,product_type_name,cnt
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,Blazer,5
1,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,Dress,1
2,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,Gloves,1
3,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,Hoodie,1
4,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,Jacket,3


In [25]:
# Get the top N categories for each user
TOP_CATEGORIES_PER_USER = 3

def top_categories_for_user(df):
    df_sorted = df.sort_values("cnt", ascending=False)
    return list(df_sorted["product_type_name"].head(TOP_CATEGORIES_PER_USER))

user_top_types = (
    user_type_counts.groupby("customer_id")
    .apply(top_categories_for_user)
    .to_dict()
)

print(len(user_top_types))

  .apply(top_categories_for_user)


1356709


In [26]:
# During training, the most popular items within each product_type_name
type_item_pop = (
    train_cat.groupby(["product_type_name", "article_id"])["customer_id"]
    .count()
    .reset_index(name="cnt")
)

type_item_pop.head()

Unnamed: 0,product_type_name,article_id,cnt
0,Accessories set,755356001,11
1,Accessories set,858306002,7
2,Accessories set,858306003,12
3,Accessories set,858306005,1
4,Accessories set,858306006,11


In [32]:
TYPE_TOP_ITEMS = 50  # The top 50 most popular items in each category

type_to_top_items = {}
for ptype, subdf in type_item_pop.groupby("product_type_name"):
    sub_sorted = subdf.sort_values("cnt", ascending=False)
    type_to_top_items[ptype] = list(sub_sorted["article_id"].astype(int).head(TYPE_TOP_ITEMS))

len(type_to_top_items)

130

In [33]:
# Create Category-based candidates for each user
MAX_PER_USER = 200
category_candidates = {}
for cust, types in user_top_types.items():
    cand_list = []
    for t in types:
        cand_list.extend(type_to_top_items.get(t, []))
    seen = set()
    final_cand_list = []
    for item in cand_list:
        if item not in seen:
            seen.add(item)
            final_cand_list.append(item)
    category_candidates[cust] = final_cand_list[:MAX_PER_USER]
print(len(category_candidates))
    

1356709


In [34]:
for k in [12, 50, 100]:
    metrics = evaluate_recall(category_candidates, valid_gt, k=k)
    print(f"[Category-based] Recall@{k}:", metrics)

save_pickle(category_candidates, os.path.join(RES_DIR, "recall_category.pkl"))


[Category-based] Recall@12: {'mean_recall@12': 0.007771040348258594, 'hit_rate@12': 0.019665047625055193, 'num_users': 63412}
[Category-based] Recall@50: {'mean_recall@50': 0.016429647358185674, 'hit_rate@50': 0.03926701570680628, 'num_users': 63412}
[Category-based] Recall@100: {'mean_recall@100': 0.026399832303020003, 'hit_rate@100': 0.0640415063394941, 'num_users': 63412}


#### 4. Co-purchase-based Recall

Idea:
- Treat all items purchased by the same customer on the same day as a "basket".
- For each item A, count how many times it co-occurs with other items B in the same basket.
- For each user, look at their purchased items and recommend items that are frequently
  co-purchased with them.
- To avoid expensive cost, restricting to a subset of transactions (e.g. recent months or popular items).


In [35]:
COPURCHASE_RECENT_DAYS = 90
cop_start = VALID_START - timedelta(days=COPURCHASE_RECENT_DAYS)
cop_train = train_df[train_df["t_dat"] >= cop_start].copy()

basket_df = (
    cop_train.groupby(["customer_id", "t_dat"])["article_id"]
    .apply(lambda x: list(set(x.astype(int))))
    .reset_index(name="basket")
)

basket_df.head(), basket_df.shape

(                                         customer_id      t_dat  \
 0  00000dbacae5abe5e23885899a1fa44253a17956c6d1c3... 2020-09-05   
 1  0000423b00ade91418cceaf3b26c6af3dd342b51fd051e... 2020-07-08   
 2  000058a12d5b43e67d225668fa1f8d618c13dc232df0ca... 2020-09-15   
 3  00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f... 2020-08-12   
 4  0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d... 2020-09-14   
 
                                          basket  
 0                                   [568601043]  
 1                                   [826211002]  
 2                                   [794321007]  
 3  [730683050, 896152002, 927530004, 791587015]  
 4                        [719530003, 448509014]  ,
 (1214591, 3))

In [None]:
# Calculate item → co-purchased items
co_counts = defaultdict(Counter)

for row in basket_df["basket"]:
    # empty basket
    if len(row) <= 1:
        continue
    for i in row:
        for j in row:
            if i == j:
                continue
            co_counts[i][j] += 1

len(co_counts)


41627

In [41]:
# For each item, take the top N co-purchased items.
TOP_CO_ITEMS = 10

item_co_candidates = {}
for item, counter in co_counts.items():
    item_co_candidates[int(item)] = [j for j, c in counter.most_common(TOP_CO_ITEMS)]

len(item_co_candidates)


41627

In [42]:
def build_copurchase_candidates(user_hist_dict, item_co_dict, max_per_user=200):
    user_cands = {}
    for cust, hist_items in user_hist_dict.items():
        related = []
        for it in hist_items:
            related.extend(item_co_dict.get(it, []))
        if not related:
            continue
        counter = Counter(related)
        sorted_items = [i for i, c in counter.most_common(max_per_user)]
        user_cands[cust] = sorted_items
    return user_cands

copurchase_candidates = build_copurchase_candidates(
    user_hist_items, item_co_candidates, max_per_user=200
)

len(copurchase_candidates)


1245226

In [43]:
for k in [12, 50, 100]:
    metrics = evaluate_recall(copurchase_candidates, valid_gt, k=k)
    print(f"[Co-purchase] Recall@{k}:", metrics)

save_pickle(category_candidates, os.path.join(RES_DIR, "recall_copurchase.pkl"))

[Co-purchase] Recall@12: {'mean_recall@12': 0.017769848957451848, 'hit_rate@12': 0.04253899672776948, 'num_users': 62954}


[Co-purchase] Recall@50: {'mean_recall@50': 0.041809497027143885, 'hit_rate@50': 0.09500587730724021, 'num_users': 62954}
[Co-purchase] Recall@100: {'mean_recall@100': 0.05823925536719977, 'hit_rate@100': 0.13036502843345935, 'num_users': 62954}


#### 5. Combine Recall Pools

Now combine different recall strategies into a single candidate set per user:

- User-history-based candidates
- Recent popularity candidates
- Category-based candidates
- Co-purchase candidates

Take the union of these sets and limit the total number of candidates per user.

In [44]:
def merge_candidates(*cand_dicts, max_per_user=300):
    merged = {}
    all_customers = set()
    for d in cand_dicts:
        all_customers.update(d.keys())

    for cust in all_customers:
        seen = set()
        merged_list = []
        for d in cand_dicts:
            items = d.get(cust, [])
            for it in items:
                if it not in seen:
                    seen.add(it)
                    merged_list.append(it)
                    if len(merged_list) >= max_per_user:
                        break
            if len(merged_list) >= max_per_user:
                break
        merged[cust] = merged_list
    return merged

final_candidates = merge_candidates(
    user_history_candidates,
    recent_pop_candidates,
    category_candidates,
    copurchase_candidates,
    max_per_user=300
)

len(final_candidates)


1362281

In [47]:
for k in [50, 100, 200, 300]:
    metrics = evaluate_recall(final_candidates, valid_gt, k=k)
    print(f"[Final merged] Recall@{k}:", metrics)
    
save_pickle(final_candidates, os.path.join(RES_DIR, "recall_final_merged.pkl"))

[Final merged] Recall@50: {'mean_recall@50': 0.05731768677567743, 'hit_rate@50': 0.11478023889597588, 'num_users': 68984}
[Final merged] Recall@100: {'mean_recall@100': 0.1078970009319, 'hit_rate@100': 0.21726777223704047, 'num_users': 68984}
[Final merged] Recall@200: {'mean_recall@200': 0.1783868461271448, 'hit_rate@200': 0.3498347442885307, 'num_users': 68984}
[Final merged] Recall@300: {'mean_recall@300': 0.20418784585142544, 'hit_rate@300': 0.3909892148904094, 'num_users': 68984}
