# Baseline Popularity Model for Recommendation System
This notebook implements several baseline recommendation models for the H&M Personalized Fashion Recommendations challenge. It loads the data, explores user history, and builds three types of recommenders:
- **Global Popularity**: Recommends the most popular items overall.
- **Personal Popularity (Recency-Weighted)**: Recommends items based on each user's purchase history, with more recent purchases weighted higher.
- **Category-Affinity Baseline**: Recommends popular items within a user's favorite product categories.

The notebook also generates submission files for each model.

## Importing Libraries and Loading Data
We start by importing essential libraries for data manipulation and visualization. The main datasets (`articles.csv`, `customers.csv`, `transactions.parquet`, and `sample_submission.csv`) are loaded for further processing.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
sample = pd.read_csv("sample_submission.csv")

In [3]:
articles = pd.read_csv('articles.csv')
customers = pd.read_csv('customers.csv')
transactions_parquet = pd.read_parquet('transactions.parquet')

## Date Conversion and Transaction Preparation
To efficiently work with transaction dates, we define a memoized date conversion function. The transaction data is then processed to ensure date columns are in the correct format for time-based analysis.

In [4]:
import datetime

def convert_to_date(s):
    """
    Memoization technique - very fast conversion to pure python dates
    """
    dates = {date:datetime.datetime.strptime(date,'%Y-%m-%d') for date in s.unique()}
    return s.map(dates)

In [5]:
transactions_parquet["t_dat"] = convert_to_date(transactions_parquet["t_dat"])
transactions_parquet["t_dat"] = pd.to_datetime(transactions_parquet["t_dat"])
transactions = transactions_parquet

In [6]:
tx = transactions
arts = articles
cust = customers

## User History Extraction
We extract each user's purchase history by grouping transactions by customer and collecting their purchased article IDs in order. This will be useful for personalized recommendations.

In [7]:
user_hist = (
    tx.sort_values("t_dat")
      .groupby("customer_id")["article_id"]
      .apply(list)
      .to_dict()
)

In [8]:
# Pad to 10 digits (e.g., 706016001 -> "0706016001")
fmt = lambda a: str(a).zfill(10)

## Global Popularity Model
The global popularity model recommends the top-N most purchased items to every user, regardless of their individual history. This serves as a simple but strong baseline.

In [None]:
# Precompute global top-N
N = 12
global_topN = tx['article_id'].value_counts().index[:N].tolist()

In [10]:
def recommend_global(uid):
    return global_topN

## Personal Popularity (Recency-Weighted) Model
This model recommends items based on each user's own purchase history, giving higher weight to more recent purchases. It uses exponential decay to prioritize recent activity.

In [None]:
import pandas as pd
from tqdm.auto import tqdm

def build_user_profiles(tx: pd.DataFrame, decay: float = 0.9, N: int = 12) -> dict:
    # 1. Compute weights
    now = tx['t_dat'].max()
    df = tx.copy()
    df['days_ago'] = (now - df['t_dat']).dt.days
    df['weight'] = decay ** (df['days_ago'] / 7)

    # 2. Aggregate weights per (user, item)
    user_item = df.groupby(['customer_id', 'article_id'], as_index=False)['weight'].sum()

    # 3. Build profiles with tqdm
    user_profiles = {}
    total_users = user_item['customer_id'].nunique()
    for uid, grp in tqdm(user_item.groupby('customer_id'), 
                        desc='Building user profiles', 
                        total=total_users, 
                        unit='user'):
        top_articles = grp.nlargest(N, 'weight')['article_id'].astype(str).tolist()
        user_profiles[uid] = top_articles

    return user_profiles

user_profiles = build_user_profiles(tx, decay=0.9, N=12)


In [13]:
def recommend_personal(uid):
    return user_profiles.get(uid, global_topN)

## Category-Affinity Baseline Model
This model recommends the most popular items within each user's top-2 most purchased product categories. It combines user history with global category trends.

In [15]:
pop_by_cat = (
    tx
    .merge(arts[['article_id','product_type_name']], on='article_id')
    .groupby('product_type_name')['article_id']
    .value_counts()
    .groupby(level=0)
    .head(20)
    .reset_index(name='count')
    .groupby('product_type_name')['article_id']
    .apply(list)
    .to_dict()
)

In [19]:
from collections import Counter

id_to_cat = dict(zip(
  arts['article_id'],
  arts['product_type_name']
))

def recommend_by_category(uid, N=12):
    hist = user_hist.get(uid, [])
    # map each article to its category via dict lookup
    cats = Counter(id_to_cat.get(a, None) for a in hist).most_common(2)

    recs = []
    for cat, _ in cats:
        recs.extend(pop_by_cat.get(cat, []))
    return recs[:N] if recs else global_topN


## Generating Submission Files
For each model, we generate a submission file in the required format. Each file contains customer IDs and their recommended articles, padded to 10 digits, ready for submission to the kaggle.

In [None]:
models = {
    "global"   : recommend_global,
    "personal" : recommend_personal,
    "category" : recommend_by_category
}

# pad any id to 10 digits
pad10 = lambda a: str(a).zfill(10)

from tqdm.auto import tqdm

for name, func in models.items():
    sub = sample[['customer_id']].copy()
    customer_ids = sub['customer_id'].tolist()

    preds = []
    for uid in tqdm(customer_ids, desc=f"Model: {name}", unit="cust"):
        preds.append(" ".join(pad10(x) for x in func(uid)))  # <- pad here

    sub['prediction'] = preds
    fname = f"submission_{name}.csv"
    sub.to_csv(fname, index=False)
    print(f"✏️  Wrote {fname} ({len(sub)} rows)")

