# 02: Data Embeddings & Baseline Models

This notebook covers:

- **Configuration & Imports**: define data paths and evaluation settings
- **Load Processed Data**: sequences, maps, and interaction tables
- **Data Inspection**: peek at samples
- **Create Interaction Matrices**:
  - Build sparse user–item matrices from sequences
  - Generate train/test splits for evaluation
- **Define Baseline Models**:
  - Popularity
  - Item-based KNN
  - User-based KNN
  - Matrix Factorization (NMF)
- **Evaluation Metrics**: Recall@K
- **Train & Evaluate** each baseline
- **Summarize & Save** results to JSON

## 1. Configuration & Imports

- Set file paths for processed data
- Define evaluation parameters (K values, sample sizes)

In [1]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
from scipy.sparse import csr_matrix
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Paths to processed data
out_dir = Path('../data/processed/jarir/')
seq_train_path = out_dir / 'sequences_train.parquet'
seq_val_path   = out_dir / 'sequences_val.parquet'
seq_test_path  = out_dir / 'sequences_test.parquet'
item_map_path  = out_dir / 'item_id_map.parquet'
cust_map_path  = out_dir / 'customer_id_map.parquet'

# Evaluation config
K_VALUES = [5, 10, 20]
EVAL_SAMPLE_SIZE = 1000  # sample users for faster evaluation (None = all)

# Random seed for reproducibility
np.random.seed(42)

# Placeholder for baseline results
baseline_results = {}

## 2. Load Processed Data

- Read train/val/test sequence tables
- Read user & item mappings

In [2]:
print("Loading sequences and maps...")
seq_train = pd.read_parquet(seq_train_path, engine='fastparquet')
seq_val   = pd.read_parquet(seq_val_path,   engine='fastparquet')
seq_test  = pd.read_parquet(seq_test_path,  engine='fastparquet')
item_map  = pd.read_parquet(item_map_path, engine='fastparquet')
cust_map  = pd.read_parquet(cust_map_path, engine='fastparquet')
print(f"Train seq: {len(seq_train)} rows")
print(f"Val seq:   {len(seq_val)} rows")
print(f"Test seq:  {len(seq_test)} rows")
print(f"Items:     {len(item_map)}")
print(f"Users:     {len(cust_map)}")

Loading sequences and maps...
Train seq: 1108 rows
Val seq:   169 rows
Test seq:  160 rows
Items:     1735
Users:     929


## 3. Data Inspection

- View a few example rows from sequences, item_map, and cust_map


In [3]:
print("Sample sequence row:")
print(seq_train.head(3))
print("\nSample item map:")
print(item_map.head(3))
print("\nSample customer map:")
print(cust_map.head(3))

Sample sequence row:
   customer_id  user_idx         ts history_idx  pos_item_idx     country
0     10018322         5 2024-03-07        9 10            11  0103-PLAZA
1     10018322         5 2024-03-24     9 10 11            12  0103-PLAZA
2     10018322         5 2024-04-30  9 10 11 12            13  0103-PLAZA

Sample item map:
      stock_code  item_idx
0      RQ-CHB002         0
1  ZQ-F27318BGLD         1
2     NE-0230059         2

Sample customer map:
   customer_id  user_idx
0        11949         0
1        24811         1
2        33097         2


## 4. Create Sparse Interaction Matrix

- Build user–item matrix: history items weighted lower than positive event
- Also prepare full interactions matrix for non-sequence data

In [4]:
from scipy.sparse import csr_matrix

def build_matrix_from_sequences(seq_df, n_users, n_items):
    rows, cols, vals = [], [], []
    for _, r in seq_df.iterrows():
        u = int(r['user_idx'])
        p = int(r['pos_item_idx'])
        # positive event
        rows.append(u); cols.append(p); vals.append(1.0)
        # history events
        if pd.notna(r['history_idx']) and r['history_idx']:
            h = [int(x) for x in r['history_idx'].split()]
            for item in h:
                rows.append(u); cols.append(item); vals.append(0.5)
    return csr_matrix((vals, (rows, cols)), shape=(n_users, n_items))

n_users = len(cust_map)
n_items = len(item_map)
train_mat = build_matrix_from_sequences(seq_train, n_users, n_items)
full_mat  = build_matrix_from_sequences(pd.concat([seq_train, seq_val, seq_test]), n_users, n_items)
print(f"Train matrix: {train_mat.shape}, nz={train_mat.nnz}")

Train matrix: (929, 1735), nz=1623


## 5. Prepare Train/Test Split

- Remove validation interactions from training matrix for proper held-out evaluation
- Collect test interactions list [(user, item)]

In [5]:
test_interactions = []
for _, r in seq_val.iterrows():
    u, p = int(r['user_idx']), int(r['pos_item_idx'])
    test_interactions.append((u, p))
    train_mat[u, p] = 0
print(f"Prepared {len(test_interactions)} held-out test cases")

Prepared 169 held-out test cases


## 6. Define Baseline Models

- **Popularity**: rank by total interactions
- **ItemKNN**: cosine similarity on item columns
- **UserKNN**: cosine similarity on user rows
- **Matrix Factorization**: NMF on dense matrix


In [6]:
class Popularity:
    def fit(self, mat):
        self.pop = np.array(mat.sum(axis=0)).flatten()
        return self
    def recommend(self, u, k=10):
        seen = mat[u].nonzero()[1]
        scores = self.pop.copy(); scores[seen] = -1
        return np.argsort(scores)[-k:][::-1]

class ItemKNN:
    def __init__(self, k=50): self.k = k
    def fit(self, mat):
        self.sim = cosine_similarity(mat.T)
        return self
    def recommend(self, u, k=10):
        user_vec = mat[u].toarray().flatten()
        seen = user_vec.nonzero()[0]
        scores = np.zeros(mat.shape[1])
        for i in seen:
            top = np.argsort(self.sim[i])[-self.k:]
            scores[top] += self.sim[i, top]
        scores[seen] = -1
        return np.argsort(scores)[-k:][::-1]

class UserKNN:
    def __init__(self, k=50): self.k = k
    def fit(self, mat):
        self.sim = cosine_similarity(mat)
        return self
    def recommend(self, u, k=10):
        top_users = np.argsort(self.sim[u])[-(self.k+1):][::-1]
        scores = np.zeros(mat.shape[1])
        for v in top_users:
            scores += mat[v].toarray().flatten()
        seen = mat[u].nonzero()[1]; scores[seen] = -1
        return np.argsort(scores)[-k:][::-1]

class MFBaseline:
    def __init__(self, n_f=50): self.n_f = n_f
    def fit(self, mat):
        dense = mat.toarray()
        self.model = NMF(n_components=self.n_f, random_state=42)
        self.W = self.model.fit_transform(dense)
        self.H = self.model.components_
        return self
    def recommend(self, u, k=10):
        scores = self.W[u].dot(self.H)
        seen = train_mat[u].nonzero()[1]; scores[seen] = -1
        return np.argsort(scores)[-k:][::-1]

# bind mat for popularity
mat = train_mat

## 7. Evaluation Function (Recall@K)

- Compute Recall@K over held-out test_interactions

In [7]:
def recall_at_k(recs, true_item):
    return int(true_item in recs)

def evaluate(model, k=10):
    hits = []
    for u, true in test_interactions:
        recs = model.recommend(u, k)
        hits.append(recall_at_k(recs, true))
    return np.mean(hits)

## 8. Train & Evaluate Baselines

- Fit each model on `train_mat`
- Evaluate Recall@K for K in `K_VALUES`

In [8]:
results = {}

# Popularity
pop = Popularity().fit(train_mat)
results['Popularity'] = {f'Recall@{k}': evaluate(pop, k) for k in K_VALUES}

# ItemKNN
itemknn = ItemKNN(k=50).fit(train_mat)
results['ItemKNN'] = {f'Recall@{k}': evaluate(itemknn, k) for k in K_VALUES}

# UserKNN
userknn = UserKNN(k=50).fit(train_mat)
results['UserKNN'] = {f'Recall@{k}': evaluate(userknn, k) for k in K_VALUES}

# Matrix Factorization
mf = MFBaseline(n_f=50).fit(train_mat)
results['MatrixFactorization'] = {f'Recall@{k}': evaluate(mf, k) for k in K_VALUES}

print("Baseline Results:", results)

Baseline Results: {'Popularity': {'Recall@5': 0.0650887573964497, 'Recall@10': 0.09467455621301775, 'Recall@20': 0.1242603550295858}, 'ItemKNN': {'Recall@5': 0.0, 'Recall@10': 0.0, 'Recall@20': 0.011834319526627219}, 'UserKNN': {'Recall@5': 0.029585798816568046, 'Recall@10': 0.029585798816568046, 'Recall@20': 0.047337278106508875}, 'MatrixFactorization': {'Recall@5': 0.0, 'Recall@10': 0.005917159763313609, 'Recall@20': 0.005917159763313609}}


## 9. Summarize & Save Results

- Choose best model at highest K
- Write `baseline_results.json`

In [9]:
best = max(results.items(), key=lambda x: x[1][f'Recall@{K_VALUES[-1]}'])
baseline_summary = {
    'held_out_interactions': results,
    'best_model': best[0],
    'best_recall': best[1][f'Recall@{K_VALUES[-1]}']
}
with open(out_dir / 'baseline_results.json', 'w') as f:
    json.dump(baseline_summary, f, indent=2)

print(f"Saved baseline results, best={best[0]} Recall@{K_VALUES[-1]}={best[1][f'Recall@{K_VALUES[-1]}']:.4f}")

Saved baseline results, best=Popularity Recall@20=0.1243
