# Graph Augmentation for Recommender Systems  
## From Pure Collaborative Filtering to Content-Aware Graphs

This notebook continues the development of a graph-based recommender system
on the **GoodBooks-10k** dataset.

In the previous step, we built a **pure collaborative filtering baseline**
using a bipartite **user–item graph** and the **LightGCN** model.
That experiment established a **reliable performance ceiling** for
collaborative signals alone.

## Motivation

Pure collaborative filtering captures **co-consumption patterns**,
but ignores **semantic relations between items**.

As a result:
- sparse items receive little signal,
- cold-start behavior is limited,
- improvements plateau even with more complex architectures.

Instead of increasing model complexity,  
we explore a different direction: **graph augmentation**.

## Graph Augmentation Idea

We enrich the original user–item graph with **content-based nodes**:

- **Books** are connected to **tags** (genres, themes, descriptors)
- Tags act as **semantic bridges** between otherwise weakly connected items

This results in a multi-hop structure:

user → book → tag → book

Information can now propagate not only through shared users,
but also through shared semantic attributes.

## Key Principles of This Experiment

- **Same data splits** as the CF baseline (no leakage)
- **Same training protocol** (loss, sampling, hyperparameters)
- **Same evaluation metrics** (Hit@K, NDCG@K)
- **Only the graph structure is changed**

This ensures a **fair and controlled comparison**.

## Goals

1. Demonstrate that **graph structure matters** more than architectural novelty
2. Quantify gains from content-aware message passing
3. Build an interpretable and extensible foundation for hybrid GNN recommenders

## What This Notebook Is — and Is Not

✔ This is a **controlled research step**  
✔ This is an **engineering-quality experiment**  
✖ This is not a hyperparameter sweep  
✖ This is not a SOTA benchmark

## Outcome

By the end of this notebook, we will compare:

- **Pure CF LightGCN**
- **Augmented LightGCN (user–book–tag graph)**

under identical conditions and evaluate whether
semantic graph augmentation leads to meaningful ranking improvements.

In [1]:
# Cell 2: Imports and global configuration
# - set seeds, device
# - define project paths
# - centralize filenames to avoid "magic strings"

import copy
import json
import math
import numpy as np
import os
import pandas as pd
import random
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

from collections import defaultdict
from __future__ import annotations
from pathlib import Path
from dataclasses import dataclass
from torch.optim import Adam

def set_seed(seed: int = 42) -> None:
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


SEED = 42
set_seed(SEED)

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("DEVICE:", DEVICE)

# Project paths 
PROJECT_ROOT = Path(r"D:/ML/GNN/graph_recsys")  
DATA_PROCESSED = PROJECT_ROOT / "data_processed" / "v2_proper"
DATA_RAW = PROJECT_ROOT / "data_raw"
ARTIFACTS = PROJECT_ROOT / "artifacts" / "v2_proper"

for p in [PROJECT_ROOT, DATA_PROCESSED, DATA_RAW, ARTIFACTS]:
    p.mkdir(parents=True, exist_ok=True)

print("DATA_PROCESSED:", DATA_PROCESSED)
print("ARTIFACTS:", ARTIFACTS)

# ---- Expected files in v2_proper ----
FILES = {
    "train": "train_interactions.csv",
    "val": "val_interactions.csv",
    "test": "test_interactions.csv",
    "user_map": "user2idx.csv",   # columns: user_id, user_idx
    "book_map": "book2idx.csv",   # columns: book_id, book_idx
}

# ---- Raw content files ----
RAW_FILES = {
    "tags": "tags.csv",           # tag_id, tag_name
    "book_tags": "book_tags.csv", # goodbooks format: goodreads_book_id, tag_id, count
}

print("Ready.")

DEVICE: cuda
DATA_PROCESSED: D:\ML\GNN\graph_recsys\data_processed\v2_proper
ARTIFACTS: D:\ML\GNN\graph_recsys\artifacts\v2_proper
Ready.


In [6]:
# Cell 3: Load splits and mappings (your actual saved format)
# - splits_ui.npz contains train_ui/val_ui/test_ui arrays of shape (N,2) with (u,i) indices
# - user2idx.csv and book2idx.csv were saved via pd.Series(...).to_csv()
#   => CSV has two columns: [index, value] i.e. [id, idx] but without headers

def load_series_mapping(path: Path) -> dict[int, int]:
    """
    Loads mapping saved by:
        pd.Series(mapping_dict).to_csv(path)
    The resulting CSV usually has columns like: ['Unnamed: 0', '0'].
    Returns: {original_id: mapped_idx}
    """
    if not path.exists():
        raise FileNotFoundError(f"Missing file: {path}")

    df = pd.read_csv(path)  # will have 2 columns: index + value

    if df.shape[1] < 2:
        # if something really weird happened
        df = pd.read_csv(path, header=None)
        if df.shape[1] < 2:
            raise ValueError(f"Mapping file {path} must have 2 columns (id, idx). Got: {df.shape}")

    id_col = df.columns[0]
    idx_col = df.columns[1]

    ids = pd.to_numeric(df[id_col], errors="raise").astype(int).to_numpy()
    idx = pd.to_numeric(df[idx_col], errors="raise").astype(int).to_numpy()

    mapping = dict(zip(ids, idx))
    return mapping

# Paths
splits_path   = DATA_PROCESSED / "splits_ui.npz"
user_map_path = DATA_PROCESSED / "user2idx.csv"
book_map_path = DATA_PROCESSED / "book2idx.csv"

for p in [splits_path, user_map_path, book_map_path]:
    if not p.exists():
        raise FileNotFoundError(f"Missing file: {p}")

# Load splits
z = np.load(splits_path, allow_pickle=True)
assert "train_ui" in z and "val_ui" in z and "test_ui" in z, f"Unexpected keys in splits: {list(z.keys())}"

train_ui = z["train_ui"].astype(np.int32)
val_ui   = z["val_ui"].astype(np.int32)
test_ui  = z["test_ui"].astype(np.int32)

n_users = int(z["n_users"]) if "n_users" in z else None
n_items = int(z["n_items"]) if "n_items" in z else None

print("train_ui:", train_ui.shape, train_ui.dtype)
print("val_ui:", val_ui.shape, val_ui.dtype)
print("test_ui:", test_ui.shape, test_ui.dtype)
print("n_users:", n_users, "n_items:", n_items)

assert train_ui.ndim == 2 and train_ui.shape[1] == 2, f"train_ui must be (N,2), got {train_ui.shape}"
assert val_ui.ndim == 2 and val_ui.shape[1] == 2, f"val_ui must be (N,2), got {val_ui.shape}"
assert test_ui.ndim == 2 and test_ui.shape[1] == 2, f"test_ui must be (N,2), got {test_ui.shape}"

# Load mappings (Series to CSV format)
user2idx = load_series_mapping(user_map_path)   # {user_id: u}
book2idx = load_series_mapping(book_map_path)   # {book_id: i}

# Build inverse mappings (idx -> original id) for readability/debug
# Important: this works because ids were saved as keys in the Series
idx2user = {u: user_id for user_id, u in user2idx.items()}
idx2book = {i: book_id for book_id, i in book2idx.items()}

# Infer counts from mapping (more reliable than npz if mismatch)
num_users = max(idx2user.keys()) + 1 if len(idx2user) else 0
num_items = max(idx2book.keys()) + 1 if len(idx2book) else 0

print("num_users(from mapping):", num_users, "num_items(from mapping):", num_items)

# Optional: consistency check with npz counts
if n_users is not None:
    assert n_users == num_users, f"Mismatch n_users: npz={n_users}, mapping={num_users}"
if n_items is not None:
    assert n_items == num_items, f"Mismatch n_items: npz={n_items}, mapping={num_items}"

# Build DataFrames for convenience (both idx and original ids)
def ui_to_df(ui: np.ndarray, name: str) -> pd.DataFrame:
    u = ui[:, 0].astype(int)
    i = ui[:, 1].astype(int)
    df = pd.DataFrame({"user_idx": u, "book_idx": i})
    df["user_id"] = df["user_idx"].map(idx2user).astype(int)
    df["book_id"] = df["book_idx"].map(idx2book).astype(int)
    return df[["user_id", "book_id", "user_idx", "book_idx"]]

train_df = ui_to_df(train_ui, "train")
val_df   = ui_to_df(val_ui, "val")
test_df  = ui_to_df(test_ui, "test")

print("train_df:", train_df.shape, "val_df:", val_df.shape, "test_df:", test_df.shape)
train_df.head()

train_ui: (4926384, 2) int32
val_ui: (53398, 2) int32
test_ui: (53398, 2) int32
n_users: 53398 n_items: 9999
num_users(from mapping): 53398 num_items(from mapping): 9999
train_df: (4926384, 4) val_df: (53398, 4) test_df: (53398, 4)


Unnamed: 0,user_id,book_id,user_idx,book_idx
0,1,258,0,257
1,1,1796,0,1795
2,1,4691,0,4690
3,1,2063,0,2062
4,1,11,0,10


In [37]:
# Cell 3a: Load raw tags relations (tags.csv, book_tags.csv)
# - читаем raw файлы из data_raw
# - оставляем только книги, которые присутствуют в нашем book2idx (9999 книг)
# - приводим типы, проверяем схему колонок

def read_csv(path: Path) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(f"Missing file: {path}")
    return pd.read_csv(path)

tags_path = DATA_RAW / RAW_FILES["tags"]
book_tags_path = DATA_RAW / RAW_FILES["book_tags"]

tags_df = read_csv(tags_path)
book_tags_df = read_csv(book_tags_path)

print("tags_df:", tags_df.shape, "| columns:", tags_df.columns.tolist())
print("book_tags_df:", book_tags_df.shape, "| columns:", book_tags_df.columns.tolist())
print("book_tags dtypes:", book_tags_df.dtypes.to_dict())

# schema normalization (defensive)
rename_map = {}
if "goodreads_book_id" not in book_tags_df.columns and "book_id" in book_tags_df.columns:
    rename_map["book_id"] = "goodreads_book_id"
if rename_map:
    book_tags_df = book_tags_df.rename(columns=rename_map)

assert {"goodreads_book_id", "tag_id"}.issubset(book_tags_df.columns), \
    "book_tags.csv must contain goodreads_book_id and tag_id"
assert {"tag_id"}.issubset(tags_df.columns), "tags.csv must contain tag_id"
assert "count" in book_tags_df.columns, "Expected 'count' in book_tags.csv"

# types
book_tags_df["goodreads_book_id"] = book_tags_df["goodreads_book_id"].astype(int)
book_tags_df["tag_id"] = book_tags_df["tag_id"].astype(int)
book_tags_df["count"] = book_tags_df["count"].astype(int)

# IMPORTANT: keep only books that exist in our mapping (book2idx keys are original book_id)
book_ids_in_mapping = set(book2idx.keys())
book_tags_df = book_tags_df[book_tags_df["goodreads_book_id"].isin(book_ids_in_mapping)].copy()

print("book_tags_df filtered to mapped books:", book_tags_df.shape)
print("unique books with tags (raw, before any filtering):", book_tags_df["goodreads_book_id"].nunique())
print("unique tags in relations (raw, before filtering):", book_tags_df["tag_id"].nunique())

tags_df: (34252, 2) | columns: ['tag_id', 'tag_name']
book_tags_df: (999912, 3) | columns: ['goodreads_book_id', 'tag_id', 'count']
book_tags dtypes: {'goodreads_book_id': dtype('int64'), 'tag_id': dtype('int64'), 'count': dtype('int64')}
book_tags_df filtered to mapped books: (81200, 3)
unique books with tags (raw, before any filtering): 812
unique tags in relations (raw, before filtering): 6734


In [7]:
# Cell 3b: Leakage and index-range checks (in u/i space)

def check_split(df: pd.DataFrame, name: str, n_users: int, n_items: int) -> None:
    assert df.user_idx.min() >= 0 and df.user_idx.max() < n_users, f"{name}: user_idx out of range"
    assert df.book_idx.min() >= 0 and df.book_idx.max() < n_items, f"{name}: book_idx out of range"

check_split(train_df, "train", num_users, num_items)
check_split(val_df, "val", num_users, num_items)
check_split(test_df, "test", num_users, num_items)

train_pairs = set(zip(train_df.user_idx, train_df.book_idx))
val_pairs   = set(zip(val_df.user_idx, val_df.book_idx))
test_pairs  = set(zip(test_df.user_idx, test_df.book_idx))

print("train∩val:", len(train_pairs & val_pairs))
print("train∩test:", len(train_pairs & test_pairs))
print("val∩test:", len(val_pairs & test_pairs))

assert len(train_pairs & val_pairs) == 0, "Leakage: train overlaps val"
assert len(train_pairs & test_pairs) == 0, "Leakage: train overlaps test"
assert len(val_pairs & test_pairs) == 0, "Leakage: val overlaps test"

print("Splits are consistent ✔")

train_u = train_df["user_idx"].to_numpy()
train_b = train_df["book_idx"].to_numpy()

train∩val: 0
train∩test: 0
val∩test: 0
Splits are consistent ✔


In [10]:
# Cell 4: Baseline reference metrics (from Notebook 02 / handoff)
# - we didn't persist baseline_metrics.json, so we store the key numbers here
# - these values are used only for comparison tables (no training/eval uses them)

baseline_ref = {
    "model": "LightGCN (user–book) + hard negatives",
    "Hit@10": 0.084,
    "NDCG@10": 0.045,
    "Hit@20": 0.129,
    "NDCG@20": 0.057,
    "Hit@50": 0.221,
    "NDCG@50": 0.075,
}

print("Baseline reference metrics:")
for k, v in baseline_ref.items():
    print(f"  {k}: {v}")

Baseline reference metrics:
  model: LightGCN (user–book) + hard negatives
  Hit@10: 0.084
  NDCG@10: 0.045
  Hit@20: 0.129
  NDCG@20: 0.057
  Hit@50: 0.221
  NDCG@50: 0.075


In [9]:
# Cell 5: Load content relations (book-tags) in a RAM-safe way
# - read only needed columns
# - enforce dtypes to reduce memory
# - filter to our book_id universe early
# - aggregate duplicates: (book_id, tag_id) -> count_sum

# Paths
tags_path = DATA_RAW / RAW_FILES["tags"]
book_tags_path = DATA_RAW / RAW_FILES["book_tags"]

if not tags_path.exists():
    raise FileNotFoundError(f"Missing file: {tags_path}")
if not book_tags_path.exists():
    raise FileNotFoundError(f"Missing file: {book_tags_path}")

# ---- tags.csv (small) ----
# GoodBooks: tag_id, tag_name
tags_df = pd.read_csv(tags_path)
assert "tag_id" in tags_df.columns, f"tags.csv must contain tag_id, got {tags_df.columns}"
tag_name_col = "tag_name" if "tag_name" in tags_df.columns else None

print("tags_df:", tags_df.shape, "| columns:", tags_df.columns.tolist())

# ---- book_tags.csv (large) ----
# GoodBooks: goodreads_book_id, tag_id, count
# Load only necessary columns with small dtypes
usecols = None  # let pandas infer cols then subselect if schema differs
bt = pd.read_csv(
    book_tags_path,
    usecols=["goodreads_book_id", "tag_id", "count"],  
    dtype={"goodreads_book_id": np.int32, "tag_id": np.int32, "count": np.int32},
)

print("book_tags raw:", bt.shape, "| dtypes:", bt.dtypes.to_dict())

# Filter to books that exist in our mapping (early!)
# book2idx is {book_id(original): book_idx}
bt = bt[bt["goodreads_book_id"].isin(book2idx)].copy()
print("book_tags filtered to mapped books:", bt.shape)

# Aggregate duplicates: for graph we want one edge per (book,tag)
bt_agg = (
    bt.groupby(["goodreads_book_id", "tag_id"], as_index=False)["count"]
      .sum()
      .rename(columns={"goodreads_book_id": "book_id"})
)

print("book_tags aggregated:", bt_agg.shape)
print("unique books with tags:", bt_agg["book_id"].nunique())
print("unique tags in relations:", bt_agg["tag_id"].nunique())

# Quick stats
tags_per_book = bt_agg.groupby("book_id")["tag_id"].nunique()
print("tags per book: mean=", float(tags_per_book.mean()),
      "median=", float(tags_per_book.median()),
      "p90=", float(tags_per_book.quantile(0.9)))

tags_df: (34252, 2) | columns: ['tag_id', 'tag_name']
book_tags raw: (999912, 3) | dtypes: {'goodreads_book_id': dtype('int32'), 'tag_id': dtype('int32'), 'count': dtype('int32')}
book_tags filtered to mapped books: (81200, 3)
book_tags aggregated: (81199, 3)
unique books with tags: 812
unique tags in relations: 6734
tags per book: mean= 99.9987684729064 median= 100.0 p90= 100.0


In [11]:
# Cell 6: Filter tags to control noise/density + convert to idx-space
# - filter tags by min book frequency
# - optionally keep global TOP_K tags
# - keep TOP_TAGS_PER_BOOK per book by count
# - output: filtered_bt with [book_id, tag_id, count, book_idx, tag_idx]

MIN_BOOK_FREQ = 50        # tag must appear in >= this many books
TOP_K_TAGS = None         # e.g. 5000, or None
TOP_TAGS_PER_BOOK = 20    # keep top tags per book by count

# Tag frequency across books
tag_book_freq = (
    bt_agg.groupby("tag_id")["book_id"]
          .nunique()
          .sort_values(ascending=False)
)

eligible_tags = tag_book_freq[tag_book_freq >= MIN_BOOK_FREQ].index
filtered_bt = bt_agg[bt_agg["tag_id"].isin(eligible_tags)].copy()
print("After MIN_BOOK_FREQ:", filtered_bt.shape, "| tags:", filtered_bt["tag_id"].nunique())

if TOP_K_TAGS is not None:
    top_tags = tag_book_freq.loc[eligible_tags].head(TOP_K_TAGS).index
    filtered_bt = filtered_bt[filtered_bt["tag_id"].isin(top_tags)].copy()
    print("After TOP_K_TAGS:", filtered_bt.shape, "| tags:", filtered_bt["tag_id"].nunique())

# Keep top tags per book by count
filtered_bt = (
    filtered_bt.sort_values(["book_id", "count"], ascending=[True, False])
               .groupby("book_id", as_index=False)
               .head(TOP_TAGS_PER_BOOK)
               .copy()
)

print("After TOP_TAGS_PER_BOOK:", filtered_bt.shape, "| tags:", filtered_bt["tag_id"].nunique())

# Convert to idx-space for graph building
filtered_bt["book_idx"] = filtered_bt["book_id"].map(book2idx).astype(np.int32)

# Create tag2idx based on filtered tags set
unique_tag_ids = np.sort(filtered_bt["tag_id"].unique())
tag2idx_local = {int(tid): i for i, tid in enumerate(unique_tag_ids)}
filtered_bt["tag_idx"] = filtered_bt["tag_id"].map(tag2idx_local).astype(np.int32)

# Stats after filtering
filtered_tags_per_book = filtered_bt.groupby("book_id")["tag_id"].nunique()
print("Filtered tags/book: mean=", float(filtered_tags_per_book.mean()),
      "median=", float(filtered_tags_per_book.median()),
      "p90=", float(filtered_tags_per_book.quantile(0.9)))

T = len(tag2idx_local)
print("Final tag vocabulary size T:", T)

filtered_bt.head()

After MIN_BOOK_FREQ: (55262, 3) | tags: 307
After TOP_TAGS_PER_BOOK: (16239, 3) | tags: 265
Filtered tags/book: mean= 19.998768472906406 median= 20.0 p90= 20.0
Final tag vocabulary size T: 265


Unnamed: 0,book_id,tag_id,count,book_idx,tag_idx
90,1,30574,167697,0,244
31,1,11305,37174,0,105
37,1,11557,34173,0,112
26,1,8717,12986,0,86
97,1,33114,12716,0,262


In [12]:
# Cell 7: Unified node index space (users + books + tags)
# - users: [0 .. U)
# - books: [U .. U+B)
# - tags : [U+B .. U+B+T)

U = num_users
B = num_items
T = len(tag2idx_local)

user_offset = 0
book_offset = U
tag_offset  = U + B
num_nodes   = U + B + T

print(f"U={U} | B={B} | T={T} | num_nodes={num_nodes}")
print("Offsets:", {"user": user_offset, "book": book_offset, "tag": tag_offset})
print("Unified indexing ready ✔")

U=53398 | B=9999 | T=265 | num_nodes=63662
Offsets: {'user': 0, 'book': np.int64(53398), 'tag': np.int64(63397)}
Unified indexing ready ✔


In [13]:
# Cell 8: Build user–book edges from TRAIN interactions only (idx-space)
# - no val/test edges (avoid leakage)
# - bidirectional edges (undirected graph)

train_u = train_df["user_idx"].to_numpy(dtype=np.int64)
train_b = train_df["book_idx"].to_numpy(dtype=np.int64)

src_ub = torch.from_numpy(train_u + user_offset).long()
dst_ub = torch.from_numpy(train_b + book_offset).long()

edge_src = torch.cat([src_ub, dst_ub], dim=0)
edge_dst = torch.cat([dst_ub, src_ub], dim=0)

print("User–Book edges (directed count):", edge_src.numel())
print("Unique TRAIN interactions:", len(train_df))

User–Book edges (directed count): 9852768
Unique TRAIN interactions: 4926384


In [14]:
# Cell 9: Build book–tag edges (book ↔ tag)
# - uses filtered_bt with columns: book_idx, tag_idx
# - bidirectional edges

bt_books = filtered_bt["book_idx"].to_numpy(dtype=np.int64)
bt_tags  = filtered_bt["tag_idx"].to_numpy(dtype=np.int64)

src_bt = torch.from_numpy(bt_books + book_offset).long()
dst_bt = torch.from_numpy(bt_tags  + tag_offset).long()

edge_src = torch.cat([edge_src, src_bt, dst_bt], dim=0)
edge_dst = torch.cat([edge_dst, dst_bt, src_bt], dim=0)

print("Total edges after adding book–tag (directed count):", edge_src.numel())

Total edges after adding book–tag (directed count): 9885246


In [15]:
# Cell 10: Combine edges into edge_index (+ optional weights)
# - edge_index: [2, E]
# - for now: uniform edge_weight=1

edge_index = torch.stack([edge_src, edge_dst], dim=0)
edge_weight = torch.ones(edge_index.size(1), dtype=torch.float32)

print("edge_index shape:", edge_index.shape)
print("num_nodes:", num_nodes)

assert edge_index.min().item() >= 0
assert edge_index.max().item() < num_nodes
print("Graph build sanity ✔")

edge_index shape: torch.Size([2, 9885246])
num_nodes: 63662
Graph build sanity ✔


In [16]:
# Cell 11: Build normalized adjacency for LightGCN
# - create sparse adjacency A (undirected)
# - apply LightGCN normalization: D^{-1/2} A D^{-1/2}
# - store as torch.sparse_coo_tensor on DEVICE

import torch

E = edge_index.size(1)
idx = edge_index.long()

# values are 1 for all edges
val = torch.ones(E, dtype=torch.float32)

A = torch.sparse_coo_tensor(idx, val, (num_nodes, num_nodes)).coalesce()

deg = torch.sparse.sum(A, dim=1).to_dense()  # [N]
deg_inv_sqrt = torch.pow(deg.clamp(min=1.0), -0.5)

# normalized values: v_ij = 1 / sqrt(deg_i * deg_j)
row, col = A.indices()
norm_val = deg_inv_sqrt[row] * A.values() * deg_inv_sqrt[col]

A_norm = torch.sparse_coo_tensor(
    A.indices(), norm_val, A.size()
).coalesce().to(DEVICE)

print("A_norm:", A_norm.shape, "nnz:", A_norm._nnz())
print("deg stats:", float(deg.min()), float(deg.mean()), float(deg.max()))

A_norm: torch.Size([63662, 63662]) nnz: 9885246
deg stats: 1.0 155.2770233154297 19472.0


In [19]:
# Cell 11b: Sanity check for normalization
# - ensure no NaN/Inf in A_norm values
# - run one sparse mm to verify it works on DEVICE

vals = A_norm.values()
print("A_norm values: min/mean/max =", float(vals.min()), float(vals.mean()), float(vals.max()))
print("NaN in values:", torch.isnan(vals).any().item(), "| Inf in values:", torch.isinf(vals).any().item())

with torch.no_grad():
    x = model.emb.weight
    y = torch.sparse.mm(A_norm, x)
    print("propagate output:", y.shape, "NaN:", torch.isnan(y).any().item(), "Inf:", torch.isinf(y).any().item())

A_norm values: min/mean/max = 0.00025242200354114175 0.003876851871609688 0.09622504562139511
NaN in values: False | Inf in values: False
propagate output: torch.Size([63662, 64]) NaN: False Inf: False


In [20]:
# Cell 12: Define LightGCN model
# - learn embeddings for all nodes (users+books+tags)
# - propagate K layers using normalized adjacency
# - final embedding is mean of layer-wise embeddings

class LightGCN(nn.Module):
    def __init__(self, num_nodes: int, emb_dim: int = 64, num_layers: int = 3, dropout: float = 0.0):
        super().__init__()
        self.num_nodes = num_nodes
        self.emb_dim = emb_dim
        self.num_layers = num_layers
        self.dropout = dropout
        self.emb = nn.Embedding(num_nodes, emb_dim)
        nn.init.xavier_uniform_(self.emb.weight)

    def propagate(self, A_norm: torch.Tensor) -> torch.Tensor:
        x0 = self.emb.weight
        xs = [x0]
        x = x0
        for _ in range(self.num_layers):
            x = torch.sparse.mm(A_norm, x)
            if self.dropout > 0:
                x = F.dropout(x, p=self.dropout, training=self.training)
            xs.append(x)
        x_out = torch.stack(xs, dim=0).mean(dim=0)  # [N, D]
        return x_out

model = LightGCN(num_nodes=num_nodes, emb_dim=64, num_layers=3, dropout=0.0).to(DEVICE)
print(model)

LightGCN(
  (emb): Embedding(63662, 64)
)


In [21]:
# Cell 13: Build user->positive sets
# - train_pos: for negative sampling (avoid sampling seen positives)
# - val_gt/test_gt: ground-truth one item per user for evaluation (leave-one-out)

train_pos = defaultdict(set)
for u, i in zip(train_df["user_idx"].to_numpy(), train_df["book_idx"].to_numpy()):
    train_pos[int(u)].add(int(i))

# Ground truth: exactly 1 item per user (by construction)
val_gt = val_df.set_index("user_idx")["book_idx"].to_dict()
test_gt = test_df.set_index("user_idx")["book_idx"].to_dict()

print("train_pos users:", len(train_pos))
print("val_gt users:", len(val_gt), "test_gt users:", len(test_gt))

# sanity: each user should exist in val/test
assert len(val_gt) == num_users and len(test_gt) == num_users, "Expected 1 val and 1 test item per user"

train_pos users: 53398
val_gt users: 53398 test_gt users: 53398


In [30]:
# Cell 14: Ranking evaluation (leave-one-out) on validation
# - compute scores over ALL items (B=9999) batched by users
# - filter already-seen train positives
# - metrics: Hit@K, NDCG@K

@torch.no_grad()
def eval_ranking(all_emb: torch.Tensor,
                 gt: dict[int, int],
                 train_pos: dict[int, set],
                 K_list=(10, 20, 50),
                 user_batch_size: int = 512) -> dict[str, float]:
    """
    all_emb: [N, D] final node embeddings
    gt: {user_idx: true_item_idx}
    train_pos: {user_idx: set(seen_item_idx)} used for filtering
    """
    model.eval()

    # Extract user and item embeddings (books only)
    user_emb = all_emb[user_offset:user_offset + U]          # [U, D]
    item_emb = all_emb[book_offset:book_offset + B]          # [B, D]

    # Prepare tensors on DEVICE
    user_emb = user_emb.to(DEVICE)
    item_emb = item_emb.to(DEVICE)

    users = np.array(sorted(gt.keys()), dtype=np.int32)
    hits = {K: 0 for K in K_list}
    ndcgs = {K: 0.0 for K in K_list}

    for start in range(0, len(users), user_batch_size):
        batch_users = users[start:start + user_batch_size]
        bu = torch.from_numpy(batch_users.astype(np.int64)).to(DEVICE)

        # scores: [batch, B]
        scores = user_emb[bu] @ item_emb.t()

        # filter training positives: set their scores to -inf
        for row, u in enumerate(batch_users):
            seen = train_pos.get(int(u), None)
            if seen:
                seen_idx = torch.tensor(list(seen), device=DEVICE, dtype=torch.long)
                scores[row, seen_idx] = -1e9

        # get top maxK
        maxK = max(K_list)
        topk_items = torch.topk(scores, k=maxK, dim=1).indices.cpu().numpy()  # [batch, maxK]

        for row, u in enumerate(batch_users):
            true_i = int(gt[int(u)])
            ranking = topk_items[row]  # top maxK item indices

            # find rank position if present
            # rank is 1-based
            pos = np.where(ranking == true_i)[0]
            for K in K_list:
                if len(pos) > 0 and pos[0] < K:
                    hits[K] += 1
                    ndcgs[K] += 1.0 / np.log2(pos[0] + 2.0)

    n = len(users)
    metrics = {}
    for K in K_list:
        metrics[f"Hit@{K}"] = hits[K] / n
        metrics[f"NDCG@{K}"] = ndcgs[K] / n

    return metrics

print("Eval function ready.")


Eval function ready.


In [32]:
# Cell 15: Optimized LightGCN training with gradient-accumulation blocks
# What we do:
# - Train LightGCN with BPR loss
# - Negative sampling ускоряем: сразу K кандидатов, выбираем первый не-позитив
# - Самое главное ускорение: propagate(A_norm) делаем 1 раз на блок ACC_STEPS батчей,
#   а не на каждый батч (иначе будет адски медленно)
# - На одном graph делаем несколько backward(): retain_graph=True для всех, кроме последнего
# - После каждой эпохи считаем ranking metrics на validation (Hit/NDCG@K)

# ---------------------------
# Fast negative sampling
# ---------------------------
def sample_negatives_fast(users_np: np.ndarray, train_pos: dict[int, set], B: int, K_try: int = 20) -> np.ndarray:
    users_np = users_np.astype(np.int32)
    out = np.empty(len(users_np), dtype=np.int32)
    cand = np.random.randint(0, B, size=(len(users_np), K_try), dtype=np.int32)

    for r, u in enumerate(users_np):
        seen = train_pos[int(u)]
        chosen = None
        row = cand[r]
        for neg in row:
            neg = int(neg)
            if neg not in seen:
                chosen = neg
                break
        if chosen is None:
            while True:
                neg = int(np.random.randint(0, B))
                if neg not in seen:
                    chosen = neg
                    break
        out[r] = chosen
    return out

def bpr_loss(u_emb, pos_emb, neg_emb):
    pos_scores = (u_emb * pos_emb).sum(dim=1)
    neg_scores = (u_emb * neg_emb).sum(dim=1)
    return -torch.log(torch.sigmoid(pos_scores - neg_scores) + 1e-8).mean()

# ---------------------------
# Hyperparams (safe + fast)
# ---------------------------
EMB_DIM = 64
NUM_LAYERS = 3
LR = 2e-3            
WEIGHT_DECAY = 0.0  

BATCH_SIZE = 4096
ACC_STEPS = 16
EPOCHS = 20

USER_BATCH_EVAL = 1024

# Early stopping
PATIENCE = 4
MIN_DELTA = 1e-5

print("Training config:",
      {"EMB_DIM": EMB_DIM, "NUM_LAYERS": NUM_LAYERS, "LR": LR,
       "WEIGHT_DECAY": WEIGHT_DECAY,
       "BATCH_SIZE": BATCH_SIZE, "ACC_STEPS": ACC_STEPS, "EPOCHS": EPOCHS,
       "USER_BATCH_EVAL": USER_BATCH_EVAL,
       "PATIENCE": PATIENCE, "MIN_DELTA": MIN_DELTA})

# ---------------------------
# Model + optimizer
# ---------------------------
model = LightGCN(num_nodes=num_nodes, emb_dim=EMB_DIM, num_layers=NUM_LAYERS, dropout=0.0).to(DEVICE)
opt = Adam(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

# Train arrays (idx-space)
train_users = train_df["user_idx"].to_numpy(dtype=np.int32)
train_items = train_df["book_idx"].to_numpy(dtype=np.int32)
num_inter = len(train_users)
perm = np.arange(num_inter)

history = []
best_ndcg10 = -1.0
best_epoch = 0
pat_cnt = 0

# ---------------------------
# Training
# ---------------------------
for epoch in range(1, EPOCHS + 1):
    t0 = time.time()
    model.train()
    np.random.shuffle(perm)

    epoch_loss = 0.0
    n_batches = 0
    n_steps = 0

    block_size = BATCH_SIZE * ACC_STEPS

    for block_start in range(0, num_inter, block_size):
        # 1) propagate once per block (graph attached)
        all_emb = model.propagate(A_norm)  # [N, D]

        # 2) build a single loss tensor for the whole block
        loss_block = 0.0

        # figure out how many inner batches
        remaining = num_inter - block_start
        n_inner = min(ACC_STEPS, int(np.ceil(remaining / BATCH_SIZE)))

        for inner in range(n_inner):
            start = block_start + inner * BATCH_SIZE
            batch_idx = perm[start:start + BATCH_SIZE]
            bu = train_users[batch_idx]
            bi = train_items[batch_idx]
            bn = sample_negatives_fast(bu, train_pos, B, K_try=20)

            u_nodes = torch.from_numpy(bu.astype(np.int64) + user_offset).to(DEVICE)
            p_nodes = torch.from_numpy(bi.astype(np.int64) + book_offset).to(DEVICE)
            n_nodes = torch.from_numpy(bn.astype(np.int64) + book_offset).to(DEVICE)

            u_emb = all_emb[u_nodes]
            p_emb = all_emb[p_nodes]
            n_emb = all_emb[n_nodes]

            loss_block = loss_block + bpr_loss(u_emb, p_emb, n_emb)
            n_batches += 1

        loss_block = loss_block / n_inner

        opt.zero_grad(set_to_none=True)
        loss_block.backward()
        opt.step()

        epoch_loss += float(loss_block.item())
        n_steps += 1

        del all_emb, loss_block

    # ---- Validation ----
    model.eval()
    with torch.no_grad():
        all_emb_eval = model.propagate(A_norm)

    val_metrics = eval_ranking(
        all_emb_eval,
        val_gt,
        train_pos,
        K_list=(10, 20, 50),
        user_batch_size=USER_BATCH_EVAL,
    )

    dt = time.time() - t0
    row = {
        "epoch": epoch,
        "loss": epoch_loss / max(n_steps, 1),
        "time_sec": dt,
        **val_metrics
    }
    history.append(row)

    print(
        f"Epoch {epoch:02d} | loss={row['loss']:.4f} | steps={n_steps} | time={dt:.1f}s | "
        f"Hit@10={row['Hit@10']:.4f} NDCG@10={row['NDCG@10']:.4f} | "
        f"Hit@50={row['Hit@50']:.4f} NDCG@50={row['NDCG@50']:.4f}"
    )

    # ---- Early stopping on NDCG@10 ----
    cur = row["NDCG@10"]
    if cur > best_ndcg10 + MIN_DELTA:
        best_ndcg10 = cur
        best_epoch = epoch
        pat_cnt = 0
    else:
        pat_cnt += 1

    print(f"Best NDCG@10 so far: {best_ndcg10:.5f} (epoch {best_epoch}), patience={pat_cnt}/{PATIENCE}")

    if pat_cnt >= PATIENCE:
        print("Early stopping triggered.")
        break

hist_df = pd.DataFrame(history)
hist_df


Training config: {'EMB_DIM': 64, 'NUM_LAYERS': 3, 'LR': 0.002, 'WEIGHT_DECAY': 0.0, 'BATCH_SIZE': 4096, 'ACC_STEPS': 16, 'EPOCHS': 20, 'USER_BATCH_EVAL': 1024, 'PATIENCE': 4, 'MIN_DELTA': 1e-05}
Epoch 01 | loss=0.5740 | steps=76 | time=35.1s | Hit@10=0.0462 NDCG@10=0.0262 | Hit@50=0.1294 NDCG@50=0.0442
Best NDCG@10 so far: 0.02619 (epoch 1), patience=0/4
Epoch 02 | loss=0.4058 | steps=76 | time=36.6s | Hit@10=0.0488 NDCG@10=0.0273 | Hit@50=0.1345 NDCG@50=0.0458
Best NDCG@10 so far: 0.02729 (epoch 2), patience=0/4
Epoch 03 | loss=0.3694 | steps=76 | time=35.2s | Hit@10=0.0548 NDCG@10=0.0297 | Hit@50=0.1434 NDCG@50=0.0487
Best NDCG@10 so far: 0.02968 (epoch 3), patience=0/4
Epoch 04 | loss=0.3368 | steps=76 | time=34.9s | Hit@10=0.0583 NDCG@10=0.0311 | Hit@50=0.1515 NDCG@50=0.0511
Best NDCG@10 so far: 0.03115 (epoch 4), patience=0/4
Epoch 05 | loss=0.3149 | steps=76 | time=35.6s | Hit@10=0.0596 NDCG@10=0.0318 | Hit@50=0.1564 NDCG@50=0.0526
Best NDCG@10 so far: 0.03184 (epoch 5), patience

Unnamed: 0,epoch,loss,time_sec,Hit@10,NDCG@10,Hit@20,NDCG@20,Hit@50,NDCG@50
0,1,0.574,35.127548,0.046163,0.026194,0.074628,0.03335,0.129443,0.044178
1,2,0.405779,36.56183,0.048785,0.027295,0.078224,0.034695,0.134537,0.045826
2,3,0.369434,35.176867,0.054796,0.029682,0.083561,0.036877,0.143414,0.048676
3,4,0.33682,34.910926,0.058298,0.031146,0.087794,0.038543,0.151485,0.051065
4,5,0.314924,35.648708,0.059553,0.031837,0.090415,0.039579,0.156392,0.052573
5,6,0.296569,35.744465,0.061313,0.032875,0.093281,0.040883,0.162497,0.054505
6,7,0.277572,35.580034,0.063935,0.034419,0.098,0.042961,0.170044,0.057144
7,8,0.258024,35.498808,0.067943,0.036415,0.102906,0.045183,0.179464,0.060258
8,9,0.241962,36.143474,0.070639,0.037942,0.10757,0.047196,0.186861,0.062818
9,10,0.229626,37.095323,0.072063,0.038801,0.111334,0.048664,0.192498,0.064634


In [33]:
# Cell 16: Final evaluation on TEST (leave-one-out)
# - same ranking metrics as on val
# - ground truth: test_gt (1 item per user)
# - filter: train_pos (do NOT recommend items seen in train)

model.eval()
with torch.no_grad():
    all_emb_test = model.propagate(A_norm)

test_metrics = eval_ranking(
    all_emb_test,
    test_gt,
    train_pos,
    K_list=(10, 20, 50),
    user_batch_size=USER_BATCH_EVAL,
)

print("TEST metrics (augmented graph):")
for k, v in test_metrics.items():
    print(f"  {k}: {v:.6f}")

TEST metrics (augmented graph):
  Hit@10: 0.082775
  NDCG@10: 0.044579
  Hit@20: 0.127439
  NDCG@20: 0.055769
  Hit@50: 0.220982
  NDCG@50: 0.074206


In [34]:
# Cell 17: Compare augmented TEST metrics vs baseline reference
# - baseline_ref: numbers from your baseline (user–book LightGCN + hard negatives)
# - show absolute deltas

baseline_ref = {
    "Hit@10": 0.084,
    "NDCG@10": 0.045,
    "Hit@20": 0.129,
    "NDCG@20": 0.057,
    "Hit@50": 0.221,
    "NDCG@50": 0.075,
}

rows = []
for m in ["Hit@10","NDCG@10","Hit@20","NDCG@20","Hit@50","NDCG@50"]:
    aug = float(test_metrics.get(m, np.nan))
    base = float(baseline_ref.get(m, np.nan))
    rows.append({
        "metric": m,
        "baseline": base,
        "augmented_test": aug,
        "delta_abs": aug - base
    })

cmp_df = pd.DataFrame(rows)
cmp_df

Unnamed: 0,metric,baseline,augmented_test,delta_abs
0,Hit@10,0.084,0.082775,-0.001225
1,NDCG@10,0.045,0.044579,-0.000421
2,Hit@20,0.129,0.127439,-0.001561
3,NDCG@20,0.057,0.055769,-0.001231
4,Hit@50,0.221,0.220982,-1.8e-05
5,NDCG@50,0.075,0.074206,-0.000794


In [35]:
# Cell 18: Diagnose tag coverage (why augmentation may not help)
# - how many books have >=1 tag edge?
# - what fraction of catalog is covered?
# - distribution of tags per book (after filtering)

books_with_tags = filtered_bt["book_idx"].nunique()
coverage = books_with_tags / B

tags_per_book = filtered_bt.groupby("book_idx")["tag_idx"].nunique()

print(f"Books with >=1 tag edge: {books_with_tags} / {B}  ({coverage*100:.2f}%)")
print("Tags per tagged-book: mean=", float(tags_per_book.mean()),
      "median=", float(tags_per_book.median()),
      "p90=", float(tags_per_book.quantile(0.9)),
      "max=", int(tags_per_book.max()))

Books with >=1 tag edge: 812 / 9999  (8.12%)
Tags per tagged-book: mean= 19.998768472906406 median= 20.0 p90= 20.0 max= 20


## Stage Summary: Graph Augmentation v1 (user–book–tag)

### Goal of this stage
To test whether **augmenting a pure user–item graph with content-based relations**
(book–tag edges) can **break the performance ceiling of collaborative filtering**
on the **GoodBooks-10k** dataset.

## What was done

### Baseline graph (reference)
- A **bipartite user–book graph** trained with **LightGCN + BPR loss**
- Proper **leave-one-out split per user**:
  - train: all interactions except the last ones
  - validation: 1 item per user
  - test: 1 item per user
- Evaluation with **ranking metrics only** (Hit@K, NDCG@K)
- Previously established CF ceiling:
  - **NDCG@10 ≈ 0.045**
  - **Hit@10 ≈ 0.084**

This baseline serves as a **fixed reference point** for all subsequent experiments.

### Graph Augmentation v1
- Introduced **content nodes (tags)** into the graph
- Constructed an augmented topology:

user — book — tag — book


- Data sources:
- `tags.csv`
- `book_tags.csv`
- Tag filtering strategy:
- `MIN_BOOK_FREQ = 50`
- `TOP_TAGS_PER_BOOK = 20`
- Final graph statistics:
- users: 53,398
- books: 9,999
- tags: 265
- total nodes: ~63k

- Training setup:
- LightGCN on the augmented graph
- same BPR loss
- **train-only edges** (no validation/test leakage)

## Results (TEST set)

**Augmented graph (v1):**
- Hit@10 = **0.0828**
- NDCG@10 = **0.0446**
- Hit@50 = **0.2210**
- NDCG@50 = **0.0742**

**Comparison with baseline:**

| Metric   | Baseline | Augmented v1 | Δ |
|----------|----------|--------------|---|
| Hit@10   | 0.0840   | 0.0828       | −0.0012 |
| NDCG@10  | 0.0450   | 0.0446       | −0.0004 |
| Hit@50   | 0.2210   | 0.2210       | ≈ 0 |
| NDCG@50  | 0.0750   | 0.0742       | −0.0008 |

The augmented model performs **on par with the baseline**, with minor differences
within statistical noise.

## Why augmentation v1 did not improve performance

### A key diagnostic finding:

- Only **812 out of 9,999 books** are connected to at least one tag
- Tag coverage of the catalog: **≈ 8.1%**

```text
Books with ≥1 tag edge: 812 / 9999 (8.12%)
Tags per tagged book: mean ≈ 20

## As a result:

For over 90% of items, the graph remains purely user–book

LightGCN relies almost entirely on collaborative signals

The content signal is too sparse to affect global ranking

This empirically confirms the baseline conclusion:

Architectural complexity alone does not break the CF ceiling
unless the graph is enriched with sufficiently dense semantic relations.

## Key takeaway

Graph augmentation as a concept is valid, but:

❌ the current tag configuration is too sparse

❌ content edges affect only a small fraction of the catalog

✔ the pipeline is correct and leakage-free

✔ the CF ceiling is faithfully reproduced

This stage validates the engineering setup, but is not the final solution.

Next steps: Graph Augmentation v2

The next experiment focuses on strengthening the graph itself, not the model.

## Plan for v2:

 - Increase tag coverage

reduce MIN_BOOK_FREQ (e.g., 50 → 10)

increase TOP_TAGS_PER_BOOK (20 → 50)

target: ≥ 50% of books connected to tags

- (Optional) Weighted book–tag edges

use the count field from book_tags.csv

edge weights such as log1p(count) or sqrt(count)

- Re-train the same LightGCN

identical splits

identical metrics

clean comparison with baseline and v1

Only after this step does it make sense to explore:

heterogeneous GNNs (GraphSAGE, R-GCN)

or hybrid content-aware recommenders

In [43]:
# Cell 5c: Reload raw book_tags.csv WITHOUT any filtering and map goodreads_book_id -> book_id
# Why:
# - book_tags.csv uses goodreads_book_id
# - our interactions/mapping use book_id
# - previous filtering was wrong (goodreads_book_id vs book_id), causing only 812 books

books_path = DATA_RAW / "books.csv"
book_tags_path = DATA_RAW / RAW_FILES["book_tags"]
tags_path = DATA_RAW / RAW_FILES["tags"]

books_df = pd.read_csv(books_path)
tags_df = pd.read_csv(tags_path)

assert {"book_id", "goodreads_book_id"}.issubset(books_df.columns), \
    "books.csv must contain book_id and goodreads_book_id"
assert {"tag_id", "tag_name"}.issubset(tags_df.columns), \
    "tags.csv must contain tag_id and tag_name"

# load RAW (no filtering!)
book_tags_raw = pd.read_csv(book_tags_path)

# schema checks
assert {"goodreads_book_id", "tag_id", "count"}.issubset(book_tags_raw.columns), \
    "book_tags.csv must contain goodreads_book_id, tag_id, count"

book_tags_raw["goodreads_book_id"] = book_tags_raw["goodreads_book_id"].astype(int)
book_tags_raw["tag_id"] = book_tags_raw["tag_id"].astype(int)
book_tags_raw["count"] = book_tags_raw["count"].astype(int)

print("book_tags_raw:", book_tags_raw.shape)
print("unique goodreads_book_id in raw:", book_tags_raw["goodreads_book_id"].nunique())

# mapping: goodreads_book_id -> book_id
g2b = dict(zip(books_df["goodreads_book_id"].astype(int), books_df["book_id"].astype(int)))

book_tags_raw["book_id"] = book_tags_raw["goodreads_book_id"].map(g2b)
missing = book_tags_raw["book_id"].isna().sum()
print("missing g2b mappings:", missing)

book_tags_raw = book_tags_raw.dropna(subset=["book_id"]).copy()
book_tags_raw["book_id"] = book_tags_raw["book_id"].astype(int)

print("book_tags_raw after mapping:", book_tags_raw.shape)
print("unique book_id after mapping:", book_tags_raw["book_id"].nunique())


book_tags_raw: (999912, 3)
unique goodreads_book_id in raw: 10000
missing g2b mappings: 0
book_tags_raw after mapping: (999912, 4)
unique book_id after mapping: 10000


In [44]:
# Cell 5d: Correct filtering to our model catalog (book_id space)
# - keep only rows where mapped book_id is present in our book2idx keys (book_id)
# Result: book_tags_df_fixed should cover ~all books (near 9999)

book_ids_in_mapping = set(book2idx.keys())

book_tags_df_fixed = book_tags_raw[book_tags_raw["book_id"].isin(book_ids_in_mapping)].copy()

print("book_tags_df_fixed:", book_tags_df_fixed.shape)
print("unique books with tags (fixed):", book_tags_df_fixed["book_id"].nunique())
print("unique tags in relations (fixed):", book_tags_df_fixed["tag_id"].nunique())

book_tags_df_fixed: (999856, 4)
unique books with tags (fixed): 9999
unique tags in relations (fixed): 34225


In [45]:
# Cell 19 (FIXED v2): Re-filter book_tags using CORRECT book_id space (full coverage)
# What we do:
# 1) Use book_tags_df_fixed (already mapped goodreads_book_id -> book_id and filtered to our catalog)
# 2) Keep tags that appear in >= MIN_BOOK_FREQ books
# 3) For each book, keep TOP_TAGS_PER_BOOK strongest tags by 'count'
# 4) Build local tag vocabulary (tag2idx_local2) and a compact edge table filtered_bt2
#
# Output:
# - filtered_bt2 with columns: [book_idx, tag_idx, count]
# - tag2idx_local2 dict + stats (coverage should be ~9999 books)

MIN_BOOK_FREQ = 10        # tag must appear in >= this many books (tunable)
TOP_K_TAGS = None         # optional: keep only top-K tags globally by freq (start with None)
TOP_TAGS_PER_BOOK = 50    # keep top tags per book by count

assert "book_id" in book_tags_df_fixed.columns, "book_tags_df_fixed must contain book_id"
assert "tag_id" in book_tags_df_fixed.columns and "count" in book_tags_df_fixed.columns, \
    "book_tags_df_fixed must contain tag_id and count"

# ---- tag frequency by number of distinct books ----
tag_book_freq = (
    book_tags_df_fixed.groupby("tag_id")["book_id"]
    .nunique()
    .sort_values(ascending=False)
)

eligible_tags = tag_book_freq[tag_book_freq >= MIN_BOOK_FREQ].index
filtered2 = book_tags_df_fixed[book_tags_df_fixed["tag_id"].isin(eligible_tags)].copy()
print("After MIN_BOOK_FREQ:", filtered2.shape, "| tags:", filtered2["tag_id"].nunique())

if TOP_K_TAGS is not None:
    top_tags = tag_book_freq.loc[eligible_tags].head(TOP_K_TAGS).index
    filtered2 = filtered2[filtered2["tag_id"].isin(top_tags)].copy()
    print("After TOP_K_TAGS:", filtered2.shape, "| tags:", filtered2["tag_id"].nunique())

# ---- per-book top tags by count ----
filtered2 = (
    filtered2.sort_values(["book_id", "count"], ascending=[True, False])
             .groupby("book_id", as_index=False)
             .head(TOP_TAGS_PER_BOOK)
             .copy()
)

print("After TOP_TAGS_PER_BOOK:", filtered2.shape, "| tags:", filtered2["tag_id"].nunique())

# ---- map to idx space (book_id -> book_idx) ----
filtered2["book_idx"] = filtered2["book_id"].map(book2idx)

# Safety: drop any unmapped (shouldn't happen, but keep robust)
filtered2 = filtered2.dropna(subset=["book_idx"]).copy()
filtered2["book_idx"] = filtered2["book_idx"].astype(np.int64)

# ---- build local tag vocabulary ----
unique_tag_ids2 = np.sort(filtered2["tag_id"].unique())
tag2idx_local2 = {int(tid): i for i, tid in enumerate(unique_tag_ids2)}
filtered2["tag_idx"] = filtered2["tag_id"].map(tag2idx_local2).astype(np.int64)

# final compact edges table
filtered_bt2 = filtered2[["book_idx", "tag_idx", "count"]].copy()

# ---- diagnostics ----
books_with_tags2 = int(filtered_bt2["book_idx"].nunique())
coverage2 = books_with_tags2 / num_items

tags_per_book2 = filtered_bt2.groupby("book_idx")["tag_idx"].nunique()
print("Final tag vocab size T2:", len(tag2idx_local2))
print(f"Books with >=1 tag edge: {books_with_tags2} / {num_items} ({coverage2*100:.2f}%)")
print("Tags per tagged-book: mean=", float(tags_per_book2.mean()),
      "median=", float(tags_per_book2.median()),
      "p90=", float(tags_per_book2.quantile(0.9)),
      "max=", int(tags_per_book2.max()))

filtered_bt2.head()

After MIN_BOOK_FREQ: (938635, 4) | tags: 5503
After TOP_TAGS_PER_BOOK: (499860, 4) | tags: 5014
Final tag vocab size T2: 5014
Books with >=1 tag edge: 9999 / 9999 (100.00%)
Tags per tagged-book: mean= 49.99069906990699 median= 50.0 p90= 50.0 max= 50


Unnamed: 0,book_idx,tag_idx,count
619294,0,1688,50755
619295,0,1271,35418
619296,0,4937,25968
619297,0,1725,13819
619298,0,1455,12985


In [46]:
# Cell 20 (FIXED): Build augmented graph v2 (full tag coverage)
# - Nodes: users [0..U), books [U..U+B), tags [U+B..U+B+T2)
# - Edges:
#   * user<->book from TRAIN only (no leakage)
#   * book<->tag from filtered_bt2
# - Output: A_norm2 (normalized sparse adjacency on DEVICE), plus offsets and sizes

U = num_users
B = num_items
T2 = len(tag2idx_local2)

user_offset2 = 0
book_offset2 = U
tag_offset2  = U + B
num_nodes2   = U + B + T2

print(f"U={U} | B={B} | T2={T2} | num_nodes2={num_nodes2}")
print("Offsets:", {"user": user_offset2, "book": book_offset2, "tag": tag_offset2})

# ---- user-book edges from TRAIN (idx-space) ----
train_u = train_df["user_idx"].to_numpy(dtype=np.int64)
train_b = train_df["book_idx"].to_numpy(dtype=np.int64)

src_ub = torch.from_numpy(train_u + user_offset2).long()
dst_ub = torch.from_numpy(train_b + book_offset2).long()

edge_src = torch.cat([src_ub, dst_ub], dim=0)
edge_dst = torch.cat([dst_ub, src_ub], dim=0)

# ---- book-tag edges from filtered_bt2 ----
bt_books = filtered_bt2["book_idx"].to_numpy(dtype=np.int64)
bt_tags  = filtered_bt2["tag_idx"].to_numpy(dtype=np.int64)

src_bt = torch.from_numpy(bt_books + book_offset2).long()
dst_bt = torch.from_numpy(bt_tags  + tag_offset2).long()

edge_src = torch.cat([edge_src, src_bt, dst_bt], dim=0)
edge_dst = torch.cat([edge_dst, dst_bt, src_bt], dim=0)

edge_index2 = torch.stack([edge_src, edge_dst], dim=0)
E2 = edge_index2.size(1)

print("edge_index2:", edge_index2.shape, "| directed edges:", E2)

# ---- build normalized adjacency ----
val = torch.ones(E2, dtype=torch.float32)
A2 = torch.sparse_coo_tensor(edge_index2, val, (num_nodes2, num_nodes2)).coalesce()

deg2 = torch.sparse.sum(A2, dim=1).to_dense()
deg_inv_sqrt2 = torch.pow(deg2.clamp(min=1.0), -0.5)

row, col = A2.indices()
norm_val = deg_inv_sqrt2[row] * A2.values() * deg_inv_sqrt2[col]

A_norm2 = torch.sparse_coo_tensor(A2.indices(), norm_val, A2.size()).coalesce().to(DEVICE)

print("A_norm2:", tuple(A_norm2.shape), "nnz:", A_norm2._nnz())
print("deg stats:", float(deg2.min()), float(deg2.mean()), float(deg2.max()))
print("Graph2 build sanity ✔")

U=53398 | B=9999 | T2=5014 | num_nodes2=68411
Offsets: {'user': 0, 'book': np.int64(53398), 'tag': np.int64(63397)}
edge_index2: torch.Size([2, 10852488]) | directed edges: 10852488
A_norm2: (68411, 68411) nnz: 10852482
deg stats: 1.0 158.6365966796875 19502.0
Graph2 build sanity ✔


In [48]:
# Cell 22: Train on Graph2 (full tag coverage) with early stopping (val NDCG@10)
# - runs up to EPOCHS epochs but stops if no improvement
# - keeps best model state in RAM
# - at the end: evaluates on TEST using the best epoch

EMB_DIM = 64
NUM_LAYERS = 3
LR = 2e-3
WEIGHT_DECAY = 0.0

BATCH_SIZE = 4096
ACC_STEPS = 16
EPOCHS = 30

USER_BATCH_EVAL = 1024

PATIENCE = 5
MIN_DELTA = 1e-4

print("Training config:", {
    "EMB_DIM": EMB_DIM, "NUM_LAYERS": NUM_LAYERS, "LR": LR,
    "BATCH_SIZE": BATCH_SIZE, "ACC_STEPS": ACC_STEPS, "EPOCHS": EPOCHS,
    "USER_BATCH_EVAL": USER_BATCH_EVAL,
    "PATIENCE": PATIENCE, "MIN_DELTA": MIN_DELTA
})

model2 = LightGCN(num_nodes=num_nodes2, emb_dim=EMB_DIM, num_layers=NUM_LAYERS, dropout=0.0).to(DEVICE)
opt2 = Adam(model2.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

train_users = train_df["user_idx"].to_numpy(dtype=np.int32)
train_items = train_df["book_idx"].to_numpy(dtype=np.int32)
num_inter = len(train_users)
perm = np.arange(num_inter)

block_size = BATCH_SIZE * ACC_STEPS

best_ndcg10 = -1.0
best_epoch = 0
best_state = None
pat = 0

history = []

for epoch in range(1, EPOCHS + 1):
    t0 = time.time()
    model2.train()
    np.random.shuffle(perm)

    epoch_loss = 0.0
    n_steps = 0

    for block_start in range(0, num_inter, block_size):
        all_emb = model2.propagate(A_norm2)
        loss_block = 0.0

        remaining = num_inter - block_start
        n_inner = min(ACC_STEPS, int(np.ceil(remaining / BATCH_SIZE)))

        for inner in range(n_inner):
            start = block_start + inner * BATCH_SIZE
            batch_idx = perm[start:start + BATCH_SIZE]
            bu = train_users[batch_idx]
            bi = train_items[batch_idx]
            bn = sample_negatives_fast(bu, train_pos, B, K_try=20)

            u_nodes = torch.from_numpy(bu.astype(np.int64) + user_offset2).to(DEVICE)
            p_nodes = torch.from_numpy(bi.astype(np.int64) + book_offset2).to(DEVICE)
            n_nodes = torch.from_numpy(bn.astype(np.int64) + book_offset2).to(DEVICE)

            u_emb = all_emb[u_nodes]
            p_emb = all_emb[p_nodes]
            n_emb = all_emb[n_nodes]

            loss_block = loss_block + bpr_loss(u_emb, p_emb, n_emb)

        loss_block = loss_block / n_inner

        opt2.zero_grad(set_to_none=True)
        loss_block.backward()
        opt2.step()

        epoch_loss += float(loss_block.item())
        n_steps += 1

        del all_emb, loss_block

    # ---- val ----
    model2.eval()
    with torch.no_grad():
        all_emb_val = model2.propagate(A_norm2)

    val_metrics = eval_ranking(
        all_emb_val,
        val_gt,
        train_pos,
        K_list=(10, 20, 50),
        user_batch_size=USER_BATCH_EVAL,
    )

    dt = time.time() - t0
    row = {"epoch": epoch, "loss": epoch_loss / max(n_steps, 1), "time_sec": dt, **val_metrics}
    history.append(row)

    print(
        f"[Graph2] Epoch {epoch:02d} | loss={row['loss']:.4f} | time={dt:.1f}s | "
        f"Hit@10={row['Hit@10']:.4f} NDCG@10={row['NDCG@10']:.4f} | "
        f"Hit@50={row['Hit@50']:.4f} NDCG@50={row['NDCG@50']:.4f}"
    )

    cur = row["NDCG@10"]
    if cur > best_ndcg10 + MIN_DELTA:
        best_ndcg10 = cur
        best_epoch = epoch
        best_state = copy.deepcopy(model2.state_dict())
        pat = 0
        print(f"  ✅ New best NDCG@10={best_ndcg10:.5f} at epoch {best_epoch}")
    else:
        pat += 1
        print(f"  patience={pat}/{PATIENCE} (best {best_ndcg10:.5f} @ epoch {best_epoch})")

    if pat >= PATIENCE:
        print("Early stopping triggered.")
        break

hist_df2 = pd.DataFrame(history)
print("\nBest epoch:", best_epoch, "| best val NDCG@10:", best_ndcg10)

# ---- restore best and evaluate on TEST ----
if best_state is not None:
    model2.load_state_dict(best_state)

model2.eval()
with torch.no_grad():
    all_emb_test = model2.propagate(A_norm2)

test_metrics2 = eval_ranking(
    all_emb_test,
    test_gt,
    train_pos,
    K_list=(10, 20, 50),
    user_batch_size=USER_BATCH_EVAL,
)

print("\nTEST metrics (Graph2, best epoch):")
for k, v in test_metrics2.items():
    print(f"  {k}: {v:.6f}")

hist_df2

Training config: {'EMB_DIM': 64, 'NUM_LAYERS': 3, 'LR': 0.002, 'BATCH_SIZE': 4096, 'ACC_STEPS': 16, 'EPOCHS': 30, 'USER_BATCH_EVAL': 1024, 'PATIENCE': 5, 'MIN_DELTA': 0.0001}
[Graph2] Epoch 01 | loss=0.5785 | time=36.9s | Hit@10=0.0461 NDCG@10=0.0260 | Hit@50=0.1293 NDCG@50=0.0440
  ✅ New best NDCG@10=0.02604 at epoch 1
[Graph2] Epoch 02 | loss=0.4066 | time=37.5s | Hit@10=0.0485 NDCG@10=0.0272 | Hit@50=0.1337 NDCG@50=0.0456
  ✅ New best NDCG@10=0.02721 at epoch 2
[Graph2] Epoch 03 | loss=0.3710 | time=37.1s | Hit@10=0.0541 NDCG@10=0.0297 | Hit@50=0.1420 NDCG@50=0.0485
  ✅ New best NDCG@10=0.02967 at epoch 3
[Graph2] Epoch 04 | loss=0.3380 | time=37.8s | Hit@10=0.0574 NDCG@10=0.0310 | Hit@50=0.1493 NDCG@50=0.0507
  ✅ New best NDCG@10=0.03099 at epoch 4
[Graph2] Epoch 05 | loss=0.3150 | time=37.6s | Hit@10=0.0584 NDCG@10=0.0317 | Hit@50=0.1548 NDCG@50=0.0524
  ✅ New best NDCG@10=0.03174 at epoch 5
[Graph2] Epoch 06 | loss=0.2962 | time=37.5s | Hit@10=0.0608 NDCG@10=0.0328 | Hit@50=0.160

Unnamed: 0,epoch,loss,time_sec,Hit@10,NDCG@10,Hit@20,NDCG@20,Hit@50,NDCG@50
0,1,0.578549,36.884287,0.046144,0.026038,0.074572,0.03318,0.129275,0.043979
1,2,0.406551,37.491725,0.048485,0.027212,0.077681,0.034556,0.133694,0.045624
2,3,0.370962,37.104557,0.054141,0.029675,0.082793,0.036843,0.141953,0.04851
3,4,0.338048,37.763722,0.057362,0.03099,0.086614,0.038337,0.149275,0.050683
4,5,0.315009,37.614854,0.05841,0.031737,0.089067,0.039446,0.1548,0.052386
5,6,0.296218,37.519017,0.060751,0.032822,0.092588,0.040792,0.160193,0.054115
6,7,0.275092,36.928283,0.063898,0.034529,0.098431,0.043179,0.170624,0.057372
7,8,0.253508,37.334149,0.06783,0.036599,0.104161,0.045712,0.180868,0.060809
8,9,0.236324,38.984444,0.070883,0.038135,0.108712,0.047636,0.187779,0.06321
9,10,0.22353,37.697372,0.072681,0.039156,0.111952,0.049021,0.193378,0.065038


In [49]:
# Cell 23: Final evaluation on TEST (leave-one-out)
# - compute ranking metrics on FULL TEST (1 GT item per user)
# - filter already-seen train positives (train_pos)
# - uses Graph2 best model and Graph2 adjacency A_norm2

model2.eval()
with torch.no_grad():
    all_emb_test = model2.propagate(A_norm2)

test_metrics = eval_ranking(
    all_emb_test,
    test_gt,
    train_pos,
    K_list=(10, 20, 50),
    user_batch_size=USER_BATCH_EVAL,
)

print("TEST metrics (Graph2 augmented user–book–tag):")
for k, v in test_metrics.items():
    print(f"  {k}: {v:.6f}")

TEST metrics (Graph2 augmented user–book–tag):
  Hit@10: 0.090340
  NDCG@10: 0.048598
  Hit@20: 0.136971
  NDCG@20: 0.060308
  Hit@50: 0.237612
  NDCG@50: 0.080158


In [50]:
# Cell 24: Compare Graph2 TEST metrics vs baseline reference
# - baseline_ref: your baseline user–book LightGCN + hard negatives
# - show absolute deltas

baseline_ref = {
    "Hit@10": 0.084,
    "NDCG@10": 0.045,
    "Hit@20": 0.129,
    "NDCG@20": 0.057,
    "Hit@50": 0.221,
    "NDCG@50": 0.075,
}

rows = []
for m in ["Hit@10","NDCG@10","Hit@20","NDCG@20","Hit@50","NDCG@50"]:
    aug = float(test_metrics.get(m, np.nan))
    base = float(baseline_ref.get(m, np.nan))
    rows.append({
        "metric": m,
        "baseline": base,
        "graph2_test": aug,
        "delta_abs": aug - base,
        "delta_rel_%": (aug - base) / base * 100 if base and not np.isnan(aug) else np.nan
    })

cmp_df = pd.DataFrame(rows)
cmp_df

Unnamed: 0,metric,baseline,graph2_test,delta_abs,delta_rel_%
0,Hit@10,0.084,0.09034,0.00634,7.548169
1,NDCG@10,0.045,0.048598,0.003598,7.996178
2,Hit@20,0.129,0.136971,0.007971,6.179397
3,NDCG@20,0.057,0.060308,0.003308,5.804379
4,Hit@50,0.221,0.237612,0.016612,7.516695
5,NDCG@50,0.075,0.080158,0.005158,6.877812


## Conclusions — Graph2 (User–Book–Tag LightGCN)

### What was done
We extended the baseline **pure collaborative filtering** setup (user–book bipartite graph) by introducing **content-aware graph augmentation** using book tags from the GoodBooks dataset.

The final graph structure is:

user — book — tag


Key properties:
- **100% book coverage** via correct `goodreads_book_id → book_id` mapping
- Tags filtered by minimum book frequency and capped per book
- Homogeneous graph trained with **LightGCN + BPR loss**
- Strict **leave-one-out split per user** (no leakage)
- Ranking-based evaluation (Hit@K, NDCG@K)

### Results (TEST set)

| Metric     | Baseline (user–book) | Graph2 (user–book–tag) | Δ Absolute | Δ Relative |
|------------|----------------------|-------------------------|------------|------------|
| Hit@10     | 0.0840               | **0.0903**              | +0.0063    | +7.5%      |
| NDCG@10    | 0.0450               | **0.0486**              | +0.0036    | +8.0%      |
| Hit@20     | 0.1290               | **0.1370**              | +0.0080    | +6.2%      |
| NDCG@20    | 0.0570               | **0.0603**              | +0.0033    | +5.8%      |
| Hit@50     | 0.2210               | **0.2376**              | +0.0166    | +7.5%      |
| NDCG@50    | 0.0750               | **0.0802**              | +0.0052    | +6.9%      |

### Key observations
- The **baseline ceiling was broken**: pure CF saturated at NDCG@10 ≈ 0.045, while Graph2 reached **≈ 0.049**.
- Gains are **consistent across all K**, confirming that improvement is not noise.
- The model continued improving up to ~30 epochs, indicating **effective information propagation** rather than overfitting.
- Crucially, the improvement comes **not from architectural complexity**, but from **adding meaningful semantic relations to the graph**.

### Why it worked
- Tags act as **shared semantic hubs** connecting related books.
- LightGCN efficiently propagates signals across `user → book → tag → book` paths.
- Correct ID mapping and full coverage were essential — partial tag coverage previously nullified the effect.
- This validates the core hypothesis:  
  **graph enrichment is more impactful than model complexity for recommender systems.**

### Limitations
- Tag edges are currently **unweighted** (tag `count` is ignored).
- The model remains **homogeneous** (node and edge types are not distinguished).
- Textual and numerical book metadata from `books.csv` are not yet used.

### Next steps
In the next iteration, we will move toward a **fully hybrid recommender** by:
1. Adding **weighted book–tag edges** based on tag frequency.
2. Introducing **book–book similarity edges** derived from textual metadata (TF-IDF / SBERT).
3. Exploring **heterogeneous GNN architectures** with typed relations.

These steps aim to further improve ranking quality while preserving a clean, reproducible research pipeline.

## Graph3 — Fully Augmented Hybrid Graph Recommender

### Motivation
Previous experiments demonstrated that **graph enrichment is the primary driver of recommendation quality**, not architectural complexity alone.  
After successfully improving over the pure collaborative filtering baseline using **user–book–tag augmentation (Graph2)**, the next step is to build a **fully hybrid graph** that incorporates *all available structured and textual signals* from the dataset.

The goal of Graph3 is to move from:
> *“collaborative + one semantic signal”*  
to  
> **“collaborative + multi-source content-aware reasoning.”**

### Graph3: What changes compared to Graph2

Graph3 integrates **all available information sources** into a single unified graph.

#### 1. Interaction signal (unchanged)
- **user ↔ book**
- Derived only from **TRAIN interactions**
- Leave-one-out per user is preserved
- Prevents information leakage

#### 2. Tag-based semantic signal (enhanced)
- **book ↔ tag**
- Edges are **weighted** using `log1p(tag_count)`
- Provides stronger propagation through highly representative tags

#### 3. Text-based semantic similarity (new)
- **book ↔ book** edges
- Built using **TF-IDF embeddings** of `title + authors`
- Top-K nearest neighbors per book (cosine similarity)
- Captures latent thematic similarity beyond explicit tags

#### 4. Author signal (new)
- **book ↔ author**
- Allows recommendation propagation between books by the same or related authors

#### 5. Language signal (new)
- **book ↔ language**
- Helps separate and propagate preferences across language clusters

#### 6. Temporal signal (new)
- **book ↔ year_bin**
- Books are connected to coarse publication-era nodes
- Enables temporal preference smoothing

### Unified graph structure

user ─ book ─ tag
│
├── book (similarity)
│
├── author
│
├── language
│
└── year_bin


All nodes are embedded in a **single shared latent space** using LightGCN.

### Model and training
- **Model:** LightGCN
- **Loss:** Bayesian Personalized Ranking (BPR)
- **Negative sampling:** uniform with collision filtering
- **Edge normalization:** weighted symmetric normalization
- **Evaluation:** leave-one-out ranking metrics (Hit@K, NDCG@K)
- **Early stopping:** validation NDCG@10

This setup allows the model to **propagate user preference signals through multiple semantic paths** without introducing attention or heavy parameterization.

### What we expect to learn from Graph3
- Whether **multi-source graph enrichment** yields further improvements over Graph2
- Which content signals contribute positively (tags vs similarity vs metadata)
- Whether adding signals introduces noise or consistently improves ranking quality

Graph3 also serves as a **strong foundation** for future work:
- heterogeneous GNNs (typed edges)
- SBERT-based similarity edges
- feature-aware or attention-based models

### Next steps
Depending on Graph3 results, we will:
1. Tune edge construction (weights, thresholds, top-K)
2. Replace TF-IDF similarity with **SBERT embeddings**
3. Move to **heterogeneous GNN architectures**
4. Perform ablation studies to quantify each signal’s contribution

At this stage, the project transitions from *collaborative filtering* to a **full hybrid graph recommender system**.

In [1]:
# Cell 1: Imports + reproducibility
import os, gc, json, time, math
from pathlib import Path
from collections import defaultdict

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
from torch.optim import Adam

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("DEVICE:", DEVICE)

DEVICE: cuda


In [2]:
# Cell 2: Project paths
PROJECT_ROOT = Path(r"D:/ML/GNN/graph_recsys")
DATA_RAW = PROJECT_ROOT / "data_raw"
DATA_PROCESSED = PROJECT_ROOT / "data_processed" / "v2_proper"
ARTIFACTS = PROJECT_ROOT / "artifacts" / "v2_proper"

for p in [PROJECT_ROOT, DATA_RAW, DATA_PROCESSED, ARTIFACTS]:
    p.mkdir(parents=True, exist_ok=True)

print("DATA_RAW:", DATA_RAW)
print("DATA_PROCESSED:", DATA_PROCESSED)
print("ARTIFACTS:", ARTIFACTS)

RAW_FILES = {
    "books": "books.csv",
    "tags": "tags.csv",
    "book_tags": "book_tags.csv",
}

DATA_RAW: D:\ML\GNN\graph_recsys\data_raw
DATA_PROCESSED: D:\ML\GNN\graph_recsys\data_processed\v2_proper
ARTIFACTS: D:\ML\GNN\graph_recsys\artifacts\v2_proper


In [3]:
# Cell 3: Load LOO splits (idx-space) + mappings user2idx/book2idx (saved as pd.Series.to_csv)
def load_series_mapping(path: Path) -> dict[int, int]:
    """
    Loads mapping saved by: pd.Series(dict).to_csv("user2idx.csv")
    This produces 2 columns without headers. We parse robustly.
    """
    df = pd.read_csv(path, header=None)
    # Typical shape: (n+1, 2) with header row if index name saved; handle both
    # Remove rows where key is not int-like
    df = df.dropna()
    # If first row is non-numeric (e.g. index name), drop it
    def is_intlike(x):
        try:
            int(x); return True
        except: 
            return False
    df = df[df[0].apply(is_intlike) & df[1].apply(is_intlike)].copy()
    df[0] = df[0].astype(int)
    df[1] = df[1].astype(int)
    return dict(zip(df[0].values.tolist(), df[1].values.tolist()))

splits_path = DATA_PROCESSED / "splits_ui.npz"
user_map_path = DATA_PROCESSED / "user2idx.csv"
book_map_path = DATA_PROCESSED / "book2idx.csv"

for p in [splits_path, user_map_path, book_map_path]:
    if not p.exists():
        raise FileNotFoundError(f"Missing file: {p}")

z = np.load(splits_path, allow_pickle=True)
train_ui = z["train_ui"].astype(np.int32)
val_ui   = z["val_ui"].astype(np.int32)
test_ui  = z["test_ui"].astype(np.int32)

user2idx = load_series_mapping(user_map_path)   # {user_id: user_idx}
book2idx = load_series_mapping(book_map_path)   # {book_id: book_idx}

idx2user = {u: uid for uid, u in user2idx.items()}
idx2book = {i: bid for bid, i in book2idx.items()}

U = max(idx2user.keys()) + 1
B = max(idx2book.keys()) + 1

print("train_ui:", train_ui.shape, train_ui.dtype)
print("val_ui:", val_ui.shape, val_ui.dtype)
print("test_ui:", test_ui.shape, test_ui.dtype)
print("U:", U, "B:", B)

def ui_to_df(ui: np.ndarray) -> pd.DataFrame:
    u = ui[:, 0].astype(int)
    i = ui[:, 1].astype(int)
    df = pd.DataFrame({"user_idx": u, "book_idx": i})
    df["user_id"] = df["user_idx"].map(idx2user).astype(int)
    df["book_id"] = df["book_idx"].map(idx2book).astype(int)
    return df[["user_id", "book_id", "user_idx", "book_idx"]]

train_df = ui_to_df(train_ui)
val_df   = ui_to_df(val_ui)
test_df  = ui_to_df(test_ui)

print(train_df.shape, val_df.shape, test_df.shape)
train_df.head()

train_ui: (4926384, 2) int32
val_ui: (53398, 2) int32
test_ui: (53398, 2) int32
U: 53398 B: 9999
(4926384, 4) (53398, 4) (53398, 4)


Unnamed: 0,user_id,book_id,user_idx,book_idx
0,1,258,0,257
1,1,1796,0,1795
2,1,4691,0,4690
3,1,2063,0,2062
4,1,11,0,10


In [4]:
# Cell 4: Build train positives + ground truth dicts (LOO)

train_pos = defaultdict(set)
for u, i in train_ui:
    train_pos[int(u)].add(int(i))

val_gt = {int(u): int(i) for u, i in val_ui}
test_gt = {int(u): int(i) for u, i in test_ui}

print("train_pos users:", len(train_pos))
print("val_gt users:", len(val_gt), "test_gt users:", len(test_gt))

train_pos users: 53398
val_gt users: 53398 test_gt users: 53398


In [5]:
# Cell 5: Load books.csv and prepare content nodes (author/lang/year_bin)
books_path = DATA_RAW / RAW_FILES["books"]
books_df = pd.read_csv(books_path)

# Required for ID alignment with our mapping
assert {"book_id", "goodreads_book_id"}.issubset(books_df.columns)

# Keep only books in our catalog (book_id space)
books_df["book_id"] = books_df["book_id"].astype(int)
books_df = books_df[books_df["book_id"].isin(book2idx.keys())].copy()
books_df["book_idx"] = books_df["book_id"].map(book2idx).astype(int)

# Text for similarity edges
def safe_str(x): 
    return "" if pd.isna(x) else str(x)

title_col = "title" if "title" in books_df.columns else None
authors_col = "authors" if "authors" in books_df.columns else None

books_df["text"] = (
    books_df[title_col].map(safe_str) if title_col else ""
).astype(str) + " " + (
    books_df[authors_col].map(safe_str) if authors_col else ""
).astype(str)
books_df["text"] = books_df["text"].str.lower().str.replace(r"\s+", " ", regex=True).str.strip()

# Authors: split by comma
if authors_col:
    books_df["authors_list"] = books_df["authors"].fillna("").apply(
        lambda s: [a.strip() for a in str(s).split(",") if a.strip()]
    )
else:
    books_df["authors_list"] = [[] for _ in range(len(books_df))]

# Language (optional)
lang_col = "language_code" if "language_code" in books_df.columns else None
if lang_col:
    books_df["language_code"] = books_df["language_code"].fillna("unk").astype(str)
else:
    books_df["language_code"] = "unk"

# Year (optional)
year_col = "original_publication_year" if "original_publication_year" in books_df.columns else None
if year_col:
    books_df["year"] = pd.to_numeric(books_df[year_col], errors="coerce")
else:
    books_df["year"] = np.nan

# Year bins (coarse, robust)
books_df["year_bin"] = pd.cut(
    books_df["year"],
    bins=[0, 1950, 1970, 1990, 2005, 2015, 2030],
    labels=["<=1950", "1951-1970", "1971-1990", "1991-2005", "2006-2015", "2016+"],
    include_lowest=True
).astype(str).fillna("unknown")

print("books_df:", books_df.shape)
books_df[["book_id","book_idx","text","language_code","year_bin"]].head()

books_df: (9999, 28)


Unnamed: 0,book_id,book_idx,text,language_code,year_bin
0,1,0,"the hunger games (the hunger games, #1) suzann...",eng,2006-2015
1,2,1,harry potter and the sorcerer's stone (harry p...,eng,1991-2005
2,3,2,"twilight (twilight, #1) stephenie meyer",en-US,1991-2005
3,4,3,to kill a mockingbird harper lee,eng,1951-1970
4,5,4,the great gatsby f. scott fitzgerald,eng,<=1950


In [10]:
# Cell 6: Load tags/book_tags and build weighted book-tag edges (book_id space)
# - map goodreads_book_id -> book_id (via books.csv)
# - filter tags by MIN_BOOK_FREQ
# - keep TOP_TAGS_PER_BOOK per book by count
# - edge weight = log1p(count) (safe)

tags_df = pd.read_csv(DATA_RAW / RAW_FILES["tags"])
book_tags_raw = pd.read_csv(DATA_RAW / RAW_FILES["book_tags"])

assert {"goodreads_book_id","tag_id","count"}.issubset(book_tags_raw.columns)
assert {"tag_id","tag_name"}.issubset(tags_df.columns)

# Robust numeric parsing (avoid hidden NaNs/strings issues)
book_tags_raw["goodreads_book_id"] = pd.to_numeric(book_tags_raw["goodreads_book_id"], errors="coerce").astype("Int64")
book_tags_raw["tag_id"] = pd.to_numeric(book_tags_raw["tag_id"], errors="coerce").astype("Int64")
book_tags_raw["count"] = pd.to_numeric(book_tags_raw["count"], errors="coerce")

book_tags_raw = book_tags_raw.dropna(subset=["goodreads_book_id","tag_id","count"]).copy()
book_tags_raw["goodreads_book_id"] = book_tags_raw["goodreads_book_id"].astype(int)
book_tags_raw["tag_id"] = book_tags_raw["tag_id"].astype(int)

# Safety: ensure non-negative counts (avoid log1p issues)
book_tags_raw["count"] = book_tags_raw["count"].fillna(0)
book_tags_raw.loc[book_tags_raw["count"] < 0, "count"] = 0
book_tags_raw["count"] = book_tags_raw["count"].astype(int)

# map goodreads_book_id -> book_id using books_df
g2b = dict(zip(books_df["goodreads_book_id"].astype(int), books_df["book_id"].astype(int)))
book_tags_raw["book_id"] = book_tags_raw["goodreads_book_id"].map(g2b)
book_tags_raw = book_tags_raw.dropna(subset=["book_id"]).copy()
book_tags_raw["book_id"] = book_tags_raw["book_id"].astype(int)

# filter to our catalog book_ids
book_tags_df = book_tags_raw[book_tags_raw["book_id"].isin(book2idx.keys())].copy()

print("book_tags_df:", book_tags_df.shape)
print("unique books:", book_tags_df["book_id"].nunique(), "unique tags:", book_tags_df["tag_id"].nunique())

# Filtering + top tags per book
MIN_BOOK_FREQ = 10
TOP_TAGS_PER_BOOK = 50

tag_book_freq = book_tags_df.groupby("tag_id")["book_id"].nunique().sort_values(ascending=False)
eligible_tags = tag_book_freq[tag_book_freq >= MIN_BOOK_FREQ].index
bt = book_tags_df[book_tags_df["tag_id"].isin(eligible_tags)].copy()

bt = (
    bt.sort_values(["book_id","count"], ascending=[True, False])
      .groupby("book_id", as_index=False)
      .head(TOP_TAGS_PER_BOOK)
      .copy()
)

# map to idx
bt["book_idx"] = bt["book_id"].map(book2idx).astype(int)

# local tag vocab
unique_tag_ids = np.sort(bt["tag_id"].unique())
tag2idx = {int(t): i for i, t in enumerate(unique_tag_ids)}
T = len(tag2idx)
bt["tag_idx"] = bt["tag_id"].map(tag2idx).astype(int)

# weight: log1p(count)
bt["w"] = np.log1p(bt["count"].astype(float))
assert np.isfinite(bt["w"]).all(), "Non-finite weights detected in tag edges"

print("Final tags T:", T)
print("Coverage books with tag:", bt["book_idx"].nunique(), "/", B)
bt[["book_idx","tag_idx","count","w"]].head()

book_tags_df: (999856, 4)
unique books: 9999 unique tags: 34225
Final tags T: 5014
Coverage books with tag: 9999 / 9999


Unnamed: 0,book_idx,tag_idx,count,w
619294,0,1688,50755,10.834785
619295,0,1271,35418,10.475004
619296,0,4937,25968,10.164659
619297,0,1725,13819,9.533872
619298,0,1455,12985,9.471627


In [11]:
# Cell 7: Book–Book similarity edges via TF-IDF (title+authors) -> kNN
# - Adds semantic edges between books: book <-> book
# - Weight = cosine similarity

TOPK_SIM = 30  # per book
MIN_SIM = 0.10  # prune weak similarities

corpus = books_df.sort_values("book_idx")["text"].tolist()
assert len(corpus) == B, f"Expected {B} books in corpus, got {len(corpus)}"

tfidf = TfidfVectorizer(
    min_df=2,
    max_features=200000,
    ngram_range=(1, 2),
)
X = tfidf.fit_transform(corpus)  # sparse [B, V]

# cosine kNN in sparse space
nnbrs = NearestNeighbors(n_neighbors=TOPK_SIM + 1, metric="cosine", algorithm="brute")
nnbrs.fit(X)
dist, idx = nnbrs.kneighbors(X, return_distance=True)

# build edges excluding self (first neighbor)
src = []
dst = []
w = []
for i in range(B):
    for j, d in zip(idx[i, 1:], dist[i, 1:]):
        sim = 1.0 - float(d)
        if sim >= MIN_SIM:
            src.append(i)
            dst.append(int(j))
            w.append(sim)

src = np.array(src, dtype=np.int64)
dst = np.array(dst, dtype=np.int64)
w = np.array(w, dtype=np.float32)

print("book-book sim edges:", len(src), "| mean sim:", w.mean() if len(w) else None)

book-book sim edges: 265581 | mean sim: 0.27009127


In [12]:
# Cell 8: Unified node index space for Graph3 (all signals)
# users:   [0 .. U)
# books:   [U .. U+B)
# tags:    [U+B .. U+B+T)
# authors: [..]
# langs:   [..]
# yearbin: [..]

user_offset = 0
book_offset = U
tag_offset = U + B
author_offset = tag_offset + T

# author vocab
all_authors = sorted({a for lst in books_df["authors_list"] for a in lst})
author2idx = {a: i for i, a in enumerate(all_authors)}
A = len(author2idx)

lang_offset = author_offset + A
langs = sorted(books_df["language_code"].astype(str).unique().tolist())
lang2idx = {l: i for i, l in enumerate(langs)}
L = len(lang2idx)

year_offset = lang_offset + L
years = sorted(books_df["year_bin"].astype(str).unique().tolist())
year2idx = {y: i for i, y in enumerate(years)}
Y = len(year2idx)

num_nodes = U + B + T + A + L + Y

print(f"U={U} B={B} T={T} A={A} L={L} Y={Y} | num_nodes={num_nodes}")
print("Offsets:", dict(user=user_offset, book=book_offset, tag=tag_offset,
                      author=author_offset, lang=lang_offset, year=year_offset))

U=53398 B=9999 T=5014 A=5841 L=26 Y=7 | num_nodes=74285
Offsets: {'user': 0, 'book': 53398, 'tag': 63397, 'author': 68411, 'lang': 74252, 'year': 74278}


In [13]:
# Cell 9: Build Graph3 edges (weighted)
# Edges:
# - user<->book from TRAIN: weight=1
# - book<->tag: weight=log1p(count)
# - book<->book_sim: weight=cosine sim
# - book<->author: weight=1
# - book<->lang: weight=1
# - book<->yearbin: weight=1

edge_src = []
edge_dst = []
edge_w = []

def add_undirected(u, v, w=1.0):
    edge_src.append(u); edge_dst.append(v); edge_w.append(w)
    edge_src.append(v); edge_dst.append(u); edge_w.append(w)

# 1) user-book (TRAIN only)
for u, i in train_ui:
    un = user_offset + int(u)
    bn = book_offset + int(i)
    add_undirected(un, bn, 1.0)

print("Added user-book edges.")

# 2) book-tag weighted
for r in bt.itertuples(index=False):
    bidx = int(r.book_idx)
    tidx = int(r.tag_idx)
    wgt = float(r.w)
    bn = book_offset + bidx
    tn = tag_offset + tidx
    add_undirected(bn, tn, wgt)

print("Added book-tag edges.")

# 3) book-book similarity weighted
for bi, bj, ww in zip(src, dst, w):
    b1 = book_offset + int(bi)
    b2 = book_offset + int(bj)
    add_undirected(b1, b2, float(ww))

print("Added book-book similarity edges.")

# 4) book-author
for r in books_df.itertuples(index=False):
    bidx = int(r.book_idx)
    bn = book_offset + bidx
    for a in r.authors_list:
        an = author_offset + author2idx[a]
        add_undirected(bn, an, 1.0)

print("Added book-author edges.")

# 5) book-lang
for r in books_df.itertuples(index=False):
    bidx = int(r.book_idx)
    bn = book_offset + bidx
    ln = lang_offset + lang2idx[str(r.language_code)]
    add_undirected(bn, ln, 1.0)

print("Added book-lang edges.")

# 6) book-yearbin
for r in books_df.itertuples(index=False):
    bidx = int(r.book_idx)
    bn = book_offset + bidx
    yn = year_offset + year2idx[str(r.year_bin)]
    add_undirected(bn, yn, 1.0)

print("Added book-yearbin edges.")

edge_src = torch.tensor(edge_src, dtype=torch.long)
edge_dst = torch.tensor(edge_dst, dtype=torch.long)
edge_w = torch.tensor(edge_w, dtype=torch.float32)

edge_index = torch.stack([edge_src, edge_dst], dim=0)
print("edge_index:", edge_index.shape, "edge_w:", edge_w.shape)
print("node id range:", int(edge_index.min()), int(edge_index.max()))
assert edge_index.min() >= 0 and edge_index.max() < num_nodes

Added user-book edges.
Added book-tag edges.
Added book-book similarity edges.
Added book-author edges.
Added book-lang edges.
Added book-yearbin edges.
edge_index: torch.Size([2, 11450076]) edge_w: torch.Size([11450076])
node id range: 0 74284


In [14]:
# Cell 10: Build weighted normalized adjacency A_norm for Graph3
# - A is weighted sparse adjacency
# - Normalize with symmetric norm: D^{-1/2} A D^{-1/2}

A = torch.sparse_coo_tensor(edge_index, edge_w, (num_nodes, num_nodes)).coalesce()
deg = torch.sparse.sum(A, dim=1).to_dense()
deg_inv_sqrt = torch.pow(deg.clamp(min=1e-12), -0.5)

row, col = A.indices()
norm_val = deg_inv_sqrt[row] * A.values() * deg_inv_sqrt[col]
A_norm = torch.sparse_coo_tensor(A.indices(), norm_val, A.size()).coalesce().to(DEVICE)

print("A_norm:", tuple(A_norm.shape), "nnz:", A_norm._nnz())
print("deg stats:", float(deg.min()), float(deg.mean()), float(deg.max()))

A_norm: (74285, 74285) nnz: 11260793
deg stats: 0.6931471824645996 185.75030517578125 83970.375


In [15]:
# Sanity check for A_norm values
vals = A_norm.values()
print("A_norm values: min/mean/max =", float(vals.min()), float(vals.mean()), float(vals.max()))
print("NaN:", torch.isnan(vals).any().item(), "| Inf:", torch.isinf(vals).any().item())

A_norm values: min/mean/max = 0.0 0.0032131632324308157 0.117741659283638
NaN: False | Inf: False


In [16]:
# Cell 11: LightGCN + BPR + negative sampler + ranking eval

class LightGCN(nn.Module):
    def __init__(self, num_nodes: int, emb_dim: int = 64, num_layers: int = 3, dropout: float = 0.0):
        super().__init__()
        self.num_nodes = num_nodes
        self.emb_dim = emb_dim
        self.num_layers = num_layers
        self.emb = nn.Embedding(num_nodes, emb_dim)
        nn.init.normal_(self.emb.weight, std=0.01)

    @torch.no_grad()
    def propagate(self, A_norm: torch.Tensor) -> torch.Tensor:
        x0 = self.emb.weight
        out = x0
        x = x0
        for _ in range(self.num_layers):
            x = torch.sparse.mm(A_norm, x)
            out = out + x
        out = out / (self.num_layers + 1)
        return out

def bpr_loss(u, p, n):
    pos = (u * p).sum(dim=1)
    neg = (u * n).sum(dim=1)
    return -torch.log(torch.sigmoid(pos - neg) + 1e-12).mean()

def sample_negatives_fast(users_np: np.ndarray, train_pos: dict, num_items: int, K_try: int = 20) -> np.ndarray:
    # uniform negatives with retries; fast enough for B=9999
    neg = np.random.randint(0, num_items, size=len(users_np), dtype=np.int32)
    for _ in range(K_try):
        bad = np.array([neg[i] in train_pos[int(u)] for i, u in enumerate(users_np)], dtype=bool)
        if not bad.any():
            break
        neg[bad] = np.random.randint(0, num_items, size=bad.sum(), dtype=np.int32)
    return neg

@torch.no_grad()
def eval_ranking(all_emb: torch.Tensor, gt: dict[int, int], train_pos: dict, K_list=(10,20,50), user_batch_size=1024):
    # Only user/book embeddings are used for scoring
    user_emb = all_emb[user_offset:user_offset + U].to(DEVICE)
    item_emb = all_emb[book_offset:book_offset + B].to(DEVICE)

    users = np.array(sorted(gt.keys()), dtype=np.int32)
    hits = {K: 0 for K in K_list}
    ndcgs = {K: 0.0 for K in K_list}
    maxK = max(K_list)

    for start in range(0, len(users), user_batch_size):
        batch_users = users[start:start + user_batch_size]
        bu = torch.from_numpy(batch_users.astype(np.int64)).to(DEVICE)

        scores = user_emb[bu] @ item_emb.t()  # [batch, B]

        # filter train positives
        for row, u in enumerate(batch_users):
            seen = train_pos.get(int(u), None)
            if seen:
                seen_idx = torch.tensor(list(seen), device=DEVICE, dtype=torch.long)
                scores[row, seen_idx] = -1e9

        topk = torch.topk(scores, k=maxK, dim=1).indices.cpu().numpy()

        for row, u in enumerate(batch_users):
            true_i = int(gt[int(u)])
            rank = topk[row]
            pos = np.where(rank == true_i)[0]
            for K in K_list:
                if len(pos) > 0 and pos[0] < K:
                    hits[K] += 1
                    ndcgs[K] += 1.0 / np.log2(pos[0] + 2.0)

    n = len(users)
    return {f"Hit@{K}": hits[K]/n for K in K_list} | {f"NDCG@{K}": ndcgs[K]/n for K in K_list}

print("Model + losses + eval ready.")

Model + losses + eval ready.


In [22]:
# Cell 12: Train LightGCN on Graph3 — propagate per update step (most stable)
# - No caching of all_emb across many backward calls
# - No retain_graph needed

import time, math
import numpy as np
import torch
import torch.nn as nn
import pandas as pd

EMB_DIM = 64
NUM_LAYERS = 3

LR = 2e-3
WEIGHT_DECAY = 0.0

BATCH_SIZE = 4096
ACC_STEPS = 16
EPOCHS = 40

USER_BATCH_EVAL = 1024
PATIENCE = 6
MIN_DELTA = 1e-4

SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

print("Training config:", {
    "EMB_DIM": EMB_DIM, "NUM_LAYERS": NUM_LAYERS,
    "LR": LR, "WEIGHT_DECAY": WEIGHT_DECAY,
    "BATCH_SIZE": BATCH_SIZE, "ACC_STEPS": ACC_STEPS,
    "EPOCHS": EPOCHS, "USER_BATCH_EVAL": USER_BATCH_EVAL,
    "PATIENCE": PATIENCE, "MIN_DELTA": MIN_DELTA
})

# -------------------------
# Model
# -------------------------
class LightGCN(nn.Module):
    def __init__(self, num_nodes: int, emb_dim: int = 64, num_layers: int = 3):
        super().__init__()
        self.num_layers = num_layers
        self.emb = nn.Embedding(num_nodes, emb_dim)
        nn.init.normal_(self.emb.weight, std=0.01)

    def propagate(self, A_norm):
        x0 = self.emb.weight
        out = x0
        x = x0
        for _ in range(self.num_layers):
            x = torch.sparse.mm(A_norm, x)
            out = out + x
        return out / (self.num_layers + 1)

def bpr_loss(u, p, n):
    pos = (u * p).sum(dim=1)
    neg = (u * n).sum(dim=1)
    return -torch.log(torch.sigmoid(pos - neg) + 1e-12).mean()

model = LightGCN(num_nodes=num_nodes, emb_dim=EMB_DIM, num_layers=NUM_LAYERS).to(DEVICE)
opt = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

A_norm = A_norm.to(DEVICE)

# -------------------------
# Data
# -------------------------
train_users = train_ui[:, 0].astype(np.int32)
train_pos_items = train_ui[:, 1].astype(np.int32)
n_train = len(train_users)
n_batches = math.ceil(n_train / BATCH_SIZE)

def sample_negatives(users_np: np.ndarray, n_try: int = 8) -> np.ndarray:
    neg = np.random.randint(0, B, size=len(users_np), dtype=np.int32)
    for _ in range(n_try):
        bad = np.zeros(len(users_np), dtype=bool)
        for i, u in enumerate(users_np):
            seen = train_pos.get(int(u), None)
            if seen and int(neg[i]) in seen:
                bad[i] = True
        if not bad.any():
            return neg
        neg[bad] = np.random.randint(0, B, size=bad.sum(), dtype=np.int32)
    return neg

# -------------------------
# Train + early stop
# -------------------------
history = []
best_ndcg10, best_epoch = -1.0, -1
best_state = None
pat = 0

for epoch in range(1, EPOCHS + 1):
    t0 = time.time()
    model.train()

    perm = np.random.permutation(n_train)
    epoch_loss = 0.0

    opt.zero_grad(set_to_none=True)

    for bi in range(n_batches):
        sl = slice(bi * BATCH_SIZE, min((bi + 1) * BATCH_SIZE, n_train))
        idx = perm[sl]

        u = train_users[idx]
        p = train_pos_items[idx]
        n = sample_negatives(u)

        u_t = torch.from_numpy(u.astype(np.int64)).to(DEVICE) + user_offset
        p_t = torch.from_numpy(p.astype(np.int64)).to(DEVICE) + book_offset
        n_t = torch.from_numpy(n.astype(np.int64)).to(DEVICE) + book_offset

        # ключ: propagate под каждый шаг (граф уникальный -> backward безопасен)
        all_emb = model.propagate(A_norm)

        loss = bpr_loss(all_emb[u_t], all_emb[p_t], all_emb[n_t]) / ACC_STEPS
        loss.backward()

        epoch_loss += float(loss.item()) * ACC_STEPS

        if (bi + 1) % ACC_STEPS == 0 or (bi + 1) == n_batches:
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            opt.step()
            opt.zero_grad(set_to_none=True)

    # ---- Validation
    model.eval()
    with torch.no_grad():
        all_emb_eval = model.propagate(A_norm)

    val_metrics = eval_ranking(
        all_emb_eval, val_gt, train_pos,
        K_list=(10, 20, 50),
        user_batch_size=USER_BATCH_EVAL,
    )

    dt = time.time() - t0
    row = {"epoch": epoch, "loss": epoch_loss / max(n_batches, 1), "time_sec": dt, **val_metrics}
    history.append(row)

    print(
        f"[Graph3-STABLE] Epoch {epoch:02d} | loss={row['loss']:.4f} | time={dt:.1f}s | "
        f"Hit@10={val_metrics['Hit@10']:.4f} NDCG@10={val_metrics['NDCG@10']:.4f} | "
        f"Hit@50={val_metrics['Hit@50']:.4f} NDCG@50={val_metrics['NDCG@50']:.4f}"
    )

    ndcg10 = float(val_metrics["NDCG@10"])
    if ndcg10 > best_ndcg10 + MIN_DELTA:
        best_ndcg10 = ndcg10
        best_epoch = epoch
        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
        pat = 0
        print(f"New best NDCG@10={best_ndcg10:.5f} at epoch {best_epoch}")
    else:
        pat += 1
        print(f"  patience={pat}/{PATIENCE} (best {best_ndcg10:.5f} @ epoch {best_epoch})")
        if pat >= PATIENCE:
            print("Early stopping triggered.")
            break

# restore best
if best_state is not None:
    model.load_state_dict(best_state)

hist_df = pd.DataFrame(history)
display(hist_df.tail(10))
print(f"\nBest epoch: {best_epoch} | best val NDCG@10: {best_ndcg10:.6f}")

# ---- TEST
model.eval()
with torch.no_grad():
    all_emb_test = model.propagate(A_norm)

test_metrics = eval_ranking(
    all_emb_test, test_gt, train_pos,
    K_list=(10, 20, 50),
    user_batch_size=USER_BATCH_EVAL,
)

print("\nTEST metrics (Graph3-STABLE, best epoch):")
for k, v in test_metrics.items():
    print(f"  {k}: {v:.6f}")

Training config: {'EMB_DIM': 64, 'NUM_LAYERS': 3, 'LR': 0.002, 'WEIGHT_DECAY': 0.0, 'BATCH_SIZE': 4096, 'ACC_STEPS': 16, 'EPOCHS': 40, 'USER_BATCH_EVAL': 1024, 'PATIENCE': 6, 'MIN_DELTA': 0.0001}
[Graph3-STABLE] Epoch 01 | loss=0.5915 | time=343.6s | Hit@10=0.0457 NDCG@10=0.0257 | Hit@50=0.1290 NDCG@50=0.0437
New best NDCG@10=0.02571 at epoch 1
[Graph3-STABLE] Epoch 02 | loss=0.4094 | time=343.8s | Hit@10=0.0484 NDCG@10=0.0268 | Hit@50=0.1324 NDCG@50=0.0449
New best NDCG@10=0.02684 at epoch 2
[Graph3-STABLE] Epoch 03 | loss=0.3775 | time=341.9s | Hit@10=0.0530 NDCG@10=0.0289 | Hit@50=0.1396 NDCG@50=0.0475
New best NDCG@10=0.02891 at epoch 3
[Graph3-STABLE] Epoch 04 | loss=0.3455 | time=341.6s | Hit@10=0.0566 NDCG@10=0.0306 | Hit@50=0.1474 NDCG@50=0.0500
New best NDCG@10=0.03058 at epoch 4
[Graph3-STABLE] Epoch 05 | loss=0.3191 | time=337.7s | Hit@10=0.0593 NDCG@10=0.0319 | Hit@50=0.1539 NDCG@50=0.0521
New best NDCG@10=0.03190 at epoch 5
[Graph3-STABLE] Epoch 06 | loss=0.2958 | time=338

Unnamed: 0,epoch,loss,time_sec,Hit@10,Hit@20,Hit@50,NDCG@10,NDCG@20,NDCG@50
30,31,0.143408,343.682516,0.091726,0.142908,0.243099,0.049546,0.062406,0.082171
31,32,0.141633,344.392361,0.0924,0.143376,0.244372,0.049864,0.062682,0.082625
32,33,0.139377,345.608465,0.093505,0.144519,0.245889,0.050496,0.063297,0.083307
33,34,0.137745,343.653309,0.093955,0.145511,0.247013,0.050797,0.063747,0.083777
34,35,0.136299,346.263726,0.094517,0.146391,0.248118,0.051158,0.064186,0.084266
35,36,0.134399,347.004332,0.095078,0.147084,0.249766,0.051622,0.064697,0.084963
36,37,0.13331,347.347634,0.095715,0.14832,0.250197,0.051966,0.065179,0.085298
37,38,0.131222,348.231507,0.096689,0.150193,0.251582,0.05231,0.065714,0.085714
38,39,0.130075,346.39015,0.097644,0.150305,0.253399,0.052674,0.065872,0.086215
39,40,0.12839,345.166613,0.098094,0.150998,0.25471,0.053054,0.066328,0.0868



Best epoch: 40 | best val NDCG@10: 0.053054

TEST metrics (Graph3-STABLE, best epoch):
  Hit@10: 0.097213
  Hit@20: 0.147496
  Hit@50: 0.254785
  NDCG@10: 0.052499
  NDCG@20: 0.065097
  NDCG@50: 0.086235


Graph3: Enriched Graph Recommender (LightGCN)

Goal.
Move beyond a pure user–book bipartite graph and test whether adding structured content + metadata relations improves ranking quality for book recommendation on Goodbooks-10k.

Graph construction (Graph3).
We built a single unified node index space and added multiple edge types:

user ↔ book (train interactions only; val/test edges excluded to avoid leakage)

book ↔ tag (weighted edges using log1p(count); filtered by MIN_BOOK_FREQ and top-K tags per book)

book ↔ book similarity edges (content-based similarity computed from book text; top-K neighbors per book)

book ↔ author (author nodes extracted from book metadata)

book ↔ language (language_code nodes)

book ↔ year_bin (binned publication year nodes)

All edges were made bidirectional (undirected) and combined into a single sparse adjacency matrix, then normalized (A_norm) for LightGCN propagation.

Model.
We trained a LightGCN encoder (3 propagation layers) on the enriched graph using BPR loss with negative sampling (filtering seen train positives). Validation and test used leave-one-out ranking with metrics Hit@K and NDCG@K.

Key results.
Training was stable and metrics improved steadily up to epoch 40.

Best validation (epoch 40):

Hit@10 = 0.0981

NDCG@10 = 0.05305

Hit@50 = 0.2547

NDCG@50 = 0.0868

Test (best epoch):

Hit@10 = 0.09721

NDCG@10 = 0.05250

Hit@20 = 0.14750

NDCG@20 = 0.06510

Hit@50 = 0.25479

NDCG@50 = 0.08624

Takeaways.

Adding heterogeneous, meaningful relations (tags + metadata + similarity) significantly improves ranking quality compared to the simpler graphs tested earlier.

Validation ≈ Test indicates the gains generalize and are not just overfitting to validation.

Even with a simple homogeneous LightGCN, the enriched graph provides a strong boost — suggesting the next step is to use heterogeneous GNNs that can treat edge types differently (rather than averaging everything equally).

Next steps.

Run ablation studies (remove one relation group at a time) to quantify which edges drive the improvement (tags vs authors vs similarity vs language/year).

Upgrade the model family to heterogeneous architectures (e.g., R-GCN / HGT / HAN / HeteroConv) to leverage edge-type structure more effectively.

Improve the content layer by incorporating stronger text embeddings (e.g., SBERT) either as node features or similarity edges.

In [28]:
# ============================
# Cell: Rebuild Graph3 edges WITH relation types (edge_type/rel2id)
# - produces: edge_index, edge_w, edge_type, rel2id
# - optionally rebuilds: A_norm
# ============================

import json
import numpy as np
import pandas as pd
import time
import torch

from pathlib import Path

# ----------------------------
# Helpers
# ----------------------------
def ensure_tensor(x, dtype, device="cpu"):
    if isinstance(x, torch.Tensor):
        return x.to(device=device, dtype=dtype)
    return torch.tensor(x, dtype=dtype, device=device)

def add_edges(store, src, dst, w, rel_name, rel2id):
    """
    store: dict with lists
    src/dst: 1D Long tensor or array-like
    w: 1D float tensor or array-like (same length)
    rel_name: str
    """
    if rel_name not in rel2id:
        rel2id[rel_name] = len(rel2id)
    rid = rel2id[rel_name]

    src = ensure_tensor(src, torch.long, device="cpu").view(-1)
    dst = ensure_tensor(dst, torch.long, device="cpu").view(-1)
    w   = ensure_tensor(w, torch.float32, device="cpu").view(-1)

    assert src.numel() == dst.numel() == w.numel(), f"{rel_name}: src/dst/w size mismatch"

    store["src"].append(src)
    store["dst"].append(dst)
    store["w"].append(w)
    store["t"].append(torch.full((src.numel(),), rid, dtype=torch.int16))

def build_sparse_norm(edge_index, edge_w, num_nodes):
    """
    Weighted LightGCN normalization: D^{-1/2} A D^{-1/2}
    where A is undirected (we already add both directions)
    """
    # COO
    row = edge_index[0]
    col = edge_index[1]
    val = edge_w

    # degree = sum of weights on outgoing edges
    deg = torch.zeros(num_nodes, dtype=torch.float32)
    deg.scatter_add_(0, row, val)
    deg = torch.clamp(deg, min=1e-12)

    inv_sqrt = torch.pow(deg, -0.5)
    norm_val = inv_sqrt[row] * val * inv_sqrt[col]

    A = torch.sparse_coo_tensor(
        indices=edge_index,
        values=norm_val,
        size=(num_nodes, num_nodes),
        dtype=torch.float32
    ).coalesce()
    return A

# ----------------------------
# Preconditions
# ----------------------------
need = ["U","B","T","num_nodes",
        "user_offset","book_offset","tag_offset","author_offset","lang_offset","year_offset",
        "train_ui","books_df","bt",
        "author2idx","lang2idx","year2idx"]
missing = [v for v in need if v not in globals()]
if missing:
    raise NameError(f"Missing required variables for rebuild: {missing}")

U_cnt, B_cnt, T_cnt = int(U), int(B), int(T)
num_nodes_cnt = int(num_nodes)

print("Rebuild target:",
      f"U={U_cnt} B={B_cnt} T={T_cnt} num_nodes={num_nodes_cnt}")
print("Offsets:",
      {"user": int(user_offset), "book": int(book_offset), "tag": int(tag_offset),
       "author": int(author_offset), "lang": int(lang_offset), "year": int(year_offset)})

# ----------------------------
# Start building
# ----------------------------
rel2id = {}
store = {"src": [], "dst": [], "w": [], "t": []}

# ---- 1) user-book (TRAIN only), weight=1, bidirectional
train_u = train_ui[:, 0].astype(np.int64)
train_b = train_ui[:, 1].astype(np.int64)

src = torch.from_numpy(train_u) + int(user_offset)
dst = torch.from_numpy(train_b) + int(book_offset)
w1  = torch.ones(src.numel(), dtype=torch.float32)

add_edges(store, src, dst, w1, "user_book", rel2id)
add_edges(store, dst, src, w1, "book_user", rel2id)
print("[OK] added user-book edges")

# ---- 2) book-tag edges (from bt), bidirectional
# bt must contain: book_idx, tag_idx, w
assert {"book_idx","tag_idx","w"}.issubset(bt.columns), f"bt columns missing, got {bt.columns.tolist()}"
bt_books = bt["book_idx"].to_numpy(dtype=np.int64)
bt_tags  = bt["tag_idx"].to_numpy(dtype=np.int64)
bt_w     = bt["w"].to_numpy(dtype=np.float32)

src_bt = torch.from_numpy(bt_books) + int(book_offset)
dst_bt = torch.from_numpy(bt_tags) + int(tag_offset)
w_bt   = torch.from_numpy(bt_w)

add_edges(store, src_bt, dst_bt, w_bt, "book_tag", rel2id)
add_edges(store, dst_bt, src_bt, w_bt, "tag_book", rel2id)
print("[OK] added book-tag edges")

# ---- 3) book-book similarity edges (optional)
# Prefer using existing sim dataframe if present, else recompute
sim_df = None
for cand in ["sim_df", "bb_sim_df", "book_book_sim_df", "book_sim_df"]:
    if cand in globals():
        sim_df = globals()[cand]
        break

if sim_df is not None:
    # expected columns: i (book_idx), j (book_idx), sim (float)
    cols = set(sim_df.columns)
    if {"i","j","sim"}.issubset(cols):
        i = sim_df["i"].to_numpy(np.int64)
        j = sim_df["j"].to_numpy(np.int64)
        s = sim_df["sim"].to_numpy(np.float32)
    elif {"book_i","book_j","sim"}.issubset(cols):
        i = sim_df["book_i"].to_numpy(np.int64)
        j = sim_df["book_j"].to_numpy(np.int64)
        s = sim_df["sim"].to_numpy(np.float32)
    else:
        raise ValueError(f"Unknown sim_df schema: {sim_df.columns.tolist()}")

    src_bb = torch.from_numpy(i) + int(book_offset)
    dst_bb = torch.from_numpy(j) + int(book_offset)
    w_bb   = torch.from_numpy(s)

    add_edges(store, src_bb, dst_bb, w_bb, "book_book_sim", rel2id)
    add_edges(store, dst_bb, src_bb, w_bb, "book_book_sim", rel2id)
    print("[OK] added book-book sim edges from existing sim_df:", len(s))
else:
    # Fallback: recompute TF-IDF cosine topk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity

    TOPK_SIM = int(globals().get("TOPK_SIM", 30))
    MIN_SIM  = float(globals().get("MIN_SIM", 0.2))
    TFIDF_MIN_DF = int(globals().get("TFIDF_MIN_DF", 2))
    TFIDF_NGRAMS = tuple(globals().get("TFIDF_NGRAMS", (1,2)))

    texts = books_df["text"].fillna("").astype(str).tolist()
    tfidf = TfidfVectorizer(min_df=TFIDF_MIN_DF, ngram_range=TFIDF_NGRAMS)
    X = tfidf.fit_transform(texts)

    # cosine sim row-by-row topk 
    rows_i, rows_j, sims = [], [], []
    for i in range(X.shape[0]):
        # sim to all
        sim = cosine_similarity(X[i], X).ravel()
        sim[i] = 0.0
        # topk
        idx = np.argpartition(-sim, TOPK_SIM)[:TOPK_SIM]
        idx = idx[sim[idx] >= MIN_SIM]
        for j in idx:
            rows_i.append(i); rows_j.append(int(j)); sims.append(float(sim[j]))

    i = np.array(rows_i, dtype=np.int64)
    j = np.array(rows_j, dtype=np.int64)
    s = np.array(sims, dtype=np.float32)

    src_bb = torch.from_numpy(i) + int(book_offset)
    dst_bb = torch.from_numpy(j) + int(book_offset)
    w_bb   = torch.from_numpy(s)

    add_edges(store, src_bb, dst_bb, w_bb, "book_book_sim", rel2id)
    add_edges(store, dst_bb, src_bb, w_bb, "book_book_sim", rel2id)
    print("[OK] added book-book sim edges via TF-IDF:", len(s))

# ---- 4) book-author edges (bidirectional)
# books_df must contain 'authors' and 'book_idx'
if "authors" not in books_df.columns:
    raise NameError("books_df must have 'authors' column for book-author edges")

def parse_authors(x):
    if pd.isna(x):
        return []
    # goodbooks: "Suzanne Collins" (sometimes multiple authors separated by ',')
    parts = [p.strip() for p in str(x).split(",")]
    return [p for p in parts if p]

ba_book = []
ba_auth = []
for _, r in books_df[["book_idx","authors"]].iterrows():
    bidx = int(r["book_idx"])
    for a in parse_authors(r["authors"]):
        if a in author2idx:
            ba_book.append(bidx)
            ba_auth.append(int(author2idx[a]))

ba_book = np.array(ba_book, dtype=np.int64)
ba_auth = np.array(ba_auth, dtype=np.int64)

src_ba = torch.from_numpy(ba_book) + int(book_offset)
dst_ba = torch.from_numpy(ba_auth) + int(author_offset)
w_ba   = torch.ones(src_ba.numel(), dtype=torch.float32)

add_edges(store, src_ba, dst_ba, w_ba, "book_author", rel2id)
add_edges(store, dst_ba, src_ba, w_ba, "author_book", rel2id)
print("[OK] added book-author edges:", len(ba_book))

# ---- 5) book-lang edges (bidirectional)
if "language_code" not in books_df.columns:
    raise NameError("books_df must have 'language_code' column for book-lang edges")

bl_book = []
bl_lang = []
for _, r in books_df[["book_idx","language_code"]].iterrows():
    bidx = int(r["book_idx"])
    lang = r["language_code"]
    if pd.isna(lang): 
        continue
    lang = str(lang)
    if lang in lang2idx:
        bl_book.append(bidx)
        bl_lang.append(int(lang2idx[lang]))

bl_book = np.array(bl_book, dtype=np.int64)
bl_lang = np.array(bl_lang, dtype=np.int64)

src_bl = torch.from_numpy(bl_book) + int(book_offset)
dst_bl = torch.from_numpy(bl_lang) + int(lang_offset)
w_bl   = torch.ones(src_bl.numel(), dtype=torch.float32)

add_edges(store, src_bl, dst_bl, w_bl, "book_lang", rel2id)
add_edges(store, dst_bl, src_bl, w_bl, "lang_book", rel2id)
print("[OK] added book-lang edges:", len(bl_book))

# ---- 6) book-yearbin edges (bidirectional)
if "year_bin" not in books_df.columns:
    raise NameError("books_df must have 'year_bin' column for book-year edges")

by_book = []
by_year = []
for _, r in books_df[["book_idx","year_bin"]].iterrows():
    bidx = int(r["book_idx"])
    yb = r["year_bin"]
    if pd.isna(yb):
        continue
    yb = str(yb)
    if yb in year2idx:
        by_book.append(bidx)
        by_year.append(int(year2idx[yb]))

by_book = np.array(by_book, dtype=np.int64)
by_year = np.array(by_year, dtype=np.int64)

src_by = torch.from_numpy(by_book) + int(book_offset)
dst_by = torch.from_numpy(by_year) + int(year_offset)
w_by   = torch.ones(src_by.numel(), dtype=torch.float32)

add_edges(store, src_by, dst_by, w_by, "book_year", rel2id)
add_edges(store, dst_by, src_by, w_by, "year_book", rel2id)
print("[OK] added book-year edges:", len(by_book))

# ----------------------------
# Final concat
# ----------------------------
edge_src = torch.cat(store["src"], dim=0)
edge_dst = torch.cat(store["dst"], dim=0)
edge_w   = torch.cat(store["w"], dim=0).float()
edge_type= torch.cat(store["t"], dim=0)

edge_index = torch.stack([edge_src, edge_dst], dim=0).long()

print("\nRebuilt:")
print(" edge_index:", tuple(edge_index.shape))
print(" edge_w:", tuple(edge_w.shape), "| finite:", bool(torch.isfinite(edge_w).all().item()))
print(" edge_type:", tuple(edge_type.shape), "| rels:", rel2id)

# sanity
assert int(edge_index.min()) >= 0
assert int(edge_index.max()) < int(num_nodes_cnt)
assert edge_w.numel() == edge_type.numel() == edge_index.shape[1]

# ----------------------------
# (Optional) rebuild A_norm for consistency in bundle
# ----------------------------
REBUILD_A_NORM = True
if REBUILD_A_NORM:
    A_norm = build_sparse_norm(edge_index, edge_w, num_nodes_cnt)
    print(" A_norm:", tuple(A_norm.shape), "nnz:", int(A_norm._nnz()))

print("\n[OK] edge_type + rel2id are ready ✔")

Rebuild target: U=53398 B=9999 T=5014 num_nodes=74285
Offsets: {'user': 0, 'book': 53398, 'tag': 63397, 'author': 68411, 'lang': 74252, 'year': 74278}
[OK] added user-book edges
[OK] added book-tag edges
[OK] added book-book sim edges via TF-IDF: 265581
[OK] added book-author edges: 13215
[OK] added book-lang edges: 9999
[OK] added book-year edges: 9999

Rebuilt:
 edge_index: (2, 11450076)
 edge_w: (11450076,) | finite: True
 edge_type: (11450076,) | rels: {'user_book': 0, 'book_user': 1, 'book_tag': 2, 'tag_book': 3, 'book_book_sim': 4, 'book_author': 5, 'author_book': 6, 'book_lang': 7, 'lang_book': 8, 'book_year': 9, 'year_book': 10}
 A_norm: (74285, 74285) nnz: 11260518

[OK] edge_type + rel2id are ready ✔


In [29]:
# ============================
# SAVE Graph3 bundle (final) — with edge_type + rel2id
# ============================

# ---------- Paths ----------
PROJECT_ROOT = Path(r"D:/ML/GNN/graph_recsys")
DATA_PROCESSED = PROJECT_ROOT / "data_processed" / "v2_proper"
ARTIFACTS = PROJECT_ROOT / "artifacts" / "v2_proper"

BUNDLE_DIR = ARTIFACTS / "graph3_bundle"
MAP_DIR = BUNDLE_DIR / "mappings"
BUNDLE_DIR.mkdir(parents=True, exist_ok=True)
MAP_DIR.mkdir(parents=True, exist_ok=True)

print("Saving to:", BUNDLE_DIR)

# ---------- Guardrails: required objects ----------
required = [
    "A_norm", "edge_index", "edge_w", "edge_type", "rel2id",
    "train_ui", "val_ui", "test_ui",
    "user2idx", "book2idx",
    "U", "B", "T",
    "user_offset", "book_offset", "tag_offset",
    "author_offset", "lang_offset", "year_offset",
    "tag2idx", "author2idx", "lang2idx", "year2idx",
    "num_nodes",
]
missing = [k for k in required if k not in globals()]
if missing:
    raise NameError(f"Missing variables in notebook scope: {missing}")

# ---------- Counts ----------
U_cnt = int(U)
B_cnt = int(B)
T_cnt = int(T)
A_cnt = int(len(author2idx))
L_cnt = int(len(lang2idx))
Y_cnt = int(len(year2idx))

# ---------- 1) Save graph tensors (CPU) ----------
A_norm_cpu = A_norm.coalesce().to("cpu")

edge_index_cpu = edge_index.to("cpu")
edge_w_cpu = edge_w.to("cpu")
edge_type_cpu = edge_type.to("cpu")

graph_state = {
    # sparse normalized adjacency for full-batch LightGCN
    "A_norm": A_norm_cpu,                  # torch sparse COO [num_nodes, num_nodes]
    "num_nodes": int(num_nodes),

    # counts (per node type in unified index space)
    "U": U_cnt,
    "B": B_cnt,
    "T": T_cnt,
    "A_authors": A_cnt,
    "L": L_cnt,
    "Y": Y_cnt,

    # offsets
    "offsets": {
        "user_offset": int(user_offset),
        "book_offset": int(book_offset),
        "tag_offset": int(tag_offset),
        "author_offset": int(author_offset),
        "lang_offset": int(lang_offset),
        "year_offset": int(year_offset),
    },

    # vocabularies (node-local ids for each type)
    "tag2idx": tag2idx,
    "author2idx": author2idx,
    "lang2idx": lang2idx,
    "year2idx": year2idx,

    # raw edges (useful for hetero models / typed message passing)
    "edge_index": edge_index_cpu,          # [2, E]
    "edge_w": edge_w_cpu,                  # [E]
    "edge_type": edge_type_cpu,            # [E] int rel ids
    "rel2id": rel2id,                      # dict[str,int]
}

torch.save(graph_state, BUNDLE_DIR / "graph3_state.pt")
print("[OK] saved:", BUNDLE_DIR / "graph3_state.pt")

# ---------- 2) Save splits ----------
np.savez_compressed(
    BUNDLE_DIR / "splits_ui.npz",
    train_ui=train_ui.astype(np.int64),
    val_ui=val_ui.astype(np.int64),
    test_ui=test_ui.astype(np.int64),
    U=U_cnt,
    B=B_cnt,
)
print("[OK] saved:", BUNDLE_DIR / "splits_ui.npz")

# ---------- 3) Save mappings (dict -> Series CSV) ----------
pd.Series(user2idx).to_csv(MAP_DIR / "user2idx.csv", header=False)
pd.Series(book2idx).to_csv(MAP_DIR / "book2idx.csv", header=False)

idx2user = {v: k for k, v in user2idx.items()}
idx2book = {v: k for k, v in book2idx.items()}
pd.Series(idx2user).to_csv(MAP_DIR / "idx2user.csv", header=False)
pd.Series(idx2book).to_csv(MAP_DIR / "idx2book.csv", header=False)
print("[OK] saved mappings:", MAP_DIR)

# ---------- 4) Save meta/config ----------
# These may or may not exist depending on your cells; handle safely.
build_config = dict(
    graph_name="Graph3",
    notes=(
        "Unified graph: user-book(train only) + book-tag(log1p count) + "
        "book-book tfidf cosine + book-author + book-lang + book-year_bin"
    ),
    MIN_BOOK_FREQ=int(globals().get("MIN_BOOK_FREQ", -1)),
    TOP_TAGS_PER_BOOK=int(globals().get("TOP_TAGS_PER_BOOK", -1)),
    TOPK_SIM=int(globals().get("TOPK_SIM", -1)),
    MIN_SIM=float(globals().get("MIN_SIM", -1.0)),
    tfidf_min_df=int(globals().get("TFIDF_MIN_DF", 2)),
    tfidf_ngrams=list(globals().get("TFIDF_NGRAMS", [1, 2])),
)

meta = dict(
    created_at_utc=time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    counts=dict(U=U_cnt, B=B_cnt, T=T_cnt, A_authors=A_cnt, L=L_cnt, Y=Y_cnt),
    num_nodes=int(num_nodes),
    nnz=int(A_norm_cpu._nnz()),
    E=int(edge_index_cpu.shape[1]),
    rels=int(len(rel2id)),
    train_edges=int(train_ui.shape[0]),
    val_edges=int(val_ui.shape[0]),
    test_edges=int(test_ui.shape[0]),
    edge_index_shape=list(edge_index_cpu.shape),
    edge_w_shape=list(edge_w_cpu.shape),
    edge_type_shape=list(edge_type_cpu.shape),
    offsets=graph_state["offsets"],
)

(BUNDLE_DIR / "build_config.json").write_text(
    json.dumps(build_config, ensure_ascii=False, indent=2), encoding="utf-8"
)
(BUNDLE_DIR / "meta.json").write_text(
    json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8"
)

print("[OK] saved meta/config:", BUNDLE_DIR)
print("Bundle ready ✅", BUNDLE_DIR)

Saving to: D:\ML\GNN\graph_recsys\artifacts\v2_proper\graph3_bundle
[OK] saved: D:\ML\GNN\graph_recsys\artifacts\v2_proper\graph3_bundle\graph3_state.pt
[OK] saved: D:\ML\GNN\graph_recsys\artifacts\v2_proper\graph3_bundle\splits_ui.npz
[OK] saved mappings: D:\ML\GNN\graph_recsys\artifacts\v2_proper\graph3_bundle\mappings
[OK] saved meta/config: D:\ML\GNN\graph_recsys\artifacts\v2_proper\graph3_bundle
Bundle ready ✅ D:\ML\GNN\graph_recsys\artifacts\v2_proper\graph3_bundle


In [30]:
bundle_dir = Path(r"D:/ML/GNN/graph_recsys/artifacts/v2_proper/graph3_bundle")
g = torch.load(bundle_dir / "graph3_state.pt", map_location="cpu")
z = np.load(bundle_dir / "splits_ui.npz")

print("A_norm:", g["A_norm"].shape, "nnz:", g["A_norm"]._nnz())
print("edge_index:", g["edge_index"].shape, "edge_w:", g["edge_w"].shape, "edge_type:", g["edge_type"].shape)
print("rels:", g["rel2id"])
print("U,B:", z["U"], z["B"], "train:", z["train_ui"].shape, "val:", z["val_ui"].shape, "test:", z["test_ui"].shape)

# sanity
assert g["A_norm"].shape[0] == g["A_norm"].shape[1] == g["num_nodes"]
assert g["edge_index"].shape[1] == g["edge_w"].numel() == g["edge_type"].numel()
assert int(g["edge_index"].max()) < int(g["num_nodes"])
print("[OK] bundle sanity passed ✅")

  g = torch.load(bundle_dir / "graph3_state.pt", map_location="cpu")


A_norm: torch.Size([74285, 74285]) nnz: 11260518
edge_index: torch.Size([2, 11450076]) edge_w: torch.Size([11450076]) edge_type: torch.Size([11450076])
rels: {'user_book': 0, 'book_user': 1, 'book_tag': 2, 'tag_book': 3, 'book_book_sim': 4, 'book_author': 5, 'author_book': 6, 'book_lang': 7, 'lang_book': 8, 'book_year': 9, 'year_book': 10}
U,B: 53398 9999 train: (4926384, 2) val: (53398, 2) test: (53398, 2)
[OK] bundle sanity passed ✅
