# Graph-Based Recommender System — LightGCN Baseline

In this notebook, we build and evaluate a **proper graph-based recommender system baseline**
using the **GoodBooks-10k** dataset and a **LightGCN** architecture.

The goal of this stage is **not** to achieve the best possible recommendation quality,
but to:

1. Build a **correct and scalable pipeline** for graph-based recommendation.
2. Use **proper data splitting** suitable for recommender systems.
3. Evaluate the model with **ranking metrics** (Hit@K, NDCG@K), not classification metrics.
4. Understand the **performance ceiling of pure collaborative filtering** on a user–item graph.

This notebook serves as a **reference baseline** for further experiments with
graph augmentation and hybrid recommender models.

## Dataset

We use the original **GoodBooks-10k** dataset and leverage the following interaction signals:

- `ratings.csv` → positive user–book interactions (`rating >= threshold`)
- `to_read.csv` → implicit positive feedback

Other files (`books.csv`, `tags.csv`, `book_tags.csv`) are intentionally **not used here**
and will be incorporated in the next stage when we move to **content-aware graphs**.

## Graph Construction

We construct a **bipartite graph**:

- User nodes
- Book nodes
- Undirected edges representing positive interactions

Node indices:
- users: `[0 … n_users - 1]`
- books: `[n_users … n_users + n_items - 1]`

The adjacency matrix is **symmetrically normalized** and used by LightGCN.

## Data Splitting

We apply a **leave-one-out split per user**:

- **Train**: all but the last interaction
- **Validation**: 1 interaction per user
- **Test**: 1 interaction per user

This setup:
- avoids information leakage
- matches real-world recommendation scenarios
- enables correct ranking-based evaluation

## Model

We use **LightGCN**, a simplified GCN designed specifically for collaborative filtering:

- No feature transformation
- No nonlinearities
- Message passing only through normalized adjacency
- Final embeddings are averaged across layers

To improve training stability and ranking quality, we use:
- **BPR loss**
- **Popularity-biased (hard) negative sampling**

## Evaluation

We evaluate the model using **ranking metrics**:

- Hit@K
- Recall@K
- NDCG@K

Metrics are computed on:
- a validation set during training
- the **full test set** after training

All seen items from the training set are masked during evaluation.

In [1]:
import json
import numpy as np
import pandas as pd
import os
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

from collections import defaultdict
from datetime import datetime
from pathlib import Path

SEED = 42
rng = np.random.default_rng(SEED)

PROJECT_DIR = Path(r"D:/ML/GNN/graph_recsys")
DATA_RAW = PROJECT_DIR / "data_raw"
DATA_PROCESSED = PROJECT_DIR / "data_processed" / "v2_proper"
ARTIFACTS = PROJECT_DIR / "artifacts" / "v2_proper"

DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
ARTIFACTS.mkdir(parents=True, exist_ok=True)

print("DATA_RAW:", DATA_RAW)
print("DATA_PROCESSED:", DATA_PROCESSED)
print("ARTIFACTS:", ARTIFACTS)

DATA_RAW: D:\ML\GNN\graph_recsys\data_raw
DATA_PROCESSED: D:\ML\GNN\graph_recsys\data_processed\v2_proper
ARTIFACTS: D:\ML\GNN\graph_recsys\artifacts\v2_proper


## Build implicit positive interactions

We convert the dataset into a *binary implicit-feedback* format:

- **ratings.csv** → keep only strong positive feedback (`rating >= threshold`)
- **to_read.csv** → treat as an additional positive signal ("user intends to read")

Then we concatenate both sources into a single interaction table and **drop duplicates**
so that each `(user_id, book_id)` pair becomes a single edge in the user–ite graph.


In [2]:
books = pd.read_csv(DATA_RAW / "books.csv")
ratings = pd.read_csv(DATA_RAW / "ratings.csv")
to_read = pd.read_csv(DATA_RAW / "to_read.csv")
tags = pd.read_csv(DATA_RAW / "tags.csv")
book_tags = pd.read_csv(DATA_RAW / "book_tags.csv")

print("books:", books.shape, "| columns:", list(books.columns))
print("ratings:", ratings.shape, "| columns:", list(ratings.columns))
print("to_read:", to_read.shape, "| columns:", list(to_read.columns))
print("tags:", tags.shape, "| columns:", list(tags.columns))
print("book_tags:", book_tags.shape, "| columns:", list(book_tags.columns))

print("\nratings nunique users:", ratings["user_id"].nunique())
print("ratings nunique books:", ratings["book_id"].nunique())
print("to_read nunique users:", to_read["user_id"].nunique())
print("to_read nunique books:", to_read["book_id"].nunique())

books: (10000, 23) | columns: ['book_id', 'goodreads_book_id', 'best_book_id', 'work_id', 'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year', 'original_title', 'title', 'language_code', 'average_rating', 'ratings_count', 'work_ratings_count', 'work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'image_url', 'small_image_url']
ratings: (5976479, 3) | columns: ['user_id', 'book_id', 'rating']
to_read: (912705, 2) | columns: ['user_id', 'book_id']
tags: (34252, 2) | columns: ['tag_id', 'tag_name']
book_tags: (999912, 3) | columns: ['goodreads_book_id', 'tag_id', 'count']

ratings nunique users: 53424
ratings nunique books: 10000
to_read nunique users: 48871
to_read nunique books: 9986


## Build implicit positive interactions

We convert raw GoodBooks data into an **implicit-feedback** interaction table.

- From `ratings.csv` we keep only **strong positive** signals (`rating >= threshold`), treating them as "user liked the book".
- From `to_read.csv` we treat entries as **implicit interest** signals ("user wants to read").

Then we concatenate both sources into a single `(user_id, book_id)` table and **drop duplicates** to avoid duplicate edges and evaluation leakage.

In [3]:
RATING_POS_THRESHOLD = 4

ratings_pos = ratings.loc[ratings["rating"] >= RATING_POS_THRESHOLD, ["user_id", "book_id"]].copy()
ratings_pos["source"] = "rating_pos"

to_read_pos = to_read[["user_id", "book_id"]].copy()
to_read_pos["source"] = "to_read"

interactions = pd.concat([ratings_pos, to_read_pos], ignore_index=True)

# убираем дубли (если юзер и так поставил рейтинг, и есть to_read)
interactions = interactions.drop_duplicates(subset=["user_id", "book_id"], keep="first")

print("interactions:", interactions.shape)
print(interactions["source"].value_counts())
interactions.head()

interactions: (5033236, 3)
source
rating_pos    4122111
to_read        911125
Name: count, dtype: int64


Unnamed: 0,user_id,book_id,source
0,1,258,rating_pos
1,2,4081,rating_pos
2,2,260,rating_pos
3,2,9296,rating_pos
4,2,26,rating_pos


## Filter sparse users and items (minimum counts)

Graph recommenders are sensitive to extreme sparsity.  
We iteratively filter out:

- users with fewer than `MIN_INTERACTIONS_PER_USER` interactions
- books with fewer than `MIN_INTERACTIONS_PER_BOOK` interactions

This is an *iterative* process because removing sparse items can make some users sparse again (and vice versa).  
The result is a cleaner interaction graph with fewer degenerate nodes and more stable training/evluation.


In [4]:
MIN_INTERACTIONS_PER_USER = 5
MIN_INTERACTIONS_PER_BOOK = 5

def filter_min_counts(df, user_col="user_id", item_col="book_id",
                      min_user=5, min_item=5):
    before = len(df)
    while True:
        user_cnt = df[user_col].value_counts()
        item_cnt = df[item_col].value_counts()

        good_users = user_cnt[user_cnt >= min_user].index
        good_items = item_cnt[item_cnt >= min_item].index

        df2 = df[df[user_col].isin(good_users) & df[item_col].isin(good_items)]
        if len(df2) == len(df):
            break
        df = df2
    after = len(df)
    return df, before, after

interactions_f, before, after = filter_min_counts(
    interactions,
    min_user=MIN_INTERACTIONS_PER_USER,
    min_item=MIN_INTERACTIONS_PER_BOOK
)

print(f"Filtered interactions: {before} -> {after}")
print("users:", interactions_f["user_id"].nunique(), "books:", interactions_f["book_id"].nunique())

Filtered interactions: 5033236 -> 5033180
users: 53398 books: 9999


## Build contiguous node indices and save mappings

For efficient graph construction, we remap raw IDs:

- `user_id` → `u` in `[0 .. n_users-1]`
- `book_id` → `i` in `[0 .. n_items-1]`

This makes tensors compact, speeds up training, and simplifies saving/loading.  
We also save `user2idx` and `book2idx` so we can later:
- reconstruct recommendations for real `user_id` / `book_id`
- join predictions back to metadata (`boks.csv`)


In [7]:
unique_users = np.sort(interactions_f["user_id"].unique())
unique_books = np.sort(interactions_f["book_id"].unique())

user2idx = {u: i for i, u in enumerate(unique_users)}
book2idx = {b: i for i, b in enumerate(unique_books)}

interactions_f = interactions_f.copy()

interactions_f.loc[:, "u"] = interactions_f["user_id"].map(user2idx).astype(np.int32)
interactions_f.loc[:, "i"] = interactions_f["book_id"].map(book2idx).astype(np.int32)

n_users = len(unique_users)
n_items = len(unique_books)

print("n_users:", n_users, "n_items:", n_items, "n_interactions:", len(interactions_f))

# сохраняем маппинги
pd.Series(user2idx).to_csv(DATA_PROCESSED / "user2idx.csv")
pd.Series(book2idx).to_csv(DATA_PROCESSED / "book2idx.csv")

n_users: 53398 n_items: 9999 n_interactions: 5033180


## Proper recommender split: Leave-One-Out per user

We create a **recsys-style** split: for each user:

- **Train**: all but 2 interactions
- **Validation**: 1 held-out interaction
- **Test**: 1 held-out interaction

This is a realistic evaluation protocol:
- avoids leakage (no future items in train)
- guarantees that every user in val/test exists in train
- enables ranking-based metrics (Hit@K, NDCG@K)

Note: users with extremely small histories are handled safely (fallback totrain-only).


In [8]:
def leave_one_out_split(df, user_col="u", item_col="i", seed=42):
    rng = np.random.default_rng(seed)
    train_rows, val_rows, test_rows = [], [], []

    for u, g in df.groupby(user_col):
        items = g[item_col].to_numpy()
        if len(items) < 3:
            # на всякий: если мало — всё в train
            train_rows.append(g)
            continue

        perm = rng.permutation(len(items))
        test_idx = perm[0]
        val_idx = perm[1]

        mask = np.ones(len(items), dtype=bool)
        mask[[test_idx, val_idx]] = False

        g_train = g.iloc[mask.nonzero()[0]]
        g_val = g.iloc[[val_idx]]
        g_test = g.iloc[[test_idx]]

        train_rows.append(g_train)
        val_rows.append(g_val)
        test_rows.append(g_test)

    train = pd.concat(train_rows, ignore_index=True)
    val = pd.concat(val_rows, ignore_index=True)
    test = pd.concat(test_rows, ignore_index=True)
    return train, val, test

train_df, val_df, test_df = leave_one_out_split(interactions_f, seed=SEED)

print("train:", train_df.shape, "val:", val_df.shape, "test:", test_df.shape)
print("check unique users:", train_df["u"].nunique(), val_df["u"].nunique(), test_df["u"].nunique())

train: (4926384, 5) val: (53398, 5) test: (53398, 5)
check unique users: 53398 53398 53398


## Build fast lookup structures for evaluation

We prepare helper structures used in ranking evaluation:

- `train_user_items[u]`: set of items already seen in training (to be **masked** in ranking)
- `val_pos[u]`: the single positive target item for validation
- `test_pos[u]`: the single positive target item for test

This matches the standard "leave-one-out + ranking" evaluation setup:
**recommend new items excluding those the user already interacted with intrain**.


In [9]:
train_user_items = defaultdict(set)
for u, i in zip(train_df["u"].to_numpy(), train_df["i"].to_numpy()):
    train_user_items[u].add(i)

val_pos = dict(zip(val_df["u"].to_numpy(), val_df["i"].to_numpy()))
test_pos = dict(zip(test_df["u"].to_numpy(), test_df["i"].to_numpy()))

n_users, n_items

(53398, 9999)

## Define ranking metrics (Hit@K, Recall@K, NDCG@K)

We evaluate recommendations with **ranking metrics**:

- **Hit@K**: whether the true held-out item is in top-K
- **Recall@K**: with one positive item per user, recall@K == hit@K
- **NDCG@K**: position-aware metric (higher if the item is ranked earlier)

These metrics reflect recommender performance better than classification metrics (AUC/AP)
because recommendation is fundamentally a *ranking*problem.


In [10]:
def hit_at_k(rank, k):
    return 1.0 if rank < k else 0.0

def recall_at_k(rank, k):
    # при одном positive это то же самое, что hit@k
    return 1.0 if rank < k else 0.0

def ndcg_at_k(rank, k):
    if rank >= k:
        return 0.0
    return 1.0 / np.log2(rank + 2)  # rank 0 -> log2(2)=1

def evaluate_ranking(ranks, ks=(10, 20, 50)):
    out = {}
    ranks = np.asarray(ranks, dtype=np.int32)
    for k in ks:
        out[f"hit@{k}"] = float(np.mean([hit_at_k(r, k) for r in ranks]))
        out[f"recall@{k}"] = float(np.mean([recall_at_k(r, k) for r in ranks]))
        out[f"ndcg@{k}"] = float(np.mean([ndcg_at_k(r, k) for r in ranks]))
    return out

## Popularity baseline (sanity check)

Before training any model, we compute a simple baseline:

- rank items by training popularity
- for each user, recommend the most popular items **excluding seen items**
- compute the rank of the held-out positive item

This baseline provides a reference point:
- if our model cannot beat popularity, something is wrong (split/leakage/training)
- it also quantifies how much of the signal is "global ppularity"


In [11]:
item_pop = train_df["i"].value_counts().sort_values(ascending=False)
popular_items = item_pop.index.to_numpy()

def rank_for_user(u, pos_item, train_user_items, popular_items, max_k=1000):
    # возвращаем позицию pos_item в ранжированном списке (меньше = лучше)
    rank = 0
    for it in popular_items[:max_k]:
        if it in train_user_items[u]:
            continue
        if it == pos_item:
            return rank
        rank += 1
    return max_k  # не найден в топе

def evaluate_popularity(pos_dict, ks=(10, 20, 50), max_k=5000):
    ranks = []
    for u, pos_item in pos_dict.items():
        r = rank_for_user(u, pos_item, train_user_items, popular_items, max_k=max_k)
        ranks.append(r)
    return evaluate_ranking(ranks, ks=ks)

print("VAL popularity:", evaluate_popularity(val_pos))
print("TEST popularity:", evaluate_popularity(test_pos))

VAL popularity: {'hit@10': 0.04436495748904453, 'recall@10': 0.04436495748904453, 'ndcg@10': 0.025656646790484853, 'hit@20': 0.07284917038091314, 'recall@20': 0.07284917038091314, 'ndcg@20': 0.03284832700329946, 'hit@50': 0.1260159556537698, 'recall@50': 0.1260159556537698, 'ndcg@50': 0.04338053655791091}
TEST popularity: {'hit@10': 0.04548859507846736, 'recall@10': 0.04548859507846736, 'ndcg@10': 0.025908327415101125, 'hit@20': 0.07309262519195475, 'recall@20': 0.07309262519195475, 'ndcg@20': 0.032838528907749166, 'hit@50': 0.12687741113899398, 'recall@50': 0.12687741113899398, 'ndcg@50': 0.043492052391389875}


## Save processed splits and artifacts

We save processed arrays for reproducibility and fast iteration:

- `train_ui`, `val_ui`, `test_ui` as compact NumPy arrays of `(u, i)`
- `n_users`, `n_items`
- `popular_items` list for baseline/evaluation

We use `.npz` / `.npy` to avoid optional parquet dependencies and to keep the pipeline lighweight.


In [13]:
# сохраним пары (u,i) как numpy
train_ui = train_df[["u","i"]].to_numpy(dtype=np.int32)
val_ui   = val_df[["u","i"]].to_numpy(dtype=np.int32)
test_ui  = test_df[["u","i"]].to_numpy(dtype=np.int32)

np.savez_compressed(
    DATA_PROCESSED / "splits_ui.npz",
    train_ui=train_ui,
    val_ui=val_ui,
    test_ui=test_ui,
    n_users=np.int32(n_users),
    n_items=np.int32(n_items),
)

# популярность тоже туда же
np.save(DATA_PROCESSED / "popular_items.npy", popular_items.astype(np.int32))

print("Saved:", DATA_PROCESSED / "splits_ui.npz")

Saved: D:\ML\GNN\graph_recsys\data_processed\v2_proper\splits_ui.npz


## Load processed splits (UI format)

We load the processed interaction splits saved earlier:

- `train_ui`, `val_ui`, `test_ui` contain `(u, i)` pairs with **contiguous indices**
- `n_users`, `n_items` define the bipartite graph size

This keeps the training notebook lightweight: we don’t re-run heavy preprocessing ever time.


In [None]:
z = np.load(DATA_PROCESSED / "splits_ui.npz")
train_ui = z["train_ui"]
val_ui = z["val_ui"]
test_ui = z["test_ui"]
n_users = int(z["n_users"])
n_items = int(z["n_items"])

## Build normalized bipartite adjacency matrix A_norm

We build a **bipartite user–item graph** from training interactions only.

Node indexing:
- Users: `0 .. n_users-1`
- Items:  `n_users .. n_users+n_items-1`

We add edges in **both directions** (undirected graph) because LightGCN performs symmetric message passing.

Normalization:
- compute node degree `deg`
- use symmetric normalization `D^{-1/2} A D^{-1/2}`
- store it as a sparse matrix `A_norm` for efficient propagation

Important: we normalize using only **train edges** to avoid evalation leakage.


In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)

u = torch.from_numpy(train_ui[:, 0].astype(np.int64))
i = torch.from_numpy(train_ui[:, 1].astype(np.int64)) + n_users

row = torch.cat([u, i], dim=0)
col = torch.cat([i, u], dim=0)

num_nodes = n_users + n_items
edge_index = torch.stack([row, col], dim=0)

deg = torch.bincount(edge_index[0], minlength=num_nodes).float()
deg_inv_sqrt = deg.pow(-0.5)
deg_inv_sqrt[torch.isinf(deg_inv_sqrt)] = 0.0

edge_weight = deg_inv_sqrt[edge_index[0]] * deg_inv_sqrt[edge_index[1]]

edge_index = edge_index.to(device)
edge_weight = edge_weight.to(device)

A_norm = torch.sparse_coo_tensor(
    edge_index, edge_weight,
    size=(num_nodes, num_nodes),
    device=device
).coalesce()

print("A_norm:", A_norm.shape, "| nnz:", A_norm._nnz())

device: cuda
A_norm: torch.Size([63397, 63397]) | nnz: 9852768


## Define LightGCN (embedding + pure graph propagation)

LightGCN is a strong baseline for collaborative filtering on bipartite graphs.

Key properties:
- each node has a learnable embedding (users + items)
- propagation is pure neighborhood aggregation: `X_{k+1} = A_norm @ X_k`
- final embedding is the mean of embeddings across layers (layer-wise averaging)

Unlike classic GCN:
- no feature MLP / weight matrices per layer
- no nonlinearities
This makes LightGCN simple, fast, and often very competitive for recmmendation.
ation leakage.


## BPR loss (Bayesian Personalized Ranking)

We optimize a **pairwise ranking** objective:

- positive interaction `(u, i+)` should score higher than
- negative (unobserved) item `(u, i-)`

BPR encourages correct ranking directly, which aligns with recommender metrics (Hit@K / NDCG@K)
better than binary classificatin loss.


In [20]:
class LightGCN(nn.Module):
    def __init__(self, num_users, num_items, emb_dim=64, num_layers=3):
        super().__init__()
        self.num_users = num_users
        self.num_items = num_items
        self.num_nodes = num_users + num_items
        self.emb_dim = emb_dim
        self.num_layers = num_layers

        self.embedding = nn.Embedding(self.num_nodes, emb_dim)
        nn.init.xavier_uniform_(self.embedding.weight)

    def propagate(self, A_norm):
        x0 = self.embedding.weight
        xs = [x0]
        x = x0
        for _ in range(self.num_layers):
            x = torch.sparse.mm(A_norm, x)
            xs.append(x)
        return torch.stack(xs, dim=0).mean(dim=0)

    def forward(self, A_norm, users, pos_items, neg_items):
        all_emb = self.propagate(A_norm)
        user_emb = all_emb[:self.num_users]
        item_emb = all_emb[self.num_users:]

        u_e = user_emb[users]
        p_e = item_emb[pos_items]
        n_e = item_emb[neg_items]

        pos_scores = (u_e * p_e).sum(dim=1)
        neg_scores = (u_e * n_e).sum(dim=1)
        return pos_scores, neg_scores

def bpr_loss(pos_scores, neg_scores):
    return -torch.log(torch.sigmoid(pos_scores - neg_scores) + 1e-12).mean()

## Sample training mini-batches (positive + negative items)

We train with mini-batches:

- sample random users
- for each user pick a **positive** item from their train history
- sample a **negative** item not seen in train for that user

This is standard implicit-feedback training.
Note: pure uniform negative sampling is simple but not optimal; later we can try
popularity-biased negatives or hard egatives.


In [21]:
def sample_batch(train_user_items, n_users, n_items, batch_size=4096, rng=None):
    if rng is None:
        rng = np.random.default_rng(42)

    users = rng.integers(0, n_users, size=batch_size, endpoint=False)
    pos_items = np.empty(batch_size, dtype=np.int64)
    neg_items = np.empty(batch_size, dtype=np.int64)

    for idx, u in enumerate(users):
        u = int(u)
        seen = train_user_items[u]
        # pos
        pos_items[idx] = rng.choice(list(seen))
        # neg
        while True:
            ni = int(rng.integers(0, n_items, endpoint=False))
            if ni not in seen:
                neg_items[idx] = ni
                break

    return users.astype(np.int64), pos_items, neg_items

## Ranking evaluation via "rank of the true held-out item"

For each user we have exactly **one positive target** in val/test (leave-one-out protocol).

We compute:
- score for all candidate items
- mask items already seen in train (`train_user_items[u]`)
- rank = number of items with score higher than the positive item

Then we convert ranks into Hit@K / NDCG@K metrics.

This is a direct, transparent evaluation approach:
we measure if the model places the true item in the top-K recomendations.


In [22]:
@torch.no_grad()
def get_all_user_item_emb(model, edge_index, edge_weight):
    model.eval()
    all_emb = model.propagate(edge_index, edge_weight)
    user_emb = all_emb[:model.num_users]
    item_emb = all_emb[model.num_users:]
    return user_emb, item_emb

@torch.no_grad()
def compute_ranks_for_users(user_emb, item_emb, pos_dict, train_user_items, k_max=50, user_subset=None):
    """
    Возвращает rank (0..inf) для positive item среди всех items (исключая seen).
    Мы не строим полный sort — считаем rank через сравнение со score(pos).
    """
    if user_subset is None:
        users = list(pos_dict.keys())
    else:
        users = list(user_subset)

    ranks = []
    item_emb_T = item_emb.t()  # [d, n_items]

    for u in users:
        u = int(u)
        pos_item = int(pos_dict[u])

        # scores for all items
        scores = (user_emb[u] @ item_emb_T).detach().cpu().numpy()  # [n_items]

        # exclude seen (train items)
        seen = train_user_items[u]
        scores[list(seen)] = -1e9  # чтобы не попадали в top
        pos_score = scores[pos_item]

        # rank = сколько items имеют score > pos_score
        rank = int(np.sum(scores > pos_score))
        ranks.append(rank)

    return ranks

## Train LightGCN with BPR + evaluate on validation users

Training setup:
- LightGCN embeddings (dimension `EMB_DIM`)
- `NUM_LAYERS` propagation steps
- Adam optimizer (small weight decay for stability)

Each epoch:
1) run `STEPS` mini-batches of BPR updates
2) compute validation ranks for a subset of users (`EVAL_USERS`) to keep evaluation fast
3) track best model by **val NDCG@10**

Notes:
- evaluation subset speeds up iteration; final reporting should run on full val/test
- epoch time depends mostly on sparse propagation + atch sampling


In [23]:
EMB_DIM = 64
NUM_LAYERS = 3
LR = 2e-3
BATCH_SIZE = 4096
EPOCHS = 20
STEPS = 300
EVAL_USERS = 5000
KS = (10, 20, 50)

model = LightGCN(n_users, n_items, emb_dim=EMB_DIM, num_layers=NUM_LAYERS).to(device)

# вместо l2_reg — безопасный weight_decay
optimizer = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-6)

all_val_users = np.array(list(val_pos.keys()), dtype=np.int64)
val_users_subset = rng.choice(all_val_users, size=min(EVAL_USERS, len(all_val_users)), replace=False)

best_val_ndcg10 = -1.0
best_state = None

for epoch in range(1, EPOCHS + 1):
    t0 = time.time()
    model.train()
    losses = []

    for _ in range(STEPS):
        users, pos_items, neg_items = sample_batch(train_user_items, n_users, n_items, BATCH_SIZE, rng=rng)

        users = torch.from_numpy(users).to(device)
        pos_items = torch.from_numpy(pos_items).to(device)
        neg_items = torch.from_numpy(neg_items).to(device)

        optimizer.zero_grad()
        pos_scores, neg_scores = model(A_norm, users, pos_items, neg_items)
        loss = bpr_loss(pos_scores, neg_scores)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())

    # validation
    model.eval()
    with torch.no_grad():
        all_emb = model.propagate(A_norm)
        user_emb = all_emb[:n_users]
        item_emb = all_emb[n_users:]
        ranks = compute_ranks_for_users(user_emb, item_emb, val_pos, train_user_items, user_subset=val_users_subset)
        val_metrics = evaluate_ranking(ranks, ks=KS)

    dt = time.time() - t0
    print(f"[Epoch {epoch:03d}] loss={np.mean(losses):.4f} | val {val_metrics} | time={dt:.1f}s")

    if val_metrics["ndcg@10"] > best_val_ndcg10:
        best_val_ndcg10 = val_metrics["ndcg@10"]
        best_state = {k: v.detach().cpu() for k, v in model.state_dict().items()}

print("Best val ndcg@10:", best_val_ndcg10)

[Epoch 001] loss=0.4723 | val {'hit@10': 0.0504, 'recall@10': 0.0504, 'ndcg@10': 0.02793142752854441, 'hit@20': 0.0816, 'recall@20': 0.0816, 'ndcg@20': 0.035806972737550596, 'hit@50': 0.1414, 'recall@50': 0.1414, 'ndcg@50': 0.047699980305317885} | time=104.2s
[Epoch 002] loss=0.3851 | val {'hit@10': 0.0566, 'recall@10': 0.0566, 'ndcg@10': 0.029975346021028623, 'hit@20': 0.0912, 'recall@20': 0.0912, 'ndcg@20': 0.03867671333302165, 'hit@50': 0.1534, 'recall@50': 0.1534, 'ndcg@50': 0.0509281504193389} | time=105.5s
[Epoch 003] loss=0.3535 | val {'hit@10': 0.0598, 'recall@10': 0.0598, 'ndcg@10': 0.03138965625545384, 'hit@20': 0.0924, 'recall@20': 0.0924, 'ndcg@20': 0.039586826956874224, 'hit@50': 0.1616, 'recall@50': 0.1616, 'ndcg@50': 0.05324169363819383} | time=104.5s
[Epoch 004] loss=0.3393 | val {'hit@10': 0.0614, 'recall@10': 0.0614, 'ndcg@10': 0.03273073024259129, 'hit@20': 0.0968, 'recall@20': 0.0968, 'ndcg@20': 0.041675894311533054, 'hit@50': 0.1614, 'recall@50': 0.1614, 'ndcg@50':

## Save best model checkpoint and experiment config

We save the best model (by validation NDCG@10) into a timestamped run folder:

- `lightgcn_best_state.pt`: model weights
- `config.json`: all hyperparameters and best validation metrics

This makes experiments reproducible and keeps results organized in `artifacts/`.

In [25]:
ARTIFACTS.mkdir(parents=True, exist_ok=True)

run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
run_dir = ARTIFACTS / f"lightgcn_{run_id}"
run_dir.mkdir(parents=True, exist_ok=True)

# сохраняем best веса
torch.save(best_state, run_dir / "lightgcn_best_state.pt")

# сохраняем конфиг
config = {
    "emb_dim": EMB_DIM,
    "num_layers": NUM_LAYERS,
    "lr": LR,
    "weight_decay": 1e-6,
    "batch_size": BATCH_SIZE,
    "epochs": EPOCHS,
    "steps_per_epoch": STEPS,
    "eval_users": EVAL_USERS,
    "best_val_ndcg10": float(best_val_ndcg10),
    "best_val_metrics": {k: float(v) for k, v in val_metrics.items()},
}
with open(run_dir / "config.json", "w", encoding="utf-8") as f:
    json.dump(config, f, ensure_ascii=False, indent=2)

print("Saved to:", run_dir)

Saved to: D:\ML\GNN\graph_recsys\artifacts\v2_proper\lightgcn_20251213_145944


## Second model: alternative graph-based recommender (same pipeline, different architecture)

After restarting the kernel, we reuse **exactly the same data pipeline and evaluation protocol**:

- identical interaction construction (ratings ≥ threshold + to-read)
- identical user/item filtering and indexing
- the same leave-one-out split (train / validation / test)
- the same ranking metrics (Hit@K, NDCG@K)
- the same popularity baseline for reference

This is intentional:  
**the goal is to compare models under identical conditions**, not to re-optimize the pipeline each time.

### What changes compared to the previous model

Only the **model architecture and training objective** differ:

- a different graph-based recommender is used
- embeddings are learned and propagated through the same user–item graph
- optimization may use a different loss (e.g. BPR / contrastive / pairwise)
- training dynamics (loss scale, convergence speed) differ accordingly

All other components — data splits, negative sampling logic, ranking evaluation — remain unchanged.

### Why we do not repeat all code explanations here

The following blocks reuse:
- the same graph construction logic
- the same batching and negative sampling ideas
- the same evaluation functions

Therefore, detailed explanations are provided **once** (in the first model section)  
to avoid redundancy and keep the notebook readable.

### Purpose of this section

This section serves to:
- verify that the pipeline is **model-agnostic**
- compare architectures fairly on the same data
- observe how different graph models behave under identical conditions

Further improvements (graph enrichment, heterogeneity, edge weighing) are addressed in the next notebook.


In [1]:
from pathlib import Path
import numpy as np
import torch
import torch.nn as nn
from collections import defaultdict
import time
import json
from datetime import datetime
from tqdm.auto import tqdm



DATA_PROCESSED = Path(r"D:\ML\GNN\graph_recsys\data_processed\v2_proper")
ARTIFACTS = Path(r"D:\ML\GNN\graph_recsys\artifacts\v2_proper")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)

rng = np.random.default_rng(42)

device: cuda


In [2]:
z = np.load(DATA_PROCESSED / "splits_ui.npz")
train_ui = z["train_ui"].astype(np.int32)
val_ui   = z["val_ui"].astype(np.int32)
test_ui  = z["test_ui"].astype(np.int32)
n_users = int(z["n_users"])
n_items = int(z["n_items"])

popular_items = np.load(DATA_PROCESSED / "popular_items.npy").astype(np.int32)

train_user_items = defaultdict(set)
for u, i in train_ui:
    train_user_items[int(u)].add(int(i))

val_pos  = {int(u): int(i) for u, i in val_ui}
test_pos = {int(u): int(i) for u, i in test_ui}

print("Loaded:")
print(" train:", train_ui.shape, "val:", val_ui.shape, "test:", test_ui.shape)
print(" n_users:", n_users, "n_items:", n_items)
print(" popular_items:", popular_items.shape, "unique:", len(np.unique(popular_items)))

Loaded:
 train: (4926384, 2) val: (53398, 2) test: (53398, 2)
 n_users: 53398 n_items: 9999
 popular_items: (9999,) unique: 9999


In [3]:
u = torch.from_numpy(train_ui[:, 0].astype(np.int64))
i = torch.from_numpy(train_ui[:, 1].astype(np.int64)) + n_users

row = torch.cat([u, i], dim=0)
col = torch.cat([i, u], dim=0)

num_nodes = n_users + n_items
edge_index = torch.stack([row, col], dim=0)  # [2, 2E]

deg = torch.bincount(edge_index[0], minlength=num_nodes).float()
deg_inv_sqrt = deg.pow(-0.5)
deg_inv_sqrt[torch.isinf(deg_inv_sqrt)] = 0.0

edge_weight = deg_inv_sqrt[edge_index[0]] * deg_inv_sqrt[edge_index[1]]

edge_index = edge_index.to(device)
edge_weight = edge_weight.to(device)

A_norm = torch.sparse_coo_tensor(
    edge_index, edge_weight,
    size=(num_nodes, num_nodes),
    device=device
).coalesce()

print("A_norm:", A_norm.shape, "| nnz:", A_norm._nnz())

A_norm: torch.Size([63397, 63397]) | nnz: 9852768


In [4]:
def evaluate_ranking(ranks, ks=(10, 20, 50)):
    # ranks: list[int], where rank=0 means best item, etc.
    ranks = np.asarray(ranks, dtype=np.int64)
    out = {}
    for k in ks:
        hit = (ranks < k).mean()
        out[f"hit@{k}"] = float(hit)
        out[f"recall@{k}"] = float(hit)  # single positive per user => recall==hit
        # ndcg for single positive
        ndcg = np.where(ranks < k, 1.0 / np.log2(ranks + 2.0), 0.0).mean()
        out[f"ndcg@{k}"] = float(ndcg)
    return out

@torch.no_grad()
def compute_ranks_for_users(user_emb, item_emb, pos_dict, train_user_items, user_subset=None):
    if user_subset is None:
        users = list(pos_dict.keys())
    else:
        users = list(user_subset)

    item_emb_T = item_emb.t()  # [d, n_items]
    ranks = []

    for u in users:
        u = int(u)
        pos_item = int(pos_dict[u])

        scores = (user_emb[u] @ item_emb_T).detach().cpu().numpy()  # [n_items]
        seen = train_user_items[u]
        if seen:
            scores[list(seen)] = -1e9
        pos_score = scores[pos_item]
        rank = int(np.sum(scores > pos_score))
        ranks.append(rank)

    return ranks

In [5]:
class LightGCN(nn.Module):
    def __init__(self, num_users, num_items, emb_dim=64, num_layers=3):
        super().__init__()
        self.num_users = num_users
        self.num_items = num_items
        self.num_nodes = num_users + num_items
        self.emb_dim = emb_dim
        self.num_layers = num_layers

        self.embedding = nn.Embedding(self.num_nodes, emb_dim)
        nn.init.xavier_uniform_(self.embedding.weight)

    def propagate(self, A_norm):
        x0 = self.embedding.weight
        xs = [x0]
        x = x0
        for _ in range(self.num_layers):
            x = torch.sparse.mm(A_norm, x)
            xs.append(x)
        return torch.stack(xs, dim=0).mean(dim=0)

    def forward(self, A_norm, users, pos_items, neg_items):
        all_emb = self.propagate(A_norm)
        user_emb = all_emb[:self.num_users]
        item_emb = all_emb[self.num_users:]

        u_e = user_emb[users]
        p_e = item_emb[pos_items]
        n_e = item_emb[neg_items]

        pos_scores = (u_e * p_e).sum(dim=1)
        neg_scores = (u_e * n_e).sum(dim=1)
        return pos_scores, neg_scores

def bpr_loss(pos_scores, neg_scores):
    return -torch.log(torch.sigmoid(pos_scores - neg_scores) + 1e-12).mean()

In [6]:
# настройки hard negatives
TOPK_NEG = 2000          # пул популярных для negatives
P_POP_NEG = 0.7          # вероятность взять neg из popular pool

popular_pool = popular_items[:min(TOPK_NEG, len(popular_items))].astype(np.int64)

def sample_batch_hard(train_user_items, n_users, n_items, batch_size=4096, rng=None):
    if rng is None:
        rng = np.random.default_rng(42)

    users = rng.integers(0, n_users, size=batch_size, endpoint=False)
    pos_items = np.empty(batch_size, dtype=np.int64)
    neg_items = np.empty(batch_size, dtype=np.int64)

    for idx, u in enumerate(users):
        u = int(u)
        seen = train_user_items[u]

        # positive
        pos_items[idx] = rng.choice(list(seen))

        # negative (hard)
        while True:
            if rng.random() < P_POP_NEG:
                ni = int(rng.choice(popular_pool))
            else:
                ni = int(rng.integers(0, n_items, endpoint=False))
            if ni not in seen:
                neg_items[idx] = ni
                break

    return users.astype(np.int64), pos_items, neg_items

In [8]:
EMB_DIM = 64
NUM_LAYERS = 3
LR = 2e-3
WEIGHT_DECAY = 1e-6

BATCH_SIZE = 4096
EPOCHS = 20
STEPS = 300
EVAL_USERS = 5000
KS = (10, 20, 50)

model = LightGCN(n_users, n_items, emb_dim=EMB_DIM, num_layers=NUM_LAYERS).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

all_val_users = np.array(list(val_pos.keys()), dtype=np.int64)
val_users_subset = rng.choice(all_val_users, size=min(EVAL_USERS, len(all_val_users)), replace=False)

best_val_ndcg10 = -1.0
best_state = None
best_metrics = None

for epoch in range(1, EPOCHS + 1):
    t0 = time.time()
    model.train()
    losses = []

    for _ in range(STEPS):
        users, pos_items, neg_items = sample_batch_hard(
            train_user_items, n_users, n_items,
            batch_size=BATCH_SIZE, rng=rng
        )
        users = torch.from_numpy(users).to(device)
        pos_items = torch.from_numpy(pos_items).to(device)
        neg_items = torch.from_numpy(neg_items).to(device)

        optimizer.zero_grad()
        pos_scores, neg_scores = model(A_norm, users, pos_items, neg_items)
        loss = bpr_loss(pos_scores, neg_scores)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())

    # validation
    model.eval()
    with torch.no_grad():
        all_emb = model.propagate(A_norm)
        user_emb = all_emb[:n_users]
        item_emb = all_emb[n_users:]
        ranks = compute_ranks_for_users(user_emb, item_emb, val_pos, train_user_items, user_subset=val_users_subset)
        val_metrics = evaluate_ranking(ranks, ks=KS)

    dt = time.time() - t0
    print(f"[Epoch {epoch:03d}] loss={np.mean(losses):.4f} | val {val_metrics} | time={dt:.1f}s")

    if val_metrics["ndcg@10"] > best_val_ndcg10:
        best_val_ndcg10 = val_metrics["ndcg@10"]
        best_state = {k: v.detach().cpu() for k, v in model.state_dict().items()}
        best_metrics = val_metrics.copy()

print("Best val ndcg@10:", best_val_ndcg10, "| best metrics:", best_metrics)

[Epoch 001] loss=0.5759 | val {'hit@10': 0.0546, 'recall@10': 0.0546, 'ndcg@10': 0.028216020367364956, 'hit@20': 0.0864, 'recall@20': 0.0864, 'ndcg@20': 0.036172434707701685, 'hit@50': 0.157, 'recall@50': 0.157, 'ndcg@50': 0.050158361440504215} | time=107.4s
[Epoch 002] loss=0.4706 | val {'hit@10': 0.0666, 'recall@10': 0.0666, 'ndcg@10': 0.03628392834314441, 'hit@20': 0.105, 'recall@20': 0.105, 'ndcg@20': 0.04590641587142754, 'hit@50': 0.1814, 'recall@50': 0.1814, 'ndcg@50': 0.06101243575589649} | time=108.6s
[Epoch 003] loss=0.4253 | val {'hit@10': 0.0694, 'recall@10': 0.0694, 'ndcg@10': 0.038178567919614076, 'hit@20': 0.1136, 'recall@20': 0.1136, 'ndcg@20': 0.04929299762774982, 'hit@50': 0.1912, 'recall@50': 0.1912, 'ndcg@50': 0.06460546630653431} | time=107.5s
[Epoch 004] loss=0.4051 | val {'hit@10': 0.0734, 'recall@10': 0.0734, 'ndcg@10': 0.039736806478453755, 'hit@20': 0.1194, 'recall@20': 0.1194, 'ndcg@20': 0.05131958004273017, 'hit@50': 0.1986, 'recall@50': 0.1986, 'ndcg@50': 0.

In [10]:
@torch.no_grad()
def compute_ranks_for_users_batched(user_emb, item_emb, pos_dict, train_user_items, users, batch_size=512):
    """
    users: np.array of user ids (int)
    pos_dict: dict u -> pos_item
    train_user_items: dict u -> set(seen_items)  (только train!)
    """
    user_emb = user_emb.to(device)
    item_emb = item_emb.to(device)
    item_emb_T = item_emb.t().contiguous()  # [d, n_items]

    ranks = np.empty(len(users), dtype=np.int64)

    for start in tqdm(range(0, len(users), batch_size), desc="Full TEST ranks"):
        end = min(start + batch_size, len(users))
        batch_users = users[start:end]
        bu = torch.from_numpy(batch_users).to(device)  # [B]

        # scores: [B, n_items]
        scores = (user_emb[bu] @ item_emb_T)  # GPU matmul

        # переносим на CPU для удобного маскинга питоном по sets
        scores_cpu = scores.detach().cpu().numpy()

        for j, u in enumerate(batch_users):
            u = int(u)
            pos_item = int(pos_dict[u])

            # mask seen items from TRAIN
            seen = train_user_items[u]
            if seen:
                scores_cpu[j, list(seen)] = -1e9

            pos_score = scores_cpu[j, pos_item]
            rank = int(np.sum(scores_cpu[j] > pos_score))
            ranks[start + j] = rank

    return ranks

# --- FULL TEST evaluation (all users) ---
model.eval()
with torch.no_grad():
    all_emb = model.propagate(A_norm)
    user_emb = all_emb[:n_users]
    item_emb = all_emb[n_users:]

full_test_users = np.array(list(test_pos.keys()), dtype=np.int64)

ranks_full_test = compute_ranks_for_users_batched(
    user_emb=user_emb,
    item_emb=item_emb,
    pos_dict=test_pos,
    train_user_items=train_user_items,
    users=full_test_users,
    batch_size=512  # можно 256 если память/скорость будут капризничать
)

full_test_metrics = evaluate_ranking(ranks_full_test, ks=(10,20,50))
print("FULL TEST metrics:", full_test_metrics)

Full TEST ranks:   0%|          | 0/105 [00:00<?, ?it/s]

FULL TEST metrics: {'hit@10': 0.08421663732724072, 'recall@10': 0.08421663732724072, 'ndcg@10': 0.04552879810597425, 'hit@20': 0.12886250421364095, 'recall@20': 0.12886250421364095, 'ndcg@20': 0.056736608678710554, 'hit@50': 0.22058878609685756, 'recall@50': 0.22058878609685756, 'ndcg@50': 0.07481866735724439}


In [11]:
ARTIFACTS.mkdir(parents=True, exist_ok=True)

run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
run_dir = ARTIFACTS / f"lightgcn_hardneg_{run_id}"
run_dir.mkdir(parents=True, exist_ok=True)

torch.save(best_state, run_dir / "lightgcn_best_state.pt")

config = {
    "model": "LightGCN",
    "variant": "hardneg_popularity",
    "emb_dim": EMB_DIM,
    "num_layers": NUM_LAYERS,
    "lr": LR,
    "weight_decay": WEIGHT_DECAY,
    "batch_size": BATCH_SIZE,
    "epochs": EPOCHS,
    "steps_per_epoch": STEPS,
    "topk_neg_pool": int(TOPK_NEG),
    "p_pop_neg": float(P_POP_NEG),
    "eval_users_val": int(EVAL_USERS),
    "eval_users_test": int(EVAL_TEST_USERS),
    "best_val_ndcg10": float(best_val_ndcg10),
    "best_val_metrics": {k: float(v) for k, v in (best_metrics or {}).items()},
    "test_subset_metrics": {k: float(v) for k, v in test_metrics.items()},
}

with open(run_dir / "config.json", "w", encoding="utf-8") as f:
    json.dump(config, f, ensure_ascii=False, indent=2)

print("Saved to:", run_dir)

Saved to: D:\ML\GNN\graph_recsys\artifacts\v2_proper\lightgcn_hardneg_20251213_160028


# Results and Conclusions

## Quantitative Results (Full Test Set)

The final evaluation on the **full test set** (≈53k users, ≈10k items) yields:

- **Hit@10** ≈ 0.084  
- **NDCG@10** ≈ 0.045  
- **Hit@20** ≈ 0.129  
- **NDCG@20** ≈ 0.057  
- **Hit@50** ≈ 0.221  
- **NDCG@50** ≈ 0.075  

Compared to a popularity-based baseline, this represents a **1.5–2× improvement**
across ranking metrics.

## Interpretation

These results confirm several important points:

1. **The training and evaluation pipeline is correct**
   - Proper split
   - Proper metrics
   - No data leakage

2. **LightGCN effectively captures collaborative signals**
   from the user–item interaction graph.

3. At the same time, the model reaches a **clear performance plateau**.

This plateau is **not caused by model capacity or optimization issues**, but by
the **limited information content of the pure user–item graph**.

## Key Insight

> Pure collaborative filtering on GoodBooks-10k saturates around  
> **NDCG@10 ≈ 0.04–0.05**.

Further architectural complexity (e.g. GAT, Graph Transformers) **will not lead to
significant gains** without introducing additional signals.

## Why This Notebook Matters

This notebook is intentionally positioned as:

- a **baseline**
- a **pipeline validation**
- a **reference point**

It is **not** meant to be the final recommender system, but rather a solid foundation
for further work.

## Next Steps

In the next notebook, we move beyond pure collaborative filtering and **augment the graph**
with additional information:

- Book–tag relationships
- Content-aware connections
- Hybrid graph structures

The goal is to inject **new semantic signals** into the graph and push recommendation
quality beyond the collaborative filtering ceiling.