# Final Evaluation & Research Report
## Graph-based Recommender System (Goodbooks-10k, Graph3)

This notebook concludes the research cycle for the **Graph3-based recommender system**.

### What this notebook does
- Collects results from all completed experiments.
- Builds a unified comparison table across models.
- Identifies the strongest approaches under a **fixed evaluation protocol**.
- Summarizes key findings and trade-offs.
- Records future research directions (without implementing them yet).

### What this notebook does NOT do
- No new model training.
- No hyperparameter tuning.
- No graph rebuilding.

The goal is **closure and clarity**,not exploration.
zing which model won and why.


⚠️ Note on experiment logging

Experiments in this project were conducted iteratively during research.
As a result, results are stored in heterogeneous formats
(CSV summaries, JSON files, and notebook outputs).

For the final report, we rely on the **validated test metrics**
reported in each experiment notebook.

## Final Comparison (Test set, 10k users)

| Model | Encoder | Sampling | NDCG@10 (C=1000) | Hit@10 (C=1000) | Notes |
|------|--------|----------|------------------|----------------|-------|
| LightGCN (baseline) | LightGCN | full-batch | ~0.0298 | ~0.054 | Reference baseline |
| GraphSAGE | SAGEConv | neighbor | ~0.029–0.030 | ~0.058 | Ranking loss, scalable |
| GAT | GATConv | neighbor | **~0.0426** | **~0.086** | Best attention-based |
| TransformerConv | TransformerConv | neighbor | ~0.033 | ~0.067 | Slower, less stable |
| PinSAGE-style | GraphSAGE | random-walk | **~0.0412** | ~0.076 | Strong walk-based signal |

## Engineering follow-up

A natural next step is to refactor the experimental code into:
- a unified graph builder (`build_graph3.py`),
- a common training/evaluation script,
- standardized run metadata (`run_meta.json`).

This refactor is intentionally postponed to a future iteration,
after completing the exploratory research phase.

# Final Research Summary

## Project scope
In this project we explored **graph-based recommendation models** on the Goodbooks-10k dataset using a fixed and reproducible **Graph3** setup.

Graph3 is a heterogeneous graph that includes:
- users and books (core interaction signal),
- auxiliary relations: book–tag, book–author, book–language, book–year_bin,
- optional book–book similarity edges.

All experiments were evaluated under the same protocol:
- leave-one-out (LOO) split,
- candidate-based ranking (C = 1000 and C = 2000),
- ranking metrics (Hit@K, NDCG@K).

The goal was not to find a single “best” architecture, but to understand **what actually drives ranking quality in GNN-based recommenders**.

---

## What worked

### 1. Objective alignment is critical
The strongest and most consistent result across all experiments is that
**ranking-aligned objectives (BPR)** are essential.

Models trained with binary link prediction (BCE + negative sampling) —
such as a full-batch R-GCN setup — showed decreasing loss but **failed catastrophically on ranking metrics**.
This confirmed a fundamental mismatch between:
- local edge classification,
- and global ranking evaluation.

Once BPR loss was introduced, even relatively simple architectures
(GraphSAGE) became competitive.

**Conclusion:**  
> The choice of objective function has a larger impact on ranking quality than the choice of GNN layer.

---

### 2. Sampling is not a compromise — it is an advantage
Mini-batch training with neighborhood sampling consistently produced:
- faster training,
- better scalability,
- and equal or better ranking performance compared to full-batch baselines.

Neighbor sampling allowed the models to focus on **local collaborative structure**,
which turned out to be more useful for ranking than global message passing.

**Conclusion:**  
> For recommendation, sampling-based GNNs are not just scalable — they are often better.

---

### 3. Attention helps, but is not mandatory
Attention-based models (GAT, TransformerConv) performed strongly
and achieved some of the best results under the same training setup.

However, their advantage over non-attention models was **not universal**:
- well-designed GraphSAGE models,
- especially with improved neighborhood construction,
were able to match or closely approach attention-based performance.

**Conclusion:**  
> Attention is a powerful tool, but neighborhood quality can compensate for architectural simplicity.

---

### 4. Neighborhood construction matters as much as the encoder
PinSAGE-style random-walk sampling produced one of the strongest and most stable models.

Random walks implicitly capture:
- co-visitation patterns,
- collaborative proximity,
- and higher-order structure that is difficult to encode with simple K-hop sampling.

This led to ranking quality comparable to attention-based models,
without explicitly using attention.

**Conclusion:**  
> How neighborhoods are built can be as important as how they are processed.

---

## What did not work (and why)

### 1. Heterogeneous structure alone is not sufficient
Simply adding more node and edge types did not automatically improve results.

Without:
- a ranking-aligned objective,
- and a sampling strategy consistent with the evaluation,
heterogeneous models underperformed simpler baselines.

This was clearly observed in the R-GCN experiment.

**Conclusion:**  
> Heterogeneity increases model capacity, but only helps when the training signal is aligned with the task.

---

### 2. Increasing complexity without control yields diminishing returns
More complex architectures:
- trained slower,
- were harder to stabilize,
- and did not always outperform simpler models.

This reinforced the idea that **controlled experiments and fair comparisons**
are more valuable than stacking complexity.

---

## Final outcomes

- A fixed **Graph3 reference pipeline** was established.
- Multiple GNN architectures were compared under the same protocol.
- Strong baselines were identified:
  - sampling-based GraphSAGE,
  - GAT with neighbor sampling,
  - PinSAGE-style random-walk sampling.
- Clear design principles for graph recommenders were extracted.

Rather than producing a single “winner”, the project resulted in a **set of validated insights** that can guide future work.

---

## Key takeaways

1. Ranking objectives > architecture choice.
2. Sampling is essential for both scalability and quality.
3. Attention helps, but good neighborhoods can replace it.
4. Heterogeneous graphs require careful training design.
5. Research clarity matters more than raw model complexity.

---

## Limitations and future directions

This project intentionally focused on architectural and training aspects.
Several promising directions are left for future iterations:
- systematic ablation of relation types,
- true heterogeneous GNNs (HeteroConv, HGT),
- text-augmented node features for cold-start handling,
- self-supervised pretraining (masked edges, contrastive learning).

These extensions build naturally on the current reference pipeline
and can be explored without revisiting earlier design decisions.