Significant performance gap on cross-domain evaluation

## 🐛 Reproduction Issue: Significant performance gap on cross-domain evaluation

### Summary

I tried to reproduce SubgraphRAG following the official code and default settings, but observed a **significant performance gap**, especially on **cross-domain** evaluation (WebQSP ↔ CWQ). The in-domain results are relatively close, but cross-domain performance is far below the reported numbers.

---

### Results Comparison (K=100)

| Scenario | Metric | My Result | Paper Result | Gap |
|----------|--------|-----------|--------------|-----|
| **WebQSP → WebQSP** | Answer Recall | 0.871 | 0.944 | **-0.073** |
| | Shortest Path Triple Recall | 0.794 | 0.883 | -0.089 |
| | GPT-4o Triple Recall | 0.732 | 0.865 | -0.133 |
| **CWQ → CWQ** | Answer Recall | 0.895 | 0.914 | **-0.019** |
| | Shortest Path Triple Recall | 0.759 | 0.811 | -0.052 |
| | GPT-4o Triple Recall | 0.739 | 0.840 | -0.101 |
| **WebQSP → CWQ** | Answer Recall | 0.560 | 0.887 | **-0.327** |
| | Shortest Path Triple Recall | 0.374 | 0.794 | -0.420 |
| | GPT-4o Triple Recall | 0.363 | 0.776 | -0.413 |
| **CWQ → WebQSP** | Answer Recall | 0.203 | 0.773 | **-0.570** |
| | Shortest Path Triple Recall | 0.114 | 0.622 | -0.508 |
| | GPT-4o Triple Recall | 0.124 | 0.623 | -0.499 |

---

### Training Results (Validation Set)

| Dataset | loss | val/ans_recall@100 | val/triple_recall@100 | Best Epoch | Total Epochs | Patience |
|---------|------|--------------------|------------------------|------------|--------------|----------|
| WebQSP | 0.01183 | 0.89683 | 0.82729 | 111 | 126 | 15 |
| CWQ | 0.00567 | 0.91499 | 0.77503 | 130 | 145 | 15 |

**Observation**: Training converged properly with reasonable validation metrics. The model seems to learn well on the training/validation sets.

---

### Experimental Setup

#### Hyperparameters
- `seed = 42` (default in source code)
- `lr = 1e-4`
- `patience = 15`
- `K = 100` (for evaluation)

#### Run Commands
```bash
# Training
python train.py -d webqsp
python train.py -d cwq

# Evaluation (in-domain)
python eval.py -d webqsp -p /path/to/webqsp_retrieval_result.pth
python eval.py -d cwq -p /path/to/cwq_retrieval_result.pth

# Evaluation (cross-domain)
python eval.py -d cwq -p /path/to/webqsp_retrieval_result.pth
python eval.py -d webqsp -p /path/to/cwq_retrieval_result.pth

**Training (val set):**
- WebQSP: loss=0.01183, ans_recall=0.8968, triple_recall=0.8273 @ epoch 111
- CWQ: loss=0.00567, ans_recall=0.9150, triple_recall=0.7750 @ epoch 130

**Setup:** seed=42, lr=1e-4, patience=15, PyTorch 2.5.1+cu121, CUDA 12.4, GPU NVIDIA A40

**Tried:** default seed, patience tuning, multiple seeds (32,52) — gap persists

Any suggestions? Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance gap on cross-domain evaluation #39

🐛 Reproduction Issue: Significant performance gap on cross-domain evaluation

Summary

Results Comparison (K=100)

Training Results (Validation Set)

Experimental Setup

Hyperparameters

Run Commands

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Metric	My Result	Paper Result	Gap
WebQSP → WebQSP	Answer Recall	0.871	0.944	-0.073
	Shortest Path Triple Recall	0.794	0.883	-0.089
	GPT-4o Triple Recall	0.732	0.865	-0.133
CWQ → CWQ	Answer Recall	0.895	0.914	-0.019
	Shortest Path Triple Recall	0.759	0.811	-0.052
	GPT-4o Triple Recall	0.739	0.840	-0.101
WebQSP → CWQ	Answer Recall	0.560	0.887	-0.327
	Shortest Path Triple Recall	0.374	0.794	-0.420
	GPT-4o Triple Recall	0.363	0.776	-0.413
CWQ → WebQSP	Answer Recall	0.203	0.773	-0.570
	Shortest Path Triple Recall	0.114	0.622	-0.508
	GPT-4o Triple Recall	0.124	0.623	-0.499

Dataset	loss	val/ans_recall@100	val/triple_recall@100	Best Epoch	Total Epochs	Patience
WebQSP	0.01183	0.89683	0.82729	111	126	15
CWQ	0.00567	0.91499	0.77503	130	145	15

Significant performance gap on cross-domain evaluation #39

Description

🐛 Reproduction Issue: Significant performance gap on cross-domain evaluation

Summary

Results Comparison (K=100)

Training Results (Validation Set)

Experimental Setup

Hyperparameters

Run Commands

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions