🐛 Reproduction Issue: Significant performance gap on cross-domain evaluation
Summary
I tried to reproduce SubgraphRAG following the official code and default settings, but observed a significant performance gap, especially on cross-domain evaluation (WebQSP ↔ CWQ). The in-domain results are relatively close, but cross-domain performance is far below the reported numbers.
Results Comparison (K=100)
| Scenario |
Metric |
My Result |
Paper Result |
Gap |
| WebQSP → WebQSP |
Answer Recall |
0.871 |
0.944 |
-0.073 |
|
Shortest Path Triple Recall |
0.794 |
0.883 |
-0.089 |
|
GPT-4o Triple Recall |
0.732 |
0.865 |
-0.133 |
| CWQ → CWQ |
Answer Recall |
0.895 |
0.914 |
-0.019 |
|
Shortest Path Triple Recall |
0.759 |
0.811 |
-0.052 |
|
GPT-4o Triple Recall |
0.739 |
0.840 |
-0.101 |
| WebQSP → CWQ |
Answer Recall |
0.560 |
0.887 |
-0.327 |
|
Shortest Path Triple Recall |
0.374 |
0.794 |
-0.420 |
|
GPT-4o Triple Recall |
0.363 |
0.776 |
-0.413 |
| CWQ → WebQSP |
Answer Recall |
0.203 |
0.773 |
-0.570 |
|
Shortest Path Triple Recall |
0.114 |
0.622 |
-0.508 |
|
GPT-4o Triple Recall |
0.124 |
0.623 |
-0.499 |
Training Results (Validation Set)
| Dataset |
loss |
val/ans_recall@100 |
val/triple_recall@100 |
Best Epoch |
Total Epochs |
Patience |
| WebQSP |
0.01183 |
0.89683 |
0.82729 |
111 |
126 |
15 |
| CWQ |
0.00567 |
0.91499 |
0.77503 |
130 |
145 |
15 |
Observation: Training converged properly with reasonable validation metrics. The model seems to learn well on the training/validation sets.
Experimental Setup
Hyperparameters
seed = 42 (default in source code)
lr = 1e-4
patience = 15
K = 100 (for evaluation)
Run Commands
# Training
python train.py -d webqsp
python train.py -d cwq
# Evaluation (in-domain)
python eval.py -d webqsp -p /path/to/webqsp_retrieval_result.pth
python eval.py -d cwq -p /path/to/cwq_retrieval_result.pth
# Evaluation (cross-domain)
python eval.py -d cwq -p /path/to/webqsp_retrieval_result.pth
python eval.py -d webqsp -p /path/to/cwq_retrieval_result.pth
**Training (val set):**
- WebQSP: loss=0.01183, ans_recall=0.8968, triple_recall=0.8273 @ epoch 111
- CWQ: loss=0.00567, ans_recall=0.9150, triple_recall=0.7750 @ epoch 130
**Setup:** seed=42, lr=1e-4, patience=15, PyTorch 2.5.1+cu121, CUDA 12.4, GPU NVIDIA A40
**Tried:** default seed, patience tuning, multiple seeds (32,52) — gap persists
Any suggestions? Thanks!
🐛 Reproduction Issue: Significant performance gap on cross-domain evaluation
Summary
I tried to reproduce SubgraphRAG following the official code and default settings, but observed a significant performance gap, especially on cross-domain evaluation (WebQSP ↔ CWQ). The in-domain results are relatively close, but cross-domain performance is far below the reported numbers.
Results Comparison (K=100)
Training Results (Validation Set)
Observation: Training converged properly with reasonable validation metrics. The model seems to learn well on the training/validation sets.
Experimental Setup
Hyperparameters
seed = 42(default in source code)lr = 1e-4patience = 15K = 100(for evaluation)Run Commands