Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories
NJU-LINK Team, Nanjing University · JIUTIAN Research
DRIFT is a claim-centric auditing framework for diagnosing deep-research agent trajectories. Instead of only checking whether the final answer is correct, DRIFT localizes the semantic spans where an agent first makes, reuses, or finalizes harmful unsupported claims.
We release:
- TELBench, a 1,000-instance benchmark for span-level error localization in real deep-research trajectories.
- DRIFT, a clean runner with two comparable settings:
barefull-context prediction and the full claim-centricdriftpipeline. - Project page, figures, prompts, and reproducibility utilities.
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, hypothesis testing, and answer synthesis. Final-answer accuracy says whether the agent succeeds, but not which step made the trajectory unreliable. TELBench turns this into a process-level task: given only the question and ordered raw semantic spans, predict the harmful error span ids.
DRIFT audits trajectories as claim graphs rather than independent spans:
- Claim Keeper records decision-critical claims and when they become consequential.
- Support Seeker uses graph-grep raw-span access to check whether those claims are supported by trajectory evidence.
- Dependency Tracer localizes spans where unsupported claims become harmful commitments.
Model prompts receive only the task question and ordered raw span text. Gold labels, annotations, judge results, span types, manual notes, and generated summaries are stripped before prompting.
git clone https://github.com/NJU-LINK/DRIFT.git
cd DRIFT
python -m pip install -e .
python -m pip install -U huggingface_hub
hf download NJU-LINK/TELBench \
--repo-type dataset \
--local-dir data \
--include "TELBench.jsonl.enc" \
--include "TELBench.jsonl.enc.sha256" \
--include "TELBench.jsonl.sha256" \
--include "TELBench.passphrase.txt"
bash scripts/decrypt_telbench.sh
cat > .env <<'EOF'
API_URL=https://example.com/codex
API_KEY=your_api_key_here
EOF
drift \
--setting drift \
--input data/TELBench.jsonl \
--model gpt-5.4 \
--api-type responses \
--env-file .env \
--outdir runs/telbench_gpt54 \
--workers 8
drift-eval \
--gold data/TELBench.jsonl \
--pred runs/telbench_gpt54/drift/gpt-5.4/summary.json \
--output runs/telbench_gpt54/drift/gpt-5.4/eval.jsonFor complete API, data, and evaluation details, see docs/USAGE.md. For the TELBench JSONL schema, see data/README.md.
DRIFT/
├── data/
│ └── README.md # download TELBench artifacts from Hugging Face
├── docs/
│ ├── index.html
│ ├── styles.css
│ ├── USAGE.md
│ └── assets/
├── scripts/
│ └── decrypt_telbench.sh
├── src/drift_open/
│ ├── cli.py
│ ├── client.py
│ ├── data.py
│ ├── evaluate.py
│ ├── prompts.py
│ ├── runner.py
│ └── span_store.py
├── tests/
└── pyproject.toml
- Project page: https://nju-link.github.io/DRIFT/
- Usage guide: docs/USAGE.md
- Data guide: data/README.md
- Code package: src/drift_open
@misc{wang2026drift,
title = {Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories},
author = {Wang, Jiaming and Feng, Ziteng and Wu, Jiangtao and others},
year = {2026},
note = {DRIFT project}
}
