DRIFT

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

NJU-LINK Team, Nanjing University · JIUTIAN Research

DRIFT is a claim-centric auditing framework for diagnosing deep-research agent trajectories. Instead of only checking whether the final answer is correct, DRIFT localizes the semantic spans where an agent first makes, reuses, or finalizes harmful unsupported claims.

We release:

TELBench, a 1,000-instance benchmark for span-level error localization in real deep-research trajectories.
DRIFT, a clean runner with two comparable settings: bare full-context prediction and the full claim-centric drift pipeline.
Project page, figures, prompts, and reproducibility utilities.

What We Study

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, hypothesis testing, and answer synthesis. Final-answer accuracy says whether the agent succeeds, but not which step made the trajectory unreliable. TELBench turns this into a process-level task: given only the question and ordered raw semantic spans, predict the harmful error span ids.

DRIFT in Brief

DRIFT audits trajectories as claim graphs rather than independent spans:

Claim Keeper records decision-critical claims and when they become consequential.
Support Seeker uses graph-grep raw-span access to check whether those claims are supported by trajectory evidence.
Dependency Tracer localizes spans where unsupported claims become harmful commitments.

Model prompts receive only the task question and ordered raw span text. Gold labels, annotations, judge results, span types, manual notes, and generated summaries are stripped before prompting.

Quick Start

git clone https://github.com/NJU-LINK/DRIFT.git
cd DRIFT
python -m pip install -e .
python -m pip install -U huggingface_hub

hf download NJU-LINK/TELBench \
  --repo-type dataset \
  --local-dir data \
  --include "TELBench.jsonl.enc" \
  --include "TELBench.jsonl.enc.sha256" \
  --include "TELBench.jsonl.sha256" \
  --include "TELBench.passphrase.txt"

bash scripts/decrypt_telbench.sh

cat > .env <<'EOF'
API_URL=https://example.com/codex
API_KEY=your_api_key_here
EOF

drift \
  --setting drift \
  --input data/TELBench.jsonl \
  --model gpt-5.4 \
  --api-type responses \
  --env-file .env \
  --outdir runs/telbench_gpt54 \
  --workers 8

drift-eval \
  --gold data/TELBench.jsonl \
  --pred runs/telbench_gpt54/drift/gpt-5.4/summary.json \
  --output runs/telbench_gpt54/drift/gpt-5.4/eval.json

For complete API, data, and evaluation details, see docs/USAGE.md. For the TELBench JSONL schema, see data/README.md.

Project Structure

DRIFT/
├── data/
│   └── README.md              # download TELBench artifacts from Hugging Face
├── docs/
│   ├── index.html
│   ├── styles.css
│   ├── USAGE.md
│   └── assets/
├── scripts/
│   └── decrypt_telbench.sh
├── src/drift_open/
│   ├── cli.py
│   ├── client.py
│   ├── data.py
│   ├── evaluate.py
│   ├── prompts.py
│   ├── runner.py
│   └── span_store.py
├── tests/
└── pyproject.toml

Links

Project page: https://nju-link.github.io/DRIFT/
Usage guide: docs/USAGE.md
Data guide: data/README.md
Code package: src/drift_open

Citation

@misc{wang2026drift,
  title  = {Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories},
  author = {Wang, Jiaming and Feng, Ziteng and Wu, Jiangtao and others},
  year   = {2026},
  note   = {DRIFT project}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DRIFT

What We Study

DRIFT in Brief

Quick Start

Project Structure

Links

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
docs		docs
scripts		scripts
src/drift_open		src/drift_open
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

DRIFT

What We Study

DRIFT in Brief

Quick Start

Project Structure

Links

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages