Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition
Preprint: arXiv:2606.03657
NovelAPIBench is a dynamic, model-conditional benchmark for studying how
code-LLMs acquire genuinely novel APIs — calls exposed by a library version
released after the model's training cutoff. Given any base model M and target
library L, the pipeline (i) discovers novel APIs, (ii) extracts a decomposed
knowledge bundle {S_name, S_param, E, M_prose, M_code}, (iii) generates
execute-then-assert coding tasks, and (iv) filters them to a model-conditional
set that probes only what M cannot already solve. Each failure is then
classified into one of six diagnostic categories (WrongAPISelection,
WrongImport, WrongSyntax, WrongParam, WrongShapeDtype, WrongLogic)
so the benchmark reports why a model failed.
- Highlights
- Knowledge Components
- Repository Layout
- Quick Start
- Installation
- Pipeline Usage
- Benchmark Groups, Models, Conditions
- Reproducing the Paper
- Data Layout
- Development
- Authors
- License
- Citation
- Dynamic, model-conditional benchmark. Library versions and the task pool are re-derived per base model so what stays in the benchmark is genuinely novel relative to that model's pretraining horizon.
- Decomposed knowledge bundle. Every novel API is represented by five
paper-aligned components (
S_name,S_param,E,M_prose,M_code) that can be combined freely to isolate which component closes which failure. - Execute-then-assert harnesses. Every task ships with a structural, comparative, and mock-based assertion layer plus a target-API spy, so solutions cannot pass without invoking the API of interest.
- Six-class failure taxonomy. Each failed
pass@1sample is automatically labelled by the deterministic fast-path classifier plus a strong-model judge, turning aggregate pass rates into mechanism-level diagnostics. - Adaptation matrix. First-class support for RAG, SFT/LoRA, RAFT, GRACE, MEMIT, and AlphaEdit on the same task pool, so external and parametric adaptation paradigms are evaluated under identical conditions.
- ~1.9k tasks across 19 libraries × 4 backbones (Qwen2.5-Coder-7B, Seed-Coder-8B, OpenCoder-8B, R1-Distill-Qwen-7B).
NovelAPIBench mirrors the paper's S / E / M decomposition end-to-end:
| Paper notation | What it captures | Code identifier (KnowledgeBundle) |
|---|---|---|
S_name |
fully qualified API name | bundle.surface.s_name |
S_param |
parameter signature + types + defaults | bundle.surface.parameters |
E |
validated usage exemplars (executed / static) |
bundle.exemplars |
M_prose |
natural-language mechanism description | bundle.m_prose |
M_code |
implementation source (docstrings stripped) | bundle.m_code |
The 12 paper conditions are exposed as KnowledgeCondition values:
baseline, s_name, e, m_prose, m_code, s, s_m_prose, s_m_code,
s_m_prose_m_code, s_e, s_e_m_code, full.
NovelAPIBench/
├── configs/ Hydra-style YAML configs
│ ├── base.yaml models, hardware, paths
│ ├── stage{1..4}_*.yaml per-stage pipeline configs
│ ├── experiment.yaml condition grid + SFT/RAG hyperparams
│ ├── libraries/ 19 paper libraries (one YAML each)
│ ├── groups/ benchmark group definitions
│ └── experiments/ per-(group × model) experiment specs
├── src/
│ ├── common/ config, IO, LLM client, runtime env, sandbox
│ ├── stage1_discovery/ API-surface diff + novelty probes
│ ├── stage2_extraction/ S + E + M_prose + M_code extractors
│ ├── stage3_generation/ task generation + execute-then-assert harness
│ ├── stage4_filtering/ C1/C2/C3 gates
│ ├── experiments/ conditions, RAG indexer, SFT/RAFT/editing
│ └── evaluation/ pass@k, failure classifier, analysis (RQ1-3)
├── scripts/ CLI entry points (see `make help`)
├── tests/ pytest suite (82 tests)
├── docs/
│ ├── pipeline_overview.md what each stage produces
│ └── reproducibility.md exact paper-reproduction commands
├── environment-local.yml CPU-only env (laptop / CI smoke)
├── environment-gpu.yml CUDA + vLLM env (paper runs)
├── requirements{,-gpu}.txt
└── Makefile make help
A minimal smoke run that exercises every renamed code path without GPU or API credits (good for verifying a fresh checkout):
# 1. Create the lightweight env (no CUDA, no vLLM)
make install-local
conda activate novel-api-local
# 2. Lint + 82 unit tests + import every entry-point script
make check
make smokeFor real benchmark generation you need a GPU machine, an OPENAI_API_KEY, and
a GITHUB_TOKEN. See Installation and
Pipeline Usage below.
conda env create -f environment-local.yml
conda activate novel-api-local
pip install -e .
make checkconda env create -f environment-gpu.yml
conda activate novel-api
pip install -e .The GPU path has been tested with torch 2.10.0+cu128, vllm 0.19.0,
transformers 4.57.6, and trl 1.1.0.
Copy .env.example to .env and fill in:
GITHUB_TOKEN=... # for changelog / source fetching
OPENAI_API_KEY=... # for surface + M_prose extraction + strong-judge
NOVEL_API_TEMP_ENV_DIR=... # scratch dir for per-library versioned envs
CONDA_PKGS_DIRS=... # shared conda package cache (recommended)The benchmark is built per library, then aggregated into groups, then run as condition × delivery experiments.
python scripts/run_pipeline.py --library torch --stages 1 2 3 4 --model qwen2.5-coder-7b
# or:
make pipeline LIBRARY=torch MODEL=qwen2.5-coder-7bThis runs:
- Stage 1 — Novel API discovery: resolves the library version pair around
M's cutoff, installs both into versioned conda envs, diffs the public surface, and applies the novelty probe. - Stage 2 — Knowledge extraction: extracts
S+E(via web-search-grounded LLM),M_prose(tier-routed), andM_code. - Stage 3 — Task generation: emits easy/medium/hard tasks per API with an execute-then-assert harness that wraps a target-API spy.
- Stage 4 — Filtering: applies the three model-conditional gates (C1 reference validity, C2 empirical novelty, C3 informativeness).
Per-stage outputs land in data/raw/{library}/stage{N}_*.{model_short}.jsonl.
python scripts/build_group.py --group agent_tool --model qwen2.5-coder-7b
make build-group GROUP=agent_tool MODEL=qwen2.5-coder-7bpython scripts/run_experiment.py --experiment agent-tool-qwen2.5-coder-7b --all
python scripts/run_eval.py --experiment agent-tool-qwen2.5-coder-7b
python scripts/run_analysis.py --experiment agent-tool-qwen2.5-coder-7b --output data/analysis/Or chain everything through:
python scripts/run_all_experiments.py --steps run eval analysisStep-by-step debug runners (skip the heavy install/LLM cost) live behind
--debug --mock-llm:
python scripts/run_stage.py --stage 1 --library pytorch --debug --mock-llm
python scripts/run_stage.py --stage 2 --library pytorch --debug --mock-llm| Group | Libraries |
|---|---|
swe |
fastapi, flask, django, sqlalchemy, pydantic |
ai4science |
rdkit, MDAnalysis, deepchem, pymatgen, astropy, ase |
data_science |
numpy, pandas, scipy |
dl |
diffusers, transformers, torch |
agent_tool |
langgraph, fastmcp |
all_pooled_paper |
pooled Section 5 split, 80/20 train/test |
lolo_diffusers |
Section 5 held-out-diffusers split |
qwen2.5-coder-7b # primary (Section 4)
seed-coder-8b-instruct
r1-distill-qwen-7b
opencoder-8b-instruct
| Condition | Components prepended at inference |
|---|---|
baseline |
none |
s_name |
S_name |
e |
E |
m_prose |
M_prose |
m_code |
M_code |
s |
S_name + S_param |
s_m_prose |
S + M_prose |
s_m_code |
S + M_code |
s_m_prose_m_code |
S + M_prose + M_code |
s_e |
S + E |
s_e_m_code |
S + E + M_code |
full |
S + E + M_prose + M_code |
Combined with three delivery mechanisms — rag, sft, rag_sft — this is the
full 1 + 11 × 3 = 34-cell main grid (build_experiment_grid in
src/experiments/conditions.py).
The exact end-to-end commands for every figure and table are in
docs/reproducibility.md. At a high level:
# Section 4 main grid (4 backbones × 5 groups)
for m in qwen2.5-coder-7b seed-coder-8b-instruct r1-distill-qwen-7b opencoder-8b-instruct; do
python scripts/run_pipeline.py --all-libraries --stages 1 2 3 4 --model "$m"
done
python scripts/build_group.py --paper-grid
python scripts/run_all_experiments.py --steps run eval analysis
# Section 5 parametric adaptation (SFT / RAFT / GRACE / MEMIT / AlphaEdit)
make pooled-paper # see docs/reproducibility.md for the restA single full pipeline run takes ~hours on an 8×A6000 node and consumes proportional OpenAI / GitHub API quota; the analysis scripts are CPU-only and cheap to re-run.
All runtime artefacts are gitignored and land under data/:
data/
├── raw/{library}/ per-library stage outputs
│ ├── stage1_discovery.{model}.jsonl
│ ├── stage2_knowledge.{model}.jsonl
│ ├── stage3_tasks.{model}.jsonl
│ └── stage4_filtered.{model}.jsonl
├── processed/groups/{group}/{model}/ group benchmark + 80/20 split
├── rag/{group}/ FAISS index per RAG cell
├── sft/{group}/ SFT training JSONLs per condition
├── edits/ MEMIT / AlphaEdit / GRACE artefacts
├── results/{experiment_id}/ predictions + pass@k
└── analysis/{experiment_id}/ per-RQ figures + summary CSVs
The on-disk knowledge bundle uses paper field names directly (S_name,
S_param, E, M_prose, M_code) so it can be consumed without the Python
package — see KnowledgeBundle.to_paper_dict() in
src/stage2_extraction/schemas.py.
make check # ruff + 82 pytest
make lint
make format
make test
make smoke # import-and-help every entry-point script
make cleanPull requests and issues are welcome.
The pipeline code, curated benchmark records, task descriptions, test harnesses, and released failure labels are released under Apache-2.0.
Knowledge bundles may include M_code excerpts copied from upstream library
source. Those excerpts inherit the license of the source library — see
LICENSES.md for the complete per-library matrix and
redistribution notes (in particular, MDAnalysis- and ase-derived M_code
carry copyleft terms).
If you use NovelAPIBench in your research, please cite the preprint:
@misc{novelapibench2026,
title = {Diagnosing Knowledge Gaps in {LLM} Tool Use:
An Agentic Benchmark for Novel {API} Acquisition},
author = {Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen},
year = {2026},
eprint = {2606.03657},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}