NovelAPIBench

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Preprint: arXiv:2606.03657

NovelAPIBench is a dynamic, model-conditional benchmark for studying how code-LLMs acquire genuinely novel APIs — calls exposed by a library version released after the model's training cutoff. Given any base model M and target library L, the pipeline (i) discovers novel APIs, (ii) extracts a decomposed knowledge bundle {S_name, S_param, E, M_prose, M_code}, (iii) generates execute-then-assert coding tasks, and (iv) filters them to a model-conditional set that probes only what M cannot already solve. Each failure is then classified into one of six diagnostic categories (WrongAPISelection, WrongImport, WrongSyntax, WrongParam, WrongShapeDtype, WrongLogic) so the benchmark reports why a model failed.

Highlights

Dynamic, model-conditional benchmark. Library versions and the task pool are re-derived per base model so what stays in the benchmark is genuinely novel relative to that model's pretraining horizon.
Decomposed knowledge bundle. Every novel API is represented by five paper-aligned components (S_name, S_param, E, M_prose, M_code) that can be combined freely to isolate which component closes which failure.
Execute-then-assert harnesses. Every task ships with a structural, comparative, and mock-based assertion layer plus a target-API spy, so solutions cannot pass without invoking the API of interest.
Six-class failure taxonomy. Each failed pass@1 sample is automatically labelled by the deterministic fast-path classifier plus a strong-model judge, turning aggregate pass rates into mechanism-level diagnostics.
Adaptation matrix. First-class support for RAG, SFT/LoRA, RAFT, GRACE, MEMIT, and AlphaEdit on the same task pool, so external and parametric adaptation paradigms are evaluated under identical conditions.
~1.9k tasks across 19 libraries × 4 backbones (Qwen2.5-Coder-7B, Seed-Coder-8B, OpenCoder-8B, R1-Distill-Qwen-7B).

Knowledge Components

NovelAPIBench mirrors the paper's S / E / M decomposition end-to-end:

Paper notation	What it captures	Code identifier (`KnowledgeBundle`)
`S_name`	fully qualified API name	`bundle.surface.s_name`
`S_param`	parameter signature + types + defaults	`bundle.surface.parameters`
`E`	validated usage exemplars (`executed` / `static`)	`bundle.exemplars`
`M_prose`	natural-language mechanism description	`bundle.m_prose`
`M_code`	implementation source (docstrings stripped)	`bundle.m_code`

The 12 paper conditions are exposed as KnowledgeCondition values:

baseline, s_name, e, m_prose, m_code, s, s_m_prose, s_m_code, s_m_prose_m_code, s_e, s_e_m_code, full.

Repository Layout

NovelAPIBench/
├── configs/                       Hydra-style YAML configs
│   ├── base.yaml                  models, hardware, paths
│   ├── stage{1..4}_*.yaml         per-stage pipeline configs
│   ├── experiment.yaml            condition grid + SFT/RAG hyperparams
│   ├── libraries/                 19 paper libraries (one YAML each)
│   ├── groups/                    benchmark group definitions
│   └── experiments/               per-(group × model) experiment specs
├── src/
│   ├── common/                    config, IO, LLM client, runtime env, sandbox
│   ├── stage1_discovery/          API-surface diff + novelty probes
│   ├── stage2_extraction/         S + E + M_prose + M_code extractors
│   ├── stage3_generation/         task generation + execute-then-assert harness
│   ├── stage4_filtering/          C1/C2/C3 gates
│   ├── experiments/               conditions, RAG indexer, SFT/RAFT/editing
│   └── evaluation/                pass@k, failure classifier, analysis (RQ1-3)
├── scripts/                       CLI entry points (see `make help`)
├── tests/                         pytest suite (82 tests)
├── docs/
│   ├── pipeline_overview.md       what each stage produces
│   └── reproducibility.md         exact paper-reproduction commands
├── environment-local.yml          CPU-only env (laptop / CI smoke)
├── environment-gpu.yml            CUDA + vLLM env (paper runs)
├── requirements{,-gpu}.txt
└── Makefile                       make help

Quick Start

A minimal smoke run that exercises every renamed code path without GPU or API credits (good for verifying a fresh checkout):

# 1. Create the lightweight env (no CUDA, no vLLM)
make install-local
conda activate novel-api-local

# 2. Lint + 82 unit tests + import every entry-point script
make check
make smoke

For real benchmark generation you need a GPU machine, an OPENAI_API_KEY, and a GITHUB_TOKEN. See Installation and Pipeline Usage below.

Installation

Local checks (CPU-only)

conda env create -f environment-local.yml
conda activate novel-api-local
pip install -e .
make check

Full reproduction (CUDA + vLLM)

conda env create -f environment-gpu.yml
conda activate novel-api
pip install -e .

The GPU path has been tested with torch 2.10.0+cu128, vllm 0.19.0, transformers 4.57.6, and trl 1.1.0.

Environment variables

Copy .env.example to .env and fill in:

GITHUB_TOKEN=...                # for changelog / source fetching
OPENAI_API_KEY=...              # for surface + M_prose extraction + strong-judge
NOVEL_API_TEMP_ENV_DIR=...      # scratch dir for per-library versioned envs
CONDA_PKGS_DIRS=...             # shared conda package cache (recommended)

Pipeline Usage

The benchmark is built per library, then aggregated into groups, then run as condition × delivery experiments.

Build one library

python scripts/run_pipeline.py --library torch --stages 1 2 3 4 --model qwen2.5-coder-7b
# or:
make pipeline LIBRARY=torch MODEL=qwen2.5-coder-7b

This runs:

Stage 1 — Novel API discovery: resolves the library version pair around M's cutoff, installs both into versioned conda envs, diffs the public surface, and applies the novelty probe.
Stage 2 — Knowledge extraction: extracts S + E (via web-search-grounded LLM), M_prose (tier-routed), and M_code.
Stage 3 — Task generation: emits easy/medium/hard tasks per API with an execute-then-assert harness that wraps a target-API spy.
Stage 4 — Filtering: applies the three model-conditional gates (C1 reference validity, C2 empirical novelty, C3 informativeness).

Per-stage outputs land in data/raw/{library}/stage{N}_*.{model_short}.jsonl.

Build a benchmark group

python scripts/build_group.py --group agent_tool --model qwen2.5-coder-7b
make build-group GROUP=agent_tool MODEL=qwen2.5-coder-7b

Run an experiment

python scripts/run_experiment.py --experiment agent-tool-qwen2.5-coder-7b --all
python scripts/run_eval.py        --experiment agent-tool-qwen2.5-coder-7b
python scripts/run_analysis.py    --experiment agent-tool-qwen2.5-coder-7b --output data/analysis/

Or chain everything through:

python scripts/run_all_experiments.py --steps run eval analysis

Debug helpers

Step-by-step debug runners (skip the heavy install/LLM cost) live behind --debug --mock-llm:

python scripts/run_stage.py --stage 1 --library pytorch --debug --mock-llm
python scripts/run_stage.py --stage 2 --library pytorch --debug --mock-llm

Benchmark Groups, Models, Conditions

Groups

Group	Libraries
`swe`	`fastapi`, `flask`, `django`, `sqlalchemy`, `pydantic`
`ai4science`	`rdkit`, `MDAnalysis`, `deepchem`, `pymatgen`, `astropy`, `ase`
`data_science`	`numpy`, `pandas`, `scipy`
`dl`	`diffusers`, `transformers`, `torch`
`agent_tool`	`langgraph`, `fastmcp`
`all_pooled_paper`	pooled Section 5 split, 80/20 train/test
`lolo_diffusers`	Section 5 held-out-`diffusers` split

Models

qwen2.5-coder-7b           # primary (Section 4)
seed-coder-8b-instruct
r1-distill-qwen-7b
opencoder-8b-instruct

Knowledge Conditions

Condition	Components prepended at inference
`baseline`	none
`s_name`	`S_name`
`e`	`E`
`m_prose`	`M_prose`
`m_code`	`M_code`
`s`	`S_name + S_param`
`s_m_prose`	`S + M_prose`
`s_m_code`	`S + M_code`
`s_m_prose_m_code`	`S + M_prose + M_code`
`s_e`	`S + E`
`s_e_m_code`	`S + E + M_code`
`full`	`S + E + M_prose + M_code`

Combined with three delivery mechanisms — rag, sft, rag_sft — this is the full 1 + 11 × 3 = 34-cell main grid (build_experiment_grid in src/experiments/conditions.py).

Reproducing the Paper

The exact end-to-end commands for every figure and table are in docs/reproducibility.md. At a high level:

# Section 4 main grid (4 backbones × 5 groups)
for m in qwen2.5-coder-7b seed-coder-8b-instruct r1-distill-qwen-7b opencoder-8b-instruct; do
    python scripts/run_pipeline.py --all-libraries --stages 1 2 3 4 --model "$m"
done
python scripts/build_group.py --paper-grid
python scripts/run_all_experiments.py --steps run eval analysis

# Section 5 parametric adaptation (SFT / RAFT / GRACE / MEMIT / AlphaEdit)
make pooled-paper                              # see docs/reproducibility.md for the rest

A single full pipeline run takes ~hours on an 8×A6000 node and consumes proportional OpenAI / GitHub API quota; the analysis scripts are CPU-only and cheap to re-run.

Data Layout

All runtime artefacts are gitignored and land under data/:

data/
├── raw/{library}/                   per-library stage outputs
│   ├── stage1_discovery.{model}.jsonl
│   ├── stage2_knowledge.{model}.jsonl
│   ├── stage3_tasks.{model}.jsonl
│   └── stage4_filtered.{model}.jsonl
├── processed/groups/{group}/{model}/   group benchmark + 80/20 split
├── rag/{group}/                     FAISS index per RAG cell
├── sft/{group}/                     SFT training JSONLs per condition
├── edits/                           MEMIT / AlphaEdit / GRACE artefacts
├── results/{experiment_id}/         predictions + pass@k
└── analysis/{experiment_id}/        per-RQ figures + summary CSVs

The on-disk knowledge bundle uses paper field names directly (S_name, S_param, E, M_prose, M_code) so it can be consumed without the Python package — see KnowledgeBundle.to_paper_dict() in src/stage2_extraction/schemas.py.

Development

make check     # ruff + 82 pytest
make lint
make format
make test
make smoke     # import-and-help every entry-point script
make clean

Pull requests and issues are welcome.

License

The pipeline code, curated benchmark records, task descriptions, test harnesses, and released failure labels are released under Apache-2.0.

Knowledge bundles may include M_code excerpts copied from upstream library source. Those excerpts inherit the license of the source library — see LICENSES.md for the complete per-library matrix and redistribution notes (in particular, MDAnalysis- and ase-derived M_code carry copyleft terms).

Citation

If you use NovelAPIBench in your research, please cite the preprint:

@misc{novelapibench2026,
  title         = {Diagnosing Knowledge Gaps in {LLM} Tool Use:
                   An Agentic Benchmark for Novel {API} Acquisition},
  author        = {Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen},
  year          = {2026},
  eprint        = {2606.03657}, 
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NovelAPIBench

Table of Contents

Highlights

Knowledge Components

Repository Layout

Quick Start

Installation

Local checks (CPU-only)

Full reproduction (CUDA + vLLM)

Environment variables

Pipeline Usage

Build one library

Build a benchmark group

Run an experiment

Debug helpers

Benchmark Groups, Models, Conditions

Groups

Models

Knowledge Conditions

Reproducing the Paper

Data Layout

Development

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSES.md		LICENSES.md
Makefile		Makefile
README.md		README.md
environment-gpu.yml		environment-gpu.yml
environment-local.yml		environment-local.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NovelAPIBench

Table of Contents

Highlights

Knowledge Components

Repository Layout

Quick Start

Installation

Local checks (CPU-only)

Full reproduction (CUDA + vLLM)

Environment variables

Pipeline Usage

Build one library

Build a benchmark group

Run an experiment

Debug helpers

Benchmark Groups, Models, Conditions

Groups

Models

Knowledge Conditions

Reproducing the Paper

Data Layout

Development

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages