Skip to content

JimmmmmL/NovelAPIBench

Repository files navigation

NovelAPIBench

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Preprint: arXiv:2606.03657

NovelAPIBench is a dynamic, model-conditional benchmark for studying how code-LLMs acquire genuinely novel APIs — calls exposed by a library version released after the model's training cutoff. Given any base model M and target library L, the pipeline (i) discovers novel APIs, (ii) extracts a decomposed knowledge bundle {S_name, S_param, E, M_prose, M_code}, (iii) generates execute-then-assert coding tasks, and (iv) filters them to a model-conditional set that probes only what M cannot already solve. Each failure is then classified into one of six diagnostic categories (WrongAPISelection, WrongImport, WrongSyntax, WrongParam, WrongShapeDtype, WrongLogic) so the benchmark reports why a model failed.

Table of Contents

Highlights

  • Dynamic, model-conditional benchmark. Library versions and the task pool are re-derived per base model so what stays in the benchmark is genuinely novel relative to that model's pretraining horizon.
  • Decomposed knowledge bundle. Every novel API is represented by five paper-aligned components (S_name, S_param, E, M_prose, M_code) that can be combined freely to isolate which component closes which failure.
  • Execute-then-assert harnesses. Every task ships with a structural, comparative, and mock-based assertion layer plus a target-API spy, so solutions cannot pass without invoking the API of interest.
  • Six-class failure taxonomy. Each failed pass@1 sample is automatically labelled by the deterministic fast-path classifier plus a strong-model judge, turning aggregate pass rates into mechanism-level diagnostics.
  • Adaptation matrix. First-class support for RAG, SFT/LoRA, RAFT, GRACE, MEMIT, and AlphaEdit on the same task pool, so external and parametric adaptation paradigms are evaluated under identical conditions.
  • ~1.9k tasks across 19 libraries × 4 backbones (Qwen2.5-Coder-7B, Seed-Coder-8B, OpenCoder-8B, R1-Distill-Qwen-7B).

Knowledge Components

NovelAPIBench mirrors the paper's S / E / M decomposition end-to-end:

Paper notation What it captures Code identifier (KnowledgeBundle)
S_name fully qualified API name bundle.surface.s_name
S_param parameter signature + types + defaults bundle.surface.parameters
E validated usage exemplars (executed / static) bundle.exemplars
M_prose natural-language mechanism description bundle.m_prose
M_code implementation source (docstrings stripped) bundle.m_code

The 12 paper conditions are exposed as KnowledgeCondition values:

baseline, s_name, e, m_prose, m_code, s, s_m_prose, s_m_code, s_m_prose_m_code, s_e, s_e_m_code, full.

Repository Layout

NovelAPIBench/
├── configs/                       Hydra-style YAML configs
│   ├── base.yaml                  models, hardware, paths
│   ├── stage{1..4}_*.yaml         per-stage pipeline configs
│   ├── experiment.yaml            condition grid + SFT/RAG hyperparams
│   ├── libraries/                 19 paper libraries (one YAML each)
│   ├── groups/                    benchmark group definitions
│   └── experiments/               per-(group × model) experiment specs
├── src/
│   ├── common/                    config, IO, LLM client, runtime env, sandbox
│   ├── stage1_discovery/          API-surface diff + novelty probes
│   ├── stage2_extraction/         S + E + M_prose + M_code extractors
│   ├── stage3_generation/         task generation + execute-then-assert harness
│   ├── stage4_filtering/          C1/C2/C3 gates
│   ├── experiments/               conditions, RAG indexer, SFT/RAFT/editing
│   └── evaluation/                pass@k, failure classifier, analysis (RQ1-3)
├── scripts/                       CLI entry points (see `make help`)
├── tests/                         pytest suite (82 tests)
├── docs/
│   ├── pipeline_overview.md       what each stage produces
│   └── reproducibility.md         exact paper-reproduction commands
├── environment-local.yml          CPU-only env (laptop / CI smoke)
├── environment-gpu.yml            CUDA + vLLM env (paper runs)
├── requirements{,-gpu}.txt
└── Makefile                       make help

Quick Start

A minimal smoke run that exercises every renamed code path without GPU or API credits (good for verifying a fresh checkout):

# 1. Create the lightweight env (no CUDA, no vLLM)
make install-local
conda activate novel-api-local

# 2. Lint + 82 unit tests + import every entry-point script
make check
make smoke

For real benchmark generation you need a GPU machine, an OPENAI_API_KEY, and a GITHUB_TOKEN. See Installation and Pipeline Usage below.

Installation

Local checks (CPU-only)

conda env create -f environment-local.yml
conda activate novel-api-local
pip install -e .
make check

Full reproduction (CUDA + vLLM)

conda env create -f environment-gpu.yml
conda activate novel-api
pip install -e .

The GPU path has been tested with torch 2.10.0+cu128, vllm 0.19.0, transformers 4.57.6, and trl 1.1.0.

Environment variables

Copy .env.example to .env and fill in:

GITHUB_TOKEN=...                # for changelog / source fetching
OPENAI_API_KEY=...              # for surface + M_prose extraction + strong-judge
NOVEL_API_TEMP_ENV_DIR=...      # scratch dir for per-library versioned envs
CONDA_PKGS_DIRS=...             # shared conda package cache (recommended)

Pipeline Usage

The benchmark is built per library, then aggregated into groups, then run as condition × delivery experiments.

Build one library

python scripts/run_pipeline.py --library torch --stages 1 2 3 4 --model qwen2.5-coder-7b
# or:
make pipeline LIBRARY=torch MODEL=qwen2.5-coder-7b

This runs:

  1. Stage 1 — Novel API discovery: resolves the library version pair around M's cutoff, installs both into versioned conda envs, diffs the public surface, and applies the novelty probe.
  2. Stage 2 — Knowledge extraction: extracts S + E (via web-search-grounded LLM), M_prose (tier-routed), and M_code.
  3. Stage 3 — Task generation: emits easy/medium/hard tasks per API with an execute-then-assert harness that wraps a target-API spy.
  4. Stage 4 — Filtering: applies the three model-conditional gates (C1 reference validity, C2 empirical novelty, C3 informativeness).

Per-stage outputs land in data/raw/{library}/stage{N}_*.{model_short}.jsonl.

Build a benchmark group

python scripts/build_group.py --group agent_tool --model qwen2.5-coder-7b
make build-group GROUP=agent_tool MODEL=qwen2.5-coder-7b

Run an experiment

python scripts/run_experiment.py --experiment agent-tool-qwen2.5-coder-7b --all
python scripts/run_eval.py        --experiment agent-tool-qwen2.5-coder-7b
python scripts/run_analysis.py    --experiment agent-tool-qwen2.5-coder-7b --output data/analysis/

Or chain everything through:

python scripts/run_all_experiments.py --steps run eval analysis

Debug helpers

Step-by-step debug runners (skip the heavy install/LLM cost) live behind --debug --mock-llm:

python scripts/run_stage.py --stage 1 --library pytorch --debug --mock-llm
python scripts/run_stage.py --stage 2 --library pytorch --debug --mock-llm

Benchmark Groups, Models, Conditions

Groups

Group Libraries
swe fastapi, flask, django, sqlalchemy, pydantic
ai4science rdkit, MDAnalysis, deepchem, pymatgen, astropy, ase
data_science numpy, pandas, scipy
dl diffusers, transformers, torch
agent_tool langgraph, fastmcp
all_pooled_paper pooled Section 5 split, 80/20 train/test
lolo_diffusers Section 5 held-out-diffusers split

Models

qwen2.5-coder-7b           # primary (Section 4)
seed-coder-8b-instruct
r1-distill-qwen-7b
opencoder-8b-instruct

Knowledge Conditions

Condition Components prepended at inference
baseline none
s_name S_name
e E
m_prose M_prose
m_code M_code
s S_name + S_param
s_m_prose S + M_prose
s_m_code S + M_code
s_m_prose_m_code S + M_prose + M_code
s_e S + E
s_e_m_code S + E + M_code
full S + E + M_prose + M_code

Combined with three delivery mechanisms — rag, sft, rag_sft — this is the full 1 + 11 × 3 = 34-cell main grid (build_experiment_grid in src/experiments/conditions.py).

Reproducing the Paper

The exact end-to-end commands for every figure and table are in docs/reproducibility.md. At a high level:

# Section 4 main grid (4 backbones × 5 groups)
for m in qwen2.5-coder-7b seed-coder-8b-instruct r1-distill-qwen-7b opencoder-8b-instruct; do
    python scripts/run_pipeline.py --all-libraries --stages 1 2 3 4 --model "$m"
done
python scripts/build_group.py --paper-grid
python scripts/run_all_experiments.py --steps run eval analysis

# Section 5 parametric adaptation (SFT / RAFT / GRACE / MEMIT / AlphaEdit)
make pooled-paper                              # see docs/reproducibility.md for the rest

A single full pipeline run takes ~hours on an 8×A6000 node and consumes proportional OpenAI / GitHub API quota; the analysis scripts are CPU-only and cheap to re-run.

Data Layout

All runtime artefacts are gitignored and land under data/:

data/
├── raw/{library}/                   per-library stage outputs
│   ├── stage1_discovery.{model}.jsonl
│   ├── stage2_knowledge.{model}.jsonl
│   ├── stage3_tasks.{model}.jsonl
│   └── stage4_filtered.{model}.jsonl
├── processed/groups/{group}/{model}/   group benchmark + 80/20 split
├── rag/{group}/                     FAISS index per RAG cell
├── sft/{group}/                     SFT training JSONLs per condition
├── edits/                           MEMIT / AlphaEdit / GRACE artefacts
├── results/{experiment_id}/         predictions + pass@k
└── analysis/{experiment_id}/        per-RQ figures + summary CSVs

The on-disk knowledge bundle uses paper field names directly (S_name, S_param, E, M_prose, M_code) so it can be consumed without the Python package — see KnowledgeBundle.to_paper_dict() in src/stage2_extraction/schemas.py.

Development

make check     # ruff + 82 pytest
make lint
make format
make test
make smoke     # import-and-help every entry-point script
make clean

Pull requests and issues are welcome.

License

The pipeline code, curated benchmark records, task descriptions, test harnesses, and released failure labels are released under Apache-2.0.

Knowledge bundles may include M_code excerpts copied from upstream library source. Those excerpts inherit the license of the source library — see LICENSES.md for the complete per-library matrix and redistribution notes (in particular, MDAnalysis- and ase-derived M_code carry copyleft terms).

Citation

If you use NovelAPIBench in your research, please cite the preprint:

@misc{novelapibench2026,
  title         = {Diagnosing Knowledge Gaps in {LLM} Tool Use:
                   An Agentic Benchmark for Novel {API} Acquisition},
  author        = {Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen},
  year          = {2026},
  eprint        = {2606.03657}, 
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors