A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang
Scientific figures are structured compositions of discrete semantic components, so the localized errors image generators make on such layouts call not for a stronger backbone but for a harness around it. We instantiate this idea in two complementary systems that share one design:
- Crafter — a multi-agent harness for figure generation that generalizes across figure types (academic figures, posters, infographics) and input conditions (text-to-image, mask completion, key-element composition, sketch refinement) without architectural changes.
- CraftEditor — applies the same harness pattern to convert raster outputs into coordinate-faithful editable SVGs.
Figure 1. The Crafter generation harness.
Figure 2. The CraftEditor raster-to-SVG pipeline.
We also release CraftBench — 279 samples spanning three figure types and four input conditions, each with a human-drawn target.
git clone https://github.com/HaozheZhao/Crafter.git
cd Crafter
pip install -e .
export OPENROUTER_API_KEY="sk-or-..."All chat / VLM / image calls go through a single OpenAI-compatible endpoint
(OpenRouter). The role mapping lives in
configs/default.yaml.
CraftEditor additionally needs a text-prompted SAM3 grounding server. Start one on any machine with a CUDA-capable GPU:
# 1. Install the official SAM3 package
git clone https://github.com/facebookresearch/sam3 && cd sam3
pip install -e . && pip install timm ftfy iopath portalocker flask
# 2. Run a small Flask wrapper that exposes /health, /segment_text, /segment_points
python sam3_server.py --port 8765 --host 0.0.0.0
# 3. Point Crafter at the server
export SAM3_SERVER_URL="http://<host>:8765"CraftEditor requires the SAM3 server. If you do not run one, use only the generation half (the commands below).
The bundled examples/ folder has inputs for three end-to-end
runs. Cases #1 and #3 share a SceneSelect figure (CraftEditor's top-scoring
case in Figure 3); case #2 is the NC-TTT poster inpaint case from Figure 3.
All three commands use the same task templates the CraftBench evaluation script feeds the model, so end-to-end behaviour matches benchmark runs:
# 1. Text-to-image — generate the method figure from text only.
python demo.py --paper examples/sample_paper.txt \
--instruction-file examples/sample_instruction_t2i.txt \
--out examples_out/figure.png
# 2. Mask completion (inpaint) — fill the blanked-out 'Methodology' column of the poster.
python demo.py --paper examples/sample_inpaint_paper.txt \
--instruction-file examples/sample_instruction_inpaint.txt \
--reference examples/sample_inpaint_input.png \
--out examples_out/figure_inpainted.png
# 3. Convert a raster figure into an editable SVG.
python convert.py --img examples/sample_figure.png --out-dir examples_out/editable/crafter generate --caption "Figure 1: Overall workflow of our method." \
--paper-text-file paper.txt --out figure.pngAdd --reference sketch.png to condition on a sketch, partial figure, or icon
collage.
Use a paper PDF instead of a plain-text extract (beta).
demo.py also accepts a PDF as the --paper argument; text is extracted via
pypdf. LaTeX-rendered PDFs work cleanly; scanned PDFs and dense two-column
layouts may need manual text extraction first. We recommend the plain-text path
above for reproducible runs.
python demo.py --paper paper.pdf --instruction "..." --out figure.pngDefault pipeline: extraction (gpt-image-2) → grounding (SAM3) → composition.
# the bundled figure CraftEditor scores highest on in Figure 3.
python convert.py --img examples/sample_figure.png --out-dir examples_out/editable/
Figure 3. CraftEditor (rightmost column) versus Edit-Banana and AutoFigure-Edit on five representative cases. examples/sample_figure.png is the input raster of the top row (academic / t2i, the highest-scoring case).
Skip the gpt-image-2 extraction phase (SAM-only).
--sam-only passes the raster straight to SAM3 grounding, bypassing
gpt-image-2 icon extraction. Trades quality for speed and skips one external
provider dependency.
python convert.py --img figure.png --out-dir editable/ --sam-onlyCraftBench — 279 samples spanning three figure types and four input
conditions, each with a human-drawn target. The dataset lives on the
HuggingFace Hub
and is downloaded automatically by both inference.py and run_eval. The
craftbench/ folder in this repo bundles three illustrative
samples (one per task) plus the evaluation scripts.
Figure 4. Sample tasks from CraftBench.
Figure 5. CraftBench distribution by figure type and input condition.
# 1. Generate Crafter outputs over the bench (writes <id>.png per sample).
python inference.py --bench craftbench --out runs/crafter_cb
# 2. Score against the human-drawn targets (referenced VLM judge via OpenRouter).
python -m craftbench.evaluation.run_eval --runs runs/crafter_cb --out cb.jsonrun_eval reports an overall win-rate and a per-task breakdown.
Three model slots in configs/default.yaml:
| Slot | Default |
|---|---|
llm |
anthropic/claude-opus-4.6 |
vlm |
google/gemini-3.1-pro-preview |
generator |
google/gemini-3-pro-image-preview (Nano Banana Pro) |
OPENROUTER_API_KEY is the only required secret; the YAML never holds keys.
Use gpt-image-2 instead of Nano Banana Pro.
gpt-image-2 produces sharper text and supports arbitrary pixel resolutions,
but on OpenRouter it is rate-limited and clamped to a small enum
(aspect_ratio ∈ {1:1, 2:3, 3:2, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9},
image_size ∈ {1K, 2K}).
We recommend deploying gpt-image-2 on your own Azure OpenAI resource and
exporting the four standard variables — when all four are set, gpt-image-2
calls bypass OpenRouter and go straight to Azure (everything else keeps
using OpenRouter):
export AZURE_OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com"
export AZURE_OPENAI_API_KEY="<your-key>"
export AZURE_OPENAI_DEPLOYMENT="<your-deployment-name>"
# optional: override the api version (default: 2025-04-01-preview)
# export AZURE_OPENAI_API_VERSION="2025-04-01-preview"
# optional: force an exact pixel size (overrides the aspect map)
# export CRAFTER_AZURE_IMAGE_SIZE="1024x512"Then point the generator slot at gpt-image-2 in
configs/default.yaml:
generator: openai/gpt-5.4-image-2If you do not have Azure, you can still use OpenRouter for gpt-image-2 by
swapping the generator slot to openai/gpt-5.4-image-2 — outputs will be
clamped to the enum above and may rate-limit under load.
Crafter/
├── crafter/{generation, editor, shared}/ # the package
├── craftbench/ # 279-sample bench + self-contained eval
├── configs/default.yaml # 3-slot model config
├── demo.py · convert.py · inference.py # entry-point scripts
├── examples/ # sample paper PDF + sketch ref
├── assets/ # paper figures
└── README · pyproject · requirements · LICENSE
@article{zhao_crafter,
title = {Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs},
author = {Zhao, Haozhe and Si, Shuzheng and Wang, Zhenhailong and Wang, Zheng
and Chen, Liang and Li, Xiaotong and Liang, Zhixiang and Sun, Maosong
and Zhang, Minjia},
}MIT — see LICENSE.