AEON vLLM Ultimate — DGX Spark / Blackwell

One container, the whole fleet. A single image — ghcr.io/aeon-7/aeon-vllm-ultimate:latest — serves every AEON model on NVIDIA DGX Spark (GB10, sm_121a) and other consumer-Blackwell GPUs (RTX 50 series): Gemma-4-26B-A4B, Qwen3.6-27B, and Qwen3.6-35B-A3B all run on the same build, with DFlash speculative decoding, NVFP4 weights, NVFP4/FP8 KV cache, and the OpenAI-compatible gateway intact.

Built on vLLM v0.24.0 compiled from source for sm_121a, merged with the AEON speculative-decoding stack: Triton software NVFP4 KV cache (PR #44389) + DFlash SWA / high-concurrency / prefix-cache fixes (PR #40898, #41703, #43982-port) + the AEON DGX Spark runtime patches + TurboQuant + DFlash speculative decoding.

🆕 2026-07-02 — :latest is now the v0.24.0 sm_121a build (:2026-07-01-v0.24.0). Rebuilt from source on vLLM v0.24.0 as a 3-way merge that preserves the AEON spec-decode tree. Still carries the three open upstream PRs (#44389 NVFP4-KV, #40898 DFlash SWA, #41703 prefix-cache corruption — all re-verified still unmerged), now bakes the runtime patches into the source (DFlash block-table unpad, spec-decode cudagraph alignment / open twin PR #46324), and adds three post-tag fixes: the UMA negative-cudagraph-estimate clamp (port of open PR #46932 — negative estimates on unified-memory GPUs silently inflated the KV budget), the tied-embedding fix for ModelOpt checkpoints (cherry-pick of merged-post-tag PR #45544 — without it every tied Gemma-4 crashes at load), and a use_mm_prefix signature fix for the carried NVFP4-KV backend overrides. New in the v0.24.0 base for Spark users: DFlash on the FlashInfer backend (#43081, non-causal prefill), UMA memory-pressure release during weight loading (#45179), --moe-backend/--linear-backend selection incl. the SM12x flashinfer_b12x CuteDSL backends, Dynamic SD per-batch-size draft lengths (#32374), async scheduling default-on, FlashInfer 0.6.12, pinned transformers 5.12.1 (replaces git-HEAD). ⚠️ Breaking: v0.24.0 removed VLLM_NVFP4_GEMM_BACKEND and the VLLM_USE_FLASHINFER_MOE_* env vars — use --linear-backend flashinfer_cutlass / --moe-backend cutlass instead (recipes below updated). Validated on the full fleet before push: 35B A/B at throughput parity with v0.23.0, DFlash concurrency clean through c=64, Triton NVFP4-KV boot + generation, 26B voice stack healthy (DFlash pos0 acceptance 60–86%). Rollback tag: :2026-06-18-v0.23.0-dflashfix.

🆕 2026-06-18 — :latest is now the v0.23.0 sm_121a build (:2026-06-18-v0.23.0-dflashfix). Rebuilt from source on vLLM v0.23.0 as a 3-way merge that preserves the AEON spec-decode tree, and adds the DFlash high-concurrency fix (port of upstream PR #43982): the drafter previously crashed at ≥32 concurrent requests under speculative decoding (padded-vs-unpadded KV block-table shape mismatch) and now scales cleanly to c=64. Carries the still-open PR #44389 (NVFP4-KV), #40898 (DFlash SWA), #41703 (prefix-cache corruption). See What we fixed for the DGX Spark and the v0.23.0 fleet benchmarks. Rollback tag: :2026-06-11-pr41703.

Quickstart (DGX Spark, copy-paste)

The canonical Spark recipe: Qwen3.6-27B Multimodal-NVFP4-MTP body + z-lab DFlash drafter + FP8 KV — the measured-best daily-driver config (parity speed with the smaller XS body, higher quality-eval scores). One block pulls the container, the model, and the drafter, then serves on :8000. (Full deployment matrix — MTP/NVFP4-KV, TurboQuant, Gemma-4-26B, dedicated-VRAM Blackwell — is in Deployment recipes further down.)

# 1) Pull the unified container (vLLM 0.24.0 + sm_121a + DFlash high-concurrency fix)
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2) Pull the recommended body — Multimodal-NVFP4-MTP (modelopt NVFP4, image+video capable;
#    parity speed with the XS variant + higher quality-eval scores) — fresh clone
GIT_LFS_SKIP_SMUDGE=1 git clone \
  https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
  /models/Qwen3.6-27B-AEON-MM-MTP
( cd /models/Qwen3.6-27B-AEON-MM-MTP && git lfs pull )

# 3) Pull the DFlash drafter (z-lab 5-layer, ~3.3 GB) — fresh clone (DFlash only)
GIT_LFS_SKIP_SMUDGE=1 git clone \
  https://huggingface.co/z-lab/Qwen3.6-27B-DFlash \
  /models/Qwen3.6-27B-DFlash-drafter
( cd /models/Qwen3.6-27B-DFlash-drafter && git lfs pull )

# 4) Serve — DFlash drafter + FP8 KV (mounts body at /model, drafter at /drafter)
docker run -d --name aeon-vllm \
    --restart unless-stopped \
    --gpus all --ipc=host --shm-size=16g \
    --net=host \
    -e VLLM_USE_FLASHINFER_SAMPLER=1 \
    -v /models/Qwen3.6-27B-AEON-MM-MTP:/model:ro \
    -v /models/Qwen3.6-27B-DFlash-drafter:/drafter:ro \
    --entrypoint vllm \
    ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
    serve /model \
        --served-model-name aeon aeon-fast aeon-deep aeon-ultimate qwen36-ultimate aeon-ultimate-xs \
        --dtype auto \
        --quantization modelopt \
        --kv-cache-dtype fp8_e4m3 \
        --attention-backend TRITON_ATTN \
        --max-model-len 229376 \
        --max-num-seqs 16 \
        --max-num-batched-tokens 32768 \
        --gpu-memory-utilization 0.60 \
        --enable-chunked-prefill \
        --enable-prefix-caching \
        --generation-config vllm \
        --reasoning-parser qwen3 \
        --tool-call-parser qwen3_coder \
        --enable-auto-tool-choice \
        --mm-encoder-tp-mode data \
        --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12,"attention_backend":"TRITON_ATTN"}' \
        --trust-remote-code

# 5) Smoke test
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"aeon","messages":[{"role":"user","content":"Hello!"}],"max_tokens":64,"temperature":0.0}' \
  | jq .choices[0].message.content

Why these flags: --quantization modelopt matches the recommended Multimodal-NVFP4-MTP body; --kv-cache-dtype fp8_e4m3 is the stable DFlash pairing on GB10; --gpu-memory-utilization 0.60 leaves room for Qwen3-ASR/Qwen3-TTS sidecars on the same Spark; and --attention-backend TRITON_ATTN must be set both on the target model and inside the DFlash JSON because vLLM does not inherit target attention-backend settings into speculative drafters. --max-model-len 229376 gives one near-full-context session while still leaving KV headroom for output and smaller concurrent agents; --max-num-seqs 16 and --max-num-batched-tokens 32768 keep the agent/gateway burst path usable. Leave --mamba-block-size unset and let vLLM derive the hybrid GDN geometry. If git clone leaves LFS pointer files, re-run git lfs pull in the model dir so vLLM sees real weights.

What's inside

Component	Version	Why
vLLM	0.23.0 + sm_121a build, AEON spec-decode 3-way merge	Built from source for GB10; carries PR #44389 (Triton NVFP4 KV) + #40898 (DFlash SWA) + #41703 (prefix-cache corruption) + #43982-port (DFlash high-concurrency fix, new 2026-06-18)
PyTorch	2.11.0+cu130	CUDA 13.0 with sm_121a (DGX Spark / GB10) compute capability
transformers	5.10.0.dev0 (HEAD)	Recognizes `gemma4_unified`, `qwen3_5`, all bleeding-edge model classes
flashinfer	0.6.12	NVFP4 GEMM kernels, sliding-window attention, MLA, custom attention
TurboQuant	0.2.0 (AEON-7 fork)	CUDA-graph-safe QJL — 4-bit KV compression on top of vLLM's native KV cache
modelopt	available via pip if needed	Quantization framework (not bundled — image stays small for serving)

v0.23.0 fleet benchmarks — one image, three models

The whole point of this container is that a single build runs the entire AEON fleet on a DGX Spark. The three charts below are the same ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0 + AEON sm_121a + DFlash) serving three very different architectures — a Gemma-4 MoE, a Qwen3.6 hybrid GDN+attention dense model, and a Qwen3.6 A3B MoE — each scaling cleanly from 1 to 64 concurrent requests with no crash (the pre-fix image died at c≥32 under speculative decoding).

Numbers are measured on DGX Spark GB10 (sm_121a) with DFlash speculative decoding, NVFP4 weights, FP8 KV cache, prefix caching on, p50 across ≥ samples per point.

Gemma-4-26B-A4B-it-Uncensored (NVFP4)

Single-stream (c=1), by category, on aeon-vllm-ultimate:latest:

Category	🟢 Decode tok/s	TTFT p50	TPOT p50	Prefill (PP)	DFlash accept
Coding	155.8	83 ms	6.4 ms	601 tok/s	58.9%
Math	127.8	145 ms	7.8 ms	420 tok/s	48.7%
Reasoning	118.9	105 ms	8.4 ms	439 tok/s	43.9%
Prose	49.8	105 ms	20.1 ms	324 tok/s	11.1%
Natural language	67.3	97 ms	14.9 ms	393 tok/s	20.0%
Extraction / JSON	202.4	85 ms	4.9 ms	602 tok/s	77.5%

Long-context hold (DFlash acceptance does not collapse as histories grow): at ~16k tokens (c=1) Coding draft acceptance is 58.7% (128 tok/s decode); at ~33k tokens it holds 46.7% (93 tok/s decode). That long-context acceptance hold is the SWA-fix win (PR #40898) — earlier images collapsed past ~2k tokens.

Stock-vs-optimized single-stream contrast on this build:

Provisional contrast. The stock / un-optimized bars are from stock vanilla vLLM (default settings, no DFlash, no DGX-Spark / sm_121a optimizations) and are provisional, pending a fresh fully-vanilla re-bench on the current v0.23.0 version.

Qwen3.6-35B-A3B-heretic (NVFP4)

Single-stream (c=1), by category:

Category	🟢 Decode tok/s	TTFT p50	TPOT p50	Prefill (PP)	DFlash accept
Coding	91.7	88 ms	10.9 ms	509 tok/s	32.5%
Math	123.6	113 ms	8.1 ms	494 tok/s	47.7%
Reasoning	120.6	120 ms	8.3 ms	359 tok/s	46.3%
Prose	75.2	137 ms	13.3 ms	234 tok/s	23.7%
Natural language	91.8	104 ms	10.9 ms	326 tok/s	32.3%
Extraction / JSON	79.8	103 ms	12.5 ms	468 tok/s	28.1%

Long-context hold: Coding draft acceptance is 40.8% at ~16k (90.8 tok/s decode) and 42.8% at ~33k (79.3 tok/s decode) — the A3B drafter holds acceptance flat across context.

Qwen3.6-27B-AEON-Ultimate (NVFP4 MTP-XS body + DFlash n=12)

Single-stream (c=1), by category:

Category	🟢 Decode tok/s	TTFT p50	TPOT p50	Prefill (PP)	DFlash accept
Coding	41.8	140 ms	23.9 ms	322 tok/s	34.5%
Math	47.3	244 ms	21.1 ms	229 tok/s	41.7%
Reasoning	56.1	234 ms	17.8 ms	183 tok/s	50.0%
Prose	34.1	146 ms	29.4 ms	220 tok/s	27.3%
Natural language	38.3	137 ms	26.1 ms	248 tok/s	31.3%
Extraction / JSON	44.2	246 ms	22.6 ms	195 tok/s	37.2%

vs a stock vanilla vllm/vllm-openai:nightly baseline of ~10.5 tok/s (no DFlash, no sm_121a optimizations) → optimized hits ~38–56 tok/s by category ≈ 4–5× single-stream decode.

Long-context hold: Coding draft acceptance is 49.5% at ~16k and 29.1% at ~33k — long histories stay drafted on the SWA-fixed drafter.

About the stock baseline: the "stock / un-optimized" comparison figure is from stock vanilla vLLM (default settings, no DFlash, no DGX-Spark / sm_121a optimizations — vllm/vllm-openai:nightly eager). It is provisional and will be refreshed once a fresh fully-vanilla benchmark completes on the current version. The optimized figures above are measured on the new aeon-vllm-ultimate:latest (vLLM 0.23.0) build. There is no published vanilla baseline for the 35B-A3B yet (pending re-bench).

What we fixed for the DGX Spark

All three models above run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703) — vLLM v0.23.0 built from source for GB10 / sm_121a and merged with the AEON speculative-decoding stack. This is the centerpiece of the build: a set of fixes that take the default "it runs, but it crashes under load and drafting collapses on long context" behavior and turn it into a stable, long-context, high-concurrency local-agent server.

Fix	What it does	Why it matters on GB10
DFlash high-concurrency fix (new 2026-06-18)	Slices the speculative drafter's KV block-table to the unpadded batch (`block_table[:num_reqs]`)	The drafter previously crashed at ≥32 concurrent requests (padded-vs-unpadded block-table shape mismatch in FlashAttention varlen — the engine died at c=64 with `block_table must have shape …`). Now scales cleanly to c=64. A port of upstream PR #43982, which fixed this for MTP but never for DFlash — present and unfixed even in the prior image.
Triton NVFP4 KV cache (PR #44389)	Software NVFP4 KV-cache path	The only 4-bit KV path on sm_121a (upstream's is hard-gated to B200) → ~3× KV capacity / longer context per GB of unified memory.
DFlash sliding-window attention (PR #40898)	Runs the drafter's SWA layers as true sliding-window	Long-context draft acceptance holds as agent histories grow (e.g. Gemma-26B Coding ≈ 59% at ~16k, ≈ 47% at ~33k) instead of collapsing past ~2k tokens.
Prefix-cache corruption immunity (PR #41703)	Masks rejected/invalid context KV slots so they are never written	Without it, `--enable-prefix-caching` + DFlash silently decays draft acceptance to 0% over minutes-to-hours of traffic (engine-global, ~6× slowdown that only a restart healed). With it, prefix caching is safe again under sustained production load.
sm_121a-native build	`TORCH_CUDA_ARCH_LIST=12.1a`, `ENABLE_NVFP4_SM100=0`	Compiles the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to — true 4-bit tensor-core throughput, no dead B200-only kernels.
sm_121a boot + CUDA-graph patches	RTLD-lazy `_C_stable_libtorch` load; spec-decode CUDA-graph capture-size alignment	Boots past MXFP4 (SM100-only) symbols absent on GB10; prevents `cudaErrorIllegalAddress` on partial-acceptance decode steps under speculative decoding.
Unified-memory tuning	`--gpu-memory-utilization ≤0.70–0.88`, FULL CUDA graphs, async scheduling, z-lab DFlash drafters	GB10 shares one LPDDR5X pool across CPU + GPU; conservative KV headroom avoids page-thrash while keeping FULL-graph + speculative-decode throughput.

The result

Scales to 64 concurrent requests with no crash — the same image, on all three fleet models (the prior image crashed at c≥32 under speculative decoding).
Native NVFP4 4-bit compute on Blackwell tensor cores — the speed of 4-bit with near-16-bit accuracy.
Speculative decoding (DFlash) holds high draft acceptance from short prompts through long (16k–32k) agent histories.
Roughly 4–5× faster single-stream decode vs a stock un-optimized vanilla vLLM baseline (Qwen3.6-27B: ~10.5 → ~38–56 tok/s by category; provisional pending a fresh vanilla re-bench).

Why this container for Blackwell + DGX Spark users

🚀 NVFP4 KV cache — up to 3× KV capacity (Triton software path)

PR #44389 (lesj0610/vllm) adds a Triton software path that packs the KV cache as E2M1 FP4 + E4M3 block scales. Enable per-serve via --kv-cache-dtype nvfp4. Independent of native FP4 conversion instructions — works on any sm_120 / sm_121 / sm_100 / sm_90 GPU.

When activated:

3× KV cache capacity on Qwen3.6-27B and Qwen3.6-35B-A3B (per PR author benchmarks)
MRCR quality comparable to auto KV baseline — closer than TurboQuant 4bit_nc

Not activated by default. Pass --kv-cache-dtype nvfp4 to opt in.

🛠️ AEON DGX Spark patches (sm_121a runtime fixes)

The container ships with our 3 idempotent runtime patches that ensure correctness on GB10 hardware until upstream fixes land:

Patch	What it fixes
patch_cuda_optional_import	Wraps `import vllm._C_stable_libtorch` in `RTLD_LAZY` so the SM100-only `mxfp4_experts_quant` and `silu_and_mul_mxfp4_experts_quant` symbols are tolerated as unresolved until first call (they never fire on sm_121a workloads)
patch_cudagraph_align	Drops the `cudagraph_mode==FULL`-only gate on the spec-decode capture-size alignment filter in `config/compilation.py` so PIECEWISE mode also rounds capture sizes to multiples of `(1 + num_speculative_tokens)` — eliminates `cudaErrorIllegalAddress` mid-decode on partial-acceptance steps

All patches are idempotent — they no-op when upstream merges the equivalent fix.

🩹 DFlash drafter correctness fixes (PR #40898 + #41703, merged ahead of upstream)

Both PRs are open upstream but required for correct DFlash operation (the z-lab drafter README pins them); the v0.23.0 :latest build carries them in-tree (3-way merged), alongside the new DFlash high-concurrency fix (PR #43982 port). They fix three real defects we root-caused in production on DGX Spark:

Defect	Symptom	Fix
Rejected-token context-KV writes — the `copy_and_expand_dflash_inputs_kernel` stored slot mappings for rejected draft tokens, writing garbage K/V into the drafter's paged KV cache (incl. shared blocks). With `--enable-prefix-caching` the corruption was persistent and self-accelerating	Draft acceptance decays 34–56% → 0.0% over minutes-to-hours of traffic (scales with volume); sticky engine-global; ~6× decode slowdown (144 → 24 tok/s) that only a restart healed	#41703 masks rejected/invalid context slots (`-1`) so they are never written
Drafter sliding-window ignored — SWA drafters (e.g. the Gemma-4-26B drafter: 4 of 5 layers SWA-2048) ran all layers as full attention	Long-context requests (>2048 tok history) got ~0% acceptance per-request even on a healthy server	#40898 adds DFlash SWA support (per-layer sliding-window wiring + causal SWA drafting metadata)
Missing Gemma-4 adapter pieces — no sqrt(hidden) embedding normalizer or final-logit softcapping in the draft path; `flash_attn` drafter rejected on multimodal Gemma targets	Depressed acceptance ceiling (MAL 4.4–6.6 vs z-lab's published 6.1–8.6); forced onto `flex_attention`	#41703 adds both + `use_mm_prefix=False`, enabling the upstream-tested `flash_attn` drafter on Gemma-4

⚠️ Recipe note: Qwen3.6 DFlash on the v0.24.0 Spark image uses "attention_backend":"TRITON_ATTN" in the speculative config, matching the top quickstart. Gemma-4 DFlash recipes remain a separate path and may specify "attention_backend":"flash_attn" where called out. With these fixes, --enable-prefix-caching is safe again with DFlash — soak-validated under production fleet traffic.

🧠 TurboQuant K8V4 — 4-bit KV cache compression

0xSero/turboquant with the AEON-7 fork applying our fix/cuda-graph-safe-qjl-powers patch — caches the [1, 2, 4, 8, 16, 32, 64, 128] constant per-device once at module load instead of re-allocating per call. Without this fix, TurboQuant crashes at boot during CUDA graph capture; the lazy workaround --enforce-eager costs ~30% throughput.

Enable per-serve via --kv-cache-dtype tq_k8v4.

⚡ DFlash speculative decoding (native via `--speculative-config`)

DFlash and EAGLE3 drafters are supported natively via vLLM's --speculative-config flag — no extra package needed since vLLM 0.21. Pair with our aeon-7 DFlash drafters on HF for 1.5-2.5× throughput on the Qwen3.x family.

🔬 Native Blackwell SM 12.1 sm_121a compute

Built for TORCH_CUDA_ARCH_LIST="12.1a" — the sm_121a target for the GB10 in DGX Spark. Also runs on RTX 5090 / RTX 5080 / RTX PRO 6000 Blackwell (sm_120) thanks to the same family matcher in vLLM main.

Deployment recipes

The Quickstart at the top is the canonical daily-driver path (Recipe A). This section is the full deployment matrix: the canonical target is the AEON-7 Qwen3.6 family — see Validated models below. Pick the variant that matches your hardware, then follow the matching recipe.

Pull the image

docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest
# or pin the current build (vLLM 0.23.0 + DFlash high-concurrency fix)
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:2026-06-18-v0.23.0-dflashfix
# previous build (pre-v0.23.0 / pre-concurrency-fix) kept for rollback
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:2026-06-11-pr41703

Recipe A — DGX Spark, DFlash drafter + FP8 KV (recommended for daily-driver)

This is the measured-best config for DGX Spark per the AEON-7 Qwen3.6 routing memo: on the Qwen3.6-27B MTP-XS body, the DFlash drafter beats the MTP-method by +56 % median / +150 % peak on Spark's unified-memory GB10 (measured 2026-04-28). Note this is a 27B-XS-body result — Qwen3.6-35B-A3B is at parity (no DFlash win; its 8-layer all-full-attention drafter draws even with MTP-style decoding).

⚠️ DFlash + NVFP4 KV is not yet compatible on sm_121a (still true on the v0.23.0 build). The DFlash drafter uses non-causal attention (parallel candidate generation), and none of the currently-built backends pair non-causal with NVFP4 KV on Spark:

FLASH_ATTN — doesn't support NVFP4 KV

FLASHINFER — supports NVFP4 KV but requires SM100 (we're on SM121)

TRITON_ATTN — supports NVFP4 KV but is causal-only

Use --kv-cache-dtype fp8_e4m3 with DFlash. NVFP4 KV works cleanly with causal speculators (mtp, qwen3_5_mtp, eagle3, ngram) — see Recipe B.

This is the recipe in the top Quickstart — the full pull-container + pull-model + pull-drafter + serve block is there; don't duplicate it here. The serve flags below are the same ones; this section just explains each.

Key flags:

--quantization modelopt — the recommended Multimodal-NVFP4-MTP body is a modelopt NVFP4 checkpoint. (The older -NVFP4 production body is compressed-tensors format: nvfp4-pack-quantized → use --quantization compressed-tensors for that one.)
--kv-cache-dtype fp8_e4m3 — DFlash is non-causal and incompatible with NVFP4 KV on Spark today (see Recipe B for NVFP4 KV with MTP).
--speculative-config '{"method":"dflash",...}' — method: "dflash" is the native vLLM speculator (not "speculators").
--attention-backend TRITON_ATTN plus "attention_backend":"TRITON_ATTN" inside the DFlash JSON — vLLM does not inherit target attention-backend settings into speculative drafters.
--max-num-batched-tokens 32768 — must accommodate long agent startup prompts plus num_speculative_tokens × max_num_seqs headroom.
Leave --mamba-block-size unset — vLLM now derives the hybrid GatedDeltaNet + attention cache geometry correctly.
--gpu-memory-utilization — keep this ≤ 0.88 on DGX Spark. The quickstart uses 0.60 so Qwen3-ASR/Qwen3-TTS sidecars can share the GPU without forcing unified-memory pressure. Raise toward 0.75-0.85 only when the LLM is the dominant workload.

💡 Drafter materialization note. vLLM bind-mounts the drafter dir but can't follow symlinks that point outside the mount (e.g. into the HF cache blobs/ dir). After huggingface-cli download, either pass --local-dir-use-symlinks=False or cp -L $HF_CACHE/snapshots/<hash>/* /models/Qwen3.6-27B-DFlash-drafter/ so the files are real, not symlinks. This pitfall cost us 4 startup failures.

Recipe B — MTP self-speculation + NVFP4 KV (capacity-bound workloads)

For workloads where KV capacity is the bottleneck (long context, many concurrent streams), use the modelopt MTP-XS body with NVFP4 KV cache — the smaller body maximizes KV headroom. This is the only Spark recipe that exercises PR #44389's ~3× KV capacity gain today. (If you have VRAM to spare and want higher-quality output, the full Multimodal-NVFP4-MTP body drops in here too, at a small KV-headroom cost.)

docker run -d --name aeon-vllm \
    --gpus all --ipc=host --shm-size=16g --net=host \
    -v /models/Qwen3.6-27B-AEON-MTP-XS:/model:ro \
    --entrypoint vllm \
    ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
    serve /model \
        --served-model-name aeon \
        --quantization modelopt \
        --kv-cache-dtype nvfp4 \
        --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
        --max-model-len 32768 --max-num-seqs 8 \
        --gpu-memory-utilization 0.60 \
        --enable-chunked-prefill --enable-prefix-caching \
        --trust-remote-code

⚠️ MTP throughput is lower than DFlash on Spark (Qwen3.6-27B). Measured 2026-04-28 on the Qwen3.6-27B MTP-XS body: DFlash beats MTP by +56 % median / +150 % peak with the same XS body. (On Qwen3.6-35B-A3B the two are at parity — its 8-layer all-full-attention drafter has no DFlash win.) Use MTP only when you need NVFP4 KV's ~3× capacity (long contexts or higher batch sizes) and can accept the lower throughput. For pure throughput on Spark, use Recipe A. For dedicated-VRAM Blackwell (RTX PRO 6000, B100/B200), MTP is the right choice everywhere.

Recipe C — TurboQuant K8V4 (4-bit KV, extreme capacity)

docker run -d --name aeon-vllm \
    --gpus all --ipc=host --shm-size=16g --net=host \
    -e VLLM_USE_TURBOQUANT=1 \
    -e TURBOQUANT_KV_BITS=4 \
    -v /models/Qwen3.6-27B-AEON-NVFP4:/model:ro \
    --entrypoint vllm \
    ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
    serve /model \
        --quantization compressed-tensors \
        --kv-cache-dtype fp8 \
        --max-num-seqs 16 \
        ...

⚠️ Cannot mix TurboQuant K8V4 with --kv-cache-dtype nvfp4. Pick one. K8V4 wins on raw capacity (4-bit K + 4-bit V); NVFP4 KV wins on quality at ~3× capacity.

Smoke test

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "aeon",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64,
    "temperature": 0.0
  }' | jq .choices[0].message.content

Benchmarks

For the current v0.23.0 per-model decode tables and 1→64 concurrency charts, see v0.23.0 fleet benchmarks above. The benchmarks below are prior-image validation gates and config-selection A/Bs (2026-06-11-pr41703 and 2026-06-04 era) — kept for the DFlash-correctness story and the KV/speculator config comparison, which still hold on the v0.23.0 build.

Gemma-4-26B-A4B + DFlash — DFlash-correctness validation gates (image `2026-06-11-pr41703`)

AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 + the z-lab gemma-4-26B-A4B-it-DFlash drafter, production profile (--gpu-memory-utilization 0.68 --max-model-len 184320 --max-num-seqs 32 --max-num-batched-tokens 32768 --enable-prefix-caching, body triton_attn, drafter flash_attn, num_speculative_tokens 10). Validation gates measured before/after the PR #40898+#41703 fixes:

Gate	pre-fix image	`2026-06-11-pr41703`
Long-context (~9k sys prompt) draft acceptance	~0–7% (SWA defect)	43.3% / MAL 5.3
Prefix-caching ON + fleet-burst + 10-min production soak	acceptance collapses to 0% in ~25 min (corruption)	52.0% / MAL 6.20 — improves under load
Single-stream coding (c=1, greedy)	144 tok/s fresh-boot best, decaying to ~24	149–150 tok/s, sustained
Long-context throughput	~46 tok/s (APC unusable)	78 tok/s (APC accelerates the cached prefix)
Live production probe (voice fleet, post-deploy)	—	60% acceptance / MAL 7.0

Mean acceptance length now lands in z-lab's published 6.1–8.6 range. KV at this profile: 726k tokens / 3.94× concurrency at 180k ctx. Serve command:

docker run -d --name gemma26b --gpus all --ipc=host --net=host --shm-size=16g \
  -v /models/Gemma-4-26B-A4B-it-Uncensored-NVFP4:/model:ro \
  -v /models/gemma-4-26B-A4B-it-DFlash:/drafter:ro \
  -e TORCH_CUDA_ARCH_LIST=12.1a \
  --entrypoint bash ghcr.io/aeon-7/aeon-vllm-ultimate:latest -lc 'exec vllm serve /model \
    --quantization compressed-tensors --trust-remote-code \
    --attention-backend triton_attn \
    --linear-backend flashinfer_cutlass \
    --max-model-len 184320 --max-num-seqs 32 --max-num-batched-tokens 32768 \
    --gpu-memory-utilization 0.68 --enable-chunked-prefill --enable-prefix-caching \
    --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
    --speculative-config "{\"method\":\"dflash\",\"model\":\"/drafter\",\"num_speculative_tokens\":10,\"attention_backend\":\"flash_attn\"}"'

Qwen3.6 benchmarks (image `2026-06-04` era)

Measured on DGX Spark GB10 (sm_121a) with the historical 2026-06-04-era recipe: --max-num-seqs 8 --max-model-len 8192 --gpu-memory-utilization 0.78 --enable-chunked-prefill --enable-prefix-caching --mamba-block-size 256 --quantization {compressed-tensors|modelopt}. Current v0.24 quickstarts use --gpu-memory-utilization 0.60 for ASR/TTS sidecar headroom and leave --mamba-block-size unset.

🏆 Production-style: greedy + n_spec=15, by prompt category

The headline single-stream config — MTP-XS body + DFlash drafter (n_spec=15) + BF16 KV + greedy sampling — on 24 prompts (4 per category), max_tokens=400:

Category	n	TTFT median	TPOT median	decode tok/s mean	decode tok/s median	peak
math	4	243 ms	22.3 ms	44.6	44.9	45.7 ⚡
code	4	243 ms	24.1 ms	40.4	41.6	44.4
reasoning	4	195 ms	28.4 ms	35.9	35.2	40.1
summary	4	242 ms	33.1 ms	31.3	30.4	37.5
dialogue	4	243 ms	33.4 ms	30.0	30.1	36.6
prose	4	132 ms	37.5 ms	26.2	26.9	29.6
OVERALL	24	242 ms	29.3 ms	34.7	34.1	45.7

Concurrent ×4 streams (mixed categories):

Round	Wall	Agg tok/s	TTFT mean
1 (cold)	19.05 s	71.5	1222 ms
2 (steady)	17.57 s	84.4	276 ms

Key findings:

Math and code hit 41–46 tok/s because token sequences are predictable — DFlash's n=15 acceptance window stays full.
Prose is slowest at ~26 tok/s — high-entropy creative text means fewer drafter tokens accepted.
Per-category headline matches the v3 production card (38.5 median / 71.3 peak, thinking-on) — math/code peak ~45 tok/s aligns with field reports.
n_spec=15 cuts KV concurrency in half (146k tokens at 8k ctx, 17.9× max concurrent vs ~37× at n=4). Trade per-stream peak throughput for concurrency.

Apples-to-apples 4-config comparison (sampled, n_spec=4 — same settings for all)

Same 8 generic prompts, temperature=0.7, max_tokens=200, n_spec=4. Use this when comparing speculator method or KV dtype at identical settings.

Provenance / status: this 4-config A/B was measured on the earlier 2026-06-04 era image and is preserved here as the canonical config-selection baseline (speculator method × KV dtype, all else equal): the FP8-E4M3 KV config (17.35 tok/s median single-stream) vs the winning DFlash + XS-body + BF16 KV config (24.27 tok/s, +40%). These are config-vs-config on the AEON container, not a stock-vanilla comparison. The vanilla-vLLM baselines used in the fleet section above are provisional and pending a fresh fully-vanilla re-benchmark on the current v0.23.0 version.

Single-stream (mean of 5 rounds)

Config	Body	KV cache	TTFT mean	TTFT median	TPOT mean	tok/s mean	tok/s median
MTP self-spec (n=1)	XS (modelopt, 21 GB)	NVFP4 (PR #44389)	139 ms	121 ms	57.76 ms/tok	17.26	16.64
MTP self-spec (n=1)	XS (modelopt, 21 GB)	FP8-E4M3	182 ms	214 ms	57.05 ms/tok	17.35	17.40
DFlash drafter (n=4)	NVFP4 (compressed-tensors, 26 GB)	BF16 (auto)	299 ms	298 ms	50.21 ms/tok	19.44	20.10
🏆 DFlash drafter (n=4)	XS (modelopt, 21 GB)	BF16 (auto)	174 ms	131 ms	40.84 ms/tok	24.27	23.73

Concurrent × 4 streams (mean of 12 streams over 3 rounds)

Config	Body	KV cache	TTFT median (steady)	TPOT mean	per-stream tok/s	aggregate peak
MTP self-spec (n=1)	XS body	NVFP4	286 ms	61.10 ms/tok	15.71	~64 tok/s
MTP self-spec (n=1)	XS body	FP8-E4M3	239 ms	60.17 ms/tok	15.84	~66 tok/s
DFlash drafter (n=4)	NVFP4 body	BF16 (auto)	328 ms	55.39 ms/tok	15.98	~68 tok/s
🏆 DFlash drafter (n=4)	XS body	BF16 (auto)	476 ms¹ / 259 ms²	44.21 ms/tok	19.59	~87 tok/s

¹round 2 (warm) ²round 3 (fully steady)

Headlines

🏆 The winning speed pattern on Spark is an MTP-format body + DFlash drafter (n=4) + BF16 KV. Even though the body name says "MTP", it works great with an external DFlash drafter — and an MTP body leaves more compute and KV headroom than the 26 GB compressed-tensors body. Results vs the FP8-KV baseline: +40% single-stream tok/s and +24% concurrent throughput; aggregate peak ~87 tok/s on 4 concurrent streams. (Speed rows below were measured on the smaller -MTP-XS body.)
✅ Recommended daily-driver body: Multimodal-NVFP4-MTP (the full, non-XS body). In AEON-7's follow-up evals it runs at parity speed with the XS body while scoring materially higher on quality benchmarks — so it's the default when you have the VRAM (it's only slightly larger than XS). Keep the -MTP-XS body when footprint is tight. Both use --quantization modelopt and pair with the same z-lab DFlash drafter.
DFlash on the NVFP4 (compressed-tensors) body is also a big win (+12% single, +0.9% concurrent) but the heavier 26 GB body loses ground to the same drafter on the lighter MTP bodies.
MTP + NVFP4 KV is the only path to PR #44389's ~3× KV capacity gain. Use when capacity (long context, more streams) outweighs the ~30-40% lower throughput vs DFlash. NVFP4 KV is within ±1% of FP8 on throughput at this prompt size; the real benefit is ~3× more KV blocks at the same memory budget.
TPOT story is the cleanest signal. DFlash + XS-body hits 40.8 ms/tok single-stream, which is 28% faster than MTP (57 ms) and 18% faster than DFlash on the heavier NVFP4 body (50 ms). The drafter's n=4 acceptance and the smaller body's bandwidth advantage compound.
Round-1 concurrent TTFT (~1.5–4.6 s) is cold-cache + spec-decode warm-up. Steady-state TTFT is rounds 2–3 (typically ~250–500 ms).

KV cache capacity by body

Body	GPU KV cache size at 8k ctx	Max concurrency
NVFP4 (compressed-tensors, 26 GB) + DFlash + BF16 KV	264,922 tokens	32.3×
XS (modelopt, 21 GB) + DFlash + BF16 KV	300,966 tokens	36.7×

Raw JSON summaries: bench_mtp_fp8kv.json, bench_mtp_nvfp4kv.json, bench_dflash_bf16kv.json, bench_xs_dflash_bf16kv.json. Methodology + plotting in bench_summary.md.

Validated models

This image is purpose-built around the AEON-7 Qwen3.6 family for DGX Spark. Other Blackwell-class models work but are not the canonical target.

Model	Quant format	Spec method	Status	Notes
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP	modelopt NVFP4 (GDN preserved BF16)	DFlash drafter (or native MTP)	✅ Recommended Spark daily-driver — parity speed with XS, higher quality-eval scores	`--quantization modelopt`; pair with `z-lab/Qwen3.6-27B-DFlash`; image+video capable
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4	compressed-tensors `nvfp4-pack-quantized`	DFlash drafter	✅ Benchmarked in this card (heavier 26 GB body)	Prefer the Multimodal-NVFP4-MTP body above; pair with `z-lab/Qwen3.6-27B-DFlash`
AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4	compressed-tensors `nvfp4-pack-quantized`	DFlash drafter	✅ Fleet-benchmarked in the v0.23.0 section above	Drafter must use `attention_backend: flash_attn` on this image; pair with z-lab `gemma-4-26B-A4B-it-DFlash`
AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4	compressed-tensors `nvfp4-pack-quantized`	DFlash drafter	✅ Fleet-benchmarked in the v0.23.0 section above	A3B MoE; 8-layer all-full-attn drafter (no SWA/`--mamba-block-size` needed); optimal n≈10–11
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS	modelopt NVFP4	DFlash drafter (or native MTP)	✅ Smallest footprint — speed rows benchmarked below	Min-VRAM option; step up to the full Multimodal-NVFP4-MTP for higher quality if VRAM allows. Native MTP-method underperforms DFlash on Spark
z-lab/Qwen3.6-27B-DFlash	BF16 5-layer drafter (3.3 GB)	—	✅ Pairs with `…-NVFP4` above	Drafter for DFlash recipe
AEON-7/Step-3.7-Flash-AEON-Ultimate-Abliterated-NVFP4	NVFP4 (modelopt)	—	🟡 Expected to work	198B MoE — not yet smoke-tested in this image

Known issues (upstream vLLM)

These are upstream PR #44389 or core-vLLM bugs that we didn't introduce and can't fix without substantial patching. They're documented here so users don't think the container is broken:

NVFP4 KV cache requires a causal attention backend on SM121

PR #44389 lights up --kv-cache-dtype nvfp4 via the Triton software path, but the Triton backend is causal-only. The FlashInfer NVFP4 KV path requires SM100 — on SM121 it falls back to FP8.

Practical impact: NVFP4 KV pairs cleanly with causal speculators (mtp, qwen3_5_mtp, eagle3, ngram, ngram_gpu) but not with non-causal drafters like DFlash. If you pick --kv-cache-dtype nvfp4 + method:"dflash", vLLM raises:

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(...
  kv_cache_dtype=nvfp4, ..., use_non_causal=True). Reasons:
    FLASH_ATTN: [kv_cache_dtype not supported],
    FLASHINFER: [non-causal attention not supported, nvfp4 KV cache in FlashInfer requires SM100],
    TRITON_ATTN: [non-causal attention not supported],
    FLEX_ATTENTION: [kv_cache_dtype not supported],
    TURBOQUANT: [kv_cache_dtype not supported, non-causal attention not supported]

Workaround for DFlash: use --kv-cache-dtype auto (BF16). FP8 KV also fails for DFlash in this build because FLASHINFER and TRITON_ATTN both lost their non-causal kernel path in PR #44389's refactor:

ValueError: ... kv_cache_dtype=fp8_e4m3, ..., use_non_causal=True. Reasons:
  FLASH_ATTN: [kv_cache_dtype not supported]   (BF16 only)
  FLASHINFER:  [non-causal attention not supported]
  TRITON_ATTN: [non-causal attention not supported]
  FLEX_ATTENTION: [kv_cache_dtype not supported]
  TURBOQUANT: [kv_cache_dtype not supported, non-causal attention not supported]

This is a current-state limitation of the v0.23.0 build on sm_121a: DFlash's non-causal (parallel candidate) attention has no FP8/NVFP4 KV kernel partner on GB10 today — FLASH_ATTN is BF16-KV only, and both FLASHINFER and TRITON_ATTN dropped their non-causal path in PR #44389's refactor (FlashInfer's NVFP4 KV also needs SM100). NVFP4/FP8 KV will pair with DFlash once either (a) the Triton backend gains a non-causal kernel or (b) FLASHINFER's non-causal + FP8 path returns. Until then, run DFlash with --kv-cache-dtype auto (BF16), or use a causal speculator (Recipe B) for NVFP4 KV.

Workaround for NVFP4 KV: use a causal speculator (mtp, qwen3_5_mtp, eagle3, ngram, ngram_gpu) — see Recipe B. The Triton NVFP4-KV path supports those.

Gemma-4-12B-AEON variants

Variant	Issue
`Gemma-4-12B-AEON-Abliterated-K4-BF16`	vLLM's `TransformersMultiModalForCausalLM` fallback hits a shape mismatch on `Gemma4UnifiedForConditionalGeneration`. `RuntimeError: mat1 and mat2 shapes cannot be multiplied (2048x4096 and 8192x3840)` in a linear projection during graph capture. Suspect a multimodal-fused QKV layer not handled by the fallback path.
`Gemma-4-12B-AEON-Abliterated-K4-NVFP4-SVDQuant`	vLLM only knows `NVFP4 / NVFP4_FP8_MHA / W4A16_NVFP4 / MXFP8 / MIXED_PRECISION`. Our model's `quant_algo=NVFP4_SVD` (ModelOpt's newer SVD+low-rank variant) isn't yet recognized. Awaiting a deserializer PR in vLLM's `model_executor.layers.quantization.modelopt`.

Gemma-4-26B-A4B-NVFP4 — badly-quantized vision-embedder variant only

This entry is specific to the variant whose vision embedder was quantized. The correctly-quantized Gemma-4-26B-A4B-it-Uncensored-NVFP4 (vision embedder excluded as BF16) is fleet-benchmarked and works on the v0.23.0 :latest image — see the fleet section.

For the badly-quantized variant, vLLM creates the embed_vision.embedding_projection as a quantized ReplicatedLinear, but the checkpoint has only the unquantized embed_vision.embedding_projection.weight (because embed_vision* was excluded during quantization). Weight-loading mismatch. Likely an exclude_modules wildcard handling bug in PR #44389's refactor.

For Gemma-4 production today

The correctly-quantized Gemma-4-26B-A4B-it-Uncensored-NVFP4 body + z-lab gemma-4-26B-A4B-it-DFlash drafter is fleet-benchmarked on the current v0.23.0 :latest image (see the fleet section). The variants in the table above (Gemma-4-12B-AEON-*, Gemma-4-26B-A4B-NVFP4 with a quantized vision embedder) fail for model-side reasons that are independent of this container — they fail on any vLLM build.

Build provenance

Current :latest (= :2026-06-18-v0.23.0-dflashfix) built 2026-06-18 on DGX Spark (GB10, 128 GB unified memory) — vLLM v0.23.0 compiled from source for sm_121a as a 3-way merge that preserves the AEON spec-decode tree (TORCH_CUDA_ARCH_LIST=12.1a, full CUDA compile). Carries the still-open upstream PRs #44389 (Triton NVFP4 KV), #40898 (DFlash SWA), #41703 (prefix-cache corruption immunity), plus the new in-tree DFlash high-concurrency fix (port of upstream PR #43982). Rollback tag: :2026-06-11-pr41703 (vLLM 0.22.1 era). Earlier :2026-06-04-pr44389 source pin was lesj0610/vllm@lesj/triton-nvfp4-kv-fork-20260602 commit e8c77b85.

Dockerfile + patches + verify script live in this repo (AEON-7/vllm-ultimate-dgx-spark).

License

vLLM is Apache-2.0. PyTorch BSD-3-Clause. TurboQuant Apache-2.0. AEON patches MIT.

This container is provided "AS IS" — see the legal section below.

Arbitration Clause

By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this container or its outputs, you acknowledge and agree to the following:

Sole Responsibility. You, the user, are solely and exclusively responsible for (a) every prompt issued to any model served by this container, (b) every response produced, (c) every downstream action taken in reliance on those responses, and (d) any harm — direct, indirect, consequential, foreseeable, or otherwise — that results.
No Warranty. This container is provided strictly "AS IS", without warranty of any kind, express or implied, including warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, performance, or legal compliance in any jurisdiction.
Legal Compliance. You are responsible for ensuring your use complies with all applicable laws, regulations, terms of service, and organizational policies in every jurisdiction in which you operate.
Operational Safety. When serving uncensored or abliterated models with this container, you are expected to implement appropriate downstream safety layers: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows.
No Endorsement. The authors, contributors, and publishers do not endorse, adopt, or take responsibility for any specific output produced by models served via this container.
Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this container shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding.
Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the container or your breach of this clause.
Severability. If any provision is held unenforceable in a given jurisdiction, the remaining provisions remain in full force.
Acceptance. Your use of this container constitutes your acceptance of this clause in full. If you do not accept, do not use the container.

☕ Support the work

If this container saves you days of vLLM compile-and-patch on Spark, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC)

_{bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4}

Ξ Ethereum (ETH)

_{0x1512667F6D61454ad531d2E45C0a5d1fd82D0500}

◎ Solana (SOL)

_{DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t}

ⓜ Monero (XMR)

_{836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd}

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets/perf		assets/perf
gemma4-nvfp4		gemma4-nvfp4
humming-stub		humming-stub
patches		patches
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
Dockerfile.pr41703-layer		Dockerfile.pr41703-layer
README.md		README.md
SOURCE.md		SOURCE.md
bench_cat_xs_dflash_n15_greedy.json		bench_cat_xs_dflash_n15_greedy.json
bench_categories.py		bench_categories.py
bench_dflash_bf16kv.json		bench_dflash_bf16kv.json
bench_gemma4_k4_bf16_transformers.json		bench_gemma4_k4_bf16_transformers.json
bench_gemma4_k4_bf16_vllm.json		bench_gemma4_k4_bf16_vllm.json
bench_gemma4_k4_bf16_vllm_conc8.json		bench_gemma4_k4_bf16_vllm_conc8.json
bench_gemma4_k4_bf16_vllm_fp8kv_conc16.json		bench_gemma4_k4_bf16_vllm_fp8kv_conc16.json
bench_gemma4_k4_bf16_vllm_fp8kv_conc8.json		bench_gemma4_k4_bf16_vllm_fp8kv_conc8.json
bench_mtp_fp8kv.json		bench_mtp_fp8kv.json
bench_mtp_nvfp4kv.json		bench_mtp_nvfp4kv.json
bench_summary.md		bench_summary.md
bench_vllm.py		bench_vllm.py
bench_xs_dflash_bf16kv.json		bench_xs_dflash_bf16kv.json
prepare_gemma4_for_vllm.py		prepare_gemma4_for_vllm.py
sweep_conc_v0240.json		sweep_conc_v0240.json
sweep_prodcfg_v0240.json		sweep_prodcfg_v0240.json
sweep_v0230_ab.json		sweep_v0230_ab.json
validate_sweep.py		validate_sweep.py
verify.py		verify.py

Folders and files

Latest commit

History

Repository files navigation

AEON vLLM Ultimate — DGX Spark / Blackwell

Quickstart (DGX Spark, copy-paste)

What's inside

v0.23.0 fleet benchmarks — one image, three models

Gemma-4-26B-A4B-it-Uncensored (NVFP4)

Qwen3.6-35B-A3B-heretic (NVFP4)

Qwen3.6-27B-AEON-Ultimate (NVFP4 MTP-XS body + DFlash n=12)

What we fixed for the DGX Spark

The result

Why this container for Blackwell + DGX Spark users

🚀 NVFP4 KV cache — up to 3× KV capacity (Triton software path)

🛠️ AEON DGX Spark patches (sm_121a runtime fixes)

🩹 DFlash drafter correctness fixes (PR #40898 + #41703, merged ahead of upstream)

🧠 TurboQuant K8V4 — 4-bit KV cache compression

⚡ DFlash speculative decoding (native via --speculative-config)

🔬 Native Blackwell SM 12.1 sm_121a compute

Deployment recipes

Pull the image

Recipe A — DGX Spark, DFlash drafter + FP8 KV (recommended for daily-driver)

Recipe B — MTP self-speculation + NVFP4 KV (capacity-bound workloads)

Recipe C — TurboQuant K8V4 (4-bit KV, extreme capacity)

Smoke test

Benchmarks

Gemma-4-26B-A4B + DFlash — DFlash-correctness validation gates (image 2026-06-11-pr41703)

Qwen3.6 benchmarks (image 2026-06-04 era)

🏆 Production-style: greedy + n_spec=15, by prompt category

Apples-to-apples 4-config comparison (sampled, n_spec=4 — same settings for all)

Single-stream (mean of 5 rounds)

Concurrent × 4 streams (mean of 12 streams over 3 rounds)

Headlines

KV cache capacity by body

Validated models

Known issues (upstream vLLM)

NVFP4 KV cache requires a causal attention backend on SM121

Gemma-4-12B-AEON variants

Gemma-4-26B-A4B-NVFP4 — badly-quantized vision-embedder variant only

For Gemma-4 production today

Build provenance

License

Arbitration Clause

☕ Support the work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

⚡ DFlash speculative decoding (native via `--speculative-config`)

Gemma-4-26B-A4B + DFlash — DFlash-correctness validation gates (image `2026-06-11-pr41703`)

Qwen3.6 benchmarks (image `2026-06-04` era)

Packages