DeepSeek V4 Flash DSpark C12 NVFP4 KV on 2x DGX Spark

Self-contained two-node DGX Spark recipe for serving DeepSeek-V4-Flash-DSpark with vLLM TP=2, DSpark speculative decoding, and a 1M-token max model length using the experimental nvfp4_ds_mla KV-cache path.

This repo includes Keys' DSpark concurrency patch in the vLLM overlay. That patch makes DSpark's persistent draft KV follow request identity instead of condensed batch-row position, and adds ragged mixed prefill/decode handling for real independent sessions.

The current local run profile is configured for:

max_model_len=1000000
max_num_seqs=12
kv_cache_dtype=nvfp4_ds_mla
gpu_memory_utilization=0.85
API bind address 0.0.0.0:8888

Important

This profile is meant for real deep-context agent serving: up to 1M tokens per separate session with MAX_NUM_SEQS=12. The KV cache is a shared pool, so twelve sessions do not each reserve 1M tokens up front. Normal agent sessions can run concurrently while retaining the 1M ceiling for unusually long requests.

Important

For long coding tasks and big prompts, use:

MAX_MODEL_LEN=1000000
MAX_NUM_SEQS=4
MAX_NUM_BATCHED_TOKENS=16384
GPU_MEMORY_UTILIZATION=0.87
GENERATION_MAX_TOKENS=384000

This repo captures the validated Stage C NVFP4 runtime, the 2026-06-30 agent-stability refresh, and the 2026-07-02 Keys C12 checkpoint:

max_model_len=1000000
max_num_seqs=12
kv_cache_dtype=nvfp4_ds_mla
reported KV pool: 3,225,280 tokens
configured active sequence slots: 12
single-stream decode stayed above 50 tok/s
deterministic direct prompts completed with no Chinese drift or repeated junk
2/4/6/12 concurrent code-gate prompts completed cleanly
DSpark in-server concurrency patch validated at max_model_len=200000, max_num_seqs=16, with static C16 at 315.1 tok/s aggregate and staggered C16 at 205.0 tok/s aggregate

If you already deployed an older copy and saw agent garble, loops, Chinese drift, or prompt/tool XML leaking into replies, keep the C12 NVFP4 profile and validate direct API behavior before changing agent harness settings. The fix path does not switch production to fp8 or a smaller fallback model.

Warning

If direct vLLM prompts are clean but an agent harness still garbles, check the harness session replay, fallback model list, and prompt/tool XML handling before changing DSpark weights or falling back to fp8.

Result

2026-07-02 Keys C12 NVFP4 Checkpoint

The current high-concurrency lane keeps Tony's known-good Stage C NVFP4 image and applies Keys' C12 serving profile.

Runtime:

endpoint tested: http://100.90.25.78:8888/v1
served model: deepseek-v4-flash-dspark
image: vllm-dspark-runtime:dspark-nvfp4-stage-c
model path: /cache/huggingface/fraserprice/DeepSeek-V4-Flash-DSpark
kv_cache_dtype=nvfp4_ds_mla
max_model_len=1000000
max_num_seqs=12
max_num_batched_tokens=8192
gpu_memory_utilization=0.85
MTP_NUM_TOKENS=5
VLLM_USE_B12X_WO_PROJECTION=1
VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1
thinking=false
--generation-config vllm
--override-generation-config '{"temperature":0.0,"top_p":1.0,"top_k":40,"repetition_penalty":1.05}'

Boot evidence:

GPU KV cache size: 3,225,280 tokens
Maximum concurrency for 1,000,000 tokens per request: ~3.2x
Application startup complete.

Code-gate validation:

concurrency	success	server generation tok/s	acceptance
1	1/1	52.79	0.585
2	2/2	79.76	0.600
4	4/4	134.70	0.602
6	6/6	127.78	0.615
12	12/12	230.10	0.602

The upstream checkpoint note for this run was not imported into this checkout; this repo keeps the runtime changes and validation summary without the upstream benchmark artifact folder.

Do not enable VLLM_USE_B12X_FP8_GEMM=1 on this Stage C image. That flag hit a DeepGEMM layout assertion during DSpark drafter warmup in testing.

2026-06-30 Clean Agent-Serving Checkpoint

The prior conservative clean endpoint was reproduced on Asusi/Spark4 before sending the model back through Hermes/OpenClaw-style harnesses.

Runtime:

endpoint tested: http://100.90.25.78:8888/v1
served model: deepseek-v4-flash-dspark
image used on that lane: vllm-dspark-runtime:mia-raf-pr1-nvfp4-keys-c
model path: /cache/huggingface/fraserprice/DeepSeek-V4-Flash-DSpark
kv_cache_dtype=nvfp4_ds_mla
max_model_len=1048576
max_num_seqs=6
max_num_batched_tokens=8192
gpu_memory_utilization=0.80
MTP_NUM_TOKENS=5
thinking=false
--generation-config vllm
--override-generation-config '{"temperature":0.0,"top_p":1.0}'
explicit per-node VLLM_HOST_IP values

Boot evidence:

GPU KV cache size: 1,990,142 tokens
Maximum concurrency for 1,048,576 tokens per request: 1.90x
Application startup complete.

Direct validation:

/v1/models reported "max_model_len": 1048576
deterministic sanity prompt returned NVFP4 DSPARK OK
five longer English prompts completed with no CJK drift and no repeated junk
code-gate server decode mean: 54.22 tok/s
2/4/6 concurrent direct prompts all succeeded cleanly

Concurrency:

concurrency	success	aggregate tok/s	stability
2	2/2	60.95	no CJK/repeat junk
4	4/4	83.21	no CJK/repeat junk
6	6/6	104.11	no CJK/repeat junk

The upstream checkpoint note for this run was not imported into this checkout.

1M NVFP4 Profile

Validated on 2x DGX Spark, one GPU per node, TP=2, single stream.

Case	server tok/s	TTFC	acceptance	accepted/draft
p256/g64	54.46	0.506s	0.667	3.33
p256/g256	65.38	0.324s	0.718	3.59
p512/g64	56.26	2.738s	0.625	3.13
p512/g256	54.41	0.422s	0.550	2.75
p512/g256 warmup1	56.73	0.417s	0.585	2.92

Boot logs reported:

GPU KV cache size: 2,044,166 tokens
Maximum concurrency for 1,048,576 tokens per request: 1.95x

The API reported:

{"max_model_len":1048576}

The upstream checkpoint note for this run was not imported into this checkout.

DSpark Concurrency Profile

Validated on the same 2x DGX Spark TP=2 deployment using Keys' DSpark concurrency patch, kv_cache_dtype=nvfp4_ds_mla, max_model_len=200000, max_num_seqs=16, MTP_NUM_TOKENS=5, and VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1.

Patch source:

drowzeys/Keys-Concurrency-Patch-for-DSpark-DeepSeek-V4-Flash
Tested patch commit: 7e4d94bbcec95223550517c0fa9244e59f9f6483

The live fix documented here keeps kv_cache_dtype=nvfp4_ds_mla and refreshes the repo's already-vendored Keys overlay with the path-adjusted Patch 2b update from that commit. In Patch 2b, ragged query_start_loc detection no longer depends on num_rejected_tokens_gpu. Treat the service as validated only after the built-in OpenAI-compatible chat smoke request plus agent-client validation pass on the live service.

Static simultaneous batch, one TP=2 replica:

concurrency	best aggregate tok/s	per-stream tok/s	acceptance
1	57.6	57.6	0.635
4	140.8	35.2	0.619
8	252.6	31.6	0.635
16	315.1	19.7	0.609

Staggered independent arrivals, one TP=2 replica:

concurrency	success	aggregate tok/s	acceptance
4	4/4	109.2	0.544
8	8/8	147.3	0.534
16	16/16	205.0	0.567

Correctness sanity check: deterministic victim output remained byte-identical under churn. A medium-churn condense test measured 0.529 acceptance and 99.7 tok/s across the churn window.

The upstream checkpoint note for this run was not imported into this checkout.

Historical 60 tok/s DSpark Baseline

The older ~60 tok/s number was reproduced, but it is a separate diagnostic profile, not this repo's default 1M NVFP4 deployment:

image rebuilt from rafaelcaricio/vllm#1 commit 3519c3b88
max_model_len=262144
max_num_seqs=1
kv_cache_dtype=fp8
MTP_NUM_TOKENS=5
thinking=false
temperature=0.0, top_p=1.0
measured 63.97 tok/s on the code_completion gate with 67.9% DSpark acceptance

Use this to diagnose image/runtime drift. Do not confuse it with the production 1M NVFP4 path. The upstream checkpoint note for this run was not imported into this checkout.

2026-06-29 Full-1M Concurrency Microbench

The 200K/16 profile above maximizes raw concurrency. For agent fleets that want the full 1M context ceiling AND concurrency, run max_model_len=1048576 with max_num_seqs=6. Every request can still grow to 1M while up to 6 sessions run at once, because the shared KV pool — not a per-slot reservation — is the real limit (see How the KV cache works).

Validated on the 2026-06-29 code-completion microbench deployment (NVFP4, max_model_len=1048576, max_num_seqs=6, VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1, VLLM_USE_B12X_WO_PROJECTION=1):

Boot: GPU KV cache size: 1,901,239 tokens, Maximum concurrency for 1,048,576 tokens per request: 1.81x
6 concurrent requests: 6/6 success, ~182 tok/s aggregate (~30 tok/s per stream), no OOM / no preemption failures
Single-stream decode on this same profile: ~67 tok/s (code)

This is the right shape when most sessions sit far below 1M (typical agent turns) but you still want the 1M ceiling available. The newer 2026-06-30 agent-stability checkpoint above is the safer number to cite for Hermes/OpenClaw harness validation.

Higher concurrency is not free: under sustained pressure you can see added scheduler churn, prefill contention, and KV fragmentation. 1M/6 is validated for normal-length agent traffic; for guaranteed deep-context work under load, 1M/2 is conservative and 500K/4 is a balanced middle.

How the KV cache works (why 1M + concurrency is safe)

Note

max_model_len and max_num_seqs are ceilings, not reservations. The real limit is the sum of live tokens across active requests fitting inside the shared KV pool.

Three independent knobs, often confused:

knob	what it is	this build
KV cache pool	total shared KV memory in tokens, sized from `gpu_memory_utilization` after weights load	about 3.2M tokens in the C12 checkpoint
`max_model_len`	per-request ceiling — how long any one request may grow	1,048,576 (1M)
`max_num_seqs`	concurrency cap — max active sequences the scheduler runs at once	12

The pool is shared and allocated on demand: PagedAttention hands KV blocks to each request as it generates tokens and frees them when it finishes. max_model_len and max_num_seqs are ceilings, not reservations — vLLM does NOT pre-allocate max_num_seqs × max_model_len of KV. So the real constraint is:

sum(live tokens across all active requests) <= KV pool

Worked examples at 1M ceiling / 12 slots:

12 requests x  50k tokens =  600k   fits easily
12 requests x 200k tokens =  2.4M   fits in the C12 checkpoint pool
12 requests x 270k tokens =  3.2M   ~at the C12 checkpoint pool
3 requests  x 1M   tokens =  3.0M   ~near the C12 checkpoint pool
12 requests x 1M   tokens = 12.0M   impossible — excess requests queue/preempt

The boot log's Maximum concurrency for 1,000,000 tokens per request: ~3.2x only means about three simultaneous full-1M requests fit. But agent turns are almost never near 1M, so 12 normal-length sessions share the pool while the 1M ceiling stays available for the rare long one. That is exactly why 1M + max_num_seqs=12 is useful: you are not reserving 12×1M, you are sharing one pool across short requests under a high ceiling.

Gotcha: gibberish, loops, Chinese drift, or prompt/XML leakage

Warning

This failure mode is often caused by stale runtime images, inherited sampling defaults, or agent orchestration state. Validate the direct OpenAI-compatible API path first, then test the agent harness.

If the model boots and basic prompts like hi work, but real agent traffic randomly turns into repeated characters, Chinese drift, leaked tool/schema XML, or Telegram-visible junk, do not assume the weights are bad.

On this deployment there are three checks to make before blaming the weights:

Runtime concurrency safety: make sure the Keys Patch 2b logic is present in recipe/overlay/vllm/v1/spec_decode/dspark_proposer.py. The important behavior is that ragged query_start_loc handling does not depend on num_rejected_tokens_gpu, and the no-rejection path creates a zero rejected token tensor instead of falling through to unsafe request reshaping. Without this, concurrent DSpark requests can mix context.
Runtime image provenance: make sure the image really contains the current DSpark overlay. A reused local tag named vllm-dspark-runtime:clean caused misleading failures even though a nearby PR-head image worked. Rebuild from the intended overlay commit when in doubt.
Decode/fallback safety: for long OpenAI-compatible agent prompts, avoid unstable sampling and hidden fallback transitions. The server default should ignore the model card's sampling defaults and apply a small sampling floor:

{
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 40,
  "repetition_penalty": 1.05,
  "include_reasoning": false,
  "reasoning_effort": "none",
  "chat_template_kwargs": {
    "thinking": false,
    "enable_thinking": false
  }
}

The compose launcher now includes --generation-config vllm, builds --override-generation-config from the GENERATION_* env values, and sets thinking=false so default requests do not inherit unstable model-card sampling. Explicit client request parameters still win. For exact deterministic curl checks, send temperature: 0 in the request body.

Also clear agent fallback lists during validation. A model that looks fixed in direct vLLM tests can still appear poisoned if the orchestration layer silently falls back, reboots a session, or replays a stale prompt/tool transcript into the visible message stream. Keep OpenClaw/Hermes changes separate from model runtime validation unless you are deliberately testing that harness.

Validation gates to run after a live fix:

direct vLLM prompts: clean
direct concurrent vLLM prompts: clean
agent harness prompts: clean, DeepSeek, no fallback
MTP5 accepted-token positions 0..4 active

This keeps NVFP4 KV and MTP5. Do not switch to fp8 or drop to a smaller fallback model just to hide the symptom unless you intentionally accept the context and quality tradeoff.

Important Caveat

Caution

This is the Stage C padded NVFP4 path. It keeps DeepSeek V4's known-good 584-byte sparse-MLA cache envelope while routing the runtime through nvfp4_ds_mla. It is not the unresolved true-layout 416-byte NVFP4 kernel fix. The true-layout experiments were useful for diagnosis but failed past roughly 411 real prompt tokens, so they are intentionally not presented here as the reproducible recipe.

Credits

See CREDITS.md for the full attribution and license notes.

This recipe stands on prior public work:

Keys / drowzeys' DSpark in-server concurrency patch: drowzeys/Keys-Concurrency-Patch-for-DSpark-DeepSeek-V4-Flash. This patch fixes the request-stable DSpark main-KV slot mapping and the ragged query_start_loc path needed for real independent-arrival continuous batching. The concurrency results in this repo depend directly on that work.
Rafael Caricio's DSpark vLLM integration: rafaelcaricio/vllm#1
Rafael Caricio's DSpark deployment/runbook PR: rafaelcaricio/spark_vllm_docker#1
Fraser Price's DeepSeek V4 Flash DSpark model/runtime work: fraserprice/DeepSeek-V4-Flash-DSpark and fraserprice/dspark-vllm
MiaAI-Lab's two-node DGX Spark packaging and worker-first launch runbook: MiaAI-Lab/DeepSeek-v4-Flash-DSpark-2x-DGX-Spark
drowzeys ("Keys") — origin of wiring the nvfp4_ds_mla KV-cache dtype into a DGX Spark launch recipe (Keys---Full-GLM-5.2-Quantrio…). This build's 1M NVFP4 KV path descends from that nvfp4_ds_mla work.
Upstream vLLM, FlashInfer, NVIDIA Blackwell/CUDA/NCCL tooling, and DeepSeek V4 Flash.
DeepSeek-AI's DeepSpec work as the public DSpark/speculative decoding foundation.

Our contribution here is the 1M NVFP4-KV checkpoint recipe, the Stage A/B/C runtime patches, sanitized two-node launch config, applying and validating Keys' concurrency patch on the NVFP4 profile, and measured benchmark artifacts from the validated runs.

License Notes

Repo scripts and docs are published under this repo's LICENSE. The vLLM overlay/runtime files are vLLM-derived and retain their Apache-2.0 lineage and SPDX headers where present. Base images, FlashInfer/TileLang/Triton/CUDA/NCCL, and model weights are separate upstream artifacts with their own licenses and usage terms.

Files

path	purpose
`recipe/overlay/`	base DSpark vLLM overlay files
`recipe/Dockerfile.dspark-runtime-overlay`	builds the base DSpark runtime overlay
`recipe/nvfp4/Dockerfile.stage-a`	adds `nvfp4_ds_mla` dtype plumbing
`recipe/nvfp4/Dockerfile.stage-b`	enables DeepSeek V4 `nvfp4_ds_mla` probe path
`recipe/nvfp4/Dockerfile.stage-c`	switches DeepSeek V4 NVFP4 to the validated 584-byte padded envelope
`docker-compose.dspark.yml`	two-node vLLM/DSpark service
`.env.dspark.example`	sanitized cluster configuration template
`build-dspark-vllm-runtime.sh`	builds the Stage C image locally and on the worker
`prepare-dspark-model-cache.sh`	downloads/verifies the model cache
`start-deepseek-v4-flash-dspark.sh`	worker-first launch and smoke test; honors worker path/cache/IP overrides
`stop-deepseek-v4-flash-dspark.sh`	stops head and worker services
`status-deepseek-v4-flash-dspark.sh`	shows head/worker container state
`logs-deepseek-v4-flash-dspark.sh`	tails head/worker DSpark logs
`smoke-deepseek-v4-flash-dspark.sh`	direct concurrent OpenAI-compatible smoke test
`validate-dspark-config.sh`	renders and checks the local DSpark compose/env config
`patches/keys-concurrency.patch`	full path-adjusted Keys concurrency patch reference
`docs/PATCHES.md`	plain-English Patch 1 / Patch 2 / Patch 2b concurrency explanation
`UPSTREAM_V024_STATUS.md`	current vLLM v0.24.0 vs DSpark PR #46995 upgrade notes
`scripts/agent_sanity_bench.py`	direct OpenAI-compatible 1/2/4/6 concurrency and garble check
`scripts/capture_runtime.sh`	captures head/worker Docker inspect, ps, and log tails before/after changes
`benchmarks/`	local benchmark scripts retained from this checkout; upstream benchmark artifacts were intentionally not imported

Quick Start

Run from the head node.

cp .env.dspark.example .env.dspark

Edit these values for your cluster:

WORKER_HOST
WORKER_SCRIPT_DIR if the worker checkout/deployment path differs from the head
MASTER_ADDR
NCCL_IB_HCA
NCCL_SOCKET_IFNAME
NCCL_IB_GID_INDEX
HF_CACHE
WORKER_HF_CACHE if the worker cache path differs from the head
VLLM_HOST_IP and WORKER_VLLM_HOST_IP for each node's fabric IP

For this local setup the key values are:

WORKER_HOST=10.0.0.2
MASTER_ADDR=10.0.0.1
VLLM_HOST_IP=10.0.0.1
WORKER_VLLM_HOST_IP=10.0.0.2
MASTER_PORT=25000
NCCL_IB_HCA=rocep1s0f1
NCCL_SOCKET_IFNAME=enp1s0f1np1

Keep these agent-serving defaults unless you are deliberately experimenting:

VLLM_HOST=0.0.0.0 if Hermes/OpenClaw or another machine must reach the API
MAX_MODEL_LEN=1000000
MAX_NUM_SEQS=12
GPU_MEMORY_UTILIZATION=0.85
MTP_NUM_TOKENS=5
VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1
VLLM_USE_B12X_WO_PROJECTION=1
GENERATION_TEMPERATURE=0.0
GENERATION_TOP_P=1.0
GENERATION_TOP_K=40
GENERATION_REPETITION_PENALTY=1.05
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0

Build the base overlay and Stage C NVFP4 image:

./build-dspark-vllm-runtime.sh

Prepare the model cache:

./prepare-dspark-model-cache.sh

Start the service:

./start-deepseek-v4-flash-dspark.sh

The start script prints the resolved non-secret runtime profile, syncs the compose/env files to the worker path, validates rendered Docker Compose on both nodes, starts the worker first, then starts the head and follows startup logs while waiting for the API. If startup fails, it prints recent head and worker logs before exiting.

The API serves at:

http://HEAD_NODE_IP:8888/v1

For head-node-only tests, set VLLM_HOST=127.0.0.1. For Hermes/OpenClaw or another machine to use the endpoint, keep VLLM_HOST=0.0.0.0 and control access at the network/firewall layer.

Runtime Profile

C12 Agent-Serving Profile

Core vLLM flags:

--tensor-parallel-size 2
--distributed-executor-backend mp
--nnodes 2
--kv-cache-dtype nvfp4_ds_mla
--block-size 256
--max-model-len 1000000
--max-num-seqs 12
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.85
--speculative-config '{"method":"dspark","num_speculative_tokens":${MTP_NUM_TOKENS:-5}}'
--generation-config vllm
--override-generation-config '{"temperature":0.0,"top_p":1.0,"top_k":40,"repetition_penalty":1.05}'

Key runtime env:

VLLM_USE_B12X_MOE=1
VLLM_USE_B12X_WO_PROJECTION=1
VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1
VLLM_DSPARK_CONFIDENCE_SCHEDULER=off
VLLM_DSPARK_LOCAL_ARGMAX=1
VLLM_DSPARK_REPLICATE_MARKOV_W1=1
VLLM_DSPARK_FUSED_MARKOV_ARGMAX=0
VLLM_DSPARK_REFERENCE_KV_QUANT_DEQUANT=0
VLLM_DSV4_B12X_COMPRESSED_MLA=0
VLLM_DSV4_DSPARK_DEFER_TARGET_CAPTURE=0
B12X_W4A16_TC_DECODE=0

200k Concurrency Profile

For DSpark concurrency, use the included overlay files with Keys' concurrency patch and set:

MAX_MODEL_LEN=200000
MAX_NUM_SEQS=16
VLLM_USE_B12X_WO_PROJECTION=1
VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1

This checkout intentionally does not import upstream's benchmarks/keys-concurrency/ folder. Use the retained local benchmark scripts if you need ad hoc checks:

python3 benchmarks/bench_concurrent.py http://127.0.0.1:8888 1,4,8,16
python3 benchmarks/staggered_bench.py http://127.0.0.1:8888 16 0.4
python3 benchmarks/correctness_test.py http://127.0.0.1:8888

1M Single-Stream Legacy Profile

For conservative single-stream testing, set MAX_NUM_SEQS=1 and VLLM_USE_B12X_WO_PROJECTION=0. Keep MTP_NUM_TOKENS=5 unless you are deliberately running an experiment; upstream Mia and Keys both validate the DSpark path at MTP5.

Verify

After launch:

curl -fsS http://127.0.0.1:8888/v1/models

Confirm the returned model entry reports:

"max_model_len": 1000000

Then check logs:

docker compose --env-file .env.dspark -f docker-compose.dspark.yml logs vllm-dspark \
  | grep -E "GPU KV cache size|Maximum concurrency"

Expected C12 checkpoint values are around:

GPU KV cache size: 3.2M tokens
Maximum concurrency for 1,000,000 tokens per request: ~3.2x

Before pointing an agent harness at the endpoint, run the direct sanity bench:

DSPARK_BASE_URL=http://HEAD_NODE_IP:8888/v1 \
CONCURRENCY=1,2,4,6 \
python3 scripts/agent_sanity_bench.py

Every row should report bad_outputs: 0. If this direct test is clean but an agent still garbles, investigate the agent session, fallback list, or harness prompt replay before blaming the DSpark weights.

Capture runtime evidence before and after any fix:

scripts/capture_runtime.sh runtime-before-change
scripts/capture_runtime.sh runtime-after-change

Notes

The old speed checkpoint is single stream, not aggregate throughput.
The high-concurrency benchmark is aggregate throughput and was validated at max_model_len=200000, not full 1M context.
Full context and high concurrency compete for the same KV pool. The C12 1M profile is intended for normal agent traffic where most sessions sit far below the 1M ceiling; it is not twelve simultaneous full-1M requests.
To combine DSpark concurrency with longer context, pick a lower context target first, then raise concurrency slowly while watching boot logs, KV allocation, acceptance, and request errors.
1M was validated as booted/advertised max_model_len with KV headroom and short-prompt speed probes. This repo does not claim a full 1M-token retrieval or correctness benchmark.
The measured probes were p256/p512 with g64/g256. Rebenchmark if you change sampling, batching, context length, WO projection, compressed MLA, or the confidence scheduler.
The current configured agent-serving profile is MAX_MODEL_LEN=1000000, MAX_NUM_SEQS=12, GPU_MEMORY_UTILIZATION=0.85, MTP_NUM_TOKENS=5, VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1, VLLM_USE_B12X_WO_PROJECTION=1, deterministic generation overrides, and VLLM_DSV4_B12X_COMPRESSED_MLA=0.
Worker-first startup avoids a race during multi-node mp initialization and now validates rendered compose on both nodes before starting containers.
Requires matching images on both nodes, correct NCCL/RoCE settings, and a two-node Blackwell-class/DGX Spark setup.
The API binds to 127.0.0.1 by default; exposing it is a deliberate security choice.
The next max-sequence ladder to try is approximately 1.25M, 1.5M, then 1.75M, with the same boot/log/speed gates. Raw KV math alone is not enough because DeepSeek V4 sparse MLA also allocates max-length-dependent workspaces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepSeek V4 Flash DSpark C12 NVFP4 KV on 2x DGX Spark

Result

2026-07-02 Keys C12 NVFP4 Checkpoint

2026-06-30 Clean Agent-Serving Checkpoint

1M NVFP4 Profile

DSpark Concurrency Profile

Historical 60 tok/s DSpark Baseline

2026-06-29 Full-1M Concurrency Microbench

How the KV cache works (why 1M + concurrency is safe)

Gotcha: gibberish, loops, Chinese drift, or prompt/XML leakage

Important Caveat

Credits

License Notes

Files

Quick Start

Runtime Profile

C12 Agent-Serving Profile

200k Concurrency Profile

1M Single-Stream Legacy Profile

Verify

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
benchmarks		benchmarks
docs		docs
patches		patches
recipe		recipe
scripts		scripts
.env.dspark.example		.env.dspark.example
.gitignore		.gitignore
CREDITS.md		CREDITS.md
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
build-dspark-vllm-runtime.sh		build-dspark-vllm-runtime.sh
docker-compose.dspark.yml		docker-compose.dspark.yml
logs-deepseek-v4-flash-dspark.sh		logs-deepseek-v4-flash-dspark.sh
prepare-dspark-model-cache.sh		prepare-dspark-model-cache.sh
smoke-deepseek-v4-flash-dspark.sh		smoke-deepseek-v4-flash-dspark.sh
start-deepseek-v4-flash-dspark.sh		start-deepseek-v4-flash-dspark.sh
status-deepseek-v4-flash-dspark.sh		status-deepseek-v4-flash-dspark.sh
stop-deepseek-v4-flash-dspark.sh		stop-deepseek-v4-flash-dspark.sh
validate-dspark-config.sh		validate-dspark-config.sh

Folders and files

Latest commit

History

Repository files navigation

DeepSeek V4 Flash DSpark C12 NVFP4 KV on 2x DGX Spark

Result

2026-07-02 Keys C12 NVFP4 Checkpoint

2026-06-30 Clean Agent-Serving Checkpoint

1M NVFP4 Profile

DSpark Concurrency Profile

Historical 60 tok/s DSpark Baseline

2026-06-29 Full-1M Concurrency Microbench

How the KV cache works (why 1M + concurrency is safe)

Gotcha: gibberish, loops, Chinese drift, or prompt/XML leakage

Important Caveat

Credits

License Notes

Files

Quick Start

Runtime Profile

C12 Agent-Serving Profile

200k Concurrency Profile

1M Single-Stream Legacy Profile

Verify

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages