GitHub - Luce-Org/lucebox-hub: Fast LLM speculative inference server for consumer hardware.

Local LLM inference server built for speed. Custom kernels, speculative prefill & decoding.
Each optimization in our engine is for specific model family and hardware target.

Inference Engine Optimizations

Each one is self-contained with setup instructions and benchmark notes.

Supported Models & Drafters

All speedups measured vs vendored llama.cpp (-fa 1, matching KV quant). Combined = geometric mean √(TTFT × decode) where both phases benched; otherwise the single-phase speedup. Drafters published on huggingface.co/Lucebox.

Model	Speedup
Qwen 3.5-0.8B (Megakernel)	~2×
Qwen 3.6-27B + PFlash	~5.6×
Qwen 3.6-27B + DDTree	4.84×
Laguna-XS.2 33B + PFlash	5.4× @128K
Qwen 3.6-27B HIP	~2.6×
Gemma-4-26B-A4B	1.31×

Drafter	Phase
`Qwen3.6-27B`	decode
`gemma-4-26B-A4B`	decode
`gemma-4-31B`	decode
`Qwen3-0.6B`	prefill

Tested Machines (GPU/APU)

Reference target: RTX 3090 (Ampere sm_86) — all headline numbers. Other NVIDIA archs auto-detected by CMake / setup.py; AMD HIP backend separate (Strix Halo section).

	Arch	GPU	Min CUDA / ROCm	Status	Bench
	Ampere `sm_86`	RTX 3090, A-series	CUDA 12.0	✅ reference	megakernel · dflash
	Blackwell `sm_120`	RTX 5090	CUDA 12.8	✅ 205 tok/s, 4.84×	↗
	Blackwell `sm_121`	DGX Spark / GB10	CUDA 12.9	✅ megakernel NVFP4	↗
	Turing `sm_75`	RTX 2080 Ti	CUDA 12.0	✅ 53 tok/s DFlash	↗
	Ada `sm_89`	RTX 40xx	CUDA 12.0	🟡 community WSL2 bench	↗
—	Blackwell `sm_110`	Jetson AGX Thor	CUDA 13.0	🟡 builds, unbenched	—
	Volta `sm_70` / Pascal `sm_61`	V100, P40	CUDA 12.0	🟡 fallback paths, unbenched	—
	RDNA3.5 `gfx1151`	Ryzen AI MAX+ 395 / Strix Halo	ROCm 6+	✅ 37 tok/s HIP	↗
	RDNA3 `gfx1100`	Radeon RX 7900 XTX	ROCm 6+	✅ 50 tok/s HIP	↗
—	RDNA4 `gfx1201`	Radeon AI PRO R9700	ROCm 6.4+	✅ 55 tok/s HIP	↗

server/ (DFlash) builds with CMake 3.18+ and --recurse-submodules for Luce-Org/llama.cpp@luce-dflash — no PyTorch needed. optimizations/megakernel/ is the only component requiring PyTorch 2.0+ (CUDAExtension links against torch C++ libs). Power-tune: sudo nvidia-smi -pl 220 (3090 sweet spot, re-sweep for other cards).

Quick Start On Harnesses

harness/ contains RTX 3090 client launchers and regression tests for Lucebox server compatibility. Run Lucebox inside Claude Code, Codex, OpenCode, Hermes, Pi, OpenClaw, or Open WebUI, or check if a server change still works with those clients.

Client	Launcher
Claude Code	`run_claude_code.sh`
Codex	`run_codex.sh`
OpenCode	`run_opencode.sh`
Hermes	`run_hermes.sh`
Pi	`run_pi.sh`
OpenClaw	`run_openclaw.sh`
Open WebUI	`run_openwebui.sh`

All launchers spawn the native C++ HTTP server (dflash_server). Override defaults via env vars:

DFLASH_SERVER_BIN=server/build/dflash_server \
DFLASH_TARGET=server/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=server/models/draft/dflash-draft-3.6-q4_k_m.gguf \
MAX_CTX=32768 BUDGET=22 VERIFY_MODE=ddtree \
harness/clients/run_codex.sh

For no-draft targets such as Gemma, set only DFLASH_TARGET or pass DRAFT=none; the harness will not attach the default Qwen draft to a custom target.

Launcher scripts install missing real-client CLIs automatically under .harness-work/. To preinstall them yourself:

python3 harness/client_test_runner.py install --clients codex,hermes,openwebui

For direct TPS/TTFT numbers against a running server:

python3 harness/client_test_runner.py bench \
  --url http://127.0.0.1:8000 \
  --suite he,agent \
  --n-sample 3

Quick Start With Docker

Prebuilt images on GHCR track main. No CUDA toolkit or build needed. Pull the image, mount weights and serve. OpenAI-compatible API on :8000.

GPU	Image tag
NVIDIA (CUDA 12+)	`:cuda12`
AMD (ROCm 6+)	`:rocm`

Drop a GGUF model target into server/models/ first, then :8000/v1/chat/completions. Full tutorial in the Docker blog.

Install and run:

# 1. Pull the image for your GPU
docker pull ghcr.io/luce-org/lucebox-hub:cuda12   # NVIDIA
docker pull ghcr.io/luce-org/lucebox-hub:rocm     # AMD

# 2. Download a target model into server/models/ and the DFlash draft
#    into server/models/draft/ (the entrypoint only auto-discovers the
#    draft there; without it the server runs slower, target-only)
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf \
  --local-dir server/models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q4_k_m.gguf \
  --local-dir server/models/draft/

# 3a. NVIDIA (CUDA 12+)
docker run --rm --gpus all -p 8000:8080 \
  -v "$PWD/server/models:/opt/lucebox-hub/server/models" \
  ghcr.io/luce-org/lucebox-hub:cuda12

# 3b. AMD (ROCm 6+, Strix Halo / RX 7900)
docker run --rm --device /dev/kfd --device /dev/dri \
  --group-add video --group-add render --security-opt seccomp=unconfined \
  -p 8000:8080 -v "$PWD/server/models:/opt/lucebox-hub/server/models" \
  ghcr.io/luce-org/lucebox-hub:rocm

Then hit :8000/v1/chat/completions (OpenAI-compatible).

Run the Server

Default: Qwen 3.6-27B Q4_K_M target + Lucebox Q4_K_M DFlash drafter on RTX 3090. DDTree budget=22, TQ3_0 KV cache, full attention. OpenAI-compatible HTTP on :8000.

# build (CUDA 12+, CMake 3.18+)
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub
cmake -B server/build -S server -DCMAKE_BUILD_TYPE=Release
cmake --build server/build --target dflash_server -j

# default weights (~18 GB)
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir server/models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q4_k_m.gguf --local-dir server/models/draft/

# run (TQ3_0 KV auto-enabled; set =0 to disable)
DFLASH27B_KV_TQ3=1 \
./server/build/dflash_server server/models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft server/models/draft/dflash-draft-3.6-q4_k_m.gguf \
  --ddtree --ddtree-budget 22 --port 8000

Making requests

For the fastest, deterministic responses send temperature: 0 (greedy decoding gives the highest spec-decode acceptance):

curl :8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "dflash",
  "messages": [{"role": "user", "content": "Write quicksort in Python."}],
  "temperature": 0
}'

Requests that omit temperature use the model card's sampling (Qwen3.6: temperature: 1.0, top_p: 0.95, top_k: 20).

Server flags

Core

Flag	Default	Effect
`--draft <path>`	—	DFlash draft GGUF, required for speculative decode
`--port N`	`8000`	HTTP port
`--host H`	`127.0.0.1`	Bind address
`--max-ctx N`	auto-fit	KV cache size; oversizing slows prefill (FA stride over unused KV)
`--max-tokens N`	model-card	Generation cap
`--model-name S`	filename	OpenAI `model` field
`--chat-template-file <path>`	autodetect	Override Jinja template

Decode (DFlash + DDTree)

Flag	Default	Effect
`--ddtree`	off (chain)	Enable tree verify
`--ddtree-budget N`	`22`	Tree size. 22 on 3090 (default), 40 on 5090, re-sweep on GB10
`--fa-window N`	`0` / `2048` (full attention)	Sliding FA window. Leave at 0: a finite window breaks tool calls (the full-attention layers lose the system prompt/tools).
`--draft-residency {auto,persistent,request-scoped}`	`auto`	When draft weights are evicted from VRAM. `request-scoped` parks/frees them after each request's draft work (frees VRAM for the target on tight GPUs); `persistent` keeps them resident across requests; `auto` preserves current behavior while honoring the low-VRAM / `--lazy-draft` hint. Reported at `/props.runtime.draft_residency`.
`--lazy-draft`	off	Legacy alias for `--draft-residency=request-scoped` (defer draft load until first request, release after)

GPU draft top-K & verify-argmax (DFlash)

The draft-token top-K extraction and the per-step verify argmax used to run on the CPU, each requiring a full vocab × n_tokens logits copy from device to host (D2H) every speculation step. These two env flags move both onto the GPU, reading the logits in place on the device buffer and skipping the bulk D2H. Both are on by default and only take effect on CUDA builds (the kernel is CUDA-only — on HIP/ROCm builds the flags are no-ops and the CPU path always runs). Each path validates its result and falls back to the legacy CPU computation automatically on any failure (e.g. an out-of-range index), so disabling them is only needed for debugging or A/B comparison.

Env	Default	Effect
`DFLASH_GPU_DRAFT_TOPK=1`	`1` (on)	Compute the draft model's top-K vocab indices (K in 1–8) and log-sum-exp directly on the logits device buffer. `=0` forces the legacy CPU top-K (full-vocab D2H + CPU heap extract). Use `=0` to isolate the kernel when debugging or to baseline its speedup.
`DFLASH_GPU_VERIFY_ARGMAX`	on (server) / `0` (test harness)	Per-step verify argmax. In the server it is on by default and a simple on/off; `=0` forces the legacy CPU path. In `test_dflash` it is a tri-state with these values: • `0` — legacy CPU path: full `vocab × N` D2H + CPU argmax (default in the test harness). • `1` — GPU fast-path: read the in-graph batched GPU argmax (N int32s), no bulk D2H. • `2` — run both the CPU and GPU paths and report any per-step mismatches (validation mode; guards against the historical tree-verify `-1`/tie regression).

To reproduce the benchmark: baseline DFLASH_GPU_DRAFT_TOPK=0 DFLASH_GPU_VERIFY_ARGMAX=0, optimized DFLASH_GPU_DRAFT_TOPK=1 DFLASH_GPU_VERIFY_ARGMAX=1, both via python server/scripts/bench_llm.py --bench HumanEval.

Prefill compression (PFlash)

Flag / env	Default	Effect
`--prefill-compression {off,auto,always}`	`off`	When to score+compress the prompt
`--prefill-threshold N`	`32000`	In `auto`, the prompt-token count above which a single-shot prompt is compressed. Also the per-message minimum that an aged message must exceed before FlowKV compresses it on multi-turn requests. Lower it (e.g. `1024`) if you want FlowKV to act on shorter history.
`--prefill-keep-ratio F`	`0.05`	Fraction of source tokens kept (0.02 @128K, 0.10 @32K)
`--prefill-curve T:R [T:R ...]`	off (flat keep-ratio)	Piecewise keep-ratio curve, linear-interpolated over `(tokens, ratio)` breakpoints, e.g. `10000:0.5 40000:0.2 100000:0.1` (2× compression @10K, 5× @40K, 10× @100K+). Overrides `--prefill-keep-ratio`; per-session bandit override still wins.
`--prefill-drafter <gguf>`	required if on	Drafter weights (Qwen3-0.6B BF16 GGUF)
`--prefill-skip-park`	off	Keep drafter resident across requests (more VRAM, faster)
`PFLASH_FREEZE_HOT_WINDOW=N`	`2`	FlowKV: how many of the most recent messages stay verbatim. Everything older than this window (but after the system prompt) is compressed once and cached. Larger = more recent context kept uncompressed.
`DFLASH_FP_USE_BSA=1`	`0`	Dispatch sparse FA through BSA (sm_80+); required for headline 10.4×
`DFLASH_FP_ALPHA=0.85`	`0.12`	Block-selection threshold; higher = stricter = fewer K-blocks
`DFLASH_FP_PROFILE=1`	`0`	Per-stage timing log

When compression is on, the request path picks one of three modes automatically, so they never stack: the first turn is sent verbatim (the system prompt stays as a stable cache anchor), multi-turn continuations use FlowKV (only the aged history is compressed, recent turns kept verbatim, so the disk prefix cache from --prefix-cache-slots keeps hitting), and a single oversized prompt with no prior turns uses whole-prompt PFlash. With --prefill-compression off the request path is identical to a build without compression.

KV cache

Flag / env	Default	Effect
`--cache-type-k <t>` / `--cache-type-v <t>`	env-driven	Per-side quant override: `f16,bf16,q4_0,q4_1,q5_0,q5_1,q8_0,tq3_0`
`DFLASH27B_KV_TQ3=1`	(default)	Preset TQ3_0 K+V (3.5 bpv, fits 256K @ 24 GB)
`DFLASH27B_KV_Q4=1`	off	Q4_0 K+V (4.5 bpv, legacy, ~128K ceiling)
`--prefix-cache-slots N`	—	Live prefix-cache slot count
`--kv-cache-dir <path>`	—	Persist prefix cache to disk
`--kv-cache-budget N`	—	On-disk cache size cap

Bounded KV residency (KVFlash)

Pages the attention KV cache through a fixed pool of GPU slots; cold 64-token chunks live in host RAM, bit-exact and recallable. Decode speed stops depending on context length and resident KV stays pool-sized at any context. Off by default; works on every model family. Drafter-scored residency is the default on every family: the server finds the Qwen3-0.6B drafter next to the model (or via --prefill-drafter) and lazy-loads it as the relevance scorer that decides which chunks stay resident — non-qwen targets (laguna, gemma4) bridge the tokenizer gap by re-tokenizing the context text for the drafter. LRU is the fallback when no drafter is present, or the explicit choice via --kvflash-policy lru. Per-model numbers in Luce KVFlash →.

Flag / env	Default	Effect
`--kvflash <tokens\|auto>`	off	Resident pool size. `auto` sizes from the GPU: half of free VRAM after weights and reserves, at the model's KV density, capped where decode speed stays near the flat optimum (default 16384, override `DFLASH_KVFLASH_MAX_POOL`) and at `--max-ctx`. Explicit values are rounded to 256, clamped to `--max-ctx`, floored at the protected minimum so eviction always has a victim.
`--kvflash-policy {drafter,lru,qk}`	`drafter`	Residency policy. `lru` opts out of the drafter probe/load (recency-only paging, no extra VRAM). `qk` (qwen35 only) scores residency from the target model's own pooled keys against the decode query, matching drafter-grade recall at a fraction of the rescore cost with no extra model resident.
`--kvflash-tau N`	`64`	Reselect interval floor (drafter policy only); the effective interval grows with history to cap rescore overhead.
`DFLASH_KVFLASH=N`	off	Env equivalent of `--kvflash`.
`DFLASH_KVFLASH_TAU=N`	`64`	Env equivalent of `--kvflash-tau`.

Thinking budget

Flag	Default	Effect
`--think-max-tokens N`	model-card	Max tokens inside `<think>…</think>`
`--default-max-tokens N`	model-card	Default response cap
`--hard-limit-reply-budget N`	`4096`	Hard ceiling; injects `</think>` close near limit
`--reasoning-effort-{low,medium,high,x-high,max} N`	model-card	OpenAI-style effort tiers

Multi-GPU / IPC

Flag / env	Default	Effect
`--target-device <dev>`	`cuda:0`	Target backend (e.g. `cuda:0`, `hip:0`)
`--draft-device <dev>`	same as target	Draft backend; mixed backend needs `--draft-ipc-bin`
`--target-gpu N`	`0`	Target GPU index
`--draft-gpu N`	same as target	Draft GPU index; offload draft to a second GPU
`--target-devices <list>` / `--target-layer-split`	single GPU	Layer-split target across GPUs
`--draft-ipc-bin <path>`	—	Out-of-process draft binary (mixed CUDA/HIP)
`--peer-access`	off	Enable P2P between target GPUs
`--chunk N`	backend default	Prefill ubatch size
`--no-cors`	CORS on	Disable CORS headers
`DFLASH_TARGET_GPU=N`	`0`	Env var equivalent of `--target-gpu`
`DFLASH_DRAFT_GPU=N`	same as target	Env var equivalent of `--draft-gpu`
`DFLASH_MODEL_NAME=<name>`	`dflash`	Env var equivalent of `--model-name`; sets the `/v1/models` id and selects the matching `share/model_cards/<name>.json`

MoE expert offload (Spark)

For MoE targets (laguna, qwen35/qwen36) whose experts don't fit in VRAM. --spark self-tunes the hot/cold expert split, a bounded GPU cache, and the placement profile from live traffic; decode stays near the all-GPU ceiling via the default single-graph fused path. See Luce Spark →.

Flag / env	Default	Effect
`--spark`	off	One-flag autotune: enable the bounded expert cache, size it from the VRAM target, auto-load and keep persisting a placement profile (`<model>.gguf.spark.csv`).
`--spark-vram <GiB>`	whole card	Total VRAM Spark may use; it sizes the hot tier + cache + KV under this cap.
`DFLASH_SPARK=1`	off	Env equivalent of `--spark`.
`DFLASH_SPARK_VRAM_MB=N`	—	Env equivalent of `--spark-vram` (in MB).
`DFLASH_<ARCH>_EXPERT_CACHE=1`	off	Bounded GPU expert cache (`<ARCH>` = `LAGUNA` or `QWEN35MOE`); cold-miss falls toward 0 after warmup.
`DFLASH_<ARCH>_CACHE_SLOTS=N`	auto	Cache slots per layer.
`DFLASH_LAGUNA_NO_SINGLE_GRAPH=1`	off	Fall back to per-layer decode instead of the default single-graph fused hybrid.

DFlash benchmarks → · DFlash blog → · PFlash benchmarks → · PFlash blog → · Per-machine quick starts (DGX Spark, Jetson Thor, HIP) →

Run Megakernel Bench (Qwen 3.5-0.8B)

Separate Python bench; 24 layers fused into one persistent CUDA dispatch. 413 tok/s decode, 21,347 prefill, 1.87 tok/J @220W vs llama.cpp BF16.

uv sync --extra megakernel
uv run --directory megakernel python final_bench.py

Method	Prefill pp520	Decode tg128	tok/J
Megakernel `@220W`	21,347	413	1.87
llama.cpp BF16 `@350W`	11,247	267	0.76
PyTorch HF	7,578	108	n/a

Setup → · Bench → · Blog →

Blackwell (RTX 5090, DGX Spark / GB10): auto-detected by setup; NVFP4 decode path lands ~194 tok/s on GB10. See optimizations/megakernel/README.md#blackwell-sm_120--sm_121a.

Tutorials

Video tutorials for each optimization and the harness setup.


Luce Spark ▶ YouTube	Luce DFlash ▶ YouTube	Luce Turboquant ▶ YouTube
Luce Harness setup ▶ YouTube	Luce PFlash ▶ YouTube	Luce Megakernel ▶ YouTube
Luce KVFlash ▶ YouTube

Why this exists

Local AI should be the default, not a privilege. Private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to get real throughput out of it does not.

Nothing was built for local AI inference. Most machines bolt a stock GPU onto a desktop CPU and run a stock runtime, never tuning the kernels to the silicon underneath. On the same 27B model, a DGX Spark or Mac Studio leaves four to six times the real throughput on the table. General-purpose frameworks won the last decade because hand-tuning per chip cost more than it returned: one stack, decent on everything, great on nothing. Speculative decoding, speculative prefill, fused megakernels, and calibrated MoE expert offload turn idle silicon into 3-10× speedups, but they stay locked to BF16 weights on data-center GPUs. Consumer cards inherit the leftovers.

See the benchmarks and the machine at lucebox.com.

Request for Contributions

  ▮▮▮▮▮▮▮▮▮▮    HIP/CUDA kernel optimizations
  ▮▮▮▮▮▮▮▮▮▯    Speculative inference optimizations
  ▮▮▮▮▮▮▮▯▯▯    Support to new GPU/APU consumer cards
  ▮▮▮▮▮▮▮▯▯▯    Inference engine debugging
  ▮▮▮▮▮▮▯▯▯▯    Add new performance benchmarks
  ▮▮▮▮▮▯▯▯▯▯    Improvements for harnesses integration

Citation

@software{lucebox_2026,
  title  = {Fast LLM speculative inference server for specific consumer hardware.},
  author = {Lucebox},
  url    = {https://github.com/Luce-Org/lucebox-hub},
  year   = {2026}
}

Community

Discord: discord.gg/yHfswqZmJQ
Website: lucebox.com
Issues: github.com/Luce-Org/lucebox-hub/issues
Blog: lucebox.com/blog

_{Apache 2.0 · Lucebox.com}

Name		Name	Last commit message	Last commit date
Latest commit History 1,155 Commits
.github		.github
assets		assets
docs/specs		docs/specs
harness		harness
optimizations		optimizations
scripts		scripts
server		server
share/model_cards		share/model_cards
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.rocm		Dockerfile.rocm
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-bake.hcl		docker-bake.hcl
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inference Engine Optimizations

Supported Models & Drafters

Tested Machines (GPU/APU)

Quick Start On Harnesses

Quick Start With Docker

Run the Server

Making requests

Server flags

Run Megakernel Bench (Qwen 3.5-0.8B)

Tutorials

Why this exists

Request for Contributions

Citation

Community

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Inference Engine Optimizations

Supported Models & Drafters

Tested Machines (GPU/APU)

Quick Start On Harnesses

Quick Start With Docker

Run the Server

Making requests

Server flags

Run Megakernel Bench (Qwen 3.5-0.8B)

Tutorials

Why this exists

Request for Contributions

Citation

Community

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages