Skip to content

Luce-Org/lucebox-hub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lucebox

lucebox.com Discord Blog

MIT CUDA 12+ C++17

Open LLM inference, rewritten by hand for one specific chip at a time.
Kernels, speculative decoding, and quantization, tailored per target.
We don't wait for better silicon. We rewrite the software.


Inside the box

Two projects today, more coming. Each one is a self-contained release with its own benchmarks and paper-style writeup.

Megakernel    DFlash 27B


01 · Megakernel

The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch, 1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 2× the throughput.

git clone https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/megakernel
pip install -e .
python final_bench.py
Method Prefill pp520 Decode tg128 tok/J
Megakernel @220W 37,800 413 1.87
llama.cpp BF16 @350W 11,247 267 0.76
PyTorch HF 7,578 108 n/a

What makes it work: 82 blocks, 512 threads, one persistent kernel. No CPU round-trips between layers. Weights streamed straight from HuggingFace. Cooperative grid sync instead of ~100 kernel launches per token. Power ceiling hit before compute ceiling, so DVFS converts tight execution straight into saved watts.

Full writeup → · Benchmarks → · Blog post →


02 · DFlash 27B

First GGUF port of DFlash speculative decoding. Qwen3.5-27B at 130 tok/s on a single RTX 3090 (Q4_K_M target + BF16 draft). 128K context in 24 GB. 3.5× faster than chain speculative decoding, 2.9× faster than SGLang AWQ on the same hardware.

                     AR (tok/s)   DFlash+DDTree (tok/s)   Speedup
  HumanEval              37.4           130.7                3.49×
  Math500                37.4           111.2                2.97×
  GSM8K                  37.6            97.0                2.58×

The constraint that shaped the project. AWQ INT4 of Qwen3.5-27B plus the BF16 draft doesn't leave room for the DDTree verify state on a 24 GB card. Q4_K_M GGUF (14.9 GB target) is the largest format that fits target + 3.46 GB draft + budget=22 tree state + KV cache in 24 GB on the RTX 3090. Picking it forced a new port on top of ggml, since no public DFlash runtime supports a GGUF target.

What we built vs what we didn't. The algorithms are not ours:

  • DFlash (z-lab, 2025): block-diffusion draft conditioned on target hidden states.
  • DDTree (Ringel et al., 2025): tree-structured verify that beats chain verify at the same compute budget.

What we ported and tuned:

  • C++/CUDA decode engine on top of ggml (no libllama, no Python runtime, Q4_K_M target path).
  • Three custom CUDA kernels for tree-aware SSM state rollback: ggml_ssm_conv_tree, ggml_gated_delta_net_tree, ggml_gated_delta_net_tree_persist.
  • DDTree budget swept for RTX 3090 + Q4_K_M target: budget=22 is the sweet spot.
  • Q4_0 KV cache + sliding target_feat ring to fit 128K context in 24 GB with ~3% AL hit.

Full writeup → · Benchmarks → · Blog post →


Why this exists

Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't.

General-purpose frameworks dominated the last decade because hand-tuning kernels per chip was too expensive to justify. One stack, decent on everything, great on nothing. Most of the silicon's capability stays on the floor.

AI-assisted development flips that calculus. Rewrites that took a quarter now fit in a release cycle. Lucebox is where we publish them, one chip and one model family at a time. MIT source, full writeup, reproducible benchmarks.


Quickstart

git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub

# megakernel (Qwen 3.5-0.8B, batch 1)
cd megakernel && pip install -e . && python final_bench.py && cd ..

# dflash 27B (Qwen 3.5-27B Q4_K_M + z-lab draft, RTX 3090)
cd dflash
cmake -B build -S . -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j
huggingface-cli download unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download z-lab/Qwen3.5-27B-DFlash model.safetensors --local-dir models/draft/
python3 scripts/run.py --prompt "def fibonacci(n):"

Requirements: NVIDIA GPU (Ampere+), CUDA 12+, PyTorch 2.0+. Tested on RTX 3090 (2020). Use --recurse-submodules to pull the pinned Luce-Org/llama.cpp@luce-dflash fork that carries the three tree-mode ggml ops.

Optional, find your GPU's sweet spot: sudo nvidia-smi -pl 220


Repository layout

lucebox-hub/
├── megakernel/    · fused forward pass for Qwen 3.5-0.8B
├── dflash/        · DFlash speculative decoding port for Qwen 3.5-27B on RTX 3090
└── assets/        · banners, cards, diagrams

Roadmap

  Q1 2026    ▮▮▮▮▮▮▮▮▮▮    RTX 3090 kernels & optimizations
  Q2 2026    ▮▮▮▮▮▯▯▯▯▯    Ryzen AI MAX+ 395 optimizations
  Q2 2026    ▮▮▯▯▯▯▯▯▯▯    Heterogeneous CPU + GPU latency optimizations

Citation

@software{lucebox_2026,
  title  = {Lucebox: Open LLM Inference, Rewritten by Hand for One Specific Chip at a Time},
  author = {Lucebox},
  url    = {https://github.com/Luce-Org/lucebox-hub},
  year   = {2026}
}

Per-project citations live in each subproject's README.


Inspired by


Community


MIT · Lucebox.com

About

Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors