Skulk 1.2 - Clustered Speculative Decoding #313
Foxlight Foundation Admin
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Excited to ship Skulk 1.2, which brings multi-token-prediction speculative decoding - single node, pipeline-sharded, and tensor-parallel, including heterogeneous multi-node rings. It is shipping it alongside the first release of the Skulk Weights Publisher, tooling that makes MTP weight sidecars available. By decoupling MTP weights from the models and offloading them to sidecars, Skulk allows you to use virtually any quantization, fine-tune, or abliteration that your local needs dictate; you don't need to use specially prepared MTP builds - if the underlying model supports MTP, Skulk should support it.
Skulk's benchmarks align with or exceed those of other inference platforms, running that same speculation across the cluster.
Compares Favorably with Datacenter Gains
The headline result we want to get across is where these numbers land and the hardware they run on vs what we see in the literature on much better hardware.
Production native-MTP serving on datacenter GPUs is commonly reported in the +30% to +80% range. Skulk's measured gains on consumer Apple Silicon sit inside that band across most configurations - and at the top end, exceed it. Gemma 4 12B across a 2-node pipeline reaches +81%. Gemma 4 26B MoE on a single machine reaches +120%.
The full picture, per configuration
The speedup depends on model family and placement topology, and we'd rather show the spread than average it into a single misleading number.
A few things worth reading out of this:
Every benchmark is greedy decoding, 200-token completions, median of three runs, and verification-exact.
These benchmarks were not taken with RDMA - we would love to see some community benchmarks on hardware with RDMA enabled.
Pre-tuned
A speculative decoder's speedup depends heavily on how many tokens it drafts per round. Draft too shallow and you leave throughput on the table; draft too deep and acceptance falls off, so you pay for parallel rows that get rejected. The peak is different for every model, and we measured it rather than picking a default and hoping. That said, the best draft mode may differ on your hardware; for this reason, draft depth is a property on the model card and you can change it. Future releases will add a model card editor to the dashboard.
Workload transparency
Short greedy benchmarks are an easy case. The two questions a skeptic asks next are: does it survive long generations, and does it survive sampling?
Skulk's gains soften at length and under sampling - that's expected because longer runs and stochastic acceptance both cost a little - but they stay substantial. Gemma 4 12B across a pipeline holds +60% at 1000 tokens and +54% at temperature 0.7. And the sampling path preserves the output distribution exactly.
Sidecar notes
Standard quantization pipelines strip the MTP tensors to save download size. Accordingly, the quantized checkpoint you pull from Hugging Face usually doesn't have the heads Skulk or other inference frameworks need. The answer for most providers is to bake the heads into their GGUFs. But this strongly restricts your access to models and quantizations that support MTP. Skulk takes a different route, handled by the Skulk Weights Publisher (SWP) shipping alongside 1.2: extract the heads once from the original BF16 checkpoint and publish them as a standalone sidecar - one per base model, shared across every quantization of it. Skulk loads the sidecar next to whatever quant you're running, and the two are versioned independently. For Gemma 4, there's nothing to extract - Google already publishes the assistant model, so SWP just records the pairing in its catalog and Skulk resolves it at runtime.
We've also created a Hugging Face organization. The Foxlight Hugging Face organization ships sidecars today for the MTP-capable open-weight models we've identified: the Qwen3.5 and Qwen3.6 families, Qwen3-Next 80B-A3B, DeepSeek V3.2 Exp Base, DeepSeek V4 Flash and Pro, and NVIDIA Nemotron-3-Super 120B. Seventeen in total, each with a provenance card pinning the exact source SHA and preserving the upstream license unchanged. More are being added daily.And we've made SWP available so that you can publish your own sidecars to your own repos as wished.
What else is in 1.2
The speculation work is the headline, but Skulk 1.2 also substantially hardens Skulk against a wide variety of failure classes, including failovers and seamless recovery, memory protection and recovery, crash loop protection, and much more - see the release notes for more information.
What's next
SWP 1.0 also lays the groundwork for LARQL vindex publication. LARQL is a weight format that allows Skulk to split inference across GPU nodes and CPU/high-memory servers, decoupling model attention from its FFN, and in so doing decoupling the parts of the model that need GPU resources from the parts that just need memory. The tooling is ready on the SWP side; LARQL itself is next on the Skulk roadmap.
Links
Skulk: github.com/Foxlight-Foundation/Skulk
Release notes: foxlight-foundation.github.io/Skulk/release-notes/1.2.0
SWP: foxlight-foundation.github.io/skulk-weights-publisher
MTP Sidecars: huggingface.co/FoxlightAI
Beta Was this translation helpful? Give feedback.
All reactions