Skulk 1.2 - Clustered Speculative Decoding #313

2026-06-18T01:35:42Z

Foxlight Foundation Admin
Jun 18, 2026

Excited to ship Skulk 1.2, which brings multi-token-prediction speculative decoding - single node, pipeline-sharded, and tensor-parallel, including heterogeneous multi-node rings. It is shipping it alongside the first release of the Skulk Weights Publisher, tooling that makes MTP weight sidecars available. By decoupling MTP weights from the models and offloading them to sidecars, Skulk allows you to use virtually any quantization, fine-tune, or abliteration that your local needs dictate; you don't need to use specially prepared MTP builds - if the underlying model supports MTP, Skulk should support it.

Skulk's benchmarks align with or exceed those of other inference platforms, running that same speculation across the cluster.

Compares Favorably with Datacenter Gains

The headline result we want to get across is where these numbers land and the hardware they run on vs what we see in the literature on much better hardware.

Production native-MTP serving on datacenter GPUs is commonly reported in the +30% to +80% range. Skulk's measured gains on consumer Apple Silicon sit inside that band across most configurations - and at the top end, exceed it. Gemma 4 12B across a 2-node pipeline reaches +81%. Gemma 4 26B MoE on a single machine reaches +120%.

The full picture, per configuration

The speedup depends on model family and placement topology, and we'd rather show the spread than average it into a single misleading number.

A few things worth reading out of this:

Gemma 4 MoE, single node is the biggest win — cheap rounds, a strong companion drafter, no cross-node coordination cost.
Dense models across a pipeline (Gemma 4 12B, Qwen3.5 27B) come next. A dense model pays the full forward pass on every rank, and chained drafting amortizes the ring coordination cost well enough to come out clearly ahead.
Embedded-head families (the smaller Qwen3.5 configs) land lower but still solidly positive. The multi-node case for these is where the current ceiling sits.

Every benchmark is greedy decoding, 200-token completions, median of three runs, and verification-exact.

These benchmarks were not taken with RDMA - we would love to see some community benchmarks on hardware with RDMA enabled.

Pre-tuned

A speculative decoder's speedup depends heavily on how many tokens it drafts per round. Draft too shallow and you leave throughput on the table; draft too deep and acceptance falls off, so you pay for parallel rows that get rejected. The peak is different for every model, and we measured it rather than picking a default and hoping. That said, the best draft mode may differ on your hardware; for this reason, draft depth is a property on the model card and you can change it. Future releases will add a model card editor to the dashboard.

Workload transparency

Short greedy benchmarks are an easy case. The two questions a skeptic asks next are: does it survive long generations, and does it survive sampling?

Skulk's gains soften at length and under sampling - that's expected because longer runs and stochastic acceptance both cost a little - but they stay substantial. Gemma 4 12B across a pipeline holds +60% at 1000 tokens and +54% at temperature 0.7. And the sampling path preserves the output distribution exactly.

Sidecar notes

Standard quantization pipelines strip the MTP tensors to save download size. Accordingly, the quantized checkpoint you pull from Hugging Face usually doesn't have the heads Skulk or other inference frameworks need. The answer for most providers is to bake the heads into their GGUFs. But this strongly restricts your access to models and quantizations that support MTP. Skulk takes a different route, handled by the Skulk Weights Publisher (SWP) shipping alongside 1.2: extract the heads once from the original BF16 checkpoint and publish them as a standalone sidecar - one per base model, shared across every quantization of it. Skulk loads the sidecar next to whatever quant you're running, and the two are versioned independently. For Gemma 4, there's nothing to extract - Google already publishes the assistant model, so SWP just records the pairing in its catalog and Skulk resolves it at runtime.

We've also created a Hugging Face organization. The Foxlight Hugging Face organization ships sidecars today for the MTP-capable open-weight models we've identified: the Qwen3.5 and Qwen3.6 families, Qwen3-Next 80B-A3B, DeepSeek V3.2 Exp Base, DeepSeek V4 Flash and Pro, and NVIDIA Nemotron-3-Super 120B. Seventeen in total, each with a provenance card pinning the exact source SHA and preserving the upstream license unchanged. More are being added daily.And we've made SWP available so that you can publish your own sidecars to your own repos as wished.

What else is in 1.2

The speculation work is the headline, but Skulk 1.2 also substantially hardens Skulk against a wide variety of failure classes, including failovers and seamless recovery, memory protection and recovery, crash loop protection, and much more - see the release notes for more information.

What's next

SWP 1.0 also lays the groundwork for LARQL vindex publication. LARQL is a weight format that allows Skulk to split inference across GPU nodes and CPU/high-memory servers, decoupling model attention from its FFN, and in so doing decoupling the parts of the model that need GPU resources from the parts that just need memory. The tooling is ready on the SWP side; LARQL itself is next on the Skulk roadmap.

Links

Skulk: github.com/Foxlight-Foundation/Skulk

Release notes: foxlight-foundation.github.io/Skulk/release-notes/1.2.0

SWP: foxlight-foundation.github.io/skulk-weights-publisher

MTP Sidecars: huggingface.co/FoxlightAI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skulk 1.2 - Clustered Speculative Decoding #313

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Skulk 1.2 - Clustered Speculative Decoding #313

Uh oh!

Uh oh!

Foxlight Foundation Admin Jun 18, 2026

Compares Favorably with Datacenter Gains

The full picture, per configuration

Pre-tuned

Workload transparency

Sidecar notes

What else is in 1.2

What's next

Links

Replies: 0 comments

Foxlight Foundation Admin
Jun 18, 2026