Mixed-precision quantization for LLMs. Every layer refracts into a different format based on its sensitivity.
prismaquant measures the actual per-layer curvature of the loss and the
actual per-(layer, format) quantization error, then runs a proper
multi-choice knapsack to choose each layer's format under a total-bit
budget. It produces a compressed-tensors checkpoint that vLLM serves
natively — no custom runtime, no vLLM patches, no auto_round or
llmcompressor dependency at serve time.
- Formats out of the box: NVFP4 (W4A4), MXFP8 / FP8 (W8A8 dynamic
per-channel), BF16 passthrough. Extensible via
format_registry.py. - MoE: packed-expert tensors (Qwen3.5 / 3.6
gate_up_proj/down_proj, Mixtralw1/w2/w3) are handled first-class with a custom_GradNormCaptureautogradFunction— no weight-surrogate gradient accumulation, noauto_roundunfuse step. - MTP (Multi-Token-Prediction) heads are quantized end-to-end, then
exercised via vLLM's
--speculative-config method=mtpat serve time. - Qwen3.5 / 3.6 fused-sibling groups (q/k/v, gate/up,
in_proj_qkv+z,in_proj_a+b) are promoted to share aweight_global_scaleso vLLM's fused loader doesn't warn about scale divergence. - Zero vLLM patches. The output of prismaquant is a standard
compressed-tensorscheckpoint, parsed by vLLM's stock loader. No custom kernels, no forked serving stack.vllm servethe artifact and go.
Naive post-training quantization has two failure modes, both caused by the same underlying problem: the quantizer doesn't know which parts of the model are sensitive.
-
Over-preservation. Tools without a sensitivity model leave large chunks in BF16 or FP8 "to be safe" — usually
lm_head, the attention Q/K/V,down_proj, and any Linear the authors saw fail once. Every one of those is a conservative guess, not a measured decision. The result is a 5 – 6 bpp artifact that could have been 4.2 – 4.5 bpp with no quality loss. -
Over-aggression. The same tools, run with a tighter budget, cast those same genuinely-sensitive Linears to NVFP4 anyway because they have no way to tell "this 4-bit op will cost you 3 points of MMLU" apart from "this 4-bit op will cost you 0.1 points of MMLU." Quality collapses before the bit-budget savings are realized.
Both failure modes waste the most precious resource in local-LLM serving: DRAM capacity and bandwidth. On a DGX Spark (128 GB unified memory, Blackwell SM121 with NVFP4/MXFP8 tensor cores), every byte the KV cache doesn't spend on weights is a byte available for context length, batch size, or the next model in the rotation. Getting the weight format right on a per-Linear basis is the difference between running one model at high quality and running three.
prismaquant replaces those heuristics with a closed-form per-Linear cost estimate:
Δloss ≈ 0.5 · H_trace · MSE_W
where H_trace is the empirical Fisher diagonal trace (captured once
via hooks during a short calibration pass) and MSE_W is the measured
per-format round-trip error on that Linear's actual weights. The
allocator solves a standard multi-choice knapsack over those estimates
under a total-bit budget. Each bit goes where it buys the most
likelihood. The result on Qwen3.6-35B-A3B at 4.75 bpp: 124 Linears in
NVFP4, 26 in MXFP8, 252 in BF16 — a distribution no hand-crafted
heuristic would produce, and one that serves natively in vLLM with
coherent generation and valid MTP speculative decoding.
Intel's AutoRound is a strong baseline for weight-only INT4
quantization. We ran it on the same model class earlier in this
project, and the quality gap we observed wasn't because AutoRound's
rounding algorithm was worse — it's a principled sign-gradient-descent
search over rounding decisions and it produces genuinely good per-layer
integer packings. The gap was structural: AutoRound quantizes the
whole model to the same format. Every Linear becomes INT4, because
that's the API. There's no way to tell it "this specific down_proj in
layer 23 is curvature-sensitive; leave it in BF16 and spend those 12
bits somewhere less sensitive."
prismaquant operates one level up. It's not a rounding algorithm — it's
a format allocator that composes on top of RTN, AutoRound, GPTQ, or
any other per-Linear quantizer you want to plug into
format_registry.py. The FormatSpec for each format carries its own
quantize_dequantize function, so you can drop an AutoRound-generated
codebook into the pipeline and still get per-Linear mixed-precision
selection. Combined with the measured-vs-analytical cost model, that
makes the bit budget go farther at the same point on the
quality-vs-size Pareto curve.
vLLM's compressed-tensors loader accepts a specific set of
format/strategy combinations (NVFP4 tensor_group g=16, FP8 channel
W8A8, BF16 passthrough, and a few more). prismaquant's allocator is
explicitly aware of those constraints — you can't ask for a format
vLLM can't serve. That's why the initial release targets exactly the
subset vLLM supports natively (NVFP4 + FP8 W8A8 + BF16) and requires
zero vLLM changes. If vLLM's constraints change, the
format_registry.py adapts; until they do, every prismaquant artifact
is a drop-in --quantization compressed-tensors target.
The bigger opportunity — and the immediate next roadmap item — is
targeting a fixed memory footprint rather than a fixed bits-per-parameter.
Currently you ask the allocator for --target-bits 4.75. What you
actually want to ask it is "make the artifact fit 24 GB with the KV
budget I need at 8K context." The Pareto curve prismaquant already
computes has exactly this information (bits vs. Δloss vs. disk size).
Wiring a size-target mode on top is a day of work, not a research
project, and unlocks direct comparisons like:
- 3090 (24 GB) — what's the best 24 B-parameter MoE we can ship?
- DGX Spark (128 GB) — how large can an MoE get before the KV budget dominates?
Once that lands, prismaquant goes from "allocator that optimizes bpp" to "allocator that fits your hardware."
Qwen3.6-35B-A3B-MoE at target 4.75 bpp:
| Metric | Source BF16 | prismaquant | Delta |
|---|---|---|---|
| Size on disk | 70 GB | 22 GB | −69 % |
| Body format mix | 100 % BF16 | 124 × NVFP4 + 26 × MXFP8 + 252 × BF16 | |
| MTP head size | 1.7 GB (BF16) | 0.5 GB (NVFP4 experts + BF16 attn) | −68 % |
| Generation | ✓ | ✓ (coherent) | |
| Serves in vLLM | ✓ | ✓ (compressed-tensors, no patches) |
|
| MTP spec-decoding | ✓ | ✓ (n=3 draft tokens) |
vLLM backend at serve time: FLASHINFER_CUTLASS for NVFP4 MoE, CutlassFP8ScaledMMLinearKernel for the FP8 W8A8 bucket, FLASH_ATTN v2 for attention, prefix caching + FP8 KV cache enabled.
Three-way comparison against the BF16 source and RedHatAI's
Qwen3.6-35B-A3B-NVFP4 — a uniform-NVFP4 quantization of the same
base model produced by llm-compressor. All three artifacts evaluated
on the same vLLM server config (FP8 KV cache, FlashInfer backend,
prefix caching, num_concurrent=16), zero-shot, loglikelihood
scoring via lm-evaluation-harness.
| Task | Metric | BF16 (70 GB) | prismaquant 4.75 bpp (22 GB) | RedHat NVFP4 (24 GB) | Δ PQ−BF16 | Δ RH−BF16 | Δ PQ−RH |
|---|---|---|---|---|---|---|---|
| arc_easy | acc | 81.23 ± 0.80 | 80.72 ± 0.81 | 77.61 ± 0.86 | −0.51 | −3.62 | +3.11 |
| arc_easy | acc_norm | 71.76 ± 0.92 | 72.26 ± 0.92 | 69.49 ± 0.94 | +0.51 | −2.27 | +2.78 |
| arc_challenge | acc | 54.86 ± 1.45 | 54.35 ± 1.46 | 51.79 ± 1.46 | −0.51 | −3.07 | +2.56 |
| arc_challenge | acc_norm | 54.95 ± 1.45 | 55.03 ± 1.45 | 53.50 ± 1.46 | +0.08 | −1.45 | +1.54 |
| piqa | acc | 82.21 ± 0.89 | 81.94 ± 0.90 | 80.79 ± 0.92 | −0.27 | −1.41 | +1.14 |
| piqa | acc_norm | 82.97 ± 0.88 | 82.10 ± 0.89 | 81.77 ± 0.90 | −0.87 | −1.20 | +0.33 |
| hellaswag | acc | 63.43 ± 0.48 | 62.68 ± 0.48 | 62.70 ± 0.48 | −0.76 | −0.74 | −0.02 |
| hellaswag | acc_norm | 83.47 ± 0.37 | 82.91 ± 0.38 | 82.21 ± 0.38 | −0.56 | −1.25 | +0.70 |
| winogrande | acc | 75.69 ± 1.21 | 73.48 ± 1.24 | 70.80 ± 1.28 | −2.21 | −4.89 | +2.68 |
Headline numbers:
- Mean Δ vs BF16: prismaquant −0.56 pp, RedHat −2.21 pp (~4× closer to BF16 on average).
- prismaquant wins 8 / 9 metrics vs RedHat (hellaswag-acc is a 0.02 pp tie). Sign test p < 0.02.
- Biggest single-task gap: arc_easy −3.62 pp on RedHat vs −0.51 pp on prismaquant — a 3.11 pp prismaquant advantage at a combined stderr of 1.18 pp → 2.6σ, statistically significant.
- arc_easy also shows the pathology the Motivation section warned about: when every Linear gets NVFP4 with only a hand-picked ignore list, the ~5 % of genuinely sensitive Linears collapse the whole task. prismaquant's sensitivity-driven allocation keeps them in BF16 or MXFP8 and recovers 3 pp of accuracy for 2 GB less on disk.
What this says about the allocator: RedHat's checkpoint is one
format group with one regex target and 342 hand-picked ignores. At
equivalent size, prismaquant's 252 BF16 + 26 MXFP8 + 124 NVFP4 mix
(all decisions driven by measured 0.5·H·MSE_W) is strictly better
on 8 out of 9 zero-shot metrics and tied on the ninth. The
sensitivity-driven allocator is doing measurable work beyond what
uniform quantization + a curated ignore list produces.
Worth dwelling on: RedHat's artifact preserves more Linears in
BF16 than prismaquant does, yet is simultaneously larger on disk
and lower quality. The RedHat quantization_config's ignore list
contains 342 entries — the entire visual encoder (110 Linears),
all 40 MoE routers, the linear_attn.* GatedDeltaNet projections on
every body layer (~150 entries), every shared_expert.*, lm_head,
MTP head. prismaquant's artifact keeps only 252 Linears in BF16 —
90 fewer.
| BF16 Linears | MXFP8 | NVFP4 (dense) | NVFP4 (per-expert MoE) | Disk | |
|---|---|---|---|---|---|
| RedHat NVFP4 | 342 | 0 | 0 | ~20.5 k (via one regex) | 24 GB |
| prismaquant 4.75 bpp | 252 | 26 | 44 | ~20.5 k | 22 GB |
The apparent paradox — RedHat has more BF16 yet larger disk and
lower accuracy — is exactly the over-preservation failure mode the
Motivation section warned about. RedHat's quantizer can't measure
sensitivity, so it hedges: "leave everything that might be sensitive
in BF16." The 90-Linear gap between 342 and 252 is Linears RedHat
conservatively preserved but prismaquant's 0.5·H·MSE_W measurements
showed were safe to drop to MXFP8 or NVFP4. That's where the 2 GB
disk savings come from.
Crucially, the other direction holds too: of the Linears RedHat quantized to NVFP4, prismaquant's measurements flagged 26 of them as too sensitive for NVFP4 and promoted them to MXFP8 (and 252 of what RedHat quantized... wait, those are already in RedHat's BF16 list). The interesting case is really the 26 MXFP8 Linears — these are genuinely sensitive ones where NVFP4 would hurt but BF16 is overkill. RedHat can't express that distinction with a one-format scheme; prismaquant can.
The resulting accuracy pattern matches the theory exactly:
- arc_easy / arc_challenge are knowledge-recall tasks that lean heavily on the Linears prismaquant kept in BF16/MXFP8 that RedHat left NVFP4. prismaquant − RedHat ≈ +3 pp on both, 2.6σ on arc_easy.
- hellaswag / piqa lean on the bulk MoE body, which both quants put in NVFP4 identically. Margin narrows to < 1 pp, tied on hellaswag-acc.
- winogrande stresses pronoun resolution through attention — sensitive to attention-projection quantization. Both quants drop (−2.21 / −4.89), but RedHat drops >2× more because prismaquant's sensitivity measurements nudged those Linears up to MXFP8/BF16.
Over-preservation burns disk and accuracy when you can't
discriminate which Linears actually need it. Under-aggression
(quantizing genuinely sensitive Linears to recover budget) burns
accuracy worse. Both are failure modes of sensitivity-blind
quantizers. Measuring 0.5·H·MSE_W per Linear turns them into
separable decisions.
The five tasks above are commonsense / world-knowledge benchmarks that a 35 B MoE saturates. They test whether quantization catastrophically broke the model; they do not stress reasoning chains, in-context learning, or generation fluency.
What's not yet in the table and what the pre-publication roadmap calls for:
- MMLU 5-shot (especially MMLU-STEM) — reasoning-heavy, standard benchmark for PTQ papers, where compounding quantization error shows up most visibly.
- GSM8K 5-shot (generative, strict-match) — arithmetic + multi-step reasoning; the canonical "did quantization break the model?" test.
- HumanEval / MBPP — code generation, tests generation quality directly rather than loglikelihood scoring.
- MT-Bench or AlpacaEval — open-ended generation quality.
- Per-budget Pareto sweep — we've only measured 4.75 bpp. The Pareto curve in the next section is predicted Δloss from the allocator. Genuine validation requires quantizing at 4.5 and 5.0 and measuring.
- Seed sweep — single run per artifact. winogrande's −2.21 / −4.89 deltas are real-looking but should be averaged over 2-3 seeds.
Until those land, the defensible claim is the headline above: prismaquant beats uniform NVFP4 on commonsense zero-shot at a smaller size. The reasoning-heavy and generative-quality story is still pending.
The 4.75 bpp target in this artifact was a first-shot proof of concept, not a tuned optimum. It works remarkably well. This section unpacks why, and why tiny bpp bumps above pure NVFP4 buy massive quality.
NVFP4 packs 4-bit weight + 4-bit activation values into 16-element
groups sharing one FP8 scale factor. Per-parameter storage cost
works out to 4 + 8/16 = 4.5 bits/param for weights — not the
nominal 4.0 bits. A "pure NVFP4" model on disk is already at 4.5 bpp;
you don't save anything by restricting yourself to one format.
prismaquant's 4.75 bpp target is only +0.25 bpp over pure NVFP4 — roughly a 6% bit-budget increase, allocated intelligently rather than uniformly.
Predicted Δloss vs. bit budget, from pareto.csv
(allocator.py):
| Target bpp | Achieved | Format mix | Predicted Δloss | Ratio vs 4.5 |
|---|---|---|---|---|
| 4.5 | 4.643 | 173 NVFP4 + 1 MXFP8 + 228 BF16 | 4.282 | 1.00× |
| 4.75 ← | 4.758 | 124 NVFP4 + 26 MXFP8 + 252 BF16 (shipped) | 2.355 | 0.55× |
| 5.0 | 5.010 | 82 NVFP4 + 9 MXFP8 + 311 BF16 | 1.184 | 0.28× |
| 5.5 | 5.496 | 79 NVFP4 + 0 MXFP8 + 323 BF16 | 0.923 | 0.22× |
| 6.0 | 5.938 | 75 NVFP4 + 0 MXFP8 + 327 BF16 | 0.860 | 0.20× |
The curve is extremely steep near pure-NVFP4 and flattens out past ~5 bpp. Key takeaways:
- 4.5 → 4.75 bpp (+0.25 bpp, +5.6% size): predicted Δloss drops
45%. That's the quarter-bit that lets the allocator move 49
Linears out of NVFP4 and into MXFP8/BF16 — the ones where
0.5·H·MSE_Wflagged NVFP4 as expensive. - 4.75 → 5.0 bpp (+0.25 bpp, +5.3% size): another 50% drop.
- 5.0 → 6.0 bpp (+1 bpp, +20% size): only a 27% drop. Severely diminishing returns.
The Kneedle knee-detection the allocator suggests (Satopaa et al. 2011) lands right around 5.0 bpp. 4.75 sits below the knee — it's on the aggressive side of the curve, which is exactly what makes the result interesting: prismaquant at 4.75 bpp outperforms stock NVFP4 at a similar or larger effective size, because the allocator pulls the quarter-bit from the least-sensitive 90% of the model and concentrates it on the most-sensitive 10%.
The 4.75 bpp artifact's 124 NVFP4 Linears include the 80 per-expert MoE projections (representing ~805 M params × 256 experts = the model's dominant compute path at inference time). Those still go through the FLASHINFER_CUTLASS NVFP4 kernel. The 26 MXFP8 Linears go through CutlassFP8 (same family). The 252 BF16 Linears are almost entirely small norm/bias/attention-projection weights that contribute a negligible fraction of matmul FLOPs.
Decode throughput on a DGX Spark (Blackwell SM121) for our 4.75 bpp artifact is within noise of a hypothetical 4.5 bpp pure-NVFP4 artifact — you pay ~6% extra DRAM bandwidth for ~50% less quality degradation. That trade favors the quarter-bit bump almost every time.
If +0.25 bpp over pure NVFP4 halves predicted Δloss, and the next
quarter-bit halves it again, then the interesting bit budgets for
production are pure-NVFP4-equivalent + 0.25 to 0.75 bpp — not
the pure-NVFP4 point that the field has anchored on. We'll sweep
this space more thoroughly in a follow-up once MiniMax M2.7 and
3-bit support land (the same logic applies below 4 bpp: tiny bumps
above a uniform 3-bit baseline unlock large quality recovery at
~zero serving cost).
Three commands on a machine with the model cached locally:
export MODEL_PATH=/path/to/Qwen3.6-35B-A3B
export WORK_DIR=./dq-runs/qwen36
export FORMATS=NVFP4,MXFP8_E4M3,BF16
export TARGET_BITS=4.75
./prismaquant/run-pipeline.shThat runs probe → cost → allocator → native export. MTP heads, when
present in the source model (e.g. Qwen3.5 / 3.6), are covered
automatically as a built-in shard kind inside incremental_probe and
incremental_measure_quant_cost — no separate command needed. Serve:
vllm serve $WORK_DIR/exported \
--quantization compressed-tensors \
--trust-remote-code \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'sensitivity_probe ──────► probe.pkl (Fisher g² per Linear + packed expert)
│
measure_quant_cost ─────────┼─────► cost.pkl (per-(Linear, format) MSE)
│
(optional) local_reconstruct — elite-candidate clipping sweep
│
allocator ◄─────────────────┘
│
▼ layer_config.json, pareto.csv
│
(optional) measure_interactions + quadratic_refine_allocator
(optional) calibrate_allocator — KL-fit per-format gain
│
▼
export_native_compressed ► exported/ (compressed-tensors checkpoint)
│
▼
validate_native_export ► vLLM forward + greedy decode
The allocator uses a closed-form per-Linear loss proxy:
Δloss ≈ 0.5 · H_trace · MSE_W · gain_per_format
where H_trace is the Fisher diagonal trace measured in stage 1,
MSE_W is the measured per-(Linear, format) weight MSE from stage 2,
and gain_per_format is the NNLS-fit calibration gain from
calibrate_allocator (defaults to 1.0).
Streaming backward with forward + backward hooks so no full parameter gradient ever materializes. Sum-reduction CE so per-token gradients aggregate cleanly into a per-layer Fisher diagonal trace. Route-aware weighting for MoE experts (divide by observed routing probability so sparse experts' Fisher is comparable to dense Linears'). Per-token importance weighting (hard tokens count more).
Packed experts use _GradNormCapture — a torch.autograd.Function that
captures the squared Frobenius norm of the incoming gradient through an
identity forward and returns None for the weight gradient so autograd
doesn't accumulate a full 5 GB .grad on the leaf parameter. This is
what makes 40-layer × 2-param × BF16 MoE backwards tractable in 128 GB.
Incremental mode (incremental_probe.py) shards the work so only one
shard's worth of hooks is live at a time — makes a 35B run resumable
and keeps peak memory bounded.
For each tracked Linear and each registered format, apply the native
weight/activation round-trip and measure weight_mse and
output_mse = ‖Wx − Ŵx̂‖² against saved activations from stage 1.
Batched GPU path groups Linears by shape and runs one torch.bmm per
(group, format) — converts ~31 000 tiny kernel launches on a 35B MoE
down to ~360 batched ones. Unbatched CPU path is slower but uses no
extra VRAM.
Incremental mode (incremental_measure_quant_cost.py) shards the
measurements the same way the probe does and merges the per-shard
pickles at the end.
Multi-choice knapsack DP. Per-Linear Δloss = 0.5 · H_trace · MSE_W;
total constraint Σ bits ≤ target. Fused-projection siblings
(q/k/v, gate/up, Qwen3.6-specific linear_attn.{in_proj_qkv+z} and
{in_proj_a+b}) are promoted to the highest-precision sibling so vLLM's
fused loader sees consistent formats.
Outputs:
layer_config.json— per-Linear format assignment (also readable byllmcompressor/auto_round)pareto.csv— Δloss vs bits sweep across--pareto-targets- Printed Kneedle-style knee suggestion
local_reconstruct.py— grid-searches symmetric clipping on weights and activations for a small set of elite frontier-critical layers. Memory-safe (one layer at a time), intentionally slow.measure_interactions.py+quadratic_refine_allocator.py— sparse pairwise KL probe near the knee, then local quadratic refinement. The additive frontier stays fast; interactions only matter where they matter.calibrate_allocator.py— validate a few frontier points against actual KL on held-out data so the predicted frontier can be trusted or corrected on the current model. Emits per-formatgainfactors that the allocator re-reads via--calibration.
Transformers v5 drops mtp.* weights on load (they're in the class's
_keys_to_ignore_on_load_unexpected) because MTP is a vLLM-only
feature. prismaquant instantiates a standalone MtpModule (HF
Qwen3_5MoeDecoderLayer plus the pre_fc_norm_*, fc, and norm
exactly per vLLM's Qwen3_5MultiTokenPredictor forward), loads the
MTP weights directly from safetensors, then runs the standard Fisher
probe against the MTP auxiliary objective
loss = CE( lm_head( MTP(embed_{t+1}, body_hidden_t) ), ids_{t+2} )
The allocator treats MTP Linears identically to body Linears — same
cost model, same knapsack, same fused-sibling rules. At export time,
MTP per-expert tensors are split and emitted with mtp.layers.X.mlp.experts.Y.{gate|up|down}_proj.*
naming, which vLLM's MTP loader picks up via its own
mtp. → model. weight-name remap.
Turns layer_config.json into a compressed-tensors checkpoint that
vLLM serves natively.
- Quantizes each
nn.Linearper the recipe (NVFP4, MXFP8 → vLLM'sCompressedTensorsW8A8Fp8dynamic per-channel, or BF16 passthrough). - Packed MoE experts (
gate_up_proj/down_proj3D) are split into per-expert per-projection tensors (experts.{e}.{gate|up|down}_proj.weight_packed) to match vLLM's loader convention. Jointweight_global_scaleis promoted across fused siblings so vLLM's loader doesn't warn about scale divergence. - Writes
weight_global_scalein compressed-tensors' divisor convention (1/scale); vLLM inverts on load. Emitsinput_global_scale = 1.0for every NVFP4 Linear so vLLM's1/input_global_scaleinitialization stays finite — otherwise the kernel produces degenerate output (!!!!!!!!). - Generates
quantization_configwithquant_method = compressed-tensors,format = mixed-precision, per-formatconfig_groupswith explicit per-Linear regex targets in vLLM's internal naming (language_model.model.layers.X.*for the qwen3_5hf_to_vllm_mapper), plusPER_EXPERT_MOE_REGEXandMTP_PER_EXPERT_REGEXcatch-alls for the per-expert MoE tensors. - MTP weights go through the same quantize-and-emit path, then are
named with the source
mtp.prefix that vLLM's MTP loader accepts verbatim. - Visual encoder weights pass through as BF16 from source (real calibration is deferred — see "what's deferred" below).
Binary check: load the checkpoint in vLLM and do a single greedy decode.
Optionally upgrades the container's flashinfer to a known-good version
before loading (pass --no-flashinfer-upgrade to skip).
Extend with --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
to actually exercise the quantized MTP heads during the decode.
Built-in (register more via format_registry.py):
| Family | Formats |
|---|---|
| NV | NVFP4, NVFP4A16 |
| MX | MXFP4, MXFP6_E3M2, MXFP6_E2M3, MXFP8, MXFP8A16 |
| Int | INT8_W8A16, INT4_W4A16_g128 |
| Float | BF16 (passthrough) |
NVFP4 and MXFP4 are alternatives for the same 4-bit tier, not
separate precision levels. Include at most one format per bit tier
— otherwise the allocator picks between them based on per-layer RTN
measurement noise and you end up with a serving mess (two kernel
paths for 4-bit quant). Allocator warns by default, errors with
--enforce-family-coherence.
Everything in this section assumes serving with vLLM. Microscaling formats (NVFP4, MX*) require NVIDIA Blackwell-era hardware (SM100+) for native kernel support; on older Ampere/Ada you get Marlin emulation, which works at a significant speed penalty.
| Blackwell ISA | vLLM serving today | |
|---|---|---|
| NVFP4 | ✓ | ✓ (FlashInfer CUTLASS) |
| MXFP4 | ✓ | ✓ (FlashInfer CUTLASS) |
| MXFP6_E3M2 | ✓ | ✗ (kernel not yet integrated) |
| MXFP6_E2M3 | ✓ | ✗ (same) |
| MXFP8 | ✓ | ✓ (as W8A8 FP8 via Cutlass) |
| INT4 / INT8 | all NV HW | ✓ (Marlin) |
Until vLLM picks up MXFP6 serving kernels, including MXFP6_* in a
bundle means the allocator can pick it, the checkpoint will contain
it, but vLLM won't know how to load it at serve time. Safe to
experiment with on the quantization side; do not ship until the
kernels land.
| Use case | --formats |
|---|---|
| Ship today on Blackwell via vLLM | NVFP4,MXFP8_E4M3 (validated) |
| MX-pure on Blackwell | MXFP4,MXFP8 |
| Experimental with MXFP6 | NVFP4,MXFP6_E3M2,MXFP8 |
| Legacy INT pipelines | INT4_W4A16_g128,INT8_W8A16 |
requires_grad_(False) on all parameters. Backward runs only to push
gradient signal through autograd so the Fisher hooks can read it;
nothing is written back. It's a sensitivity measurement, not an
optimizer.
Hutchinson on a Linear's weights via vHv probes requires a different hook architecture than we use (hooks see activation gradients, not parameter gradients). Fisher (g²) is the natural fit for hooks and gives a first-order proxy for curvature that correlates well with quantization sensitivity when combined with measured RTN error (which removes the need for Fisher to predict anything — it only needs to rank layers).
The uniform-quantization MSE formula overweights max-magnitude outliers
and doesn't model non-uniform FP codebooks. Running RTN once and
measuring ‖Wx − Ŵx‖² captures the actual distribution of the weight
tensor and the actual functional perturbation at the layer output — no
tuning constants, no assumption about weight distributions.
The earlier formula output_mse · d_out was dimensionally unsound
(mixed output-space error with fan-out). Under the Gauss-Newton
approximation, local Δloss from a weight perturbation δW is
0.5·δWᵀ H δW, and the Fisher diagonal trace H_trace = Σ g² is a
well-defined proxy for the trace of H under the standard
independence-across-tokens assumption. Measuring weight MSE directly
is a sharper estimator than inferring it from activation propagation.
The frontier builder remains additive because that is the only practical way to sweep the whole model cheaply. prismaquant addresses the missing cross-layer terms by:
- measuring sparse pairwise interactions only for the most important
units near the knee (
measure_interactions.py) - refining the knee locally with those terms
(
quadratic_refine_allocator.py) - calibrating the refined frontier against actual KL
(
calibrate_allocator.py)
This keeps memory bounded while still capturing the interaction structure that recent MPQ literature shows matters. Framing we owe to Gemini's review: use the blunt proxy to get to the Pareto frontier fast, then measure true pairwise interactions only at the margin where the allocator is borderline.
Three places where the current 0.5 · H_trace · MSE_W proxy cuts
corners. We know about them, we can quantify most of them, and all
three have explicit follow-up work on the roadmap.
1. Scalar per-layer curvature. H_trace is a single number per
Linear — the sum of ‖∂L/∂W‖²_F across all channels. That collapses
any intra-layer curvature structure. If a given up_proj has 90 %
of its channels fine under NVFP4 but 10 % catastrophically sensitive,
the trace averages them and the allocator sees an average-pressure
layer. Per-output-channel H diagonal + per-channel weight MSE
preserves the knapsack's optimal substructure (the DP stays
separable), costs out_features floats per layer (single-digit MB
on a 35B model), and is the single most likely place the proxy
underperforms an informed allocator. On the roadmap as the "next
rigor pass." Open question: does the resulting allocation actually
differ meaningfully from the scalar-trace version on our validated
tasks? We haven't ablated yet.
2. Empirical Fisher vs true Hessian. g · gᵀ equals the true
Hessian at a strict local minimum; post-training LLMs on our
calibration distribution are close but not exact. The resulting
bias is roughly layer-uniform — the absolute Fisher magnitudes are
slightly off but the relative ranking across layers (which is all
the knapsack needs) is preserved. Hutchinson's estimator via
forward-over-reverse autodiff would fix this cleanly at ~2× probe
time, but the measured-ranking invariance means it probably doesn't
change allocation decisions in practice. Worth the experiment for a
preprint; not a production blocker.
3. Zero cross-terms. The additive formula Σᵢ 0.5 · Hᵢ · MSEᵢ
assumes every Linear's quantization error is uncorrelated with
every other's. False in general — K-FAC style off-diagonals would
capture it, but they destroy the knapsack's optimal substructure
(the problem becomes NP-hard Quadratic Knapsack). prismaquant's
existing measure_interactions.py + quadratic_refine_allocator.py
handles this the pragmatic way: measure sparse pairwise interactions
only for the most-sensitive units near the knee, then refine
locally. Additive gets us 95 % of the way to optimal; the margin
refinement handles the borderline cases where interactions actually
flip decisions.
The pragmatic TL;DR: the scalar-diagonal-additive proxy is a feature, not a bug — it keeps the global bit-routing problem cleanly solvable in O(N·bits) time. Each of the three corners cut above has a defined off-ramp that doesn't force us to abandon the knapsack.
| Stage | Peak RAM | Peak VRAM (GB10) |
|---|---|---|
| incremental_probe (35B) | 90 GB | 90 GB (unified) |
| incremental_cost (35B) | 60 GB | 60 GB |
| allocator | < 1 GB | n/a |
| export_native_compressed | 60 GB | 20 GB |
Fits 128 GB unified systems (DGX Spark GB10, etc.) for models up to ~48 B parameters. The incremental-mode watchdog in the probe and cost scripts aborts cleanly on swap pressure rather than OOM-killing the host.
prismaquant's pipeline is model-agnostic by default, with
architecture-specific knowledge concentrated in a ModelProfile
subclass. Most of what a profile needs is already encoded by vLLM on
its model class and auto-derived — a new architecture's profile
typically just names the vLLM class and optionally supplies an
MTP-probe helper.
Every place prismaquant needs to make an architecture-specific
decision — fused-sibling promotion, vLLM's weight-loader naming
convention, packed-expert parameter names, MTP module construction,
source passthrough prefixes — is routed through ModelProfile.
Profiles are looked up once at runtime from the HF config's
model_type and architectures fields; no other file in the
pipeline cares which architecture it's looking at.
Default implementations in ModelProfile read two attributes off
the vLLM model class registered for this architecture:
packed_modules_mapping(e.g.{'qkv_proj': ['q_proj', 'k_proj', 'v_proj'], 'gate_up_proj': ['gate_proj', 'up_proj'], ...}). Drivesprofile.fused_sibling_group()— the allocator promotes all members of a fused group to the highest-precision sibling so vLLM's fused-loader sees consistent formats.hf_to_vllm_mapper.orig_to_new_prefix(e.g.{'model.language_model.': 'language_model.model.', 'model.visual.': 'visual.', 'lm_head.': 'language_model.lm_head.'}). Drivesprofile.to_vllm_internal_name()— the config_groups targets and vLLM's scheme-dispatch names stay in sync without prismaquant duplicating the mapping.
Both are vLLM's source of truth. When vLLM adds a new fused pattern or naming quirk on an existing architecture, prismaquant picks it up on the next probe run with no prismaquant-side code change.
matches(model_type, architectures)— pattern-match on HF config.name— profile identifier string.vllm_architecture_class()— one string: the HFarchitectures[0]entry whose vLLM class carries the metadata above.build_mtp_module(text_config)(only if the arch has MTP) — an HF-module replica of vLLM's MTP forward so prismaquant's Fisher probe can backward through it. vLLM's MTP class isn't HF-autograd-friendly, which is why this one piece can't auto-derive.source_passthrough_prefixes()(optional) — which top-level prefixes get copied from source as BF16 (visual encoder if we defer calibration, MTP residue, etc.).to_vllm_internal_name()override (optional) — for arch quirks that the vLLM prefix map doesn't capture. Qwen3.5/3.6 overrides to preserve themtp.prefix at scheme-dispatch (vLLM'smtp.→model.remap runs at weight-load, not scheme-dispatch).
For a hypothetical WhateverForCausalLM:
- Write the profile:
# prismaquant/model_profiles/whatever.py
from .base import ModelProfile
class WhateverProfile(ModelProfile):
@classmethod
def matches(cls, model_type, architectures):
return model_type == "whatever" or any(
a.startswith("Whatever") for a in architectures)
@property
def name(self): return "whatever"
def vllm_architecture_class(self):
return "WhateverForCausalLM"
# ... only if has_mtp / visual / quirks:
# def has_mtp(self): return True
# def build_mtp_module(self, text_config): ...- Register it:
from prismaquant.model_profiles import register_profile
register_profile(WhateverProfile)Or add to prismaquant/model_profiles/registry.py's _REGISTERED list.
- Validate it against a real checkpoint — this is the load-bearing step. The validator catches 90% of "I forgot to update this" bugs:
python -m prismaquant.model_profiles.validate \
--model /path/to/Whatever-7B
Output is a list of ✓/✗ checks:
Validating profile: WhateverProfile (auto-detected)
Model: /path/to/Whatever-7B
vllm class: WhateverForCausalLM
✓ matches() returns True for this model
✓ vllm_architecture_class() resolves
✓ fused-sibling groups consistent
5 fused groups × multiple siblings map to the same canonical key
✓ to_vllm_internal_name() obeys vLLM's prefix map
3 prefix rewrites agree
✓ packed_expert_param_names() cover actual 3D params
2/2 declared names found
✓ source_passthrough_prefixes() cover real tensors
2 prefixes, all match at least one tensor
✓ MTP module constructs + loads weights
7 / 7 checks passed
The validator cross-checks against vLLM's class metadata + the model's safetensors index + (if MTP) an actual MTP module instantiation. Passing checks don't guarantee the export will work end-to-end, but they catch every class of consistency bug I've found so far across three architecture families.
-
Run the pipeline. No other file in prismaquant knows which architecture you're on. The probe, cost, allocator, and export all pull from the profile when they need an arch-specific decision.
-
Report. Add a line to the README's validated-models list, ship a recipe, and open a PR with the probe/cost artifacts and eval numbers so the next person can see precedent.
Qwen3_5Profile— covers Qwen3.5, Qwen3.6. Tested end-to-end on Qwen3.6-35B-A3B. Validator passes 7/7.DefaultProfile— generic fallback. No fused-sibling promotion, no MTP, identity name remap. Will do the right thing for simple architectures whose vLLM class haspacked_modules_mappingbut won't handle MTP or multimodal naming quirks.
Planned profiles (should all be ~50-line files each since the vLLM registry gives most of what we need):
DeepseekV3Profile— covers DeepSeek-V3 family; has MTP (deepseek_mtp), has MoE (different packed convention than Qwen).GLM4MoEProfile— has MTP (glm4_moe_mtp), has MoE.MinimaxProfile— MiniMax M1/M2.7; novel architecture, may need more than vLLM exposes.LlamaMoEProfile— Llama 3 / 4 MoE variants (no MTP).
Contributions welcome; the validator should give clear feedback on what's missing for any profile submission.
- Visual encoder calibration: weights pass through as BF16 from
source. The probe ran the regex for visual blocks, but
stage_text_onlystripsvision_configfrom the staged config so the loaded model never instantiates visual modules. Real multimodal calibration (loadForConditionalGeneration+ useAutoProcessoron an image dataset) is straightforward follow-up but not yet in the pipeline. - Non-Qwen MTP forward construction: The HF-side MTP module
replica used for Fisher probing is profile-provided. Qwen3.5/3.6
has one (
Qwen3_5Profile.build_mtp_module). Other MTP architectures need their own — see "Adding a new architecture" above.
Immediate priorities (next few weeks):
- MiniMax M2.7 on a single DGX Spark. MiniMax M2.7 is ~280 B
parameters — it does not fit in 128 GB at BF16, at FP8, or at a
uniform 4-bit quant with meaningful KV headroom. prismaquant's
mixed-format allocator gets close: if we can land an average
~3.4 bpp recipe with the sensitive paths in NVFP4 and most experts
in 3-bit, the model fits and the KV cache has room to breathe. The
blockers are
3-bit supportand a size-targeting allocator mode (below). - 3-bit format support. Requires registering a 3-bit
FormatSpecinformat_registry.pyand picking an underlying kernel path that vLLM can serve. Candidates: GPTQ 3-bit via Marlin (available now; slower), or native 3-bit W3A16 when/if compressed-tensors ships a dispatcher. - MXFP6 (E3M2 and E2M3). Blackwell tensor cores support MXFP6
natively — the hardware path is free. vLLM doesn't yet ship a
serving kernel for it, but the quantization side works fine today
(
format_registry.pyhas both variants registered). Enable in the serving path once the vLLM dispatcher lands. - Size-targeting allocator mode.
--target-bytes 24_000_000_000 --kv-budget 8192instead of--target-bits 4.75. The allocator already computes the full Pareto curve; this is a thin wrapper that picks the knee matching a hardware constraint. Unlocks direct comparisons at 3090 (24 GB), 5090 (32 GB), and Spark (128 GB) memory targets.
Broader research directions:
- Per-channel Fisher + per-channel weight MSE. The next rigor
pass on the cost model. Keep
H_diagas a vector of lengthout_featuresinstead of summing toH_trace, pair with per-channelMSE_Walready produced bymeasure_quant_cost.py's batched path, and compute Δloss asΣᵢ 0.5 · H_diag,i · MSE_W,i. Preserves the knapsack's optimal substructure, costs < 10 MB of extra storage per 35B model, and is the single most likely place the current scalar proxy underperforms. Cheap intermediate: keep the scalarH_tracebut use per-channelMSE_Wright away — a half-day of work that captures ~50 % of the full win without changing the probe. Would settle the empirical question "does per-channel sensitivity actually differ meaningfully from per-layer sensitivity for our 4.75 bpp allocation?" (hypothesis: yes, on attention Linears anddown_projwhere outlier channels are the norm). - Per-MTP-Linear acceptance-rate tuning — jointly optimize target-model quality and MTP-draft token acceptance rate. Today MTP Linears get assigned using the same body loss model; a more sophisticated allocator would weight MTP sensitivity by the draft's contribution to speculative-decoding throughput.
- Multimodal calibration — load
ForConditionalGeneration, drive image + text pairs through the multimodal processor so visual encoder blocks get real Fisher stats instead of BF16 passthrough. - Model-profile registry completion — the
model_profiles/package is in place and handles Qwen3.5/3.6 end-to-end; add first-class profiles for DeepSeek-V3 / V3.1, GLM-4, MiniMax, and Llama-family MoE so prismaquant works out of the box on those architectures. - Sparse-outlier co-quantization. Borrow from SpQR / SqueezeLLM: extract the top-k outlier weights per Linear, serve them as a sparse BF16 side-table, quantize the dense remainder more aggressively. Would push the 4.75 bpp frontier down to ~4.0 bpp with the same quality.
- Interaction-aware refinement at scale. We already have sparse
pairwise interaction measurement + local quadratic refinement
(
measure_interactions.py+quadratic_refine_allocator.py). Scaling these to cover the whole model without the memory cost of a dense pairwise probe is open. - Serving-aware calibration. The calibration set right now is generic ultrachat; domain-targeted calibration (code, reasoning, multilingual) should Pareto-dominate for task-specific deployments.
prismaquant stands on the shoulders of a decade of mixed-precision quantization research. The closed-form cost model, the Fisher-diagonal sensitivity estimator, the multi-choice knapsack formulation, and the format registry are all assembled from published ideas. Key influences:
Mixed-precision bit allocation
- Wang, K., Liu, Z., Lin, Y., Lin, J., & Han, S. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. CVPR 2019.
- Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., & Keutzer, K. HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision. ICCV 2019.
- Dong, Z., Yao, Z., Arfeen, D., Gholami, A., Mahoney, M. W., & Keutzer, K. HAWQ-V2: Hessian Aware Trace-Weighted Quantization of Neural Networks. NeurIPS 2020.
- Yao, Z., Dong, Z., Zheng, Z., Gholami, A., Yu, J., Tan, E., et al. HAWQ-V3: Dyadic Neural Network Quantization. ICML 2021.
- Chen, W., Wang, P., & Cheng, J. Towards Mixed-Precision Quantization of Neural Networks via Constrained Optimization. ICCV 2021.
Post-training quantization and rounding
- Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023.
- Frantar, E. & Alistarh, D. Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. NeurIPS 2022.
- Cheng, W., Zhang, W., Shen, H., Cai, Y., He, X., & Lv, K. Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs (AutoRound). arXiv 2309.05516, 2023.
- Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., et al. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024.
- Nagel, M., Amjad, R., van Baalen, M., Louizos, C., & Blankevoort, T. Up or Down? Adaptive Rounding for Post-Training Quantization (AdaRound). ICML 2020.
- Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., et al. BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ICLR 2021.
Rotation / pre-conditioning
- Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., et al. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. NeurIPS 2024.
- Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Krishnamoorthi, R., et al. SpinQuant: LLM Quantization with Learned Rotations. 2024.
- Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML 2023.
- Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022.
Low-bit and learned-codebook quantization
- Chee, J., Cai, Y., Kuleshov, V., & De Sa, C. QuIP: 2-Bit Quantization of Large Language Models With Guarantees. NeurIPS 2023.
- Tseng, A., Chee, J., Sun, Q., Kuleshov, V., & De Sa, C. QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. ICML 2024.
- Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., et al. SqueezeLLM: Dense-and-Sparse Quantization. ICML 2024.
- Dettmers, T., Svirschevski, R., Egiazarian, V., Kuznedelev, D., Frantar, E., Ashkboos, S., et al. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. ICLR 2024.
- Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., et al. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models. ICLR 2024.
MoE quantization
- Kim, Y., Henry, R., Imani, H., Saberian, S., Sefidgaran, M., et al. MoQE: Mixture of Quantized Experts for Efficient Mixture-of-Experts. 2023.
MTP / speculative decoding
- DeepSeek-AI. DeepSeek-V3 Technical Report. 2024 (MTP auxiliary objective, adopted here for the MTP Fisher probe).
- Leviathan, Y., Kalman, M., & Matias, Y. Fast Inference from Transformers via Speculative Decoding. ICML 2023.
- Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., & Dao, T. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. 2024.
KV cache quantization
- Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., et al. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. 2024.
- Agarwal, V. (vLLM). TurboQuant: 2-bit KV Cache Compression with Hadamard Rotation + Lloyd-Max Scalar Quantization. vLLM PR #38479, 2026. (Waiting on hybrid-attention/mamba support for Qwen3.6.)
Formats
- Open Compute Project (OCP). Microscaling Formats (MX) Specification, Rev 1.0. 2023.
- NVIDIA. Blackwell GPU Architecture and NVFP4 Format. 2024.
- Neural Magic / vLLM Project. compressed-tensors: A Universal Format for Quantized Model Serving. 2024.
Pareto-knee detection
- Satopaa, V., Albrecht, J., Irwin, D., & Raghavan, B. Finding a 'Kneedle' in a Haystack: Detecting Knee Points in System Behavior. ICDCS Workshops 2011.
Classical bit allocation (water-filling)
- Cover, T. M., & Thomas, J. A. Elements of Information Theory, Chapter 13 (Rate-Distortion), 2nd ed. Wiley, 2006.
If you use prismaquant in research, please cite this repository. A
preprint covering the closed-form allocator math, the
_GradNormCapture MoE Fisher estimator, and the MTP quantization path
is forthcoming.
@software{prismaquant2026,
title = {prismaquant: Mixed-Precision Quantization via
Fisher-Weighted Bit Allocation},
author = {Tand, Rob and contributors},
year = {2026},
url = {https://github.com/RobTand/prismaquant},
}