InstinctRazor

Sub-4-bit quantization of a frontier MoE — and a reproducible way to prove it didn't lose anything that matters.

📦 Ready-to-run weights: the deployable 48 GB IQ3_XXS GGUF (runs on one 80 GB GPU, or a small card + CPU expert-offload) is on the Hugging Face Hub — General-Instinct/InstinctRazor-Qwen3.5-122B-A10B-GGUF. The framework in this repo reproduces it from the original BF16. (The Python package/module namespace is moe-lowbit.)

We take Qwen3.5-122B-A10B (a 122B hybrid Gated-DeltaNet MoE, ~245 GB BF16) down to a ~47 GB 3-bit-expert / 4-bit-backbone checkpoint (q122_ptq3b_clip), and show that on a shared, validation-gated harness it beats the footprint-matched Gemma-4-26B-A4B (~52 GB) on knowledge, reasoning, multilingual, and multimodal-MMMU — tracking the uncompressed BF16 teacher within ~1 point, with no training. Where it doesn't win (LiveCodeBench v6, MATH-Vision — both truncation-driven, not capability loss), there's an optional Lightning-OPD on-policy-distillation path.

The headline isn't just "we compressed it." It's: a 122B MoE inverts the dense low-bit regime — at ~3 effective bits, PTQ is near-lossless because only 8/256 experts fire per token and the ~6% always-on path is protected. A 122B model that needed 4×80 GB now fits on one 80 GB GPU and out-scores a same-size dense-ish MoE.

Headline result

~47 GB clip vs the ~52 GB A4B baseline, same harness (vLLM 0.22, TP=4, thinking mode, seed 0). Full provenance + caveats in results/RESULTS.md.

Benchmark	teacher BF16	clip (~47 GB)	A4B (~52 GB)	verdict
MMLU-Pro	87.6	88.5	85.6	✅ clip ≥ A4B
GPQA-Diamond	83.8	84.8	79.3	✅ clip ≥ A4B
MMMLU	88.8	87.2	85.4	✅ clip ≥ A4B
MMMU-Pro	—	80.8	73.8†	✅ clip ≥ A4B
LiveCodeBench v6	65.5	57.0 (75.7 finished)	66.0	⚠️ recoverable gap (clip truncates 30%)
MATH-Vision	—	70.0 → 77.5 hi-budget	82.4†	⚠️ recoverable gap (truncation)
HLE (no-tools)	18.0	13.3	12.3	✅ clip ≥ A4B (below teacher)
τ²-Bench	—	—	—	⏳ no in-tree harness

†A4B official figure (a same-harness A4B multimodal run was not done). All comparisons are subject to the validation gate — see caveats below and docs/EVAL_PROTOCOL.md.

Quickstart

A. Verify the whole framework on ONE GPU (~15 min, no 122B, no 4×H100)

Runs the exact same load → PTQ → dequant-save → vLLM-eval code path on a small dense model:

python3.12 -m venv vllm_venv && source vllm_venv/bin/activate && pip install -r requirements.txt
MOE_LOWBIT_VENV=$PWD/vllm_venv source env.sh
bash pipelines/smoke.sh        # quantizes Qwen3.5-2B to 4-bit, evals it; prints "SMOKE OK"

B. Reproduce the headline (4×H100-80GB)

Three commands: set up env → build the 47 GB clip from BF16 → eval clip and the A4B baseline.

# 0. env (once): create the eval venv, then source env.sh (sets PYTHONPATH, HF token, caches)
python3.12 -m venv vllm_venv && source vllm_venv/bin/activate && pip install -r requirements.txt
HF_TOKEN=hf_xxx MOE_LOWBIT_VENV=$PWD/vllm_venv source env.sh

# 1. BF16 -> ~47 GB clip  (recipe in configs/clip_122b.env)
bash pipelines/quantize.sh

# 2. eval clip vs the footprint-matched A4B on the gated axes
bash pipelines/eval.sh --model models/q122_ptq3b_clip --tag clip --benchmarks mmlu_pro,gpqa,mmmlu
bash pipelines/eval.sh --model google/gemma-4-26b-a4b-it --tag a4b  --benchmarks mmlu_pro,gpqa,mmmlu

Then read results/vllm_eval/{clip,a4b}.json (or regenerate results/RESULTS.md). For the long-context axes add --budget 64k (math/code/HLE). Optional on-policy distillation for the LCB code gap:

bash pipelines/distill.sh --smoke   # validate the 4-GPU FSDP path first
bash pipelines/distill.sh           # gen -> score -> FSDP train -> merge+requant -> ~47 GB

Repo layout

InstinctRazor/           # (Python module namespace: moe-lowbit)
  env.sh                 # source first: PYTHONPATH(quant:eval:distill) + venv + HF/caches (overridable)
  requirements.txt       # EVAL/quantize venv (torch 2.11, vllm 0.22, transformers 5.9)
  requirements-train.txt # OPD FSDP train venv (torch 2.7 + flash-linear-attention) — separate by necessity
  src/
    quant/   moe_quant.py (fake-quant/STE/clip-search) · quant_save.py (the builder) ·
             moe_probe.py + moe_alloc.py + moe_study.py (the salience research path) · moe_ptq.py
    eval/    vllm_eval8.py (text harness) · mm_eval.py (multimodal) · vllm_eval.py · bench8_loaders.py ·
             moe_eval.py · bench_mm.py
    distill/ opd_gen.py · opd_score.py · opd_train_fsdp.py · merge_adapter.py · moe_lora.py ·
             cache_teacher.py · fsdp_setup.py (reference; live logic is inline in opd_train_fsdp.py)
  pipelines/ smoke.sh · quantize.sh · eval.sh · distill.sh
  configs/   clip_122b.env · smoke_2b.env · eval.env · opd_r1.env (+ README explaining each)
  results/   the result JSONs + RESULTS.md (each number -> its source JSON + command)
  docs/      EVAL_PROTOCOL.md + MOE_FINDINGS/MOE_PIPELINE/LADDER_RESULTS/OPD_INTEGRATION/POSITIONING/SURVEY_MOE
  archive/   quarantined dead-ends (device_map OPD, EP/FSDP smokes, diagnostics) — kept, marked deprecated

Note on imports: the modules cross-import by bare name (import moe_quant, import vllm_eval, …), so env.sh puts all three src/ subdirs on PYTHONPATH. That is the only packaging change — no source logic was rewritten.

The recipe (what actually ships)

expert_bits = 3.0   (int3, all 256 routed experts, both gate_up + down blocks)
linear_bits = 4.0   (int4, all non-expert nn.Linear + embed/lm_head)
group = 128 · per-block symmetric · MSE clip-search 16 steps
protected @ bf16:  vision tower · router/gate · all norms
=>  ~3.05 effective bits  ~=  ~47 GB packed  (under A4B's ~52 GB)

This is uniform quantization, deliberately. We built per-expert salience allocation (moe_probe+moe_alloc+moe_study) and the study (finding F4/F6, docs/MOE_FINDINGS.md) showed that on a load-balanced router (0/256 cold experts) allocation is a real lever for distribution fidelity but second-order for downstream capability — uniform 3b already lands near-lossless. So the probe/alloc path is the research that justified uniform, run optionally via PROBE=1 bash pipelines/quantize.sh.

Deployment: GGUF/llama.cpp (the packed, single-GPU-runnable form) — VALIDATED

The fake-quant checkpoint above is the capability ceiling (what eval.sh measures), stored dequantized. The shipped deployable artifact is a real low-bit GGUF, quantized from the original BF16 (not the dequant clip — that would double-quantize) with an imatrix, via llama.cpp (qwen3_5_moe support merged upstream, PR #19468):

Don't want to rebuild? Download it. The built artifact (+ the mmproj vision projector) is on the Hub: General-Instinct/InstinctRazor-Qwen3.5-122B-A10B-GGUF. To rebuild it from the original BF16 instead:

BASE_TYPE=IQ3_XXS bash pipelines/pack_gguf.sh   # convert orig BF16 → imatrix → IQ3_XXS protected

Validated artifact q122_orig-IQ3XXS-protected.gguf — 48.04 GiB (IQ3_XXS 3.06 bpw experts; protected path shared int8 / attn int4 / router+SSM f16 / embed+lm_head q8_0; recipe configs/gguf_tensor_types_iq3.txt):

metric	value	vs A4B	vs fake-quant clip
MMLU-Pro	90.7 (n=150, 0 trunc)	≥ 85.6 ✅	tracks (88.5–90)
GPQA-D	80.8 (n=198, 0 trunc)	≥ 79.3 ✅	~4 pt under (84.8)
decode tok/s, 1×80 GB H100	115.9 (prefill 2541)	—	—
decode tok/s, expert-offload (peak ~7.6 GiB VRAM, fits 8 GB to 8k ctx)	45.7 ± 0.5 (prefill ~154)	—	runs on an 8 GB card + ~48 GiB RAM

The deployable GGUF preserves the win (≥ A4B on both axes) and runs on one 80 GB GPU at ~116 tok/s (or a small card + CPU at ~47 tok/s with all routed experts offloaded). GPQA is ~4 pt below the fake-quant ceiling — an honest, small i-quant loss, not a collapse. Accuracy is measured by src/eval/llamacpp_eval8.py (llama-server + our exact prompt-build/grading — vLLM can't load this arch's GGUF). Pipeline notes baked in: --no-mtp (no mtp.* tensors despite config); # comments stripped before --tensor-type-file; vision tower → separate *-mmproj-f16.gguf.

Superseded fallback: an earlier q122_clip-Q3K-protected.gguf (57 GB) was built from the dequant clip (double-quantized, Q3_K 3.4 bpw, no imatrix). The IQ3_XXS-from-original build above replaces it as the shipped artifact.

Hardware

4× NVIDIA H100-80 GB (NVLink), driver 580.105.08, CUDA 13.0, ~885 GB CPU RAM. Quantize + eval use device_map="auto" / vLLM TP=4; OPD training uses FSDP2 across all 4 GPUs via torchrun.
The packed ~47 GB recipe is designed to run on a single 80 GB GPU; the BF16 teacher needs 4.
The one-GPU smoke (path A) runs on any ≥16 GB GPU.

Methodology & roadmap

Eval numbers are weight-only fake-quant — a capability ceiling, measured at full fidelity. quant_save.py bakes the 3b/4b quantization, then saves a dequantized BF16 checkpoint so vLLM measures capability exactly (no low-bit eval kernel). The deployable artifact is the GGUF/llama.cpp pack (docs/MOE_PIPELINE.md) — the ~47 GB on-disk, single-GPU-runnable form, and the one published on the Hub.
Same-harness deltas are the comparison. Some axes (e.g. LiveCodeBench v6 extraction) read below their official absolutes for all models; the gated quantity is the same-harness delta vs A4B, not the absolute. See docs/EVAL_PROTOCOL.md for the validation gate.
Remaining gaps are an OPD target, not a wall. Where the clip still trails A4B — code (LCB v6) and math / multimodal-math — the loss is largely token-inefficiency introduced by quantization. We close it with OPD (on-policy distillation), a separate framework we'll open-source later; pipelines/distill.sh is the in-tree reference path (gen → score → FSDP train → merge + requant), trained on the gap's own domain (e.g. code CoT for LCB).

Where the numbers come from

Every value in results/RESULTS.md maps to a JSON in results/ and the eval.sh command that produced it. Start with docs/EVAL_PROTOCOL.md (the validation gate), then docs/MOE_FINDINGS.md (why the recipe is what it is).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstinctRazor

Headline result

Quickstart

A. Verify the whole framework on ONE GPU (~15 min, no 122B, no 4×H100)

B. Reproduce the headline (4×H100-80GB)

Repo layout

The recipe (what actually ships)

Deployment: GGUF/llama.cpp (the packed, single-GPU-runnable form) — VALIDATED

Hardware

Methodology & roadmap

Where the numbers come from

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
archive		archive
configs		configs
docs		docs
pipelines		pipelines
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
env.sh		env.sh
requirements-train.txt		requirements-train.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

InstinctRazor

Headline result

Quickstart

A. Verify the whole framework on ONE GPU (~15 min, no 122B, no 4×H100)

B. Reproduce the headline (4×H100-80GB)

Repo layout

The recipe (what actually ships)

Deployment: GGUF/llama.cpp (the packed, single-GPU-runnable form) — VALIDATED

Hardware

Methodology & roadmap

Where the numbers come from

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages