Spiral

Geometric compression of rotated transformers.

Spiral exploits the geometric structure of transformer activations to achieve SOTA calibration-free INT3 weight compression and INT2 KV cache compression — without calibration data, without fine-tuning. Two results:

SOTA calibration-free INT3 weights at +0.14 nats — 101× quality improvement over naive 3-bit, competitive with calibration-based approaches (GPTQ, AWQ, QuIP#) that require representative data.
INT2 PQ KV cache at 7.1× K compression — product quantization reduces per-token KV memory from 56 KB to 32 KB (K+V combined), scaling context capacity by 1.75× at any memory budget. With full K+V PQ (in progress), this reaches 7.1× total compression.

INT3 Weight Quality

Measured eval perplexity gap vs fp16:

Qwen2.5-Coder-7B-Instruct (dense):

Method	Bits	Gap (nats)	Calibration Data Required
Naive round-to-nearest	3	+14.2	No
GPTQ	3	~+0.8	Yes (128 samples)
AWQ	3	~+0.6	Yes (calibration set)
QuIP#	3	~+0.3	Yes (calibration set)
Spiral	3	+0.141	No

GPTQ/AWQ/QuIP# gaps are approximate values from published literature at comparable model scales, not measured on this specific model.

Qwen3-Coder-30B-A3B-Instruct (MoE, 128 experts):

Method	Size	vs Spiral
Q4_K_M (GGUF)	18.6 GB	60% larger
Q3_K_M (GGUF)	15.3 GB	32% larger
Q3_K_S (GGUF)	14.2 GB	22% larger
Q2_K_M (GGUF)	11.8 GB	Similar size, higher quality loss
Spiral INT3 + PQ KV	11.6 GB	+0.228 nats, + 7.1× KV compression

Spiral achieves Q2-level model size while maintaining Q3-level quality — measured at +0.228 nats vs fp16 baseline (2.212 nats). No standard GGUF method includes KV cache compression; Spiral adds 7.1× K compression on top, enabling 75% more context at any memory budget.

The rotation is a deterministic, seeded orthonormal transform that works on any architecture — dense or MoE, any head dimension, any RoPE frequency. No calibration data, no gradient updates, no fine-tuning.

KV Cache Compression

Per-token KV memory comparison for a 7B model (28 layers, 4 KV heads, 128 head_dim):

KV Method	K bits/dim	V bits/dim	Per-token KV	K Compression
F16 (standard)	16	16	56.0 KB	1×
Q8_0	8	8	28.0 KB	2×
Q4_0	4	4	14.0 KB	4×
Spiral PQ (K only)	2.1	16	31.9 KB	7.1× (K)
Spiral PQ (K+V, planned)	2.1	2.1	7.9 KB	7.1× (K+V)

Total Memory — What Actually Fits

Model size alone doesn't determine whether a model runs on your hardware. Total memory — weights + KV cache + compute buffers — is what matters. Spiral compresses all of it.

Qwen2.5-Coder-7B at 32K context:

	Spiral INT3 + PQ KV	Q4_K_M + F16 KV	Q4_K_M + Q4_0 KV
Weights	3.0 GB	4.7 GB	4.7 GB
KV cache (32K)	0.98 GB	1.7 GB	0.43 GB
Compute + buffers	1.5 GB	1.5 GB	1.5 GB
Total	5.5 GB	7.9 GB	6.6 GB
Fits 8 GB?	Yes	No	Tight

Qwen3-Coder-30B-A3B at 32K context:

	Spiral INT3 + PQ KV	Q4_K_M + F16 KV	Q4_K_M + Q4_0 KV
Weights	11.6 GB	18.6 GB	18.6 GB
KV cache (32K)	0.11 GB	1.5 GB	0.75 GB
Compute + buffers	1.6 GB	1.5 GB	1.5 GB
Total	13.3 GB	21.6 GB	20.9 GB
Fits 16 GB?	Yes	No	No
Fits 24 GB?	Yes	Tight	Tight

At 32K context, Q4_K_M needs 21.6 GB total for the 30B MoE — it doesn't fit on 16GB and barely fits on 24GB. Spiral needs 13.3 GB. That's the difference between running and not running.

Context capacity at each memory tier (Qwen2.5-Coder-7B):

Hardware	Spiral PQ Context	Q4_K_M + F16 KV Context
8 GB Mac	113K tokens	18K tokens
16 GB Mac	360K tokens	186K tokens
24 GB Mac	606K tokens	355K tokens

For long-horizon agent tasks — multi-file code generation, repository-scale analysis, extended conversations — context capacity is the binding constraint. PQ KV trades ~34% decode speed for 75% more context at every memory tier.

How It Works

The Geometry

Trained transformer weights are not random matrices. They exhibit structure that compression can exploit:

Observation 1: Hypersphere concentration. Weight rows concentrate near a thin shell on the unit hypersphere (norm CV ≈ 0.02). Direction carries the information; amplitude is nearly constant. This enables sign/amplitude decoupling.

Observation 2: Rotated Gaussianity. Applying a random orthonormal rotation (Walsh-Hadamard transform) to any trained weight row produces nearly Gaussian marginals with equalized variance across all dimensions. Outlier channels — the primary source of quantization error — vanish under rotation.

Observation 3: PQ subspace adaptation. Product quantization with 256 learned codewords per 4-dimensional subspace captures 68.5% of the scalar-to-Shannon compression gap for KV activations. Natural-space codebooks (no rotation needed for KV) add only +0.02 nats — learned codebooks adapt to non-uniform dimensional importance inherently.

Unified Rotation

Spiral applies the same mathematical primitive — multi-pass block Walsh-Hadamard rotation — to both weights and activations:

Weights (offline): Rotate → quantize to INT3 with Lloyd-Max optimal centroids → store. At inference, rotate the input activation by the same transform before matmul. Cost: O(d log d) per token via fast WHT.

KV cache (online): K vectors are compressed via product quantization into 32 codebook indices (34 bytes per 128-dim vector). A fused Metal kernel decodes PQ codes, applies RoPE, and computes attention in a single pass — no intermediate tensor materialized.

Custom Metal Kernels

Spiral includes purpose-built GPU kernels for Apple Silicon:

Fused flash attention with inline PQ decode — one kernel launch for codebook lookup + RoPE + Q·K scoring + softmax + V accumulation. Reduces compute buffer from 2 GB (graph-level decode) to 304 MB. RoPE frequency base is parameterized from the GGUF (supports 10K for Qwen2.5, 10M for Qwen3).
Multi-pass Walsh-Hadamard rotation — seeded random orthonormal transform at O(d log d) per token, matching rotated weight basis. Adapts to any dimension (768, 2048, 3584, 4096, 18944).
Online PQ encode — compresses incoming K vectors to codebook indices during inference using L2 nearest-neighbor search.
MoE expert dispatch — rotation applied before expert gate/up projections and before down projections inside the MoE FFN, with type-guarded checks so non-Spiral models are unaffected.

Performance

Measured on Apple M2 Pro (16 GB):

Mode	Decode	Prefill
F16 KV	29 tok/s	140 tok/s
PQ KV	19 tok/s	190 tok/s

Install

brew install reinforceai/spiral/spiral

Quick Start

spiral-chat                              # interactive chat
spiral-chat --prompt "explain quicksort"  # single response
spiral-serve --port 8080                  # OpenAI-compatible API

Available Models

Model	Size	Base	Architecture	Min RAM
`qwen-25-7b-spiral`	3.02 GB	Qwen2.5-Coder-7B-Instruct	Dense	8 GB
`qwen3-coder-30b-spiral`	11.61 GB	Qwen3-Coder-30B-A3B-Instruct	MoE (128 experts, 8 active)	24 GB

spiral-chat --model qwen-25-7b-spiral
spiral-download --model qwen-25-7b-spiral

Compression Breakdown

Per-component quality cost:

Qwen2.5-Coder-7B (dense, 3.02 GB):

Component	Method	Compression	Quality Cost
Weights	Rotated Lloyd-Max INT3	4.2×	+0.141 nats
KV cache (K)	Natural-space PQ INT2	7.1×	+0.090 nats
Embeddings	Asymmetric affine INT4	4.0×	+0.017 nats
Full pipeline		4.8× model, 7.1× KV	+0.184 nats

Qwen3-Coder-30B-A3B (MoE, 11.61 GB):

Component	Method	Compression	Quality Cost
Weights (12,480 matrices)	Rotated Lloyd-Max INT3	5.3×	~+0.16 nats†
KV cache (K)	Natural-space PQ INT2	7.1×	~+0.07 nats†
Embeddings	Asymmetric affine INT4	4.0×	~+0.02 nats†
Full pipeline		5.3× model, 7.1× KV	+0.228 nats

†Per-component estimates based on 7B component ratios. End-to-end gap (+0.228) is measured directly.

The same compression physics applies to both dense and MoE architectures. Each expert's weight matrix is compressed independently — the rotation adapts to any input dimension (768, 2048, 4096). Router weights stay at fp16 for full-precision expert selection.

Acknowledgments

Spiral builds on open-source foundations:

llama.cpp by Georgi Gerganov — inference engine, GGUF format, Metal backend. Spiral's deployment infrastructure inherits directly from this project.
TurboQuant by Eric Kryski — fused asymmetric attention kernels and two-pass flash attention on Metal. The TurboFlash architecture directly inspired Spiral's fused PQ attention kernel.
llama-cpp-turboquant by TheTom — llama.cpp integration of TurboQuant, providing the foundation for Spiral's Metal kernel dispatch, GGUF type registration, and graph-level quantized inference pipeline.
Qwen Team — Qwen2.5-Coder under Apache 2.0.
The broader open-source ML community — researchers contributing to quantization theory (GPTQ, AWQ, QuIP#, AQLM), rotation methods (QuIP, SliceGPT, SpinQuant), and product quantization (Jégou et al., 2011) laid the groundwork that Spiral builds upon.

This work would not be possible without the remarkable researchers and engineers who contribute to open source.

Citation

@misc{spiral2026,
  title={Spiral: Geometric Compression of Rotated Transformers},
  author={Deshwal, Viraj},
  year={2026},
  publisher={ReinforceAI},
  url={https://github.com/ReinforceAI/spiral}
}

License

Inference engine: Based on llama.cpp (MIT) Spiral compression framework: ReinforceAI Model weights: Subject to base model license (e.g., Apache 2.0 for Qwen2.5-Coder)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
bin		bin
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
ty.toml		ty.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spiral

INT3 Weight Quality

KV Cache Compression

Total Memory — What Actually Fits

How It Works

The Geometry

Unified Rotation

Custom Metal Kernels

Performance

Install

Quick Start

Available Models

Compression Breakdown

Acknowledgments

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Spiral

INT3 Weight Quality

KV Cache Compression

Total Memory — What Actually Fits

How It Works

The Geometry

Unified Rotation

Custom Metal Kernels

Performance

Install

Quick Start

Available Models

Compression Breakdown

Acknowledgments

Citation

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages