Skip to content

Dookoo2/SVSK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SVSK: Structured Vector Sidecar

image

Overview

WhatQ4 quantization + low-rank sidecar for LLMs
Budget~4.64 bpw at rank 16
QualityBetter ΔNLL than Q4_K_M GGUF (on Qwen3-4B)
Speed34 tok/s on RTX PRO 4000 (Qwen3-4B, alpha runtime, Triton kernel)
StatusResearch freeze, runtime in progress

SVSK is a post-training quantization method for LLMs that keeps a strong 4-bit base and adds a small tile-local low-rank sidecar to recover the most harmful quantization residual.

Hello! SVSK starts from a simple observation: 4-bit quantization is already good enough to make modern LLMs practical on local hardware, but it still damages the model in uneven ways. Some quantization errors barely matter. Others hit channels and directions that are heavily used during inference and cause visible quality loss. The goal of SVSK is not to replace 4-bit quantization with a large adapter, nor to hide the problem behind a heavy model-specific tuning pipeline. The goal is narrower: keep the model in a Q4-class storage budget (or close to Q4-class, like 4.5 bpw), but recover part of the error that actually matters for the layer output.

The current method is: Text{AA-NativeQ4} + Text{SVSK-r}

where r is the sidecar rank. Higher rank gives more correction capacity, but also increases storage and compute linearly.


Why SVSK Exists

Most practical 4-bit formats focus on representing weights compactly. That is necessary, but it is not the full problem. A neural layer does not use weights in isolation. It uses them through matrix multiplication: This distinction is important. A weight error in a rarely used input channel may have little effect. The same numerical error in a high-traffic channel can produce much larger output distortion. SVSK uses this idea twice. First, it builds a better 4-bit base with activation-aware clipping. This is the AA-NativeQ4 part. Second, it looks at the remaining residual and fits a compact low-rank correction locally, tile by tile. This is the Structured Vector Sidecar part. The result is still a Q4-class representation, but with a small structured correction that targets the quantization error more directly. SVSK is a PTQ method. It does not train the model, and not introduce a large LoRA-style adapter. Also It does not try to specialize the model for a specific benchmark or prompt distribution.


Design Philosophy

SVSK is intentionally constrained. The method should not win by inflating the model. It should not rely on secret per-model tuning. It should not require a large dense adapter that destroys the memory advantage of 4-bit quantization. The main design rules (and strictly hardcoded into idea and realization)are:

  1. Stay close to a Q4 memory budget;
  2. Use calibration activations, not training;
  3. Make rank the main quality/storage/compute knob;
  4. Avoid model-specific overfitting.

This is why the project focuses on bounded post-training quantization rather than fine-tuning. The calibration set is used to measure how the model uses its channels, not to train new behavior into the model.


Math

I will show you all of the main math, but if you need more disclosure - plz, let me know.
NB! This picture was generated by ChatGPT, because of I don't like use LaTeX:) Main_goal Description


Storage Budget

With the current reference settings, SVSK stays in a Q4-class storage regime.

For example, with:

  • block_size = 64
  • tile_m = 512
  • tile_n = 1024
  • packed Q4 base weights
  • int8 U/V sidecar factors
  • small fp16 gain vectors

the estimated effective bpw is approximately:

Text{AA-NativeQ4 base} = 4.25 bpw
Text{AA-NativeQ4 + SVSK r8} = 4.44 bpw
Text{AA-NativeQ4 + SVSK r16} = 4.6 bpw

This is NOT included embeddings, activation. So the method is not winning by turning a 4-bit model into a hidden 6-bit or 8-bit model. It remains close to the intended Q4-class budget. For Qwen3-4B, the current standalone SVSK artifact is already in the same practical storage class as strong Q4 GGUF baselines. The exact runtime memory footprint still depends on the loader and execution path, but the artifact size itself is not the main blocker.


Tests and results

Environment:
CPU: Ryzen 9 7945HX
RAM: 2x32GB DDR5 SO-DIMM 5200Mhz Crucial GPU: RTX PRO 4000 Blackwell
SSD: 1TB Kingston Fury Renegade

Ok, I have done some tests of SVSK on the next models.

  1. Qwen3-4B - the most representative validated model for the method. It has been tested through the HF / PyTorch / SVSK stack and compared against practical llama.cpp GGUF baselines. By tha way, the most important point is that raw PPL values across different stacks should not be compared directly. HF/PyTorch and llama.cpp can produce different absolute reference PPL even for the same model. The safer comparison is local degradation relative to each stack’s own baseline.
  2. Qwen3.5-9B - multimodal local LLM for home tasks. It has been tested the same as Qwen3-4B.

And the results:

Qwen3-4B local degradation comparison

Variant Stack Baseline Avg NLL baseline Avg NLL variant ΔNLL PPL baseline PPL variant PPL ratio Eval setup
SVSK dense-restore HF / SVSK FP reference 2.8273395039 2.8551430255 0.0278035216 16.9004374112 17.3769223732 1.0281936467 128 × 512
Q4_K_M GGUF llama.cpp BF16 GGUF 2.6350851653 2.6672282066 0.0321430413 13.9445 14.4000 1.0326652085 128 × 512
Q4_K_XL GGUF llama.cpp BF16 GGUF 2.6350851653 2.6714899458 0.0364047805 13.9445 14.4615 1.0370755495 128 × 512

On this comparison, SVSK dense-restore shows lower local NLL degradation than the tested Q4_K_M and Q4_K_XL GGUF baselines. This is a meaningful quality signal, but it should not be overstated. It proves that the reconstructed SVSK weights are competitive in the offline evaluation stack. It does not yet prove that the compressed runtime is faster, smaller, or production-ready. The correct interpretation is: SVSK has already reached the quality band where it is worth engineering further. The next bottleneck is no longer the basic mathematical idea. The next bottleneck is runtime execution.


Qwen3.5-9B Results

The intended final comparison is:

  • BF16 / FP16 reference;
  • GGUF Q4_K_M;
  • SVSK r16.

This should be evaluated not only on perplexity, but also on task-level behavior. In particular, code-generation tasks are more useful than another isolated offline table, because the main practical question is whether the NLL improvement survives in real outputs.

Qwen3.5-9B benchmark table

Variant Stack Baseline Avg NLL baseline Avg NLL variant ΔNLL PPL baseline PPL variant PPL ratio Eval setup
SVSK r16 dense-restore HF / SVSK FP reference 2.3243889492 2.3415247332 0.0171357840 10.2204329736 10.3970772524 1.0172834438 128 × 512 token chunks
Q4_K_M GGUF llama.cpp BF16 GGUF 2.0828856041 2.0837572157 0.0008716117 8.0276 8.0346 1.0008719916 128 chunks, wiki.test.raw

The results for Q4_K_M GGUF looks very suspicious, but I'm leaving them here for reference.


How to test SVSK by yourself

This repo contains all of the scripts you need to test SVSK quantization method. And one remark, this is not production ready system, it has not ultrafast speed of inference like llama.cpp, but the speed is enough for testing. Right now it works with small TRITON kernels, even not CUDA implementation. All of the pipeline steps given for Qwen3-4B (for example) At the beginning - copy this repo to local folder on your PC:

git clone https://github.com/Dookoo2/SVSK.git

Install requirements

pip install -r requirements.txt

Than download wikitext2 evaluation and calibration set

hf download Salesforce/wikitext \
  --repo-type dataset \
  --include "wikitext-2-raw-v1/*" \
  --local-dir /path/to/your/dir/wikitext2_raw_hf

Next - create .arrow files. Just copy/paste this command.

python3 - <<'PY' from datasets import load_dataset ds = load_dataset( "Salesforce/wikitext", "wikitext-2-raw-v1", ) print("Train:") for f in ds["train"].cache_files: print(f["filename"]) print("\nValidation:") for f in ds["validation"].cache_files: print(f["filename"]) print("\nTest:") for f in ds["test"].cache_files: print(f["filename"]) PY

And you will get some info

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. Train: /path/to/your/dir/.cache/huggingface/datasets/Salesforce___wikitext/wikitext-2-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext-train.arrow Validation: /path/to/your/dir/.cache/huggingface/datasets/Salesforce___wikitext/wikitext-2-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext-validation.arrow Test: /path/to/your/dir/.cache/huggingface/datasets/Salesforce___wikitext/wikitext-2-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext-test.arrow

After, you need to build evaluation and calibration sets

mkdir -p /path/to/your/dir/Calibration_end_evaluation_texts

python3 build_wikitext_token_chunks.py \
  --arrow-file /path/to/your/dir/.cache/huggingface/datasets/Salesforce___wikitext/wikitext-2-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext-train.arrow \
  --tokenizer /path/to/your/dir/Qwen3-4B \
  --out-pt /path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_train_512tok_qwen3_4b.pt \
  --tokens-per-chunk 512 \
  --max-chunks 128
  
  python3 build_wikitext_token_chunks.py \
  --arrow-file /path/to/your/dir/.cache/huggingface/datasets/Salesforce___wikitext/wikitext-2-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext-validation.arrow \
  --tokenizer /path/to/your/dir/Qwen3-4B \
  --out-pt /path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_valid_128x512_qwen3_4b.pt \
  --tokens-per-chunk 512 \
  --max-chunks 128

Next step, you need to get PPL for reference model - Qwen3-4B fp16

python3 eval_perplexity_rewritten_aa_progressive_fixed.py \
  --model /path/to/your/dir/Qwen3-4B \
  --variant fp_reference \
  --device cuda \
  --dtype float32 \
  --calib-token-chunks-pt /path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_train_512tok_qwen3_4b.pt \
  --eval-token-chunks-pt /path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_valid_128x512_qwen3_4b.pt \
  --seq-len 512 \
  --eval-samples 128 \
  --out-dir /path/to/your/dir/fp_reference_qwen3_4b_eval128

And you will get something like

=== Full-model perplexity summary ===
{
  "source": "/path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_valid_128x512_qwen3_4b.pt",
  "source_type": "token_chunks_pt",
  "eval_samples_requested": 128,
  "eval_samples_used": 128,
  "eval_token_subsample": 4096,
  "seq_len": 512,
  "dataset_num_chunks": 128,
  "dataset_meta": {
    "arrow_file": "/path/to/your/dir/.cache/huggingface/datasets/Salesforce___wikitext/wikitext-2-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext-validation.arrow",
    "tokenizer": "/path/to/your/dir/Qwen3-4B",
    "tokens_per_chunk": 512,
    "num_chunks": 128,
    "total_tokens": 65536
  },
  "avg_nll": 2.8273521177470684,
  "perplexity": 16.900650591847338,
  "n_pred_tokens": 65408,
  "n_eval_sequences": 128
}

Next step - we need to quantize Qwen3-4B from reference to Q4 format by SVSKr16 algo. I will give full optimal parameters for my quantization tool. After all you will get .svsk container file, for example: svsk_artifact_qwen3_4B_embed8k_mixed.svsk

python3 svsk_build.py --model /path/to/your/dir/Qwen3-4B --calib-token-chunks-pt /path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_train_512tok_qwen3_4b.pt --out-dir /path/to/your/dir/svsk_artifact_qwen3_4B_embed8k_mixed --device cuda --dtype float32 --extra-dtype float16 --quant-backend cuda --quant-execution layer_stream --capture-mode block_stream --capture-block-size-layers 1 --quant-workspace-mib 1024 --fit-samples 16 --fit-token-subsample 4096 --seq-len 512 --block-size 64 --aa-importance-alpha 0.5 --aa-importance-clip-min 0.25 --aa-importance-clip-max 4.0 --aa-clip-grid 1.00,0.95,0.90,0.85,0.80,0.75,0.70 --tile-m 1024 --tile-n 2048 --rank 16 --ridge-ratio 0.0001 --svd-damping 0.0 --layers-prefix model.layers. --embed-quant-mode mixed_int8 --embed-hot-rows 8000 --embed-group-size 64

After a few time (1 hour, for example) you need to get PPL of guantized model

python3 svsk_restore_eval.py \
  --artifact-dir /path/to/your/dir/svsk_artifact_qwen3_4B_embed8k_mixed \
  --eval-token-chunks-pt /path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_valid_128x512_qwen3_4b.pt \
  --out-dir /path/to/your/dir/eval_qwen3_49b_dense_restore \
  --device cuda \
  --dtype float16 \
  --seq-len 512 \
  --eval-samples 128 \
  --trust-remote-code

And you will get something like

=== Dense-restore model perplexity summary ===
{
  "source": "/path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_valid_128x512_qwen3_4b.pt",
  "seq_len": 512,
  "eval_samples_requested": 128,
  "eval_samples_used": 128,
  "dataset_num_chunks": 128,
  "dataset_meta": {
    "arrow_file": "/path/to/your/dir/.cache/huggingface/datasets/Salesforce___wikitext/wikitext-2-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext-validation.arrow",
    "tokenizer": "/path/to/your/dir/Qwen3-4B",
    "tokens_per_chunk": 512,
    "num_chunks": 128,
    "total_tokens": 65536
  },
  "avg_nll": 2.855582034215331,
  "perplexity": 17.384552668069706,
  "n_pred_tokens": 65408,
  "n_eval_sequences": 128

PPL gap between reference model (fp16) and SVSKr16 is 0,476484962. For example PPL for Q4_K_M GGUF llama.cpp and FP16 GGUF llama.cpp is

/path/to/your/dir/llama.cpp/build/bin/llama-perplexity   -m /path/to/your/dir/llama.cpp/models/Qwen3-4B-BF16.gguf   --file /path/to/your/dir/llama.cpp/wikitext-2-raw/wiki.test.raw   -ngl 0   -t 16 --chunks 128
Final estimate: PPL = 14,2530 +/- 0.27491
/path/to/your/dir/llama.cpp/build/bin/llama-perplexity   -m /path/to/your/dir/llama.cpp/models/Qwen3-4B-Q4_K_M.gguf --file /path/to/your/dir/llama.cpp/wikitext-2-raw/wiki.test.raw   -ngl 0   -t 16 --chunks 128
Final estimate: PPL = 14,7501 +/- 0.28745

PPL gap is 0,4971

ANd the last step - inference of the quantised model

python3 svsk_runtime_chat.py   --artifact /path/to/your/dir/qwen3_4B_r16.svsk   --device cuda   --dtype float16   --linear-compute-dtype float32   --linear-output-dtype float16   --embedding-compute-dtype float32   --embedding-output-dtype float16   --linear-decode-backend triton   --use-chat-template   --trust-remote-code   --max-new-tokens 8192
`torch_dtype` is deprecated! Use `dtype` instead!
`use_return_dict` is deprecated! Use `return_dict` instead!
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
=== SVSK Runtime Chat ===
Artifact source          : /path/to/your/dir/svsk_artifact_qwen3.5_9B_embed8k_mixed
Artifact mode            : directory
Resolved artifact dir    : /path/to/your/dir/svsk_artifact_qwen3.5_9B_embed8k_mixed
Device                   : cuda
Skeleton dtype           : float16
Requested decode backend : triton
Effective runtime mode   : triton
Linear modules on Triton : 248 / 248
Linear torch fallback    : 0
Embedding logits Triton  : 1 / 1
Embedding torch fallback : 0
Unsupported SVSKLinear   : none
Unsupported embeddings   : none
Linear compute dtype     : float32
Linear output dtype      : float16
Embedding compute dtype  : float32
Embedding output dtype   : float16
Base out chunk rows      : 1024
Base in block chunk      : 16
Cold logits chunk rows   : 32768
Use chat template        : 1
Disable thinking         : 1
Commands                 : /reset, /exit

[INFO] Model loaded in 168.21 sec

You> Write a script to calculate prime numbers from 1 to a billion using the Sieve of Eratosthenes algorithm.
Assistant> ```python
def sieve_of_eratosthenes(limit):
    if limit < 2:
        return []
    
    is_prime = [True] * (limit + 1)
    is_prime[0] = is_prime[1] = False
    
    for number in range(2, int(limit ** 0.5) + 1):
        if is_prime[number]:
            for multiple in range(number * number, limit + 1, number):
                is_prime[multiple] = False
    
    primes = [num for num in range(2, limit + 1) if is_prime[num]]
    return primes

# Calculating prime numbers from 1 to 1 billion
primes_up_to_1b = sieve_of_eratosthenes(1_000_000_000)
print(f"The number of prime numbers up to 1 billion: {len(primes_up_to_1b)}")

The same pipeline you can use for quantization of Qwen3.5-9B and others models. Inference speed on my system is near 34 tps, not so fast, but it is alpha version only and MVP of this algorithm.

Autotune script

Also, I have added autotune script for buildint the best parameters for svsk_build. It works very simple:

python3 svsk_autotune.py \     --model /path/to/your/dir/Qwen3-4B \     --calib-token-chunks-pt /path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_train_512tok_qwen3_4b.pt \      --eval-token-chunks-pt /path/to/your/dir/Calibration_end_evaluation_texts/wikitext2_valid_128x512_qwen3_4b.pt \      --out-dir /path/to/your/dir/svsk_artifact_qwen3_4B_embed8k_mixed_autotune \     --device cuda \     --dtype float32 \     --seq-len 512 \     --fit-samples 16 \     --fit-token-subsample 4096 \     --proxy-eval-samples 4 \     --tune-block-count 6 \     --search-profile narrow12 \     --max-candidates 12 \     --top-k-mini-ppl 3 \     --layers-prefix model.layers. \     --block-size 64 \     --tile-m 1024 \     --tile-n 2048 \     --rank 16 \     --embed-quant-mode mixed_int8 \     --embed-hot-rows 8000 \     --embed-group-size 64 \     --quant-backend cuda \     --quant-execution layer_stream \     --capture-mode block_stream \     --trust-remote-code \     --resume

For me it works worse (PPL was worse than native parameters), but I tested it only on 1 model.

Roadmap

Milestone Status
AA-NativeQ4 base quantizer ✅ Done
SVSK sidecar fitting ✅ Done
Dense-restore evaluation ✅ Done
Standalone artifact format ✅ Done
Compressed runtime (Triton) 🔄 Alpha (34 tok/s)
Robust runtime PPL harness ⏳ Planned
CUDA kernels ⏳ Planned
llama.cpp integration ⏳ Planned

About

Q4 quantization method

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages