Skip to content

kakeyaturbo: monomorphic Rust implementation of the KakeyaTurbo RDO codec with ≥99.76% UT coverage#4

Merged
cursor[bot] merged 2 commits intomainfrom
cursor/kakeyaturbo-rust-12f5
Apr 18, 2026
Merged

kakeyaturbo: monomorphic Rust implementation of the KakeyaTurbo RDO codec with ≥99.76% UT coverage#4
cursor[bot] merged 2 commits intomainfrom
cursor/kakeyaturbo-rust-12f5

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Summary

Rust crate at kakeyaturbo/ implementing the KakeyaTurbo codec designed in the earlier conversation rounds: a single monomorphic encode_block / decode_block kernel parametrised by

  • a compile-time type parameter R: Distortion (the loss ρ)
  • a runtime &[f32] weights array w
  • a runtime CodecParams struct (variance_ratio, K, bit_width, rotation_seed)

No plugins, no dyn, no extension points. Every (R, n, d, d_eff, K, B) combination compiles to its own specialised machine-code function.

Modules

File Responsibility LOC
src/lib.rs entry + design-contract doc + re-exports 54
src/distortion.rs Distortion trait + MSE, InnerProduct, LInf ZSTs 280
src/wht.rs Walsh-Hadamard transform + seeded sign flips 307
src/quantize.rs Lloyd-Max codebooks for N(0,1) + LSB-first bit packing 355
src/pca.rs weighted PCA truncated at d_eff 371
src/kmeans.rs weighted spherical K-means with sign-aware update 436
src/skeleton.rs block-level metadata container 97
src/codec.rs the single encode_block / decode_block kernel 495
Total library code ~2 400

Quality gates

gate status
cargo build ✅ clean
cargo test 136 passing (125 unit + 5 integration + 6 proptest)
cargo clippy --all-targets ✅ 0 errors, remaining warnings are style preferences in numerical loops
cargo llvm-cov line coverage 99.76 % — see table below
cargo llvm-cov function coverage 100.00 %
#![forbid(unsafe_code)] ✅ enforced at crate root
grep -rn "dyn " src/ ✅ zero occurrences (comments/test-names aside)
grep -rn "Box<" src/ ✅ zero occurrences

Per-module coverage (cargo llvm-cov --summary-only)

Filename      Regions  Missed    Cover | Functions Missed Executed | Lines Missed   Cover
-----------------------------------------------------------------------------------------------
codec.rs          180       8   95.56% |      45       0  100.00% |   445      2   99.55%
distortion.rs      87       2   97.70% |      24       0  100.00% |   113      0  100.00%
kmeans.rs         189       5   97.35% |      35       0  100.00% |   356      0  100.00%
pca.rs            150       7   95.33% |      26       0  100.00% |   300      1   99.67%
quantize.rs       150       8   94.67% |      36       0  100.00% |   239      1   99.58%
skeleton.rs        15       0  100.00% |       7       0  100.00% |    49      0  100.00%
wht.rs            115       2   98.26% |      33       0  100.00% |   181      0  100.00%
-----------------------------------------------------------------------------------------------
TOTAL             886      32   96.39% |     206       0  100.00% |  1683      4   99.76%

The only uncovered source lines are 2 assert!(cond, "...") format-string messages inside test code (by definition unreachable while tests pass):

  • src/pca.rs:338"captured variance not monotone: prev={prev} new={}"
  • src/quantize.rs:228"MSE must decrease with more bits: prev={prev} new={mse}..."

Together with continuation lines, these account for the 4 uncovered line slots. Every line of production code is covered.

Test inventory

Unit tests (125)

  • distortion (17): closed-form values, symmetry, non-negativity, numerical-gradient match, Huber-continuity at the breakpoint, NORM_MODE correctness, zero-size verification.
  • wht (17): size-1 identity, explicit H₂/H₄ comparison, WHT² = N·I, power-of-2 panic paths, deterministic sign patterns, seed-determinism, zero-input, linearity, rotate/inverse round-trip, seed mismatch detection.
  • quantize (25): centroid count/symmetry/ordering for each bit-width, nearest-centroid selection, out-of-range clamping, MSE monotonicity in bits, metric-agreement (MSE vs IP on the scalar level), bit-pack round-trip, byte layout check, error paths.
  • pca (19): weighted mean with uniform/zero/skewed weights, 2D-ellipse major-axis recovery, variance-ratio truncation, round-trip under full rank, constant-data handling, variance-monotonicity, weighted emphasis, zero-weight-row skip, NaN/out-of-range ratio clipping, dimension mismatch panics.
  • kmeans (21): k = n identity, unit-norm centres, seed determinism, two-cluster recovery, zero-norm row skip, zero-weight-row skip, misshaped input panics, assign_and_project correctness (including anti-aligned input), residual subtraction, centre view helpers.
  • skeleton (3): byte-size accounting, dim accessors, clone equivalence.
  • codec (23): MSE / InnerProduct / LInf round-trips, 4-bit vs 1-bit MSE ordering, weight-driven PCA rotation, shape preservation, byte accounting, determinism, seed-variation, 7 panic paths, zero-coefficient edge case, all-zero-block handling, next_pow2/pad_zero/l2_norm helpers.

Integration tests (5)

Real end-to-end blocks: 64×32 MSE reconstruction, compression-ratio bound, weighted-row priority, inner-product preservation, multi-shape robustness.

Property tests (6, via proptest)

Random blocks + weights: shape invariance, determinism, finite reconstruction, all-bit-width handling, all-K handling, uniform-weight-scale invariance.

Design contract verification

The philosophy "one kernel, one code path, no dispatch" is grep-verifiable:

grep -rn "dyn "  src/     # → 0 matches
grep -rn "Box<"  src/     # → 0 matches
grep -rn "unsafe" src/    # → 0 matches

The design-contract block in src/lib.rs makes this explicit, and #![forbid(unsafe_code)] at the crate root compiles unsafe out.

Example

use kakeyaturbo::{encode_block, decode_block, CodecParams, MSE, InnerProduct, LInf};

let params = CodecParams {
    variance_ratio: 0.95, k: 8, bit_width: 3,
    rotation_seed: 0xCAFE_BABE, kmeans_max_iter: 32,
};

// Same function, three different specialised machine-code emissions:
let (sk, codes) = encode_block::<MSE>(&block, &weights, d, &params);          // V cache, time series
let (sk, codes) = encode_block::<InnerProduct>(&block, &weights, d, &params); // K cache, retrieval
let (sk, codes) = encode_block::<LInf>(&block, &weights, d, &params);         // scientific data

// Boundary-V / sparse-V / attention-weighted: just pass different weights,
// no codec changes needed — this is the L1–L5 unification under RDO.

Repro

cd kakeyaturbo
cargo test                          # 136 tests
cargo llvm-cov --summary-only       # the coverage table above
cargo llvm-cov --html --output-dir coverage && open coverage/html/index.html
cargo clippy --all-targets
grep -rn "dyn \|Box<dyn\|unsafe " src/

Environment

Rust 1.83.0 stable. Dependencies: nalgebra 0.33 (for weighted SVD), half 2.4 (fp16 scalars in Codes), rand 0.8 (SmallRng seeded sign patterns). Dev deps: approx 0.5, proptest 1.5.0.

Open in Web Open in Cursor 

cursoragent and others added 2 commits April 18, 2026 06:47
First three modules of the Rust monomorphic kernel for KakeyaTurbo:

- distortion: trait + zero-sized MSE / InnerProduct / LInf types.
  All methods are #[inline(always)] so R::d is inlined to raw
  arithmetic at each call site, eliminating any runtime dispatch.
  NormMode is a compile-time constant per metric.

- wht: Walsh-Hadamard transform + deterministic sign-flip pattern
  derived from a u32 seed. Rotates input to Gaussianise residuals;
  inverse is a single WHT with the same seed.

- quantize: Lloyd-Max codebooks for N(0,1) at 1..=4 bits, generic
  nearest-centroid quantiser parametrised by Distortion, and a
  bit-packing layer (LSB-first) for the 1..=8 bit range.

All three modules ship with dense unit tests: correctness of basic
math, boundary / panic cases, round-trips, and monomorphisation
contract (zero-size types, no dyn).

59 tests, all passing.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full monomorphic-kernel implementation of KakeyaTurbo following the
RDO unification design from the previous conversation rounds.

New modules:
  - pca.rs: weighted PCA truncated by variance_ratio (nalgebra SVD)
  - kmeans.rs: weighted spherical K-means with farthest-first init,
    sign-aware update (supports anti-aligned rows)
  - skeleton.rs: block-level metadata container
  - codec.rs: the single encode_block / decode_block kernel,
    parametrised by <R: Distortion> and runtime CodecParams

Tests:
  - 125 unit tests covering every public function, panic path, and
    numerical invariant
  - 5 integration tests end-to-end (realistic synthetic blocks,
    weight effects, inner-product preservation, multi-shape robustness)
  - 6 property-based tests via proptest (shape invariants, determinism,
    finiteness, uniform-scale invariance)

Test & coverage totals:
  - 136 tests passing
  - cargo llvm-cov: 99.76% line coverage (100% of production code;
    the 2 uncovered lines are assertion-failure messages inside tests
    that never fire), 100% function coverage
  - cargo clippy: 0 errors, remaining warnings are cosmetic style
    choices in numerical loops
  - grep verifies no 'dyn', no 'Box<dyn>', no 'unsafe' in src/

Also removes 765 tracked build artefacts from target/ that were
accidentally committed in the previous stub commit.

Contract verified as grep-able: see src/lib.rs design-contract
section and the distortion_trait_is_object_unsafe_as_intended test.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot marked this pull request as ready for review April 18, 2026 07:05
@cursor cursor Bot merged commit d20e13f into main Apr 18, 2026
cursor Bot pushed a commit that referenced this pull request Apr 22, 2026
Buckets on the HF (+7.82%) vs vLLM (+35.33%) 27-pp gap:

  #1  Engine baseline shift            ~10 pp (clean-model PPL
                                        disagreement; 0.145 KL;
                                        18% top-1 disagreement)
  #2  Codec residual magnitude         ~0    (codec is engine-
                                        agnostic; mse ratio 1.01)
  #3  Noise-sensitivity curve          HF MORE sensitive per \u03c3 in
                                        linear regime; not the cause
  #4  Boundary layers already skipped  +69 pp saved by SPRINT_CLOSEOUT
                                        boundary policy
  #5  Cross-layer non-linear compound  +39 pp (joint-cell - \u03a3
                                        singletons over 22 quiet
                                        layers)

Localised root cause: vLLM's single-forward bf16 residual-stream
accumulation through Flash-Attention compounds per-layer codec
residuals ~39 pp above their sum, while HF eager's f32-accumulate
+ teacher-force over DynamicCache compounds them less aggressively.
Each per-layer residual is small on both engines (Phase 4 matched);
what differs is the accumulation path.

Deployment recommendations:
  1. Extend vLLM boundary skip to {2, 6, 11} on top of the existing
     {0,1,7,14,26,27}; cuts ~10-15 pp off the joint Delta-ppl.
  2. Adaptive per-layer bit-width: K b=4 on the hot layers, b=3
     elsewhere; preserves 19/28 of the ratio benefit.

Phase 3 ran only on vLLM (reused production harness); the HF per-
layer curve is left as a follow-up if someone wants to confirm
that HF's cross-layer interaction is the ~+10 pp we infer here.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode deleted the cursor/kakeyaturbo-rust-12f5 branch April 23, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants