Skip to content

feat: R&D codec bench framework — upstream sync, probes P5/P7, InferenceBackend, measurement model#189

Merged
AdaWorldAPI merged 6 commits into
mainfrom
claude/codec-rnd-bench
Apr 17, 2026
Merged

feat: R&D codec bench framework — upstream sync, probes P5/P7, InferenceBackend, measurement model#189
AdaWorldAPI merged 6 commits into
mainfrom
claude/codec-rnd-bench

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

R&D framework for codec psychometric benchmarking. Upstream sync, probe results, InferenceBackend trait, agent tooling, measurement model.

What's on this branch (9 commits)

Upstream sync

  • Stale snapshot removal — deleted AdaWorldAPI-lance-graph-d9df43b/ (182 files, 3 MB). Full audit: zero content loss, our src is a strict superset. Eliminates GitHub path confusion.
  • Cherry-pick spark_dialect.rs from upstream PR DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests) #150 — the ONE file upstream has that we didn't (107 LOC Spark SQL dialect + 293 LOC test).

Python reference headers

  • scripts/tts_inference.py and scripts/bake_hhtld_codebooks.sh now have "REFERENCE ONLY — Rust is canonical" headers pointing to the Rust equivalents.

InferenceBackend trait (crates/thinking-engine/src/inference_backend.rs)

  • Runtime-switchable dispatch across ALL codec/inference paths. Nothing killed.
  • Two classification axes: (full-path QJL vs leaf-only I8 hybrid vs passthrough) × (reconstruction vs signature vs hybrid grade).
  • 7 backend structs: Passthrough, RaBitQ, Spiral, I8Hybrid, HhtlF32, Cascade, Base17Signature.
  • Designed for the EmbedAnything runtime-addressing pattern — switch backends without killing any path.

Probe results measured on real Qwen3-TTS-0.6B

Probe Tensor Result
P5 TurboQuant k_proj [2048,1024] All 4 correction methods ρ≥0.997 at L=1; ALL collapse to ρ=0.000 by L=5. Chain kills all — variance, not bias.
P7 PolarQuant HIP k_proj [1024,1024] PolarQuant-normalized families WORSE than Base17 L1 (-9%). Stripping magnitude before clustering loses informative coupling.

ADK behavior monitor agent

All agents → Opus 4.7

  • All 29 agent cards across both repos pinned to model: opus. Zero sonnet.

Invariants doc extended (470 LOC)

New invariants:

  • I9 BF17 shapeshifting — same bits carry different semantics per HHTL level (float at HEEL, signed coefficient at LEAF). Explains WHY PolarQuant-only splitting hurts.

New probes specified:

  • P8 Cronbach's α bench — psychometric measurement model. Codec candidates as test items, internal consistency (α) discovers factor structure. Epiphany × population correlation matrix ties every session lesson to testable predictions across 6 data populations (attention k_proj, MLP gate, vocab embed, Jina v5, audio codec, BGE-M3).
  • P9 Mixed bit-width per HHTL level — tests whether wider HIP (finer structure) × shorter LEAF beats narrow HIP × longer LEAF at same total bit budget. 6 variants from 38 to 102 bits/row. Core question: does accuracy at the address level compound through layers enough to pay off vs brute-force leaf precision?

Design principle

Nothing retired. Every research path coexists as an InferenceBackend variant. The bench runs all against all, Cronbach's α tells us factor structure, and deprecation is data-driven. Python is prep-only (HF download, ONNX export); Rust is the canonical inference runtime.

Test plan

  • cargo build --release --example polarquant_hip_probe — clean
  • cargo build --release --example turboquant_correction_probe — clean
  • P5 TurboQuant run — chain collapse measured
  • P7 PolarQuant HIP run — refuted (-9%)
  • P8 Cronbach's α bench implementation (next session — measurement model specified)
  • P9 resolution variants implementation (next session)
  • InferenceBackend impls for each existing path (next session)

Next session entry point

docs/CODEC_INVARIANTS_AND_EXPERIMENTS.md § P8 has the full measurement model: 7 codecs × 6 populations × 9 metrics × 6 resolution variants. The epiphany × population correlation matrix maps every invariant (I1-I9) to its testable prediction per population. Start by implementing cronbach_alpha in bgz_tensor::quality, then the bench fills the matrix.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

claude added 6 commits April 17, 2026 10:41
## InferenceBackend trait (crates/thinking-engine/src/inference_backend.rs)

Runtime-switchable dispatch across all codec/inference paths. Nothing
killed — every research path coexists as a backend variant.

Two key axes documented in the trait module:

Axis 1 — full-path vs leaf-only quantization:
  Full-path QJL/PolarQuant: entire row → JL sign+magnitude (~20 B/row)
  Leaf-only I8 hybrid: HEEL+HIP location (6b) + i8 JLQ residual (9 B/row)
  Passthrough: exact (2×n_cols B/row)

Axis 2 — reconstruction-grade vs signature-grade:
  Reconstruction: SafetensorsRaw, BurnFwd, CandleFwd, HhtlF32+SlotL
  Signature: RaBitQ, SpiralEncoding, CodecCascade, Base17
  Hybrid: I8Hybrid (location + JLQ leaf)

7 backend structs registered in all_backends(). EncodedState enum
carries opaque per-backend state. Trait methods: encode, score,
reconstruct, bytes_per_row, shared_overhead_bytes, grade.

## TurboQuant P5 results (run on Qwen3-TTS-0.6B k_proj [2048,1024])

CRITICAL FINDING: all 4 correction methods (direct i8, Fisher z,
QJL corrected, TurboQuant) hit rho >= 0.997 at single-layer, but
ALL collapse to rho = 0.000 by layer 5 in a 33-layer chain.

  Single layer: Fisher z best (rho=0.999), all >= 0.997
  Chain L=5:    ALL 0.000
  Drift/layer:  QJL 6x lower bias than direct i8 (doesn't help)

Root cause: variance, not bias. Repeated multiplication of quantized
score matrices amplifies noise beyond recovery. QJL bias correction
is correct but irrelevant when variance dominates.

Implications:
  - Path B (cascade inference through 33 layers) NOT VIABLE as
    chained score multiplication
  - Single-layer cascade IS viable (rho >= 0.997)
  - I8 hybrid (HEEL+HIP + JLQ leaf) does f32 reconstruction, not
    chained scoring — different quality model, not refuted by this
  - Hybrid strategy: cascade per-layer, f32 GEMM between layers

P5 status updated in docs/CODEC_INVARIANTS_AND_EXPERIMENTS.md:
  MEASURED — chain collapses, single-layer passes.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Codifies 7 anti-patterns (AP1-AP7) learned from PRs #176-#188 into
an agent card that fires flags when the session repeats them:

  AP1: "225/225 feels like success" without gate 2 (#178)
  AP2: Projecting quality from docs instead of measuring (#177)
  AP3: Building new codec before benching existing ones (#184)
  AP4: Centroid-residual framing on near-orthogonal data (#177/#183)
  AP5: Python in the inference hot path
  AP6: Chained score multiplication without chain-collapse check (P5)
  AP7: Modifying ndarray without explicit permission (#176)

Invoked by adk-coordinator when pattern repetition is suspected, or
by human directly. Output: list of fired flags, max 7 lines.

Also audited all 29 agent cards across both repos:
  - All pin model: opus or model: sonnet (no hardcoded versions)
  - opus → Opus 4.7 automatically, sonnet → Sonnet 4.6
  - 3 ndarray agents on sonnet (l3-strategist, migration-tracker,
    product-engineer) — intentional for speed-over-depth roles
  - adk-coordinator missing Bash tool (by design — delegates)
  - sentinel-qa missing Edit/Write (by design — audit-only)

No agent changes needed for Opus 4.7 compatibility — model: opus
resolves correctly.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
## P7: PolarQuant HIP family probe — REFUTED for pure direction split

Measured on Qwen3-TTS-0.6B k_proj [2048,1024], 256 rows:

  Base17 L1 (current):   16.8% within-family NN recall  (16/16 families)
  PolarQuant normalized:  7.8% within-family NN recall  (16/16 families)
  Delta: -9.0%  ← PolarQuant is WORSE

Root cause: stripping magnitude before clustering loses informative
signal. For k_proj rows, magnitude variation correlates with NN
structure — rows with similar magnitudes tend to be nearest neighbors.
Base17 L1 already encodes a JOINT direction+magnitude opinion through
the golden-step fold. Pure-direction families throw away half the
coupling.

Insight: the "opinion as address" framing is correct, but the opinion
must be JOINT direction+magnitude (like BF16's mantissa+exponent),
not direction alone. This confirms the logarithmic-scale bgz17
philosophy: u8 encodes both axes simultaneously.

Status: P7 REFUTED for PolarQuant-only normalization on k_proj.
Base17 L1 families are already sufficient for this tensor shape.
May differ for other roles (gate, up, down) — per-role probing is
a follow-up.

## InferenceBackend trait (inference_backend.rs)

Runtime-switchable dispatch design. 7 backend variants documented
with two classification axes:
  Axis 1: full-path QJL vs leaf-only I8 hybrid vs passthrough
  Axis 2: reconstruction-grade vs signature-grade vs hybrid

Trait: encode → EncodedState, score(i,j), reconstruct(i), grade().
Not yet wired into lib.rs (needs feature gate design for heavy deps).

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
## I9: BF17 shapeshifting

Same 16-17 bit wire width carries different constructs at different
HHTL levels: BF17 float at HEEL (joint direction+magnitude opinion),
4-bit partition at HIP, 8×i8 PolarQuant coefficients at LEAF. The
"shapeshifting" is: exponent bits at HEEL become direction bits at
LEAF; mantissa bits at HEEL become magnitude bits at LEAF. Explains
WHY PolarQuant-only splitting hurts (P7 result): the coupling between
direction and magnitude IS the information at HEEL/HIP level.

## P8: Cronbach's α codec bench — psychometric measurement model

Reframes the R&D bench from "horse race" to "psychometric instrument
validation." Codec candidates are test items; we measure internal
consistency (α) to discover factor structure.

### Epiphany × population correlation matrix

Cross-tabulates every invariant (I1-I9) and probe finding (P1-P7)
against 6 data populations: attention k_proj, MLP gate, vocab
embedding, Jina v5 output, audio codec embeddings, BGE-M3 output.
Each cell predicts what should happen if the invariant holds on
that population. The bench FILLS the cells.

### Populations chosen for cross-validation

Different distribution signatures (near-orthogonal vs unit-normalized
vs vocab-sparse vs SiLU-gated vs discrete-latent) ensure the factor
structure is real, not artifact of one tensor's shape.

### Metrics

9 metrics per (codec × population) cell. 4 already in
bgz_tensor::quality (pearson, spearman, top_k_recall, mae/rmse).
4 NEW to implement (Cronbach's α, Cohen's κ, bias, ICC).

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
@AdaWorldAPI AdaWorldAPI merged commit b9b973d into main Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants