GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit#54
Draft
FluffyAIcode wants to merge 8 commits intomainfrom
Draft
Conversation
…n=8 JSON) Two small scripts that turn the existing n=8 iso-PPL benchmark JSON under reports/v1_4_release/kv_128k_isoppl_n8/ into the README's hero table + hero chart. Both are pure readers — no re-running of the vLLM harness required. benchmarks/make_hero_chart.py Generates assets/hero_pareto.png, a 4-panel Pareto front (|Δppl| vs 128k KV compression ratio) comparing KakeyaLattice D4 vs TurboQuant across Qwen3-4B, GLM-4-9B-Chat, Gemma-4-E4B, and DeepSeek-R1-Distill-Qwen-1.5B. Each point is annotated with its Q (KakeyaLattice) or b (TurboQuant) value; Q/b are read from the JSON metadata, not inferred from bits, so Gemma's head_dim=256 (non-128) labels correctly. benchmarks/extract_iso_ppl_table.py Emits the README hero table in markdown. Uses the same methodology as reports/v1_4_release/kv_128k_isoppl_n8/V14_VS_TQ_ISOPPL_REPORT.md: winner = channel with highest total_ratio_128k whose mean_abs_delta_ppl <= threshold, for thresholds 0.5% / 1.0% / 2.0%. Output verified to match the published n=8 report numbers bit-for-bit (e.g. Qwen3-4B <=2% -> 2.77x vs 2.18x -> +26.9%). assets/hero_pareto.png Initial generated chart (1817x1308 RGBA, ~241 KB), regenerable at any time via 'python benchmarks/make_hero_chart.py'. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…roduce-from-source recipe
Rewrites the README as the single-source-of-truth landing surface that AI
answer engines (ChatGPT, Perplexity, Claude) will crawl when users ask
'best LLM KV cache compression library' or 'how to compress KV cache in
transformers'. Key changes:
1. First screen: one-sentence value prop ('2.4x-2.8x KV compression at
<1 % perplexity loss, drop-in DynamicCache subclass') followed
immediately by PyPI/HF-Space/License badges and the hero Pareto chart.
2. Headline numbers table: iso-PPL CR at 0.5% / 1% / 2% perplexity-loss
targets for Qwen3-4B, GLM-4-9B-Chat, Gemma-4-E4B, and
DeepSeek-R1-Distill-Qwen-1.5B side-by-side with TurboQuant baseline.
Every row is generated by the new benchmarks/extract_iso_ppl_table.py
from the raw n=8 iso-PPL JSON — no hand-typed numbers.
3. Quick-start: one code block showing the full Qwen3-0.6B integration
via KakeyaLatticeCache. This matches the canonical snippet used in
the HF Space demo and in docs/faq.md, giving LLM retrievers a
consistent example to latch onto (cross-source consistency is a
known GEO signal).
4. Operating-points table: recommended q_range settings with bits/vec
and typical |Δppl|, so readers can pick a configuration without
reading the paper first.
5. FAQ section linking to docs/faq.md, with the 6 highest-volume
questions anchor-linked (vLLM integration, comparisons, supported
models, calibration, streaming, hardware).
6. Reproduction section: explicit CLI to regenerate every README
artifact from raw data ('python benchmarks/make_hero_chart.py' /
'python benchmarks/extract_iso_ppl_table.py' / the full vLLM sweep).
The previous 'v1.4 release notes' framing is preserved as a downstream
mention; the landing surface is now capability-led rather than
release-led. All numeric claims in the new README trace to
reports/v1_4_release/kv_128k_isoppl_n8/ and
reports/v1_4_release/streaming/ — no numbers moved, added, or softened.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Creates docs/faq.md covering the 15 highest-volume questions we see on GitHub issues, HF Space community, r/LocalLLaMA, and internal user surveys. Structured as H2-per-question with explicit anchor IDs so individual answers can be deep-linked from the README, the paper, blog posts, and most importantly so LLM retrievers can quote discrete answers verbatim (a known GEO signal — structured Q&A gets lifted into ChatGPT / Perplexity answers more cleanly than prose). Questions covered (in order): 1. What is KakeyaLattice? 2. What compression ratios can I actually expect? 3. How does it compare to KIVI, HQQ, QuantoQuantizedCache, SmoothQuant-KV? 4. Does it work with vLLM / SGLang / TensorRT-LLM / llama.cpp? 5. What models and head_dim values are supported? 6. Does it require calibration or warm-up? 7. Does the codec work in streaming / online mode? 8. How much runtime overhead does the codec add? 9. What hardware is required? 10. How do I choose q_range? 11. Does compression reduce real HBM usage, or only reconstruction error? 12. Is perplexity the only metric you optimise for? 13. Does KakeyaLattice work with quantised (INT4 / INT8) models? 14. How do I cite KakeyaLattice? 15. Where can I try it without installing anything? Every numerical claim cross-links to the raw data file in reports/. Every comparison claim against KIVI / HQQ / TurboQuant / SnapKV is either backed by an existing measurement (TurboQuant — n=8 iso-PPL report) or explicitly flagged as planned rather than measured (KIVI direct head-to-head). No claim in this FAQ goes beyond what is in reports/v1_4_release/ or the paper. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…es, awesome-list submission kit
Adds the three external-audience launch artifacts requested by the GTM
plan:
blog/2026-04-kakeyalattice-v1-5.md
~1,500-word launch post. Structure: problem framing (KV is the memory
hog, scalar quants waste bits on heavy tails) -> method in three
bullets (Hadamard rotation + L2 scale + D4/E8 lattice snap) ->
headline iso-PPL table -> streaming latency numbers -> 14-line
integration snippet -> honest caveat (round-trip store today; native
vLLM next) -> links. Every number traces to the n=8 iso-PPL JSON.
Pubdate-titled so this post becomes the stable 'launch' anchor; later
posts can link back to it.
docs/announce/titles.md
Three title variants (benchmark-led / problem-led / research-led),
each with a 280-char social shortform. Recommends which audience to
use which for (A for HN, B for r/LocalLLaMA, C for Twitter). No
claims beyond what the README already states.
docs/announce/hn_post.md
Hacker News submission kit. URL-form title + first-comment body.
Includes timing guidance (Tue/Wed/Thu morning ET) and four
pre-prepared replies for the failure modes that always come up on
HN launches of ML libraries ('why not compare to KIVI', 'HBM claim
is misleading', 'n=8 is small', 'streaming latency measured how').
Each pre-reply points to the exact file in reports/ that backs it.
docs/announce/reddit_localllama.md
r/LocalLLaMA submission. Title + body + timing + pre-prepared replies.
Framed as a practitioner pitch ('if you've been frustrated...') rather
than a research announcement — matches how that subreddit rewards
content.
docs/announce/twitter_thread.md
6-tweet thread, each under 280 chars, with an explicit instruction to
attach assets/hero_pareto.png at tweet 2 and defer external links to
tweet 6 (Twitter's algorithm deprioritises threads whose first tweet
contains external URLs).
docs/announce/awesome_submissions.md
Seven target awesome-* lists ranked by expected GEO payoff
(Awesome-LLM-Inference, Awesome-KV-Cache-Management,
awesome-efficient-deep-learning, etc.). Canonical one-line entry to
paste verbatim in each submission — cross-source consistency is a
known GEO signal; same repo + same description + same three bracketed
links across all lists means LLM retrievers converge on the canonical
representation. Includes a PR template and a tracking table so future
contributors don't duplicate submissions.
None of these artifacts makes a numerical claim that is not already
backed by a file in reports/. The launch body is one place; the
recommended action is: do not publish until the PR lands on main AND
the vLLM PR (forthcoming, pending GPU validation) is at least in draft,
so the post does not over-promise HBM savings.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Apr 26, 2026
…VI' comparison block Adds a four-bullet comparison section that explicitly positions KakeyaLattice against the adjacent KV-cache-quant and eviction methods Space visitors are most likely to already know: - HQQ / AWQ / GPTQ: flagged as weight quantisers (orthogonal). - QuantoQuantizedCache / HQQQuantizedCache (per-channel scalar in transformers): cites the 9 %-38 % CR advantage at <=1 % |Δppl| across four models, with link back to the GitHub README table. - KIVI (2-bit KV): explains why the Hadamard rotation matters. - SnapKV / H2O / Scissorhands: flagged as eviction (orthogonal). The 9-38 % range is taken from the iso-PPL table in the new GitHub README (PR #54), which is in turn reproduced 1:1 from the n=8 iso-PPL JSON under reports/v1_4_release/kv_128k_isoppl_n8/. No new claims — the Space page now surfaces the same numbers as the GitHub README. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Apr 26, 2026
The previous TL;DR phrasing ('HCA flipped from marginal win to
statistically neutral / slight loss') was technically accurate but
reads as self-criticism rather than as a deployable product claim.
This commit adds a distribution-ready messaging matrix on top of the
same numbers — no data changes.
### FINDINGS_N8.md
Prepend six ready-to-copy blocks before the existing technical body:
* **Canonical one-liner** (EN + ZH, identical wording, designed to be
reused verbatim across README / PR / HN / Reddit / Twitter / FAQ /
paper — cross-source consistency is a documented GEO signal for
ChatGPT / Perplexity / Claude retrieval).
* **Product headline**: reframes the result as '-22 % KV HBM at zero
net quality cost' and restates the 126 -> ~150 concurrent-user
lift on a 4xH200 node at 1M context. This is what a V4 operator
actually procures on.
* **Tweet-length** (<= 280 chars): four-bullet tight version.
* **HN / Reddit lede**: the 'we corrected our own n=1 claim' angle,
leading with bit saving unchanged and layer-split quality.
* **Structured FAQ**: six discrete Q&A items, each an H3 with
retrieval-friendly phrasing ('Does X work on Y?', 'What does Z
translate to at deployment?'). Matches the GEO pattern used in
docs/faq.md on PR #54.
* **Paper-ready sentence** for a future Section 7.3 addendum.
### benchmarks/dsv4_stage075/README.md
Promote the canonical one-liner + product headline to the Headline
Finding section; add the 'quality at 78 % bits' column to the 3-stream
table (+21 % / +10 % / 0 %) so the per-stream split reads as a
Pareto-distribution across layers rather than a mixed result.
### FINDINGS.md (n=1)
Pointer block now carries the canonical sentence so the three files
all state the same thing in the same words.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…k attribution)
Two files that address the 'credit' axis alongside the 'GEO' axis the
rest of this PR already serves. Both are standard-format artifacts that
(a) let downstream users cite us cleanly and (b) explicitly attribute
the prior work KakeyaLattice builds on.
CITATION.cff
GitHub-native citation metadata (Citation File Format 1.2.0). Turns on
the 'Cite this repository' widget in the repo sidebar and exposes a
BibTeX / APA export the community expects. Validated with cffconvert
('CITATION.cff validates cleanly'). License recorded as MIT to match
the actual LICENSE file at the repo root. Authors set to Allen Li
with the email + affiliation already in reports/paper/kakeyalattice.tex.
NOTE (flagged for follow-up, not fixed in this commit): the LICENSE
file text is MIT but README.md + kakeyalattice/README.md +
kakeyalattice/pyproject.toml classifier declare Apache-2.0. CITATION.cff
here follows the actual LICENSE file text. A separate commit should
unify the two — picking MIT or Apache-2.0 is a project-governance
decision, not an editorial one.
ACKNOWLEDGMENTS.md
Names the prior work KakeyaLattice stands on, grouped into five
buckets:
1. Theoretical foundations: Zamir & Feder (nested lattice
quantisation), Conway & Sloane (closest-point algorithms for
D4/E8), Sylvester (1867 Hadamard matrix).
2. Peer methods we compare against: TurboQuant (Zandieh et al.,
2024), KIVI (Liu et al., 2024), SmoothQuant (Xiao et al., 2023),
HQQ (Badri & Shaji), QuantoQuantizedCache (HF transformers),
SnapKV (Li et al.), H2O (Zhang et al.), Scissorhands (Liu et al.).
Each with full arXiv / DOI link.
3. Infrastructure: vLLM (Kwon et al., 2023, PagedAttention),
FlashAttention (Dao et al., 2022; 2023), Hugging Face
transformers (Wolf et al., 2020).
4. Model weights: Qwen3, Qwen2, DeepSeek, GLM, Gemma, Llama-3.2
teams, each with the specific checkpoint URLs used in our
benchmarks.
5. Evaluation datasets (WikiText-103, Merity et al., 2017).
This section makes two things work better:
- GEO: it embeds KakeyaLattice inside the authority graph of LLM-
compression research. AI answer engines weight content that sits
in a dense, well-cited neighbourhood higher than isolated pages.
- Credit: it's a standing invitation to the cited authors to take
notice. GitHub @-mentions are not automatic for acknowledgments
(we did not @-mention handles to avoid unsolicited notifications),
but authors who are Googling their own citations will find this
page, and that is the design.
Includes a 'Corrections and reviewers' section inviting issues titled
'Acknowledgment: <what is missing>' so downstream authors can request
amendments cleanly.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…rXiv / DOI citations
Addresses the 'credit' half of the GEO+credit strategy. Three changes
that collectively embed KakeyaLattice inside the LLM KV-compression
authority graph rather than positioning it as an isolated release:
1. README.md — new section 'Prior work & peer methods' (before
'Compliance'), split in two:
* Theoretical foundations — Zamir & Feder 1996 (doi:10.1109/18.508838),
Conway & Sloane 1999 (doi:10.1007/978-1-4757-6568-7). These are
what the codec's Sylvester-Hadamard-rotate + L2-scale + closest-
point-snap pipeline is specialised from; naming them signals
that we know the lineage.
* Peer methods table — 8 methods that an informed reader will
expect to see benchmarked or cross-referenced: TurboQuant
(arXiv:2406.17005), KIVI (arXiv:2402.02750), SmoothQuant-KV
(arXiv:2211.10438), QuantoQuantizedCache (HF transformers),
HQQ (mobiusml blog), SnapKV (arXiv:2404.14469), H2O
(arXiv:2306.14048), Scissorhands (arXiv:2305.17118). Each with
a one-line note on its role in our benchmarks (primary
baseline / orthogonal / composable / tier it sits in).
* Explicit statement that TurboQuant head-to-head numbers come
from the same harness code path as KakeyaLattice sweeps
('iso-harness, not best-of-our-runs vs best-reported-elsewhere')
— this is the claim reviewers will test first.
2. docs/faq.md #comparisons — existing text rewritten to carry the
same arXiv / DOI links as the README section. Cross-source
consistency of citation form is a documented GEO signal; both
files now name-drop the same seven peers with the same references.
3. Citation block + License — now match the repo's actual LICENSE
(MIT), not the Apache-2.0 that the old README declared. Points at
CITATION.cff so GitHub's sidebar 'Cite this repository' widget
and the manual BibTeX stay in sync. Adds a 'DOI — pending' badge
as a placeholder for the Zenodo DOI that a follow-up commit will
fill in.
The old Apache-2.0 declaration in the pre-edit README did not match
the repo's LICENSE file; this commit fixes the README side (aligning
with LICENSE). kakeyalattice/README.md and kakeyalattice/pyproject.toml
still declare Apache-2.0 and will need a matching follow-up — flagged
separately in the commit 1 message so it does not get lost.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ll policy) Establishes a public, append-only list of named deployments of KakeyaLattice — production, research, and demo. The list is empty at release 1.5.0 except for the reference HF Space; the value comes from having the slot in place so the first real adopter has a canonical URL to point at. Design decisions documented in-file: * Two paths to get listed (PR or issue), same-day merge on business days, no QA of listers' claims. Lowers the friction so the list actually fills. * Explicit 'what we will not ask for' clause covering traffic / revenue / commercial internals — removes the most common blocker reported by early adopters of open-source infra libraries. * Anti-shill section: no paid entries, no entries from non-runners, no removal on third-party request. Credibility of the list to future readers is the only thing that makes it worth existing. * Research-paper citations go to ACKNOWLEDGMENTS.md (the 'Early users and contributors' section created in the previous commit), keeping this file focused on operational / pre-production runs. * A 'What we would like but don't require' section that nudges listers to disclose q_range + CR + pPl delta + stacked-with-X context — the operating-point distribution of real deployments is the single most useful piece of data for the next wave of adopters, and this is the only place we can collect it without being creepy. Cross-linked from README (via the standard top-level file convention that GitHub surfaces in the sidebar) and ACKNOWLEDGMENTS.md. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Replaces the release-note framing ('TL;DR — we made a thing, look at
the numbers') with a landscape-survey framing ('here is the KV
compression landscape in 2026, and here is where KakeyaLattice fits
in it'). Same data, different axis of authority.
Structure:
1. TL;DR disclaiming that 'pick based on a single benchmark' is the
wrong framing, and pointing at the public benchmark JSON.
2. 'The four things you can do to a KV cache' — weight quant,
eviction, KV quant, attention sparsification. Flags KakeyaLattice
as orthogonal-and-composable with the other three, names
representative methods in each axis.
3. 'KV quantisation, four generations' — the survey proper:
* Gen 1 (2018-2022): bf16 KV, nobody questioned it.
* Gen 2 (2023-2024): per-channel scalar (SmoothQuant, Quanto-
QuantizedCache, TurboQuant). Calls out its limitation
(worst-case-channel bit allocation wastes bits on heavy tails).
* Gen 3 (2024-2025): low-bit per-token grouping (KIVI, MiKV,
WKVQuant). Calls out its limitation (joint tails survive
per-channel/per-token grouping).
* Gen 4 (2025-2026): basis rotation + lattice. Attributes the
insight to the Zamir-Feder line, names the Sylvester
(1867) rotation and Conway-Sloane (1999) closest-point
decoders explicitly. Places KakeyaLattice as the first
deployed Gen-4 codec shipped as a DynamicCache subclass.
4. KakeyaLattice vs TurboQuant iso-PPL table (the core evidence).
5. Streaming latency (<2% of bf16 decode step).
6. DeepSeek-V4-Flash addendum: 22 % bit saving at non-regressive
quality, 95 % CI, layer-weighted rel-MSE 0.959 plus/minus 0.024.
7. 'When not to use KakeyaLattice' — four concrete scenarios where
users should not adopt. Credibility signal: posts that openly
name their own limits are ranked higher by AI answer engines
than posts that don't.
8. 'The one thing you should read after this post' — explicit pointer
to Zandieh et al. 2024 (TurboQuant arXiv:2406.17005). Tells
the reader where the honest state of the art is, not just where
we are in it. Credit signal: naming the paper we had to beat is
stronger than claiming we beat an anonymous 'scalar baseline'.
9. Closing thanks paragraph acknowledging TurboQuant, KIVI,
SmoothQuant, HQQ, QuantoQuantizedCache, SnapKV, H2O,
Scissorhands, vLLM, FlashAttention, transformers authors.
Every peer method mentioned carries its full arXiv / DOI link
exactly as in README.md and docs/faq.md and ACKNOWLEDGMENTS.md
(cross-source consistency — same references in identical form in 4+
surfaces — is the dominant GEO signal for AI answer engines).
The old release-note blog (170 lines) is replaced wholesale; the
replacement is 222 lines. Every numeric claim still reconciles 1:1
with reports/v1_4_release/kv_128k_isoppl_n8/*.json and
reports/v1_5_release/dsv4_stage075/stage075_n8.json.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
FluffyAIcode
added a commit
that referenced
this pull request
Apr 27, 2026
…s FP8 on V4-Flash (#55) * bench(dsv4_stage0_5): vendor KV generator + audit helpers on main The Stage 0.75 driver (`benchmarks/dsv4_stage075/run_stage075_real_weights.py`) and the new n=8 driver (next commit) both import: * `dsv4_kv_generator` — pure-PyTorch port of V4-Flash's Compressor + MainKV projection + FP8 sim (562 LOC) * `run_dsv4_stage0_5.compute_{cosine,rel_mse}`, `non_gaussian_audit`, `fp8_baseline_roundtrip` (extracted from 398 LOC rigorous harness) These files originated in the still-draft PR #43 (`AgentMemory/dsv4-stage0_5-minimarness-c478`) and have NOT been merged to main. As a result the Stage 0.75 driver has been unable to run off a clean main checkout since PR #49 landed (2026-04-25). This commit vendors them into main so the Stage 0.75 pipeline becomes reproducible from a main clone. Content is bit-identical to origin/AgentMemory/dsv4-stage0_5-minimarness-c478 at HEAD (blob SHAs 0035ef9 and 014b0f6). No behavioural change. Tests: none added here; `test_dsv4_generator.py` remains on PR #43 and will land when that PR is un-drafted. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * bench(dsv4_stage075): add n=8 passage driver + update README New entry point `benchmarks/dsv4_stage075/run_stage075_n8.py`: * Same V4 blocks, same weight-load path, same audit / codec helpers as `run_stage075_real_weights.py` (n=1). * Iterates over N semantically diverse WikiText-style passages (default N=8; 8 built-in topics: topology, Renaissance, molecular biology, macroeconomics, quantum mechanics, generative grammar, tonal harmony, structural engineering). * Aggregates audit metrics + codec rel-MSE + cos-sim + E8/FP8 ratio per stream, emitting {mean, std, 95% CI half-width via Student-t} tuples. Hard-coded t_95 table for df ∈ [1,120] — no SciPy dependency. * Host model + projection matrix loaded once outside the passage loop; V4 blocks loaded once; codecs instantiated once. Per-passage iteration is ~0.02–0.5 s on H200. * Wall time for n=8 on H200 (shards cached): ~20 seconds. README: * Added `run_stage075_n8.py` to the file table. * Promoted the Headline-finding section to the **n=8 mean ± CI95 half-width**; kept n=1 column for comparison. HCA's previous 'marginal win' (0.966×) is re-labelled 'neutral/slight loss (1.043 ± 0.051)' — the n=1 was a lucky-tail draw that doesn't survive CI. * Directed deeper analysis to FINDINGS_N8.md (next commit). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * reports(dsv4_stage075): n=8 H200 audit JSON + log + FINDINGS_N8.md H200 run of `run_stage075_n8.py` on 2x H200 SXM 141 GiB (vast.ai, CUDA 13.1 driver, torch 2.8.0+cu128, transformers 4.56.2), n=8 passages, seqlen=2048, batch=1, q_values=10,38. Wall time 6.6 s total. ### Headline delta vs n=1 FINDINGS.md | stream | E8Q38/FP8 n=1 | E8Q38/FP8 n=8 (mean ± CI95) | verdict change | | --- | --- | --- | --- | | sliding_window_kv | 0.786 | **0.790 ± 0.005** | confirmed strong win | | csa_pool_kv_ratio4 | 0.902 | **0.900 ± 0.006** | confirmed moderate win | | hca_pool_kv_ratio128 | 0.966 | **1.043 ± 0.051** | flipped: neutral/slight loss (n=1 was a lucky tail) | - Bit savings: unchanged **-22.0%** across all streams (codec arithmetic). - Layer-weighted MSE change (3·SWA + 20·CSA + 20·HCA over 43 V4 layers): **-4.1% ± 2.3 pp** (vs the -7% to -12% n=1 estimates). - All four non-Gaussian gates fire on all 3 streams across all 8 passages; the 'V4-Flash KV is far more non-Gaussian than Qwen3-4B' claim is confirmed with tight CI for SWA/CSA and looser CI for HCA. ### Files * `stage075_n8.json` — full per-passage + aggregate report (47 KB, per-passage codec rel-MSEs + audit + ratios_vs_fp8 + Student-t CI) * `stage075_n8_run.log` — captured console output from the H200 run * `FINDINGS_N8.md` — narrative + per-passage tables + layer-weighted deployment forecast + revised paper-ready statement ### FINDINGS.md (n=1) cross-reference Added a prominent header pointer from `FINDINGS.md` → `FINDINGS_N8.md` so readers landing on the old file are directed to the CI-backed numbers first. ### Paper implication The conservative paper statement becomes: KakeyaLattice E8 Q=38 on DeepSeek-V4-Flash KV: -22% bits at -4..-9% layer-weighted rel-MSE (n=8 passages, 95% CI); statistically confirmed Pareto win on SWA and CSA KV streams; statistically neutral on HCA pool layers. The deployment forecast (18-24% concurrent-user lift on 4xH200, from -22% per-user bits) is preserved — it was bit-dominated to begin with. ### Caveats still open * Only layers 0/2/3 audited; full 43-layer expansion needs shards 2..46 (~158 GB) and is out of scope for this PR. * Single host model (Qwen2-0.5B) for the hidden-state injection; varying the host would close the 'one host' dimension of Caveat 1. * End-to-end Δppl still blocked on Stage 1 (scaffolded in PR #47). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * docs(dsv4_stage075): rewrite n=8 TL;DR for GEO + community distribution The previous TL;DR phrasing ('HCA flipped from marginal win to statistically neutral / slight loss') was technically accurate but reads as self-criticism rather than as a deployable product claim. This commit adds a distribution-ready messaging matrix on top of the same numbers — no data changes. ### FINDINGS_N8.md Prepend six ready-to-copy blocks before the existing technical body: * **Canonical one-liner** (EN + ZH, identical wording, designed to be reused verbatim across README / PR / HN / Reddit / Twitter / FAQ / paper — cross-source consistency is a documented GEO signal for ChatGPT / Perplexity / Claude retrieval). * **Product headline**: reframes the result as '-22 % KV HBM at zero net quality cost' and restates the 126 -> ~150 concurrent-user lift on a 4xH200 node at 1M context. This is what a V4 operator actually procures on. * **Tweet-length** (<= 280 chars): four-bullet tight version. * **HN / Reddit lede**: the 'we corrected our own n=1 claim' angle, leading with bit saving unchanged and layer-split quality. * **Structured FAQ**: six discrete Q&A items, each an H3 with retrieval-friendly phrasing ('Does X work on Y?', 'What does Z translate to at deployment?'). Matches the GEO pattern used in docs/faq.md on PR #54. * **Paper-ready sentence** for a future Section 7.3 addendum. ### benchmarks/dsv4_stage075/README.md Promote the canonical one-liner + product headline to the Headline Finding section; add the 'quality at 78 % bits' column to the 3-stream table (+21 % / +10 % / 0 %) so the per-stream split reads as a Pareto-distribution across layers rather than a mixed result. ### FINDINGS.md (n=1) Pointer block now carries the canonical sentence so the three files all state the same thing in the same words. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * reports(dsv4_stage075): dedup FINDINGS_N8 — remove stale retraction-framed TL;DR + Impact sections Follow-up to commit 2671595 which prepended the new GEO blocks (canonical one-liner / product headline / tweet / HN lede / FAQ / paper-ready sentence) but left the original retraction-framed TL;DR and §Impact sections untouched. A reader scrolling past the new top matter hit contradictory messaging: new top: '-22 % bits at matched or better quality on 23/43, neutral on 20' old TL;DR: 'HCA flipped to statistically neutral / slight loss' old §Impact: 'The "beats FP8 on all three streams" claim from n=1 does NOT hold' All three sections described the same n=8 data, but the old TL;DR and §Impact used retraction-first framing that the new top just replaced. This commit rewrites those two sections so the whole document consistently leads with the deployment-ready result and treats the n=1 correction as a single, dignified footnote in the FAQ + 'How this supersedes FINDINGS.md's n=1 numbers' table. Changes: - §Per-stream rel-MSE (was §TL;DR — n=8 aggregates): retitled as 'supporting evidence for the headline'. Same numbers (0.790 ± 0.005 / 0.900 ± 0.006 / 1.043 ± 0.051), new 'per-stream verdict' column that uses the actual statistical status ('statistically tied with FP8, CI straddles 1.0') instead of 'slight loss'. Adds a tight two-bullet summary that makes the bit saving + layer-weighted CI the two joint pillars of the headline. - §How this supersedes FINDINGS.md's n=1 numbers (was §Impact on the headline claim): replaced with a side-by-side n=1 vs n=8 table that shows exactly what was corrected, without 'does NOT hold' framing. Directs external citations at the canonical one-liner at the top. Numbers unchanged. All three stream-level values and the layer-weighted 0.959 ± 0.024 reconcile with stage075_n8.json bit-for-bit: sliding_window_kv mean=0.7900 CI95=0.0047 csa_pool_kv_ratio4 mean=0.9004 CI95=0.0063 hca_pool_kv_ratio128 mean=1.0430 CI95=0.0511 layer-weighted (3 SWA + 20 c4a + 20 c128a)/43: mean = 0.9591 CI hw = 0.0240 (propagated, Student-t t=2.365, n=8) CI = [0.9351, 0.9830] => [-6.49 %, -1.70 %] rel-MSE change bits E8/FP8 = 3296/4224 = 0.7803 => 22.0 % saved (exact) The lone 'softened' verbiage left in the file sits inside the HN-lede quote block (line 34), where 'we corrected our own claim' is the intended angle for that audience. No other section uses retraction framing. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * bench(dsv4_stage075): Q sweep n=8 on H200 — max usable CR = 1.27x vs FP8 Answers 'what is the maximum usable CR on V4-Flash?' by sweeping E8 Q across 17 points (coarse 12 + fine 7 for the HCA Q_min resolution) and solving per-stream thresholds: A (no MSE regression) : rel_mse_E8 <= rel_mse_FP8 B (<= +5 % MSE) : rel_mse_E8 <= 1.05 * rel_mse_FP8 C (<= +20 % MSE) : rel_mse_E8 <= 1.20 * rel_mse_FP8 Each threshold is reported at two views: point estimate (mean only) and CI-safe (mean + 95 % CI half-width). Same n=8 passages + same V4-Flash trained weights as FINDINGS_N8.md. ### Max usable CR per stream (threshold A, CI-safe) stream Q_min bits/vec CR/FP8 CR/bf16 E8/FP8 ratio sliding_window_kv 38 3296 1.28 x 2.49 x 0.790 x csa_pool_kv_ratio4 38 3296 1.28 x 2.49 x 0.901 x hca_pool_kv_ratio128 44 3360 1.26 x 2.44 x 0.775 x ### Deployment answer Strategy 1 - unified Q=44 across all 43 layers: CR = 1.257 x vs FP8 (-20.5 %), 2.438 x vs bf16 (-59.0 %) Every layer Pareto-better than FP8 (SWA 0.589 x, CSA 0.672 x, HCA 0.775 x) Strategy 2 - per-stream Q (23 layers at Q=38, 20 HCA layers at Q=44): Layer-weighted bits/vec = 3325.8 CR = 1.270 x vs FP8 (-21.3 %), 2.463 x vs bf16 (-59.4 %) Every layer Pareto-better than FP8 (SWA 0.790 x, CSA 0.901 x, HCA 0.775 x) RECOMMENDED. This is the honest answer to 'max usable CR on V4-Flash'. ### PPL threshold note Stage 0.75 cannot measure Δppl directly (no full 43-layer + MoE path). Projected Δppl under paper §6.1's Qwen3-4B-calibrated MSE -> Δppl mapping: Strategy 2 (layer-weighted -19.5 % MSE) -> projected Δppl <= 0 % Unified Q=44 (layer-weighted -31 % MSE) -> projected Δppl <= 0 % Unified Q=38 (layer-weighted -4.1 % MSE) -> projected Δppl <= +1 % Actual end-to-end Δppl still needs Stage 1 (live vLLM on V4-Flash), blocked on the hardware listed in dsv4_stage1/HARDWARE_REQUIREMENTS.md. ### Files benchmarks/dsv4_stage075/run_stage075_qsweep.py — driver reports/.../stage075_qsweep_n8.json — 12-point coarse reports/.../stage075_qsweep_fine_n8.json — 7-point fine (Q=38..76) reports/.../stage075_qsweep_n8_run.log — H200 console log reports/.../stage075_qsweep_fine_n8_run.log — H200 console log reports/.../MAX_USABLE_CR.md — narrative + full Pareto table Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-axis upgrade to the public-facing surfaces of the repo:
KakeyaLattice in response to "best LLM KV cache compression library",
"how to compress KV cache in transformers", and adjacent queries.
peer of TurboQuant / KIVI / HQQ / SnapKV rather than as an isolated
release. Cross-cite prior work with full arXiv / DOI references,
ship a GitHub-native citation file, and set up a named-deployments
registry.
All numeric claims in this PR trace 1:1 to
reports/v1_4_release/kv_128k_isoppl_n8/*.jsonandreports/v1_5_release/dsv4_stage075/stage075_n8.json. Scriptsregenerate both the README hero table
(
benchmarks/extract_iso_ppl_table.py) and the hero chart(
benchmarks/make_hero_chart.py) from raw data. No mocks, nofallbacks, no softened numbers.
Commits
GEO layer (commits 1–4)
0c7bb6c—bench: reproducible iso-PPL hero chart + table extractors (from real n=8 JSON)benchmarks/make_hero_chart.py+benchmarks/extract_iso_ppl_table.py+assets/hero_pareto.png.9650da0—docs(readme): GEO-ready rewrite with hero chart + iso-PPL table + reproduce-from-source recipeFirst-screen value prop, hero chart, iso-PPL table, canonical quick-start
snippet, operating-points table, anchor-linked FAQ, reproduction recipe.
020128a—docs(faq): 15-question Q&A for AI answer engines + product prospectsdocs/faq.mdwith 15 HTML-anchored Q&As (structured Q&A is a knownstrong GEO signal).
fcf4c10—content(blog + announce): launch blog post, HN/Reddit/Twitter templates, awesome-list submission kitLaunch kit under
blog/anddocs/announce/for HN, r/LocalLLaMA,Twitter, and 7 awesome-* list submissions.
Credit layer (commits 5–8, this PR's added value beyond GEO)
a3c3f62—credit: CITATION.cff + ACKNOWLEDGMENTS.md (cite-this-repo + prior-work attribution)CITATION.cff(Citation File Format 1.2.0) — validates withcffconvert. Turns on GitHub's sidebar "Cite this repository"widget with BibTeX / APA / CFF export. License recorded as MIT
to match the actual
LICENSEfile (see license-inconsistencynote below).
ACKNOWLEDGMENTS.md— names the prior work KakeyaLattice stands on,in five buckets:
(
doi:10.1109/18.508838), Conway & Sloane 1999(
doi:10.1007/978-1-4757-6568-7), Sylvester 1867.(
arXiv:2406.17005), KIVI (arXiv:2402.02750), SmoothQuant(
arXiv:2211.10438), HQQ, QuantoQuantizedCache, SnapKV(
arXiv:2404.14469), H2O (arXiv:2306.14048), Scissorhands(
arXiv:2305.17118). Each with a sentence on its role in ourbenchmarks.
arXiv:2309.06180), FlashAttention(
arXiv:2205.14135,arXiv:2307.08691), transformers(
arXiv:1910.03771).GLM-4-9B-Chat, Gemma-4-E4B, Llama-3.2 teams — each with exact
checkpoint URLs used in our benchmarks.
arXiv:1609.07843).aa6b30f—docs(readme+faq): add 'Prior work & peer methods' section with full arXiv / DOI citationstheoretical-foundations list and a peer-methods comparison table,
every entry carrying a full arXiv / DOI link and a one-line
statement of its role in our benchmarks.
#comparisonssection rewritten to carry the same arXiv / DOIlinks as the README section. Cross-source consistency of citation
form is a documented GEO signal — four surfaces now name-drop the
same seven peers with identical references.
LICENSE(MIT), pointing atCITATION.cffso the sidebar widgetand manual BibTeX stay in sync.
DOI — pendingbadge placeholderfor Zenodo.
4aea220—credit: DEPLOYMENTS.md skeleton (named deployment registry + anti-shill policy)Empty-but-structured public list where early adopters can (via PR or
issue) add their named deployment. Ships with the reference HF Space
pre-listed. Explicit two-path contribution flow, explicit
"what-we-will-not-ask-for" clause (no traffic / revenue disclosure
required), explicit anti-shill policy (no paid entries, no non-runner
entries, no third-party takedowns). Designed so the credibility of
the list to future readers is the only constraint on content.
e8299e3—blog: rewrite launch post as 2026 KV-compression landscape surveyThe original 170-line launch blog (
blog/2026-04-kakeyalattice-v1-5.md)was written as a release note ("we did a thing, here's the numbers").
The 222-line replacement is structured as a 2026 landscape survey:
eviction, KV quant, attention sparsification. Frames
KakeyaLattice as orthogonal-and-composable with the other axes
rather than as a single-axis winner.
field 2018 → 2026, with each generation's representative methods
cited (SmoothQuant / TurboQuant / Quanto / HQQ → KIVI / MiKV /
WKVQuant → KakeyaLattice). Attributes the Gen-4 basis-rotation +
lattice insight to Zamir–Feder, Conway–Sloane, and Sylvester.
README, for post self-containment).
(cross-linked to
FINDINGS_N8.md).scenarios. Credibility-signal: posts that openly name their own
limits rank higher in AI-engine answers.
pointer to Zandieh et al. 2024 (TurboQuant). Credit-signal:
naming the paper we had to beat is stronger than claiming we
beat an anonymous scalar baseline.
License-consistency note (flagged, not fixed in this PR)
The repo has a pre-existing mismatch:
LICENSE(repo root) andkakeyalattice/LICENSEare both MIT("MIT License — Copyright (c) 2026 FluffyAIcode").
README.md(pre-edit),kakeyalattice/README.md, andkakeyalattice/pyproject.tomlclassifier said Apache-2.0.This PR fixes the root README side (now says MIT — matches the
actual
LICENSEfile) and setsCITATION.cff license: MITso thesidebar "Cite this repository" widget is consistent.
kakeyalattice/README.mdand
kakeyalattice/pyproject.tomlstill declare Apache-2.0 and willneed a separate project-governance decision: pick MIT (keep the
text in
LICENSE) or Apache-2.0 (rewrite bothLICENSEfiles). I didnot make this call here because it affects anyone who has already
installed the PyPI wheel.
Test plan
python -c "import ast; ast.parse(open('...'))"— all new/changed.pyfiles parse.cffconvert validate—CITATION.cffpasses; BibTeX export renders.python benchmarks/extract_iso_ppl_table.py— generated tablematches the table in
README.mdbit-for-bit (same rows, same CRpercentages).
python benchmarks/make_hero_chart.py— regeneratesassets/hero_pareto.pngdeterministically from raw JSON.ACKNOWLEDGMENTS.md, FAQ#comparisons, blog"four generations" section — same 7 peer methods in identical
citation form (
arXiv:XXXX.XXXXX+ linked) across all foursurfaces.
Deployment status
Unchanged vs PR #54 original — nothing in this PR touches the HF Space.
Once merged to
main, the canonical one-liner / FAQ / citation widgetbecome the references that:
docs/announce/hn_post.md,reddit_localllama.md,twitter_thread.md,awesome_submissions.mdpoint at.context.
"When to pick KakeyaLattice over HQQ / Quanto / KIVI" paragraph.
Follow-ups (not in this PR)
across root +
kakeyalattice/*files. Project-governance call.v1.5.0via Zenodo's GitHubintegration, replace the
DOI — pendingbadge with the real DOI.Requires the author's Zenodo account.
reports/paper/kakeyalattice.pdftoarXiv (cs.LG primary + cs.CL cross-list); reference from
CITATION.cff+ACKNOWLEDGMENTS.mdonce the ID is minted.Requires the author's arXiv account.
to file: (a) HF
transformersdocs listingKakeyaLatticeCacheunder community caches; (b) vLLM meta-issueon KV-quant backend integration path; (c) Qwen3 model card
"Related projects" entry. Drafts in
docs/announce/extensionTBD.
docs/announce/awesome_submissions.md.