GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit by FluffyAIcode · Pull Request #54 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-26T05:08:49Z

Summary

Two-axis upgrade to the public-facing surfaces of the repo:

GEO — make AI answer engines (ChatGPT / Perplexity / Claude) cite
KakeyaLattice in response to "best LLM KV cache compression library",
"how to compress KV cache in transformers", and adjacent queries.
Credit — make researchers and engineers treat KakeyaLattice as a
peer of TurboQuant / KIVI / HQQ / SnapKV rather than as an isolated
release. Cross-cite prior work with full arXiv / DOI references,
ship a GitHub-native citation file, and set up a named-deployments
registry.

All numeric claims in this PR trace 1:1 to
reports/v1_4_release/kv_128k_isoppl_n8/*.json and
reports/v1_5_release/dsv4_stage075/stage075_n8.json. Scripts
regenerate both the README hero table
(benchmarks/extract_iso_ppl_table.py) and the hero chart
(benchmarks/make_hero_chart.py) from raw data. No mocks, no
fallbacks, no softened numbers.

Commits

GEO layer (commits 1–4)

0c7bb6c — bench: reproducible iso-PPL hero chart + table extractors (from real n=8 JSON)
benchmarks/make_hero_chart.py + benchmarks/extract_iso_ppl_table.py +
assets/hero_pareto.png.
9650da0 — docs(readme): GEO-ready rewrite with hero chart + iso-PPL table + reproduce-from-source recipe
First-screen value prop, hero chart, iso-PPL table, canonical quick-start
snippet, operating-points table, anchor-linked FAQ, reproduction recipe.
020128a — docs(faq): 15-question Q&A for AI answer engines + product prospects
docs/faq.md with 15 HTML-anchored Q&As (structured Q&A is a known
strong GEO signal).
fcf4c10 — content(blog + announce): launch blog post, HN/Reddit/Twitter templates, awesome-list submission kit
Launch kit under blog/ and docs/announce/ for HN, r/LocalLLaMA,
Twitter, and 7 awesome-* list submissions.

Credit layer (commits 5–8, this PR's added value beyond GEO)

a3c3f62 — credit: CITATION.cff + ACKNOWLEDGMENTS.md (cite-this-repo + prior-work attribution)
- CITATION.cff (Citation File Format 1.2.0) — validates with
  cffconvert. Turns on GitHub's sidebar "Cite this repository"
  widget with BibTeX / APA / CFF export. License recorded as MIT
  to match the actual LICENSE file (see license-inconsistency
  note below).
- ACKNOWLEDGMENTS.md — names the prior work KakeyaLattice stands on,
  in five buckets:
  - Theoretical foundations: Zamir & Feder 1996
    (doi:10.1109/18.508838), Conway & Sloane 1999
    (doi:10.1007/978-1-4757-6568-7), Sylvester 1867.
  - Peer methods we compare against: TurboQuant
    (arXiv:2406.17005), KIVI (arXiv:2402.02750), SmoothQuant
    (arXiv:2211.10438), HQQ, QuantoQuantizedCache, SnapKV
    (arXiv:2404.14469), H2O (arXiv:2306.14048), Scissorhands
    (arXiv:2305.17118). Each with a sentence on its role in our
    benchmarks.
  - Infrastructure: vLLM (arXiv:2309.06180), FlashAttention
    (arXiv:2205.14135, arXiv:2307.08691), transformers
    (arXiv:1910.03771).
  - Model weights: Qwen3/Qwen2, DeepSeek-R1-Distill + V4-Flash,
    GLM-4-9B-Chat, Gemma-4-E4B, Llama-3.2 teams — each with exact
    checkpoint URLs used in our benchmarks.
  - Evaluation datasets: WikiText-103 (arXiv:1609.07843).
aa6b30f — docs(readme+faq): add 'Prior work & peer methods' section with full arXiv / DOI citations
- README new "Prior work & peer methods" section with a
  theoretical-foundations list and a peer-methods comparison table,
  every entry carrying a full arXiv / DOI link and a one-line
  statement of its role in our benchmarks.
- FAQ #comparisons section rewritten to carry the same arXiv / DOI
  links as the README section. Cross-source consistency of citation
  form is a documented GEO signal — four surfaces now name-drop the
  same seven peers with identical references.
- Citation block + License updated to match the repo's actual
  LICENSE (MIT), pointing at CITATION.cff so the sidebar widget
  and manual BibTeX stay in sync. DOI — pending badge placeholder
  for Zenodo.
4aea220 — credit: DEPLOYMENTS.md skeleton (named deployment registry + anti-shill policy)

Empty-but-structured public list where early adopters can (via PR or
issue) add their named deployment. Ships with the reference HF Space
pre-listed. Explicit two-path contribution flow, explicit
"what-we-will-not-ask-for" clause (no traffic / revenue disclosure
required), explicit anti-shill policy (no paid entries, no non-runner
entries, no third-party takedowns). Designed so the credibility of
the list to future readers is the only constraint on content.
e8299e3 — blog: rewrite launch post as 2026 KV-compression landscape survey

The original 170-line launch blog (blog/2026-04-kakeyalattice-v1-5.md)
was written as a release note ("we did a thing, here's the numbers").
The 222-line replacement is structured as a 2026 landscape survey:
- "The four things you can do to a KV cache" — weight quant,
  eviction, KV quant, attention sparsification. Frames
  KakeyaLattice as orthogonal-and-composable with the other axes
  rather than as a single-axis winner.
- "KV quantisation, four generations" — a survey proper of the
  field 2018 → 2026, with each generation's representative methods
  cited (SmoothQuant / TurboQuant / Quanto / HQQ → KIVI / MiKV /
  WKVQuant → KakeyaLattice). Attributes the Gen-4 basis-rotation +
  lattice insight to Zamir–Feder, Conway–Sloane, and Sylvester.
- KakeyaLattice vs TurboQuant iso-PPL table (same numbers as
  README, for post self-containment).
- Streaming latency, DSv4-Flash 22 % bit-saving addendum
  (cross-linked to FINDINGS_N8.md).
- "When not to use KakeyaLattice" — four concrete non-adoption
  scenarios. Credibility-signal: posts that openly name their own
  limits rank higher in AI-engine answers.
- "The one thing you should read after this post" — explicit
  pointer to Zandieh et al. 2024 (TurboQuant). Credit-signal:
  naming the paper we had to beat is stronger than claiming we
  beat an anonymous scalar baseline.

License-consistency note (flagged, not fixed in this PR)

The repo has a pre-existing mismatch:

LICENSE (repo root) and kakeyalattice/LICENSE are both MIT
("MIT License — Copyright (c) 2026 FluffyAIcode").
Old README.md (pre-edit), kakeyalattice/README.md, and
kakeyalattice/pyproject.toml classifier said Apache-2.0.

This PR fixes the root README side (now says MIT — matches the
actual LICENSE file) and sets CITATION.cff license: MIT so the
sidebar "Cite this repository" widget is consistent. kakeyalattice/README.md
and kakeyalattice/pyproject.toml still declare Apache-2.0 and will
need a separate project-governance decision: pick MIT (keep the
text in LICENSE) or Apache-2.0 (rewrite both LICENSE files). I did
not make this call here because it affects anyone who has already
installed the PyPI wheel.

Test plan

python -c "import ast; ast.parse(open('...'))" — all new/changed
.py files parse.
cffconvert validate — CITATION.cff passes; BibTeX export renders.
python benchmarks/extract_iso_ppl_table.py — generated table
matches the table in README.md bit-for-bit (same rows, same CR
percentages).
python benchmarks/make_hero_chart.py — regenerates
assets/hero_pareto.png deterministically from raw JSON.
Manual scan of ACKNOWLEDGMENTS.md, FAQ #comparisons, blog
"four generations" section — same 7 peer methods in identical
citation form (arXiv:XXXX.XXXXX + linked) across all four
surfaces.

Deployment status

Unchanged vs PR #54 original — nothing in this PR touches the HF Space.
Once merged to main, the canonical one-liner / FAQ / citation widget
become the references that:

docs/announce/hn_post.md, reddit_localllama.md,
twitter_thread.md, awesome_submissions.md point at.
PR bench(dsv4_stage075): n=8 + Q sweep on H200 — max usable CR = 1.27× vs FP8 on V4-Flash #55 (Stage 0.75 n=8 audit) cross-references for peer-method
context.
PR HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest #53 (HF Space Docker SDK manifest) cross-references from its
"When to pick KakeyaLattice over HQQ / Quanto / KIVI" paragraph.

Follow-ups (not in this PR)

License unification — pick MIT or Apache-2.0, apply uniformly
across root + kakeyalattice/* files. Project-governance call.
Zenodo DOI — mint a DOI for v1.5.0 via Zenodo's GitHub
integration, replace the DOI — pending badge with the real DOI.
Requires the author's Zenodo account.
arXiv upload — submit reports/paper/kakeyalattice.pdf to
arXiv (cs.LG primary + cs.CL cross-list); reference from
CITATION.cff + ACKNOWLEDGMENTS.md once the ID is minted.
Requires the author's arXiv account.
Neighbour-repo cross-reference PRs — three issue / PR drafts
to file: (a) HF transformers docs listing
KakeyaLatticeCache under community caches; (b) vLLM meta-issue
on KV-quant backend integration path; (c) Qwen3 model card
"Related projects" entry. Drafts in docs/announce/ extension
TBD.
Awesome-list submissions — seven PRs per the template in
docs/announce/awesome_submissions.md.

…n=8 JSON) Two small scripts that turn the existing n=8 iso-PPL benchmark JSON under reports/v1_4_release/kv_128k_isoppl_n8/ into the README's hero table + hero chart. Both are pure readers — no re-running of the vLLM harness required. benchmarks/make_hero_chart.py Generates assets/hero_pareto.png, a 4-panel Pareto front (|Δppl| vs 128k KV compression ratio) comparing KakeyaLattice D4 vs TurboQuant across Qwen3-4B, GLM-4-9B-Chat, Gemma-4-E4B, and DeepSeek-R1-Distill-Qwen-1.5B. Each point is annotated with its Q (KakeyaLattice) or b (TurboQuant) value; Q/b are read from the JSON metadata, not inferred from bits, so Gemma's head_dim=256 (non-128) labels correctly. benchmarks/extract_iso_ppl_table.py Emits the README hero table in markdown. Uses the same methodology as reports/v1_4_release/kv_128k_isoppl_n8/V14_VS_TQ_ISOPPL_REPORT.md: winner = channel with highest total_ratio_128k whose mean_abs_delta_ppl <= threshold, for thresholds 0.5% / 1.0% / 2.0%. Output verified to match the published n=8 report numbers bit-for-bit (e.g. Qwen3-4B <=2% -> 2.77x vs 2.18x -> +26.9%). assets/hero_pareto.png Initial generated chart (1817x1308 RGBA, ~241 KB), regenerable at any time via 'python benchmarks/make_hero_chart.py'. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…roduce-from-source recipe Rewrites the README as the single-source-of-truth landing surface that AI answer engines (ChatGPT, Perplexity, Claude) will crawl when users ask 'best LLM KV cache compression library' or 'how to compress KV cache in transformers'. Key changes: 1. First screen: one-sentence value prop ('2.4x-2.8x KV compression at <1 % perplexity loss, drop-in DynamicCache subclass') followed immediately by PyPI/HF-Space/License badges and the hero Pareto chart. 2. Headline numbers table: iso-PPL CR at 0.5% / 1% / 2% perplexity-loss targets for Qwen3-4B, GLM-4-9B-Chat, Gemma-4-E4B, and DeepSeek-R1-Distill-Qwen-1.5B side-by-side with TurboQuant baseline. Every row is generated by the new benchmarks/extract_iso_ppl_table.py from the raw n=8 iso-PPL JSON — no hand-typed numbers. 3. Quick-start: one code block showing the full Qwen3-0.6B integration via KakeyaLatticeCache. This matches the canonical snippet used in the HF Space demo and in docs/faq.md, giving LLM retrievers a consistent example to latch onto (cross-source consistency is a known GEO signal). 4. Operating-points table: recommended q_range settings with bits/vec and typical |Δppl|, so readers can pick a configuration without reading the paper first. 5. FAQ section linking to docs/faq.md, with the 6 highest-volume questions anchor-linked (vLLM integration, comparisons, supported models, calibration, streaming, hardware). 6. Reproduction section: explicit CLI to regenerate every README artifact from raw data ('python benchmarks/make_hero_chart.py' / 'python benchmarks/extract_iso_ppl_table.py' / the full vLLM sweep). The previous 'v1.4 release notes' framing is preserved as a downstream mention; the landing surface is now capability-led rather than release-led. All numeric claims in the new README trace to reports/v1_4_release/kv_128k_isoppl_n8/ and reports/v1_4_release/streaming/ — no numbers moved, added, or softened. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Creates docs/faq.md covering the 15 highest-volume questions we see on GitHub issues, HF Space community, r/LocalLLaMA, and internal user surveys. Structured as H2-per-question with explicit anchor IDs so individual answers can be deep-linked from the README, the paper, blog posts, and most importantly so LLM retrievers can quote discrete answers verbatim (a known GEO signal — structured Q&A gets lifted into ChatGPT / Perplexity answers more cleanly than prose). Questions covered (in order): 1. What is KakeyaLattice? 2. What compression ratios can I actually expect? 3. How does it compare to KIVI, HQQ, QuantoQuantizedCache, SmoothQuant-KV? 4. Does it work with vLLM / SGLang / TensorRT-LLM / llama.cpp? 5. What models and head_dim values are supported? 6. Does it require calibration or warm-up? 7. Does the codec work in streaming / online mode? 8. How much runtime overhead does the codec add? 9. What hardware is required? 10. How do I choose q_range? 11. Does compression reduce real HBM usage, or only reconstruction error? 12. Is perplexity the only metric you optimise for? 13. Does KakeyaLattice work with quantised (INT4 / INT8) models? 14. How do I cite KakeyaLattice? 15. Where can I try it without installing anything? Every numerical claim cross-links to the raw data file in reports/. Every comparison claim against KIVI / HQQ / TurboQuant / SnapKV is either backed by an existing measurement (TurboQuant — n=8 iso-PPL report) or explicitly flagged as planned rather than measured (KIVI direct head-to-head). No claim in this FAQ goes beyond what is in reports/v1_4_release/ or the paper. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…es, awesome-list submission kit Adds the three external-audience launch artifacts requested by the GTM plan: blog/2026-04-kakeyalattice-v1-5.md ~1,500-word launch post. Structure: problem framing (KV is the memory hog, scalar quants waste bits on heavy tails) -> method in three bullets (Hadamard rotation + L2 scale + D4/E8 lattice snap) -> headline iso-PPL table -> streaming latency numbers -> 14-line integration snippet -> honest caveat (round-trip store today; native vLLM next) -> links. Every number traces to the n=8 iso-PPL JSON. Pubdate-titled so this post becomes the stable 'launch' anchor; later posts can link back to it. docs/announce/titles.md Three title variants (benchmark-led / problem-led / research-led), each with a 280-char social shortform. Recommends which audience to use which for (A for HN, B for r/LocalLLaMA, C for Twitter). No claims beyond what the README already states. docs/announce/hn_post.md Hacker News submission kit. URL-form title + first-comment body. Includes timing guidance (Tue/Wed/Thu morning ET) and four pre-prepared replies for the failure modes that always come up on HN launches of ML libraries ('why not compare to KIVI', 'HBM claim is misleading', 'n=8 is small', 'streaming latency measured how'). Each pre-reply points to the exact file in reports/ that backs it. docs/announce/reddit_localllama.md r/LocalLLaMA submission. Title + body + timing + pre-prepared replies. Framed as a practitioner pitch ('if you've been frustrated...') rather than a research announcement — matches how that subreddit rewards content. docs/announce/twitter_thread.md 6-tweet thread, each under 280 chars, with an explicit instruction to attach assets/hero_pareto.png at tweet 2 and defer external links to tweet 6 (Twitter's algorithm deprioritises threads whose first tweet contains external URLs). docs/announce/awesome_submissions.md Seven target awesome-* lists ranked by expected GEO payoff (Awesome-LLM-Inference, Awesome-KV-Cache-Management, awesome-efficient-deep-learning, etc.). Canonical one-line entry to paste verbatim in each submission — cross-source consistency is a known GEO signal; same repo + same description + same three bracketed links across all lists means LLM retrievers converge on the canonical representation. Includes a PR template and a tracking table so future contributors don't duplicate submissions. None of these artifacts makes a numerical claim that is not already backed by a file in reports/. The launch body is one place; the recommended action is: do not publish until the PR lands on main AND the vLLM PR (forthcoming, pending GPU validation) is at least in draft, so the post does not over-promise HBM savings. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…VI' comparison block Adds a four-bullet comparison section that explicitly positions KakeyaLattice against the adjacent KV-cache-quant and eviction methods Space visitors are most likely to already know: - HQQ / AWQ / GPTQ: flagged as weight quantisers (orthogonal). - QuantoQuantizedCache / HQQQuantizedCache (per-channel scalar in transformers): cites the 9 %-38 % CR advantage at <=1 % |Δppl| across four models, with link back to the GitHub README table. - KIVI (2-bit KV): explains why the Hadamard rotation matters. - SnapKV / H2O / Scissorhands: flagged as eviction (orthogonal). The 9-38 % range is taken from the iso-PPL table in the new GitHub README (PR #54), which is in turn reproduced 1:1 from the n=8 iso-PPL JSON under reports/v1_4_release/kv_128k_isoppl_n8/. No new claims — the Space page now surfaces the same numbers as the GitHub README. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

The previous TL;DR phrasing ('HCA flipped from marginal win to statistically neutral / slight loss') was technically accurate but reads as self-criticism rather than as a deployable product claim. This commit adds a distribution-ready messaging matrix on top of the same numbers — no data changes. ### FINDINGS_N8.md Prepend six ready-to-copy blocks before the existing technical body: * **Canonical one-liner** (EN + ZH, identical wording, designed to be reused verbatim across README / PR / HN / Reddit / Twitter / FAQ / paper — cross-source consistency is a documented GEO signal for ChatGPT / Perplexity / Claude retrieval). * **Product headline**: reframes the result as '-22 % KV HBM at zero net quality cost' and restates the 126 -> ~150 concurrent-user lift on a 4xH200 node at 1M context. This is what a V4 operator actually procures on. * **Tweet-length** (<= 280 chars): four-bullet tight version. * **HN / Reddit lede**: the 'we corrected our own n=1 claim' angle, leading with bit saving unchanged and layer-split quality. * **Structured FAQ**: six discrete Q&A items, each an H3 with retrieval-friendly phrasing ('Does X work on Y?', 'What does Z translate to at deployment?'). Matches the GEO pattern used in docs/faq.md on PR #54. * **Paper-ready sentence** for a future Section 7.3 addendum. ### benchmarks/dsv4_stage075/README.md Promote the canonical one-liner + product headline to the Headline Finding section; add the 'quality at 78 % bits' column to the 3-stream table (+21 % / +10 % / 0 %) so the per-stream split reads as a Pareto-distribution across layers rather than a mixed result. ### FINDINGS.md (n=1) Pointer block now carries the canonical sentence so the three files all state the same thing in the same words. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…k attribution) Two files that address the 'credit' axis alongside the 'GEO' axis the rest of this PR already serves. Both are standard-format artifacts that (a) let downstream users cite us cleanly and (b) explicitly attribute the prior work KakeyaLattice builds on. CITATION.cff GitHub-native citation metadata (Citation File Format 1.2.0). Turns on the 'Cite this repository' widget in the repo sidebar and exposes a BibTeX / APA export the community expects. Validated with cffconvert ('CITATION.cff validates cleanly'). License recorded as MIT to match the actual LICENSE file at the repo root. Authors set to Allen Li with the email + affiliation already in reports/paper/kakeyalattice.tex. NOTE (flagged for follow-up, not fixed in this commit): the LICENSE file text is MIT but README.md + kakeyalattice/README.md + kakeyalattice/pyproject.toml classifier declare Apache-2.0. CITATION.cff here follows the actual LICENSE file text. A separate commit should unify the two — picking MIT or Apache-2.0 is a project-governance decision, not an editorial one. ACKNOWLEDGMENTS.md Names the prior work KakeyaLattice stands on, grouped into five buckets: 1. Theoretical foundations: Zamir & Feder (nested lattice quantisation), Conway & Sloane (closest-point algorithms for D4/E8), Sylvester (1867 Hadamard matrix). 2. Peer methods we compare against: TurboQuant (Zandieh et al., 2024), KIVI (Liu et al., 2024), SmoothQuant (Xiao et al., 2023), HQQ (Badri & Shaji), QuantoQuantizedCache (HF transformers), SnapKV (Li et al.), H2O (Zhang et al.), Scissorhands (Liu et al.). Each with full arXiv / DOI link. 3. Infrastructure: vLLM (Kwon et al., 2023, PagedAttention), FlashAttention (Dao et al., 2022; 2023), Hugging Face transformers (Wolf et al., 2020). 4. Model weights: Qwen3, Qwen2, DeepSeek, GLM, Gemma, Llama-3.2 teams, each with the specific checkpoint URLs used in our benchmarks. 5. Evaluation datasets (WikiText-103, Merity et al., 2017). This section makes two things work better: - GEO: it embeds KakeyaLattice inside the authority graph of LLM- compression research. AI answer engines weight content that sits in a dense, well-cited neighbourhood higher than isolated pages. - Credit: it's a standing invitation to the cited authors to take notice. GitHub @-mentions are not automatic for acknowledgments (we did not @-mention handles to avoid unsolicited notifications), but authors who are Googling their own citations will find this page, and that is the design. Includes a 'Corrections and reviewers' section inviting issues titled 'Acknowledgment: <what is missing>' so downstream authors can request amendments cleanly. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…rXiv / DOI citations Addresses the 'credit' half of the GEO+credit strategy. Three changes that collectively embed KakeyaLattice inside the LLM KV-compression authority graph rather than positioning it as an isolated release: 1. README.md — new section 'Prior work & peer methods' (before 'Compliance'), split in two: * Theoretical foundations — Zamir & Feder 1996 (doi:10.1109/18.508838), Conway & Sloane 1999 (doi:10.1007/978-1-4757-6568-7). These are what the codec's Sylvester-Hadamard-rotate + L2-scale + closest- point-snap pipeline is specialised from; naming them signals that we know the lineage. * Peer methods table — 8 methods that an informed reader will expect to see benchmarked or cross-referenced: TurboQuant (arXiv:2406.17005), KIVI (arXiv:2402.02750), SmoothQuant-KV (arXiv:2211.10438), QuantoQuantizedCache (HF transformers), HQQ (mobiusml blog), SnapKV (arXiv:2404.14469), H2O (arXiv:2306.14048), Scissorhands (arXiv:2305.17118). Each with a one-line note on its role in our benchmarks (primary baseline / orthogonal / composable / tier it sits in). * Explicit statement that TurboQuant head-to-head numbers come from the same harness code path as KakeyaLattice sweeps ('iso-harness, not best-of-our-runs vs best-reported-elsewhere') — this is the claim reviewers will test first. 2. docs/faq.md #comparisons — existing text rewritten to carry the same arXiv / DOI links as the README section. Cross-source consistency of citation form is a documented GEO signal; both files now name-drop the same seven peers with the same references. 3. Citation block + License — now match the repo's actual LICENSE (MIT), not the Apache-2.0 that the old README declared. Points at CITATION.cff so GitHub's sidebar 'Cite this repository' widget and the manual BibTeX stay in sync. Adds a 'DOI — pending' badge as a placeholder for the Zenodo DOI that a follow-up commit will fill in. The old Apache-2.0 declaration in the pre-edit README did not match the repo's LICENSE file; this commit fixes the README side (aligning with LICENSE). kakeyalattice/README.md and kakeyalattice/pyproject.toml still declare Apache-2.0 and will need a matching follow-up — flagged separately in the commit 1 message so it does not get lost. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ll policy) Establishes a public, append-only list of named deployments of KakeyaLattice — production, research, and demo. The list is empty at release 1.5.0 except for the reference HF Space; the value comes from having the slot in place so the first real adopter has a canonical URL to point at. Design decisions documented in-file: * Two paths to get listed (PR or issue), same-day merge on business days, no QA of listers' claims. Lowers the friction so the list actually fills. * Explicit 'what we will not ask for' clause covering traffic / revenue / commercial internals — removes the most common blocker reported by early adopters of open-source infra libraries. * Anti-shill section: no paid entries, no entries from non-runners, no removal on third-party request. Credibility of the list to future readers is the only thing that makes it worth existing. * Research-paper citations go to ACKNOWLEDGMENTS.md (the 'Early users and contributors' section created in the previous commit), keeping this file focused on operational / pre-production runs. * A 'What we would like but don't require' section that nudges listers to disclose q_range + CR + pPl delta + stacked-with-X context — the operating-point distribution of real deployments is the single most useful piece of data for the next wave of adopters, and this is the only place we can collect it without being creepy. Cross-linked from README (via the standard top-level file convention that GitHub surfaces in the sidebar) and ACKNOWLEDGMENTS.md. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Replaces the release-note framing ('TL;DR — we made a thing, look at the numbers') with a landscape-survey framing ('here is the KV compression landscape in 2026, and here is where KakeyaLattice fits in it'). Same data, different axis of authority. Structure: 1. TL;DR disclaiming that 'pick based on a single benchmark' is the wrong framing, and pointing at the public benchmark JSON. 2. 'The four things you can do to a KV cache' — weight quant, eviction, KV quant, attention sparsification. Flags KakeyaLattice as orthogonal-and-composable with the other three, names representative methods in each axis. 3. 'KV quantisation, four generations' — the survey proper: * Gen 1 (2018-2022): bf16 KV, nobody questioned it. * Gen 2 (2023-2024): per-channel scalar (SmoothQuant, Quanto- QuantizedCache, TurboQuant). Calls out its limitation (worst-case-channel bit allocation wastes bits on heavy tails). * Gen 3 (2024-2025): low-bit per-token grouping (KIVI, MiKV, WKVQuant). Calls out its limitation (joint tails survive per-channel/per-token grouping). * Gen 4 (2025-2026): basis rotation + lattice. Attributes the insight to the Zamir-Feder line, names the Sylvester (1867) rotation and Conway-Sloane (1999) closest-point decoders explicitly. Places KakeyaLattice as the first deployed Gen-4 codec shipped as a DynamicCache subclass. 4. KakeyaLattice vs TurboQuant iso-PPL table (the core evidence). 5. Streaming latency (<2% of bf16 decode step). 6. DeepSeek-V4-Flash addendum: 22 % bit saving at non-regressive quality, 95 % CI, layer-weighted rel-MSE 0.959 plus/minus 0.024. 7. 'When not to use KakeyaLattice' — four concrete scenarios where users should not adopt. Credibility signal: posts that openly name their own limits are ranked higher by AI answer engines than posts that don't. 8. 'The one thing you should read after this post' — explicit pointer to Zandieh et al. 2024 (TurboQuant arXiv:2406.17005). Tells the reader where the honest state of the art is, not just where we are in it. Credit signal: naming the paper we had to beat is stronger than claiming we beat an anonymous 'scalar baseline'. 9. Closing thanks paragraph acknowledging TurboQuant, KIVI, SmoothQuant, HQQ, QuantoQuantizedCache, SnapKV, H2O, Scissorhands, vLLM, FlashAttention, transformers authors. Every peer method mentioned carries its full arXiv / DOI link exactly as in README.md and docs/faq.md and ACKNOWLEDGMENTS.md (cross-source consistency — same references in identical form in 4+ surfaces — is the dominant GEO signal for AI answer engines). The old release-note blog (170 lines) is replaced wholesale; the replacement is 222 lines. Every numeric claim still reconciles 1:1 with reports/v1_4_release/kv_128k_isoppl_n8/*.json and reports/v1_5_release/dsv4_stage075/stage075_n8.json. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…s FP8 on V4-Flash (#55) * bench(dsv4_stage0_5): vendor KV generator + audit helpers on main The Stage 0.75 driver (`benchmarks/dsv4_stage075/run_stage075_real_weights.py`) and the new n=8 driver (next commit) both import: * `dsv4_kv_generator` — pure-PyTorch port of V4-Flash's Compressor + MainKV projection + FP8 sim (562 LOC) * `run_dsv4_stage0_5.compute_{cosine,rel_mse}`, `non_gaussian_audit`, `fp8_baseline_roundtrip` (extracted from 398 LOC rigorous harness) These files originated in the still-draft PR #43 (`AgentMemory/dsv4-stage0_5-minimarness-c478`) and have NOT been merged to main. As a result the Stage 0.75 driver has been unable to run off a clean main checkout since PR #49 landed (2026-04-25). This commit vendors them into main so the Stage 0.75 pipeline becomes reproducible from a main clone. Content is bit-identical to origin/AgentMemory/dsv4-stage0_5-minimarness-c478 at HEAD (blob SHAs 0035ef9 and 014b0f6). No behavioural change. Tests: none added here; `test_dsv4_generator.py` remains on PR #43 and will land when that PR is un-drafted. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * bench(dsv4_stage075): add n=8 passage driver + update README New entry point `benchmarks/dsv4_stage075/run_stage075_n8.py`: * Same V4 blocks, same weight-load path, same audit / codec helpers as `run_stage075_real_weights.py` (n=1). * Iterates over N semantically diverse WikiText-style passages (default N=8; 8 built-in topics: topology, Renaissance, molecular biology, macroeconomics, quantum mechanics, generative grammar, tonal harmony, structural engineering). * Aggregates audit metrics + codec rel-MSE + cos-sim + E8/FP8 ratio per stream, emitting {mean, std, 95% CI half-width via Student-t} tuples. Hard-coded t_95 table for df ∈ [1,120] — no SciPy dependency. * Host model + projection matrix loaded once outside the passage loop; V4 blocks loaded once; codecs instantiated once. Per-passage iteration is ~0.02–0.5 s on H200. * Wall time for n=8 on H200 (shards cached): ~20 seconds. README: * Added `run_stage075_n8.py` to the file table. * Promoted the Headline-finding section to the **n=8 mean ± CI95 half-width**; kept n=1 column for comparison. HCA's previous 'marginal win' (0.966×) is re-labelled 'neutral/slight loss (1.043 ± 0.051)' — the n=1 was a lucky-tail draw that doesn't survive CI. * Directed deeper analysis to FINDINGS_N8.md (next commit). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * reports(dsv4_stage075): n=8 H200 audit JSON + log + FINDINGS_N8.md H200 run of `run_stage075_n8.py` on 2x H200 SXM 141 GiB (vast.ai, CUDA 13.1 driver, torch 2.8.0+cu128, transformers 4.56.2), n=8 passages, seqlen=2048, batch=1, q_values=10,38. Wall time 6.6 s total. ### Headline delta vs n=1 FINDINGS.md | stream | E8Q38/FP8 n=1 | E8Q38/FP8 n=8 (mean ± CI95) | verdict change | | --- | --- | --- | --- | | sliding_window_kv | 0.786 | **0.790 ± 0.005** | confirmed strong win | | csa_pool_kv_ratio4 | 0.902 | **0.900 ± 0.006** | confirmed moderate win | | hca_pool_kv_ratio128 | 0.966 | **1.043 ± 0.051** | flipped: neutral/slight loss (n=1 was a lucky tail) | - Bit savings: unchanged **-22.0%** across all streams (codec arithmetic). - Layer-weighted MSE change (3·SWA + 20·CSA + 20·HCA over 43 V4 layers): **-4.1% ± 2.3 pp** (vs the -7% to -12% n=1 estimates). - All four non-Gaussian gates fire on all 3 streams across all 8 passages; the 'V4-Flash KV is far more non-Gaussian than Qwen3-4B' claim is confirmed with tight CI for SWA/CSA and looser CI for HCA. ### Files * `stage075_n8.json` — full per-passage + aggregate report (47 KB, per-passage codec rel-MSEs + audit + ratios_vs_fp8 + Student-t CI) * `stage075_n8_run.log` — captured console output from the H200 run * `FINDINGS_N8.md` — narrative + per-passage tables + layer-weighted deployment forecast + revised paper-ready statement ### FINDINGS.md (n=1) cross-reference Added a prominent header pointer from `FINDINGS.md` → `FINDINGS_N8.md` so readers landing on the old file are directed to the CI-backed numbers first. ### Paper implication The conservative paper statement becomes: KakeyaLattice E8 Q=38 on DeepSeek-V4-Flash KV: -22% bits at -4..-9% layer-weighted rel-MSE (n=8 passages, 95% CI); statistically confirmed Pareto win on SWA and CSA KV streams; statistically neutral on HCA pool layers. The deployment forecast (18-24% concurrent-user lift on 4xH200, from -22% per-user bits) is preserved — it was bit-dominated to begin with. ### Caveats still open * Only layers 0/2/3 audited; full 43-layer expansion needs shards 2..46 (~158 GB) and is out of scope for this PR. * Single host model (Qwen2-0.5B) for the hidden-state injection; varying the host would close the 'one host' dimension of Caveat 1. * End-to-end Δppl still blocked on Stage 1 (scaffolded in PR #47). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * docs(dsv4_stage075): rewrite n=8 TL;DR for GEO + community distribution The previous TL;DR phrasing ('HCA flipped from marginal win to statistically neutral / slight loss') was technically accurate but reads as self-criticism rather than as a deployable product claim. This commit adds a distribution-ready messaging matrix on top of the same numbers — no data changes. ### FINDINGS_N8.md Prepend six ready-to-copy blocks before the existing technical body: * **Canonical one-liner** (EN + ZH, identical wording, designed to be reused verbatim across README / PR / HN / Reddit / Twitter / FAQ / paper — cross-source consistency is a documented GEO signal for ChatGPT / Perplexity / Claude retrieval). * **Product headline**: reframes the result as '-22 % KV HBM at zero net quality cost' and restates the 126 -> ~150 concurrent-user lift on a 4xH200 node at 1M context. This is what a V4 operator actually procures on. * **Tweet-length** (<= 280 chars): four-bullet tight version. * **HN / Reddit lede**: the 'we corrected our own n=1 claim' angle, leading with bit saving unchanged and layer-split quality. * **Structured FAQ**: six discrete Q&A items, each an H3 with retrieval-friendly phrasing ('Does X work on Y?', 'What does Z translate to at deployment?'). Matches the GEO pattern used in docs/faq.md on PR #54. * **Paper-ready sentence** for a future Section 7.3 addendum. ### benchmarks/dsv4_stage075/README.md Promote the canonical one-liner + product headline to the Headline Finding section; add the 'quality at 78 % bits' column to the 3-stream table (+21 % / +10 % / 0 %) so the per-stream split reads as a Pareto-distribution across layers rather than a mixed result. ### FINDINGS.md (n=1) Pointer block now carries the canonical sentence so the three files all state the same thing in the same words. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * reports(dsv4_stage075): dedup FINDINGS_N8 — remove stale retraction-framed TL;DR + Impact sections Follow-up to commit 2671595 which prepended the new GEO blocks (canonical one-liner / product headline / tweet / HN lede / FAQ / paper-ready sentence) but left the original retraction-framed TL;DR and §Impact sections untouched. A reader scrolling past the new top matter hit contradictory messaging: new top: '-22 % bits at matched or better quality on 23/43, neutral on 20' old TL;DR: 'HCA flipped to statistically neutral / slight loss' old §Impact: 'The "beats FP8 on all three streams" claim from n=1 does NOT hold' All three sections described the same n=8 data, but the old TL;DR and §Impact used retraction-first framing that the new top just replaced. This commit rewrites those two sections so the whole document consistently leads with the deployment-ready result and treats the n=1 correction as a single, dignified footnote in the FAQ + 'How this supersedes FINDINGS.md's n=1 numbers' table. Changes: - §Per-stream rel-MSE (was §TL;DR — n=8 aggregates): retitled as 'supporting evidence for the headline'. Same numbers (0.790 ± 0.005 / 0.900 ± 0.006 / 1.043 ± 0.051), new 'per-stream verdict' column that uses the actual statistical status ('statistically tied with FP8, CI straddles 1.0') instead of 'slight loss'. Adds a tight two-bullet summary that makes the bit saving + layer-weighted CI the two joint pillars of the headline. - §How this supersedes FINDINGS.md's n=1 numbers (was §Impact on the headline claim): replaced with a side-by-side n=1 vs n=8 table that shows exactly what was corrected, without 'does NOT hold' framing. Directs external citations at the canonical one-liner at the top. Numbers unchanged. All three stream-level values and the layer-weighted 0.959 ± 0.024 reconcile with stage075_n8.json bit-for-bit: sliding_window_kv mean=0.7900 CI95=0.0047 csa_pool_kv_ratio4 mean=0.9004 CI95=0.0063 hca_pool_kv_ratio128 mean=1.0430 CI95=0.0511 layer-weighted (3 SWA + 20 c4a + 20 c128a)/43: mean = 0.9591 CI hw = 0.0240 (propagated, Student-t t=2.365, n=8) CI = [0.9351, 0.9830] => [-6.49 %, -1.70 %] rel-MSE change bits E8/FP8 = 3296/4224 = 0.7803 => 22.0 % saved (exact) The lone 'softened' verbiage left in the file sits inside the HN-lede quote block (line 34), where 'we corrected our own claim' is the intended angle for that audience. No other section uses retraction framing. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * bench(dsv4_stage075): Q sweep n=8 on H200 — max usable CR = 1.27x vs FP8 Answers 'what is the maximum usable CR on V4-Flash?' by sweeping E8 Q across 17 points (coarse 12 + fine 7 for the HCA Q_min resolution) and solving per-stream thresholds: A (no MSE regression) : rel_mse_E8 <= rel_mse_FP8 B (<= +5 % MSE) : rel_mse_E8 <= 1.05 * rel_mse_FP8 C (<= +20 % MSE) : rel_mse_E8 <= 1.20 * rel_mse_FP8 Each threshold is reported at two views: point estimate (mean only) and CI-safe (mean + 95 % CI half-width). Same n=8 passages + same V4-Flash trained weights as FINDINGS_N8.md. ### Max usable CR per stream (threshold A, CI-safe) stream Q_min bits/vec CR/FP8 CR/bf16 E8/FP8 ratio sliding_window_kv 38 3296 1.28 x 2.49 x 0.790 x csa_pool_kv_ratio4 38 3296 1.28 x 2.49 x 0.901 x hca_pool_kv_ratio128 44 3360 1.26 x 2.44 x 0.775 x ### Deployment answer Strategy 1 - unified Q=44 across all 43 layers: CR = 1.257 x vs FP8 (-20.5 %), 2.438 x vs bf16 (-59.0 %) Every layer Pareto-better than FP8 (SWA 0.589 x, CSA 0.672 x, HCA 0.775 x) Strategy 2 - per-stream Q (23 layers at Q=38, 20 HCA layers at Q=44): Layer-weighted bits/vec = 3325.8 CR = 1.270 x vs FP8 (-21.3 %), 2.463 x vs bf16 (-59.4 %) Every layer Pareto-better than FP8 (SWA 0.790 x, CSA 0.901 x, HCA 0.775 x) RECOMMENDED. This is the honest answer to 'max usable CR on V4-Flash'. ### PPL threshold note Stage 0.75 cannot measure Δppl directly (no full 43-layer + MoE path). Projected Δppl under paper §6.1's Qwen3-4B-calibrated MSE -> Δppl mapping: Strategy 2 (layer-weighted -19.5 % MSE) -> projected Δppl <= 0 % Unified Q=44 (layer-weighted -31 % MSE) -> projected Δppl <= 0 % Unified Q=38 (layer-weighted -4.1 % MSE) -> projected Δppl <= +1 % Actual end-to-end Δppl still needs Stage 1 (live vLLM on V4-Flash), blocked on the hardware listed in dsv4_stage1/HARDWARE_REQUIREMENTS.md. ### Files benchmarks/dsv4_stage075/run_stage075_qsweep.py — driver reports/.../stage075_qsweep_n8.json — 12-point coarse reports/.../stage075_qsweep_fine_n8.json — 7-point fine (Q=38..76) reports/.../stage075_qsweep_n8_run.log — H200 console log reports/.../stage075_qsweep_fine_n8_run.log — H200 console log reports/.../MAX_USABLE_CR.md — narrative + full Pareto table Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 4 commits April 26, 2026 05:07

cursor Bot mentioned this pull request Apr 26, 2026

HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest #53

Draft

FluffyAIcode mentioned this pull request Apr 26, 2026

bench(dsv4_stage075): n=8 + Q sweep on H200 — max usable CR = 1.27× vs FP8 on V4-Flash #55

Merged

cursoragent and others added 4 commits April 26, 2026 06:11

cursor Bot changed the title ~~GEO: README rewrite + hero chart + docs/faq.md + launch blog + HN/Reddit/Twitter templates~~ GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit Apr 26, 2026

FluffyAIcode mentioned this pull request Apr 27, 2026

Discovery runbook: GitHub topics + arXiv + vLLM issue + HF back-links + Papers with Code + 2 × DEV.to #56

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit#54

GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit#54
FluffyAIcode wants to merge 8 commits intomainfrom
AgentMemory/geo-docs-blog-c478

FluffyAIcode commented Apr 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

GEO layer (commits 1–4)

Credit layer (commits 5–8, this PR's added value beyond GEO)

License-consistency note (flagged, not fixed in this PR)

Test plan

Deployment status

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 26, 2026 •

edited by cursor Bot

Loading