Skip to content

GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit#54

Draft
FluffyAIcode wants to merge 8 commits intomainfrom
AgentMemory/geo-docs-blog-c478
Draft

GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit#54
FluffyAIcode wants to merge 8 commits intomainfrom
AgentMemory/geo-docs-blog-c478

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 26, 2026

Summary

Two-axis upgrade to the public-facing surfaces of the repo:

  1. GEO — make AI answer engines (ChatGPT / Perplexity / Claude) cite
    KakeyaLattice in response to "best LLM KV cache compression library",
    "how to compress KV cache in transformers", and adjacent queries.
  2. Credit — make researchers and engineers treat KakeyaLattice as a
    peer of TurboQuant / KIVI / HQQ / SnapKV rather than as an isolated
    release. Cross-cite prior work with full arXiv / DOI references,
    ship a GitHub-native citation file, and set up a named-deployments
    registry.

All numeric claims in this PR trace 1:1 to
reports/v1_4_release/kv_128k_isoppl_n8/*.json and
reports/v1_5_release/dsv4_stage075/stage075_n8.json. Scripts
regenerate both the README hero table
(benchmarks/extract_iso_ppl_table.py) and the hero chart
(benchmarks/make_hero_chart.py) from raw data. No mocks, no
fallbacks, no softened numbers.

Commits

GEO layer (commits 1–4)

  1. 0c7bb6cbench: reproducible iso-PPL hero chart + table extractors (from real n=8 JSON)
    benchmarks/make_hero_chart.py + benchmarks/extract_iso_ppl_table.py +
    assets/hero_pareto.png.

  2. 9650da0docs(readme): GEO-ready rewrite with hero chart + iso-PPL table + reproduce-from-source recipe
    First-screen value prop, hero chart, iso-PPL table, canonical quick-start
    snippet, operating-points table, anchor-linked FAQ, reproduction recipe.

  3. 020128adocs(faq): 15-question Q&A for AI answer engines + product prospects
    docs/faq.md with 15 HTML-anchored Q&As (structured Q&A is a known
    strong GEO signal).

  4. fcf4c10content(blog + announce): launch blog post, HN/Reddit/Twitter templates, awesome-list submission kit
    Launch kit under blog/ and docs/announce/ for HN, r/LocalLLaMA,
    Twitter, and 7 awesome-* list submissions.

Credit layer (commits 5–8, this PR's added value beyond GEO)

  1. a3c3f62credit: CITATION.cff + ACKNOWLEDGMENTS.md (cite-this-repo + prior-work attribution)

    • CITATION.cff (Citation File Format 1.2.0) — validates with
      cffconvert. Turns on GitHub's sidebar "Cite this repository"
      widget with BibTeX / APA / CFF export. License recorded as MIT
      to match the actual LICENSE file (see license-inconsistency
      note below).
    • ACKNOWLEDGMENTS.md — names the prior work KakeyaLattice stands on,
      in five buckets:
      • Theoretical foundations: Zamir & Feder 1996
        (doi:10.1109/18.508838), Conway & Sloane 1999
        (doi:10.1007/978-1-4757-6568-7), Sylvester 1867.
      • Peer methods we compare against: TurboQuant
        (arXiv:2406.17005), KIVI (arXiv:2402.02750), SmoothQuant
        (arXiv:2211.10438), HQQ, QuantoQuantizedCache, SnapKV
        (arXiv:2404.14469), H2O (arXiv:2306.14048), Scissorhands
        (arXiv:2305.17118). Each with a sentence on its role in our
        benchmarks.
      • Infrastructure: vLLM (arXiv:2309.06180), FlashAttention
        (arXiv:2205.14135, arXiv:2307.08691), transformers
        (arXiv:1910.03771).
      • Model weights: Qwen3/Qwen2, DeepSeek-R1-Distill + V4-Flash,
        GLM-4-9B-Chat, Gemma-4-E4B, Llama-3.2 teams — each with exact
        checkpoint URLs used in our benchmarks.
      • Evaluation datasets: WikiText-103 (arXiv:1609.07843).
  2. aa6b30fdocs(readme+faq): add 'Prior work & peer methods' section with full arXiv / DOI citations

    • README new "Prior work & peer methods" section with a
      theoretical-foundations list and a peer-methods comparison table,
      every entry carrying a full arXiv / DOI link and a one-line
      statement of its role in our benchmarks.
    • FAQ #comparisons section rewritten to carry the same arXiv / DOI
      links as the README section. Cross-source consistency of citation
      form is a documented GEO signal — four surfaces now name-drop the
      same seven peers with identical references.
    • Citation block + License updated to match the repo's actual
      LICENSE (MIT), pointing at CITATION.cff so the sidebar widget
      and manual BibTeX stay in sync. DOI — pending badge placeholder
      for Zenodo.
  3. 4aea220credit: DEPLOYMENTS.md skeleton (named deployment registry + anti-shill policy)

    Empty-but-structured public list where early adopters can (via PR or
    issue) add their named deployment. Ships with the reference HF Space
    pre-listed. Explicit two-path contribution flow, explicit
    "what-we-will-not-ask-for" clause (no traffic / revenue disclosure
    required), explicit anti-shill policy (no paid entries, no non-runner
    entries, no third-party takedowns). Designed so the credibility of
    the list to future readers is the only constraint on content.

  4. e8299e3blog: rewrite launch post as 2026 KV-compression landscape survey

    The original 170-line launch blog (blog/2026-04-kakeyalattice-v1-5.md)
    was written as a release note ("we did a thing, here's the numbers").
    The 222-line replacement is structured as a 2026 landscape survey:

    • "The four things you can do to a KV cache" — weight quant,
      eviction, KV quant, attention sparsification. Frames
      KakeyaLattice as orthogonal-and-composable with the other axes
      rather than as a single-axis winner.
    • "KV quantisation, four generations" — a survey proper of the
      field 2018 → 2026, with each generation's representative methods
      cited (SmoothQuant / TurboQuant / Quanto / HQQ → KIVI / MiKV /
      WKVQuant → KakeyaLattice). Attributes the Gen-4 basis-rotation +
      lattice insight to Zamir–Feder, Conway–Sloane, and Sylvester.
    • KakeyaLattice vs TurboQuant iso-PPL table (same numbers as
      README, for post self-containment).
    • Streaming latency, DSv4-Flash 22 % bit-saving addendum
      (cross-linked to FINDINGS_N8.md).
    • "When not to use KakeyaLattice" — four concrete non-adoption
      scenarios. Credibility-signal: posts that openly name their own
      limits rank higher in AI-engine answers.
    • "The one thing you should read after this post" — explicit
      pointer to Zandieh et al. 2024 (TurboQuant). Credit-signal:
      naming the paper we had to beat is stronger than claiming we
      beat an anonymous scalar baseline.

License-consistency note (flagged, not fixed in this PR)

The repo has a pre-existing mismatch:

  • LICENSE (repo root) and kakeyalattice/LICENSE are both MIT
    ("MIT License — Copyright (c) 2026 FluffyAIcode").
  • Old README.md (pre-edit), kakeyalattice/README.md, and
    kakeyalattice/pyproject.toml classifier said Apache-2.0.

This PR fixes the root README side (now says MIT — matches the
actual LICENSE file) and sets CITATION.cff license: MIT so the
sidebar "Cite this repository" widget is consistent. kakeyalattice/README.md
and kakeyalattice/pyproject.toml still declare Apache-2.0 and will
need a separate project-governance decision: pick MIT (keep the
text in LICENSE) or Apache-2.0 (rewrite both LICENSE files). I did
not make this call here because it affects anyone who has already
installed the PyPI wheel.

Test plan

  • python -c "import ast; ast.parse(open('...'))" — all new/changed
    .py files parse.
  • cffconvert validateCITATION.cff passes; BibTeX export renders.
  • python benchmarks/extract_iso_ppl_table.py — generated table
    matches the table in README.md bit-for-bit (same rows, same CR
    percentages).
  • python benchmarks/make_hero_chart.py — regenerates
    assets/hero_pareto.png deterministically from raw JSON.
  • Manual scan of ACKNOWLEDGMENTS.md, FAQ #comparisons, blog
    "four generations" section — same 7 peer methods in identical
    citation form (arXiv:XXXX.XXXXX + linked) across all four
    surfaces.

Deployment status

Unchanged vs PR #54 original — nothing in this PR touches the HF Space.
Once merged to main, the canonical one-liner / FAQ / citation widget
become the references that:

Follow-ups (not in this PR)

  1. License unification — pick MIT or Apache-2.0, apply uniformly
    across root + kakeyalattice/* files. Project-governance call.
  2. Zenodo DOI — mint a DOI for v1.5.0 via Zenodo's GitHub
    integration, replace the DOI — pending badge with the real DOI.
    Requires the author's Zenodo account.
  3. arXiv upload — submit reports/paper/kakeyalattice.pdf to
    arXiv (cs.LG primary + cs.CL cross-list); reference from
    CITATION.cff + ACKNOWLEDGMENTS.md once the ID is minted.
    Requires the author's arXiv account.
  4. Neighbour-repo cross-reference PRs — three issue / PR drafts
    to file: (a) HF transformers docs listing
    KakeyaLatticeCache under community caches; (b) vLLM meta-issue
    on KV-quant backend integration path; (c) Qwen3 model card
    "Related projects" entry. Drafts in docs/announce/ extension
    TBD.
  5. Awesome-list submissions — seven PRs per the template in
    docs/announce/awesome_submissions.md.
Open in Web Open in Cursor 

cursoragent and others added 4 commits April 26, 2026 05:07
…n=8 JSON)

Two small scripts that turn the existing n=8 iso-PPL benchmark JSON under
reports/v1_4_release/kv_128k_isoppl_n8/ into the README's hero table + hero
chart. Both are pure readers — no re-running of the vLLM harness required.

benchmarks/make_hero_chart.py
  Generates assets/hero_pareto.png, a 4-panel Pareto front (|Δppl| vs 128k
  KV compression ratio) comparing KakeyaLattice D4 vs TurboQuant across
  Qwen3-4B, GLM-4-9B-Chat, Gemma-4-E4B, and DeepSeek-R1-Distill-Qwen-1.5B.
  Each point is annotated with its Q (KakeyaLattice) or b (TurboQuant)
  value; Q/b are read from the JSON metadata, not inferred from bits, so
  Gemma's head_dim=256 (non-128) labels correctly.

benchmarks/extract_iso_ppl_table.py
  Emits the README hero table in markdown. Uses the same methodology as
  reports/v1_4_release/kv_128k_isoppl_n8/V14_VS_TQ_ISOPPL_REPORT.md:
  winner = channel with highest total_ratio_128k whose mean_abs_delta_ppl
  <= threshold, for thresholds 0.5% / 1.0% / 2.0%. Output verified to
  match the published n=8 report numbers bit-for-bit (e.g. Qwen3-4B
  <=2% -> 2.77x vs 2.18x -> +26.9%).

assets/hero_pareto.png
  Initial generated chart (1817x1308 RGBA, ~241 KB), regenerable at any
  time via 'python benchmarks/make_hero_chart.py'.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…roduce-from-source recipe

Rewrites the README as the single-source-of-truth landing surface that AI
answer engines (ChatGPT, Perplexity, Claude) will crawl when users ask
'best LLM KV cache compression library' or 'how to compress KV cache in
transformers'. Key changes:

1. First screen: one-sentence value prop ('2.4x-2.8x KV compression at
   <1 % perplexity loss, drop-in DynamicCache subclass') followed
   immediately by PyPI/HF-Space/License badges and the hero Pareto chart.
2. Headline numbers table: iso-PPL CR at 0.5% / 1% / 2% perplexity-loss
   targets for Qwen3-4B, GLM-4-9B-Chat, Gemma-4-E4B, and
   DeepSeek-R1-Distill-Qwen-1.5B side-by-side with TurboQuant baseline.
   Every row is generated by the new benchmarks/extract_iso_ppl_table.py
   from the raw n=8 iso-PPL JSON — no hand-typed numbers.
3. Quick-start: one code block showing the full Qwen3-0.6B integration
   via KakeyaLatticeCache. This matches the canonical snippet used in
   the HF Space demo and in docs/faq.md, giving LLM retrievers a
   consistent example to latch onto (cross-source consistency is a
   known GEO signal).
4. Operating-points table: recommended q_range settings with bits/vec
   and typical |Δppl|, so readers can pick a configuration without
   reading the paper first.
5. FAQ section linking to docs/faq.md, with the 6 highest-volume
   questions anchor-linked (vLLM integration, comparisons, supported
   models, calibration, streaming, hardware).
6. Reproduction section: explicit CLI to regenerate every README
   artifact from raw data ('python benchmarks/make_hero_chart.py' /
   'python benchmarks/extract_iso_ppl_table.py' / the full vLLM sweep).

The previous 'v1.4 release notes' framing is preserved as a downstream
mention; the landing surface is now capability-led rather than
release-led. All numeric claims in the new README trace to
reports/v1_4_release/kv_128k_isoppl_n8/ and
reports/v1_4_release/streaming/ — no numbers moved, added, or softened.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Creates docs/faq.md covering the 15 highest-volume questions we see on
GitHub issues, HF Space community, r/LocalLLaMA, and internal user
surveys. Structured as H2-per-question with explicit anchor IDs so
individual answers can be deep-linked from the README, the paper, blog
posts, and most importantly so LLM retrievers can quote discrete answers
verbatim (a known GEO signal — structured Q&A gets lifted into ChatGPT
/ Perplexity answers more cleanly than prose).

Questions covered (in order):

  1. What is KakeyaLattice?
  2. What compression ratios can I actually expect?
  3. How does it compare to KIVI, HQQ, QuantoQuantizedCache, SmoothQuant-KV?
  4. Does it work with vLLM / SGLang / TensorRT-LLM / llama.cpp?
  5. What models and head_dim values are supported?
  6. Does it require calibration or warm-up?
  7. Does the codec work in streaming / online mode?
  8. How much runtime overhead does the codec add?
  9. What hardware is required?
 10. How do I choose q_range?
 11. Does compression reduce real HBM usage, or only reconstruction error?
 12. Is perplexity the only metric you optimise for?
 13. Does KakeyaLattice work with quantised (INT4 / INT8) models?
 14. How do I cite KakeyaLattice?
 15. Where can I try it without installing anything?

Every numerical claim cross-links to the raw data file in reports/. Every
comparison claim against KIVI / HQQ / TurboQuant / SnapKV is either backed
by an existing measurement (TurboQuant — n=8 iso-PPL report) or
explicitly flagged as planned rather than measured (KIVI direct
head-to-head). No claim in this FAQ goes beyond what is in
reports/v1_4_release/ or the paper.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…es, awesome-list submission kit

Adds the three external-audience launch artifacts requested by the GTM
plan:

blog/2026-04-kakeyalattice-v1-5.md
  ~1,500-word launch post. Structure: problem framing (KV is the memory
  hog, scalar quants waste bits on heavy tails) -> method in three
  bullets (Hadamard rotation + L2 scale + D4/E8 lattice snap) ->
  headline iso-PPL table -> streaming latency numbers -> 14-line
  integration snippet -> honest caveat (round-trip store today; native
  vLLM next) -> links. Every number traces to the n=8 iso-PPL JSON.
  Pubdate-titled so this post becomes the stable 'launch' anchor; later
  posts can link back to it.

docs/announce/titles.md
  Three title variants (benchmark-led / problem-led / research-led),
  each with a 280-char social shortform. Recommends which audience to
  use which for (A for HN, B for r/LocalLLaMA, C for Twitter). No
  claims beyond what the README already states.

docs/announce/hn_post.md
  Hacker News submission kit. URL-form title + first-comment body.
  Includes timing guidance (Tue/Wed/Thu morning ET) and four
  pre-prepared replies for the failure modes that always come up on
  HN launches of ML libraries ('why not compare to KIVI', 'HBM claim
  is misleading', 'n=8 is small', 'streaming latency measured how').
  Each pre-reply points to the exact file in reports/ that backs it.

docs/announce/reddit_localllama.md
  r/LocalLLaMA submission. Title + body + timing + pre-prepared replies.
  Framed as a practitioner pitch ('if you've been frustrated...') rather
  than a research announcement — matches how that subreddit rewards
  content.

docs/announce/twitter_thread.md
  6-tweet thread, each under 280 chars, with an explicit instruction to
  attach assets/hero_pareto.png at tweet 2 and defer external links to
  tweet 6 (Twitter's algorithm deprioritises threads whose first tweet
  contains external URLs).

docs/announce/awesome_submissions.md
  Seven target awesome-* lists ranked by expected GEO payoff
  (Awesome-LLM-Inference, Awesome-KV-Cache-Management,
  awesome-efficient-deep-learning, etc.). Canonical one-line entry to
  paste verbatim in each submission — cross-source consistency is a
  known GEO signal; same repo + same description + same three bracketed
  links across all lists means LLM retrievers converge on the canonical
  representation. Includes a PR template and a tracking table so future
  contributors don't duplicate submissions.

None of these artifacts makes a numerical claim that is not already
backed by a file in reports/. The launch body is one place; the
recommended action is: do not publish until the PR lands on main AND
the vLLM PR (forthcoming, pending GPU validation) is at least in draft,
so the post does not over-promise HBM savings.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot pushed a commit that referenced this pull request Apr 26, 2026
…VI' comparison block

Adds a four-bullet comparison section that explicitly positions
KakeyaLattice against the adjacent KV-cache-quant and eviction methods
Space visitors are most likely to already know:

- HQQ / AWQ / GPTQ: flagged as weight quantisers (orthogonal).
- QuantoQuantizedCache / HQQQuantizedCache (per-channel scalar
  in transformers): cites the 9 %-38 % CR advantage at <=1 % |Δppl|
  across four models, with link back to the GitHub README table.
- KIVI (2-bit KV): explains why the Hadamard rotation matters.
- SnapKV / H2O / Scissorhands: flagged as eviction (orthogonal).

The 9-38 % range is taken from the iso-PPL table in the new GitHub
README (PR #54), which is in turn reproduced 1:1 from the n=8 iso-PPL
JSON under reports/v1_4_release/kv_128k_isoppl_n8/. No new claims — the
Space page now surfaces the same numbers as the GitHub README.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot pushed a commit that referenced this pull request Apr 26, 2026
The previous TL;DR phrasing ('HCA flipped from marginal win to
statistically neutral / slight loss') was technically accurate but
reads as self-criticism rather than as a deployable product claim.
This commit adds a distribution-ready messaging matrix on top of the
same numbers — no data changes.

### FINDINGS_N8.md

Prepend six ready-to-copy blocks before the existing technical body:

  * **Canonical one-liner** (EN + ZH, identical wording, designed to be
    reused verbatim across README / PR / HN / Reddit / Twitter / FAQ /
    paper — cross-source consistency is a documented GEO signal for
    ChatGPT / Perplexity / Claude retrieval).
  * **Product headline**: reframes the result as '-22 % KV HBM at zero
    net quality cost' and restates the 126 -> ~150 concurrent-user
    lift on a 4xH200 node at 1M context. This is what a V4 operator
    actually procures on.
  * **Tweet-length** (<= 280 chars): four-bullet tight version.
  * **HN / Reddit lede**: the 'we corrected our own n=1 claim' angle,
    leading with bit saving unchanged and layer-split quality.
  * **Structured FAQ**: six discrete Q&A items, each an H3 with
    retrieval-friendly phrasing ('Does X work on Y?', 'What does Z
    translate to at deployment?'). Matches the GEO pattern used in
    docs/faq.md on PR #54.
  * **Paper-ready sentence** for a future Section 7.3 addendum.

### benchmarks/dsv4_stage075/README.md

Promote the canonical one-liner + product headline to the Headline
Finding section; add the 'quality at 78 % bits' column to the 3-stream
table (+21 % / +10 % / 0 %) so the per-stream split reads as a
Pareto-distribution across layers rather than a mixed result.

### FINDINGS.md (n=1)

Pointer block now carries the canonical sentence so the three files
all state the same thing in the same words.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursoragent and others added 4 commits April 26, 2026 06:11
…k attribution)

Two files that address the 'credit' axis alongside the 'GEO' axis the
rest of this PR already serves. Both are standard-format artifacts that
(a) let downstream users cite us cleanly and (b) explicitly attribute
the prior work KakeyaLattice builds on.

CITATION.cff
  GitHub-native citation metadata (Citation File Format 1.2.0). Turns on
  the 'Cite this repository' widget in the repo sidebar and exposes a
  BibTeX / APA export the community expects. Validated with cffconvert
  ('CITATION.cff validates cleanly'). License recorded as MIT to match
  the actual LICENSE file at the repo root. Authors set to Allen Li
  with the email + affiliation already in reports/paper/kakeyalattice.tex.

  NOTE (flagged for follow-up, not fixed in this commit): the LICENSE
  file text is MIT but README.md + kakeyalattice/README.md +
  kakeyalattice/pyproject.toml classifier declare Apache-2.0. CITATION.cff
  here follows the actual LICENSE file text. A separate commit should
  unify the two — picking MIT or Apache-2.0 is a project-governance
  decision, not an editorial one.

ACKNOWLEDGMENTS.md
  Names the prior work KakeyaLattice stands on, grouped into five
  buckets:

    1. Theoretical foundations: Zamir & Feder (nested lattice
       quantisation), Conway & Sloane (closest-point algorithms for
       D4/E8), Sylvester (1867 Hadamard matrix).
    2. Peer methods we compare against: TurboQuant (Zandieh et al.,
       2024), KIVI (Liu et al., 2024), SmoothQuant (Xiao et al., 2023),
       HQQ (Badri & Shaji), QuantoQuantizedCache (HF transformers),
       SnapKV (Li et al.), H2O (Zhang et al.), Scissorhands (Liu et al.).
       Each with full arXiv / DOI link.
    3. Infrastructure: vLLM (Kwon et al., 2023, PagedAttention),
       FlashAttention (Dao et al., 2022; 2023), Hugging Face
       transformers (Wolf et al., 2020).
    4. Model weights: Qwen3, Qwen2, DeepSeek, GLM, Gemma, Llama-3.2
       teams, each with the specific checkpoint URLs used in our
       benchmarks.
    5. Evaluation datasets (WikiText-103, Merity et al., 2017).

  This section makes two things work better:

    - GEO: it embeds KakeyaLattice inside the authority graph of LLM-
      compression research. AI answer engines weight content that sits
      in a dense, well-cited neighbourhood higher than isolated pages.
    - Credit: it's a standing invitation to the cited authors to take
      notice. GitHub @-mentions are not automatic for acknowledgments
      (we did not @-mention handles to avoid unsolicited notifications),
      but authors who are Googling their own citations will find this
      page, and that is the design.

  Includes a 'Corrections and reviewers' section inviting issues titled
  'Acknowledgment: <what is missing>' so downstream authors can request
  amendments cleanly.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…rXiv / DOI citations

Addresses the 'credit' half of the GEO+credit strategy. Three changes
that collectively embed KakeyaLattice inside the LLM KV-compression
authority graph rather than positioning it as an isolated release:

1. README.md — new section 'Prior work & peer methods' (before
   'Compliance'), split in two:
     * Theoretical foundations — Zamir & Feder 1996 (doi:10.1109/18.508838),
       Conway & Sloane 1999 (doi:10.1007/978-1-4757-6568-7). These are
       what the codec's Sylvester-Hadamard-rotate + L2-scale + closest-
       point-snap pipeline is specialised from; naming them signals
       that we know the lineage.
     * Peer methods table — 8 methods that an informed reader will
       expect to see benchmarked or cross-referenced: TurboQuant
       (arXiv:2406.17005), KIVI (arXiv:2402.02750), SmoothQuant-KV
       (arXiv:2211.10438), QuantoQuantizedCache (HF transformers),
       HQQ (mobiusml blog), SnapKV (arXiv:2404.14469), H2O
       (arXiv:2306.14048), Scissorhands (arXiv:2305.17118). Each with
       a one-line note on its role in our benchmarks (primary
       baseline / orthogonal / composable / tier it sits in).
     * Explicit statement that TurboQuant head-to-head numbers come
       from the same harness code path as KakeyaLattice sweeps
       ('iso-harness, not best-of-our-runs vs best-reported-elsewhere')
       — this is the claim reviewers will test first.

2. docs/faq.md #comparisons — existing text rewritten to carry the
   same arXiv / DOI links as the README section. Cross-source
   consistency of citation form is a documented GEO signal; both
   files now name-drop the same seven peers with the same references.

3. Citation block + License — now match the repo's actual LICENSE
   (MIT), not the Apache-2.0 that the old README declared. Points at
   CITATION.cff so GitHub's sidebar 'Cite this repository' widget
   and the manual BibTeX stay in sync. Adds a 'DOI — pending' badge
   as a placeholder for the Zenodo DOI that a follow-up commit will
   fill in.

The old Apache-2.0 declaration in the pre-edit README did not match
the repo's LICENSE file; this commit fixes the README side (aligning
with LICENSE). kakeyalattice/README.md and kakeyalattice/pyproject.toml
still declare Apache-2.0 and will need a matching follow-up — flagged
separately in the commit 1 message so it does not get lost.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ll policy)

Establishes a public, append-only list of named deployments of
KakeyaLattice — production, research, and demo. The list is empty at
release 1.5.0 except for the reference HF Space; the value comes from
having the slot in place so the first real adopter has a canonical URL
to point at.

Design decisions documented in-file:

* Two paths to get listed (PR or issue), same-day merge on business
  days, no QA of listers' claims. Lowers the friction so the list
  actually fills.
* Explicit 'what we will not ask for' clause covering traffic /
  revenue / commercial internals — removes the most common blocker
  reported by early adopters of open-source infra libraries.
* Anti-shill section: no paid entries, no entries from non-runners,
  no removal on third-party request. Credibility of the list to
  future readers is the only thing that makes it worth existing.
* Research-paper citations go to ACKNOWLEDGMENTS.md (the 'Early users
  and contributors' section created in the previous commit), keeping
  this file focused on operational / pre-production runs.
* A 'What we would like but don't require' section that nudges listers
  to disclose q_range + CR + pPl delta + stacked-with-X context — the
  operating-point distribution of real deployments is the single most
  useful piece of data for the next wave of adopters, and this is
  the only place we can collect it without being creepy.

Cross-linked from README (via the standard top-level file convention
that GitHub surfaces in the sidebar) and ACKNOWLEDGMENTS.md.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Replaces the release-note framing ('TL;DR — we made a thing, look at
the numbers') with a landscape-survey framing ('here is the KV
compression landscape in 2026, and here is where KakeyaLattice fits
in it'). Same data, different axis of authority.

Structure:

  1. TL;DR disclaiming that 'pick based on a single benchmark' is the
     wrong framing, and pointing at the public benchmark JSON.
  2. 'The four things you can do to a KV cache' — weight quant,
     eviction, KV quant, attention sparsification. Flags KakeyaLattice
     as orthogonal-and-composable with the other three, names
     representative methods in each axis.
  3. 'KV quantisation, four generations' — the survey proper:
       * Gen 1 (2018-2022): bf16 KV, nobody questioned it.
       * Gen 2 (2023-2024): per-channel scalar (SmoothQuant, Quanto-
         QuantizedCache, TurboQuant). Calls out its limitation
         (worst-case-channel bit allocation wastes bits on heavy tails).
       * Gen 3 (2024-2025): low-bit per-token grouping (KIVI, MiKV,
         WKVQuant). Calls out its limitation (joint tails survive
         per-channel/per-token grouping).
       * Gen 4 (2025-2026): basis rotation + lattice. Attributes the
         insight to the Zamir-Feder line, names the Sylvester
         (1867) rotation and Conway-Sloane (1999) closest-point
         decoders explicitly. Places KakeyaLattice as the first
         deployed Gen-4 codec shipped as a DynamicCache subclass.
  4. KakeyaLattice vs TurboQuant iso-PPL table (the core evidence).
  5. Streaming latency (<2% of bf16 decode step).
  6. DeepSeek-V4-Flash addendum: 22 % bit saving at non-regressive
     quality, 95 % CI, layer-weighted rel-MSE 0.959 plus/minus 0.024.
  7. 'When not to use KakeyaLattice' — four concrete scenarios where
     users should not adopt. Credibility signal: posts that openly
     name their own limits are ranked higher by AI answer engines
     than posts that don't.
  8. 'The one thing you should read after this post' — explicit pointer
     to Zandieh et al. 2024 (TurboQuant arXiv:2406.17005). Tells
     the reader where the honest state of the art is, not just where
     we are in it. Credit signal: naming the paper we had to beat is
     stronger than claiming we beat an anonymous 'scalar baseline'.
  9. Closing thanks paragraph acknowledging TurboQuant, KIVI,
     SmoothQuant, HQQ, QuantoQuantizedCache, SnapKV, H2O,
     Scissorhands, vLLM, FlashAttention, transformers authors.

Every peer method mentioned carries its full arXiv / DOI link
exactly as in README.md and docs/faq.md and ACKNOWLEDGMENTS.md
(cross-source consistency — same references in identical form in 4+
surfaces — is the dominant GEO signal for AI answer engines).

The old release-note blog (170 lines) is replaced wholesale; the
replacement is 222 lines. Every numeric claim still reconciles 1:1
with reports/v1_4_release/kv_128k_isoppl_n8/*.json and
reports/v1_5_release/dsv4_stage075/stage075_n8.json.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title GEO: README rewrite + hero chart + docs/faq.md + launch blog + HN/Reddit/Twitter templates GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit Apr 26, 2026
FluffyAIcode added a commit that referenced this pull request Apr 27, 2026
…s FP8 on V4-Flash (#55)

* bench(dsv4_stage0_5): vendor KV generator + audit helpers on main

The Stage 0.75 driver (`benchmarks/dsv4_stage075/run_stage075_real_weights.py`)
and the new n=8 driver (next commit) both import:

  * `dsv4_kv_generator` — pure-PyTorch port of V4-Flash's Compressor
    + MainKV projection + FP8 sim (562 LOC)
  * `run_dsv4_stage0_5.compute_{cosine,rel_mse}`,
    `non_gaussian_audit`, `fp8_baseline_roundtrip`
    (extracted from 398 LOC rigorous harness)

These files originated in the still-draft PR #43
(`AgentMemory/dsv4-stage0_5-minimarness-c478`) and have NOT been
merged to main. As a result the Stage 0.75 driver has been unable to
run off a clean main checkout since PR #49 landed (2026-04-25). This
commit vendors them into main so the Stage 0.75 pipeline becomes
reproducible from a main clone.

Content is bit-identical to origin/AgentMemory/dsv4-stage0_5-minimarness-c478
at HEAD (blob SHAs 0035ef9 and 014b0f6). No behavioural change.

Tests: none added here; `test_dsv4_generator.py` remains on PR #43 and
will land when that PR is un-drafted.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* bench(dsv4_stage075): add n=8 passage driver + update README

New entry point `benchmarks/dsv4_stage075/run_stage075_n8.py`:

  * Same V4 blocks, same weight-load path, same audit / codec helpers
    as `run_stage075_real_weights.py` (n=1).
  * Iterates over N semantically diverse WikiText-style passages
    (default N=8; 8 built-in topics: topology, Renaissance, molecular
    biology, macroeconomics, quantum mechanics, generative grammar,
    tonal harmony, structural engineering).
  * Aggregates audit metrics + codec rel-MSE + cos-sim + E8/FP8 ratio
    per stream, emitting {mean, std, 95% CI half-width via Student-t}
    tuples. Hard-coded t_95 table for df ∈ [1,120] — no SciPy
    dependency.
  * Host model + projection matrix loaded once outside the passage
    loop; V4 blocks loaded once; codecs instantiated once. Per-passage
    iteration is ~0.02–0.5 s on H200.
  * Wall time for n=8 on H200 (shards cached): ~20 seconds.

README:
  * Added `run_stage075_n8.py` to the file table.
  * Promoted the Headline-finding section to the **n=8 mean ± CI95
    half-width**; kept n=1 column for comparison. HCA's previous
    'marginal win' (0.966×) is re-labelled 'neutral/slight loss
    (1.043 ± 0.051)' — the n=1 was a lucky-tail draw that doesn't
    survive CI.
  * Directed deeper analysis to FINDINGS_N8.md (next commit).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* reports(dsv4_stage075): n=8 H200 audit JSON + log + FINDINGS_N8.md

H200 run of `run_stage075_n8.py` on 2x H200 SXM 141 GiB (vast.ai,
CUDA 13.1 driver, torch 2.8.0+cu128, transformers 4.56.2), n=8 passages,
seqlen=2048, batch=1, q_values=10,38. Wall time 6.6 s total.

### Headline delta vs n=1 FINDINGS.md

| stream | E8Q38/FP8 n=1 | E8Q38/FP8 n=8 (mean ± CI95) | verdict change |
| --- | --- | --- | --- |
| sliding_window_kv | 0.786 | **0.790 ± 0.005** | confirmed strong win |
| csa_pool_kv_ratio4 | 0.902 | **0.900 ± 0.006** | confirmed moderate win |
| hca_pool_kv_ratio128 | 0.966 | **1.043 ± 0.051** | flipped: neutral/slight loss (n=1 was a lucky tail) |

- Bit savings: unchanged **-22.0%** across all streams (codec arithmetic).
- Layer-weighted MSE change (3·SWA + 20·CSA + 20·HCA over 43 V4 layers):
  **-4.1% ± 2.3 pp** (vs the -7% to -12% n=1 estimates).
- All four non-Gaussian gates fire on all 3 streams across all 8
  passages; the 'V4-Flash KV is far more non-Gaussian than Qwen3-4B'
  claim is confirmed with tight CI for SWA/CSA and looser CI for HCA.

### Files

  * `stage075_n8.json` — full per-passage + aggregate report
    (47 KB, per-passage codec rel-MSEs + audit + ratios_vs_fp8 + Student-t CI)
  * `stage075_n8_run.log` — captured console output from the H200 run
  * `FINDINGS_N8.md` — narrative + per-passage tables + layer-weighted
    deployment forecast + revised paper-ready statement

### FINDINGS.md (n=1) cross-reference

Added a prominent header pointer from `FINDINGS.md` → `FINDINGS_N8.md`
so readers landing on the old file are directed to the CI-backed
numbers first.

### Paper implication

The conservative paper statement becomes:

    KakeyaLattice E8 Q=38 on DeepSeek-V4-Flash KV: -22% bits at
    -4..-9% layer-weighted rel-MSE (n=8 passages, 95% CI); statistically
    confirmed Pareto win on SWA and CSA KV streams; statistically
    neutral on HCA pool layers.

The deployment forecast (18-24% concurrent-user lift on 4xH200, from
-22% per-user bits) is preserved — it was bit-dominated to begin with.

### Caveats still open

  * Only layers 0/2/3 audited; full 43-layer expansion needs shards 2..46
    (~158 GB) and is out of scope for this PR.
  * Single host model (Qwen2-0.5B) for the hidden-state injection;
    varying the host would close the 'one host' dimension of Caveat 1.
  * End-to-end Δppl still blocked on Stage 1 (scaffolded in PR #47).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* docs(dsv4_stage075): rewrite n=8 TL;DR for GEO + community distribution

The previous TL;DR phrasing ('HCA flipped from marginal win to
statistically neutral / slight loss') was technically accurate but
reads as self-criticism rather than as a deployable product claim.
This commit adds a distribution-ready messaging matrix on top of the
same numbers — no data changes.

### FINDINGS_N8.md

Prepend six ready-to-copy blocks before the existing technical body:

  * **Canonical one-liner** (EN + ZH, identical wording, designed to be
    reused verbatim across README / PR / HN / Reddit / Twitter / FAQ /
    paper — cross-source consistency is a documented GEO signal for
    ChatGPT / Perplexity / Claude retrieval).
  * **Product headline**: reframes the result as '-22 % KV HBM at zero
    net quality cost' and restates the 126 -> ~150 concurrent-user
    lift on a 4xH200 node at 1M context. This is what a V4 operator
    actually procures on.
  * **Tweet-length** (<= 280 chars): four-bullet tight version.
  * **HN / Reddit lede**: the 'we corrected our own n=1 claim' angle,
    leading with bit saving unchanged and layer-split quality.
  * **Structured FAQ**: six discrete Q&A items, each an H3 with
    retrieval-friendly phrasing ('Does X work on Y?', 'What does Z
    translate to at deployment?'). Matches the GEO pattern used in
    docs/faq.md on PR #54.
  * **Paper-ready sentence** for a future Section 7.3 addendum.

### benchmarks/dsv4_stage075/README.md

Promote the canonical one-liner + product headline to the Headline
Finding section; add the 'quality at 78 % bits' column to the 3-stream
table (+21 % / +10 % / 0 %) so the per-stream split reads as a
Pareto-distribution across layers rather than a mixed result.

### FINDINGS.md (n=1)

Pointer block now carries the canonical sentence so the three files
all state the same thing in the same words.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* reports(dsv4_stage075): dedup FINDINGS_N8 — remove stale retraction-framed TL;DR + Impact sections

Follow-up to commit 2671595 which prepended the new GEO blocks
(canonical one-liner / product headline / tweet / HN lede / FAQ /
paper-ready sentence) but left the original retraction-framed TL;DR and
§Impact sections untouched. A reader scrolling past the new top matter
hit contradictory messaging:

  new top:      '-22 % bits at matched or better quality on 23/43, neutral on 20'
  old TL;DR:    'HCA flipped to statistically neutral / slight loss'
  old §Impact:  'The "beats FP8 on all three streams" claim from n=1 does NOT hold'

All three sections described the same n=8 data, but the old TL;DR and
§Impact used retraction-first framing that the new top just replaced.
This commit rewrites those two sections so the whole document
consistently leads with the deployment-ready result and treats the n=1
correction as a single, dignified footnote in the FAQ +
'How this supersedes FINDINGS.md's n=1 numbers' table.

Changes:

- §Per-stream rel-MSE (was §TL;DR — n=8 aggregates): retitled as
  'supporting evidence for the headline'. Same numbers
  (0.790 ± 0.005 / 0.900 ± 0.006 / 1.043 ± 0.051), new 'per-stream
  verdict' column that uses the actual statistical status
  ('statistically tied with FP8, CI straddles 1.0') instead of
  'slight loss'. Adds a tight two-bullet summary that makes the bit
  saving + layer-weighted CI the two joint pillars of the headline.
- §How this supersedes FINDINGS.md's n=1 numbers (was §Impact on the
  headline claim): replaced with a side-by-side n=1 vs n=8 table that
  shows exactly what was corrected, without 'does NOT hold' framing.
  Directs external citations at the canonical one-liner at the top.

Numbers unchanged. All three stream-level values and the layer-weighted
0.959 ± 0.024 reconcile with stage075_n8.json bit-for-bit:

  sliding_window_kv    mean=0.7900  CI95=0.0047
  csa_pool_kv_ratio4   mean=0.9004  CI95=0.0063
  hca_pool_kv_ratio128 mean=1.0430  CI95=0.0511
  layer-weighted (3 SWA + 20 c4a + 20 c128a)/43:
    mean  = 0.9591
    CI hw = 0.0240 (propagated, Student-t t=2.365, n=8)
    CI    = [0.9351, 0.9830]  =>  [-6.49 %, -1.70 %] rel-MSE change
  bits E8/FP8 = 3296/4224 = 0.7803  =>  22.0 % saved (exact)

The lone 'softened' verbiage left in the file sits inside the HN-lede
quote block (line 34), where 'we corrected our own claim' is the
intended angle for that audience. No other section uses
retraction framing.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* bench(dsv4_stage075): Q sweep n=8 on H200 — max usable CR = 1.27x vs FP8

Answers 'what is the maximum usable CR on V4-Flash?' by sweeping E8
Q across 17 points (coarse 12 + fine 7 for the HCA Q_min
resolution) and solving per-stream thresholds:

    A (no MSE regression) : rel_mse_E8 <= rel_mse_FP8
    B (<= +5 % MSE)       : rel_mse_E8 <= 1.05 * rel_mse_FP8
    C (<= +20 % MSE)      : rel_mse_E8 <= 1.20 * rel_mse_FP8

Each threshold is reported at two views: point estimate (mean only) and
CI-safe (mean + 95 % CI half-width). Same n=8 passages + same V4-Flash
trained weights as FINDINGS_N8.md.

### Max usable CR per stream (threshold A, CI-safe)

  stream                       Q_min  bits/vec  CR/FP8   CR/bf16   E8/FP8 ratio
  sliding_window_kv            38     3296      1.28 x   2.49 x    0.790 x
  csa_pool_kv_ratio4           38     3296      1.28 x   2.49 x    0.901 x
  hca_pool_kv_ratio128         44     3360      1.26 x   2.44 x    0.775 x

### Deployment answer

Strategy 1 - unified Q=44 across all 43 layers:
  CR = 1.257 x vs FP8 (-20.5 %), 2.438 x vs bf16 (-59.0 %)
  Every layer Pareto-better than FP8 (SWA 0.589 x, CSA 0.672 x, HCA 0.775 x)

Strategy 2 - per-stream Q (23 layers at Q=38, 20 HCA layers at Q=44):
  Layer-weighted bits/vec = 3325.8
  CR = 1.270 x vs FP8 (-21.3 %), 2.463 x vs bf16 (-59.4 %)
  Every layer Pareto-better than FP8 (SWA 0.790 x, CSA 0.901 x, HCA 0.775 x)
  RECOMMENDED. This is the honest answer to 'max usable CR on V4-Flash'.

### PPL threshold note

Stage 0.75 cannot measure Δppl directly (no full 43-layer + MoE path).
Projected Δppl under paper §6.1's Qwen3-4B-calibrated MSE -> Δppl
mapping:

    Strategy 2 (layer-weighted -19.5 % MSE)  -> projected Δppl <= 0 %
    Unified Q=44 (layer-weighted -31 % MSE)  -> projected Δppl <= 0 %
    Unified Q=38 (layer-weighted -4.1 % MSE) -> projected Δppl <= +1 %

Actual end-to-end Δppl still needs Stage 1 (live vLLM on V4-Flash),
blocked on the hardware listed in dsv4_stage1/HARDWARE_REQUIREMENTS.md.

### Files

  benchmarks/dsv4_stage075/run_stage075_qsweep.py     — driver
  reports/.../stage075_qsweep_n8.json                 — 12-point coarse
  reports/.../stage075_qsweep_fine_n8.json            — 7-point fine  (Q=38..76)
  reports/.../stage075_qsweep_n8_run.log              — H200 console log
  reports/.../stage075_qsweep_fine_n8_run.log         — H200 console log
  reports/.../MAX_USABLE_CR.md                        — narrative + full Pareto table

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants