Agent memory that lives in the model's KV cache, not just in text. Built for teams self-hosting open-model inference (vLLM / SGLang) with heavy, long-lived memory.
pip install atelya
Note: the PyPI package is
atelya; the Python import and CLI areamem.
Most memory systems store text and re-prefill it into the prompt every turn — so the cost of
serving memory grows linearly with turns. Atelya OS (amem) keeps the relevant working set as
reused KV: compute it once, reuse it, don't recompute memory each turn.
All numbers below were measured on a single RTX 4070 with bench_real.py / amem_headtohead.py in
this repo. Full methodology, per-query data, and honest limits: BENCHMARK.md.
- Fidelity — 97.5% answer-for-answer agreement with a cold full re-prefill of the same chunks (n = 200, LLM-judged). Reusing KV instead of recomputing it does not change the answer.
- Cost — ~6x to ~54x less prefill. Reusing KV vs re-prefilling the same retrieved set is ~6x cheaper per query (n = 200, the conservative, default behavior); for a stable session the one-time working set amortizes to ~54x (measured, 30 queries). Your real number lands in that range depending on how much the relevant memory changes per query. Break-even ~1 query.
- Quality — at parity, not a win. Head-to-head vs Mem0 at a matched answerer + injection budget, answer correctness was 60% (amem) vs 55% (Mem0) — within noise at n = 20. We do not claim a recall-accuracy win: dedicated recall systems (Mem0, Zep, EverOS) lead that, and this comparison deliberately matches retrieval to isolate cost, so it does not reflect Mem0's stronger production recall pipeline.
Honest framing for the cost number: lead with ~6x (rigorous, n = 200). The ~54x is the best case (stable working set + Mem0 storing raw turns); Mem0's real extraction injects fewer tokens and narrows the gap — but amem still never re-prefills its resident KV. See BENCHMARK.md.
- Full fit — you self-host vLLM / SGLang on a CUDA GPU, with memory-heavy or long-horizon agents. You own inference, so you can inject KV -> you get the flat cost curve and the KV moat.
- Partial fit — closed APIs (OpenAI) or Ollama. You can't inject KV into a model you don't control, so amem reduces the per-turn memory bill but can't flatten it. The memory SDK still helps; the KV-reuse cost curve does not apply.
pip install atelya # memory SDK + CLI (no GPU needed)
pip install 'atelya[selfhost]' # + vLLM / LMCache CacheBlend engine (CUDA GPU) — unlocks KV reuseThe ~6x–54x cost win needs the self-host engine ([selfhost]). pip install atelya alone is the
memory-layer SDK; the KV moat requires inference you control. The Python package runs without the
optional Rust engine (that engine is a commercial performance deepener, not required).
Verified against amem 0.1.19.
kv-serve(the KV-reuse moat) needsamem[selfhost]+ a CUDA GPU. No GPU? Swap inamem proxy(drop-in for Claude / GPT) — same SDK; the cost curve just doesn't apply.
Start the KV-reuse memory server (loads the vLLM + CacheBlend engine):
amem kv-serve # serves on http://localhost:8000 (needs amem[selfhost] + a CUDA GPU)Use it from the tiny SDK — the second question reuses the first one's KV instead of re-prefilling the memory:
from amem import Amem
mem = Amem("http://localhost:8000")
mem.remember("alice", ["Alice lives in Seattle and works at Boeing.",
"Alice's sister Maria lives in Denver."])
sid = mem.start_session("alice") # one resident working set for the chat
print(mem.ask("alice", "Where does Alice's sister live?", session=sid))
print(mem.ask("alice", "What company does Alice work for?", session=sid)) # different q -> reuses KVOr drop it in behind the standard OpenAI SDK — point base_url at amem and pass user as the
memory namespace (per-user residency is on by default):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
model="amem", user="alice",
messages=[{"role": "user", "content": "Where does Alice's sister live?"}])
print(r.choices[0].message.content)amem proxy # closed-LLM drop-in (Claude / GPT) + cache orchestration (light, no GPU)
amem mcp # MCP server for Claude Desktop / Cursor / Claude Code (stdio)
amem serve # self-host serve (local open model + KV residency) [amem[selfhost]]
amem kv-serve # KV-native serve: CacheBlend reuse + KvPolicy eviction (moat) [amem[selfhost]]
amem bench # reproducible cost-curve benchmark (KV-residency vs re-prefill) [amem[selfhost]]
amem version # print version
Run amem --help for the authoritative list and flags.
text memory amem (the bridge) KV memory
re-prefilled every turn -> recall the relevant set, then -> its precomputed KV is REUSED
(linear cost) reuse its KV (not the text) (one-time prefill, then ~flat)
- CacheBlend (via LMCache) reuses each chunk's attention KV position-independently, with ~15% selective recompute — so an arbitrary set of recalled chunks can be served from cached KV instead of a full re-prefill.
- A value-model residency policy (relevance x recency x reuse - size) keeps the hottest memory resident and tiers the rest to CPU/disk.
- Not a recall-accuracy leaderboard claim. Mem0 / Zep / EverOS lead LoCoMo / LongMemEval recall; amem composes with a recall layer. Its edge is the cost of serving memory at parity fidelity.
- Transformer-only. CacheBlend reuses per-token attention KV. SSM / Mamba-hybrid models keep a compressed recurrent state — there is no per-token KV to blend — so the KV-reuse moat does not apply to them.
- A storage trade. KV is ~1000x the size of the text it represents; amem tiers it to CPU/disk. You buy lower compute with more storage.
Experimental, in active development — expect rough edges. Apache-2.0.