Atelya OS — KV-native memory for self-hosted agents

Agent memory that lives in the model's KV cache, not just in text. Built for teams self-hosting open-model inference (vLLM / SGLang) with heavy, long-lived memory.

pip install atelya

Note: the PyPI package is atelya; the Python import and CLI are amem.

Most memory systems store text and re-prefill it into the prompt every turn — so the cost of serving memory grows linearly with turns. Atelya OS (amem) keeps the relevant working set as reused KV: compute it once, reuse it, don't recompute memory each turn.

What's measured (not estimated)

All numbers below were measured on a single RTX 4070 with bench_real.py / amem_headtohead.py in this repo. Full methodology, per-query data, and honest limits: BENCHMARK.md.

Fidelity — 97.5% answer-for-answer agreement with a cold full re-prefill of the same chunks (n = 200, LLM-judged). Reusing KV instead of recomputing it does not change the answer.
Cost — ~6x to ~54x less prefill. Reusing KV vs re-prefilling the same retrieved set is ~6x cheaper per query (n = 200, the conservative, default behavior); for a stable session the one-time working set amortizes to ~54x (measured, 30 queries). Your real number lands in that range depending on how much the relevant memory changes per query. Break-even ~1 query.
Quality — at parity, not a win. Head-to-head vs Mem0 at a matched answerer + injection budget, answer correctness was 60% (amem) vs 55% (Mem0) — within noise at n = 20. We do not claim a recall-accuracy win: dedicated recall systems (Mem0, Zep, EverOS) lead that, and this comparison deliberately matches retrieval to isolate cost, so it does not reflect Mem0's stronger production recall pipeline.

Honest framing for the cost number: lead with ~6x (rigorous, n = 200). The ~54x is the best case (stable working set + Mem0 storing raw turns); Mem0's real extraction injects fewer tokens and narrows the gap — but amem still never re-prefills its resident KV. See BENCHMARK.md.

Who this is for

Full fit — you self-host vLLM / SGLang on a CUDA GPU, with memory-heavy or long-horizon agents. You own inference, so you can inject KV -> you get the flat cost curve and the KV moat.
Partial fit — closed APIs (OpenAI) or Ollama. You can't inject KV into a model you don't control, so amem reduces the per-turn memory bill but can't flatten it. The memory SDK still helps; the KV-reuse cost curve does not apply.

Install

pip install atelya                 # memory SDK + CLI (no GPU needed)
pip install 'atelya[selfhost]'     # + vLLM / LMCache CacheBlend engine (CUDA GPU) — unlocks KV reuse

The ~6x–54x cost win needs the self-host engine ([selfhost]). pip install atelya alone is the memory-layer SDK; the KV moat requires inference you control. The Python package runs without the optional Rust engine (that engine is a commercial performance deepener, not required).

Quickstart

Verified against amem 0.1.19. kv-serve (the KV-reuse moat) needs amem[selfhost] + a CUDA GPU. No GPU? Swap in amem proxy (drop-in for Claude / GPT) — same SDK; the cost curve just doesn't apply.

Start the KV-reuse memory server (loads the vLLM + CacheBlend engine):

amem kv-serve            # serves on http://localhost:8000  (needs amem[selfhost] + a CUDA GPU)

Use it from the tiny SDK — the second question reuses the first one's KV instead of re-prefilling the memory:

from amem import Amem

mem = Amem("http://localhost:8000")
mem.remember("alice", ["Alice lives in Seattle and works at Boeing.",
                       "Alice's sister Maria lives in Denver."])
sid = mem.start_session("alice")                       # one resident working set for the chat
print(mem.ask("alice", "Where does Alice's sister live?", session=sid))
print(mem.ask("alice", "What company does Alice work for?", session=sid))   # different q -> reuses KV

Or drop it in behind the standard OpenAI SDK — point base_url at amem and pass user as the memory namespace (per-user residency is on by default):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="amem", user="alice",
    messages=[{"role": "user", "content": "Where does Alice's sister live?"}])
print(r.choices[0].message.content)

CLI

amem proxy        # closed-LLM drop-in (Claude / GPT) + cache orchestration   (light, no GPU)
amem mcp          # MCP server for Claude Desktop / Cursor / Claude Code       (stdio)
amem serve        # self-host serve (local open model + KV residency)          [amem[selfhost]]
amem kv-serve     # KV-native serve: CacheBlend reuse + KvPolicy eviction (moat) [amem[selfhost]]
amem bench        # reproducible cost-curve benchmark (KV-residency vs re-prefill) [amem[selfhost]]
amem version      # print version

Run amem --help for the authoritative list and flags.

How it works — the bridge

text memory                amem (the bridge)                 KV memory
re-prefilled every turn  -> recall the relevant set, then  -> its precomputed KV is REUSED
(linear cost)               reuse its KV (not the text)       (one-time prefill, then ~flat)

CacheBlend (via LMCache) reuses each chunk's attention KV position-independently, with ~15% selective recompute — so an arbitrary set of recalled chunks can be served from cached KV instead of a full re-prefill.
A value-model residency policy (relevance x recency x reuse - size) keeps the hottest memory resident and tiers the rest to CPU/disk.

What this is not

Not a recall-accuracy leaderboard claim. Mem0 / Zep / EverOS lead LoCoMo / LongMemEval recall; amem composes with a recall layer. Its edge is the cost of serving memory at parity fidelity.
Transformer-only. CacheBlend reuses per-token attention KV. SSM / Mamba-hybrid models keep a compressed recurrent state — there is no per-token KV to blend — so the KV-reuse moat does not apply to them.
A storage trade. KV is ~1000x the size of the text it represents; amem tiers it to CPU/disk. You buy lower compute with more storage.

Status & license

Experimental, in active development — expect rough edges. Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
amem		amem
benchmarks		benchmarks
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
BENCHMARK.md		BENCHMARK.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
bench_real.py		bench_real.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Atelya OS — KV-native memory for self-hosted agents

What's measured (not estimated)

Who this is for

Install

Quickstart

CLI

How it works — the bridge

What this is not

Status & license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Atelya OS — KV-native memory for self-hosted agents

What's measured (not estimated)

Who this is for

Install

Quickstart

CLI

How it works — the bridge

What this is not

Status & license

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages