Claude Code has no memory. We added persistent memory and token-efficient context.
Memory persists. Money stays in your pocket.
Installation · Performance · How It Works · Best Practices · MCP Tools · Dashboard
- Docker Desktop
- Claude Code —
npm install -g @anthropic-ai/claude-code - Python 3.10+
- Cerebras API key — free at inference.cerebras.ai
- Go to inference.cerebras.ai
- Sign up — no credit card required
- Go to API Keys → create a new key — do not use the default key shown on the page, create your own
- Copy it, you will need it during setup
# 1. Clone the repo
git clone https://github.com/AbdoKnbGit/claude-code-memory.git
cd claude-code-memory
# 2. Copy the environment file and add your Cerebras key
cp .env.example .env
# Open .env and set CEREBRAS_API_KEY=csk-...
# 3. Run setup
python setup.pySetup starts the Docker container, sets ANTHROPIC_BASE_URL=http://localhost:8082 in your Claude Code config, and registers the hooks.
After setup: close your terminal, open a new one, restart Claude Code.
No
ANTHROPIC_API_KEYneeded. The proxy uses your existing Claude Code OAuth token. Nothing changes in how you authenticate.
Before opening a project, make sure the container is running:
docker psYou should see cc-nim-memory in the list. If not:
docker compose up -dInside Claude Code, open /mcp — you should see cc-memory listed as connected. If it shows as disconnected, restart Claude Code after confirming the container is running.
Memory initialization is a one-time step per project. After that, it runs automatically every session.
If Claude Code did not initialize automatically, run this in the Claude Code chat:
cc initialize the memory
Claude will call memory_init, write the config files, and confirm. Restart Claude Code after.
Every time you start a new session, Claude Code starts from zero. It does not know your stack. It does not know the decisions you made yesterday. It does not know the bug you spent three hours debugging last week. You re-explain everything, every time — and every token of that re-explanation costs money.
On top of that, Claude Code has a 5-minute cache TTL. Take a short break, switch tabs, come back — cold start. You pay full price to rebuild the context you already paid for.
claude-code-memory runs a local Docker proxy between Claude Code and the Anthropic API. It is invisible in normal use. It does not change how you work. It silently does things that Claude Code cannot do alone — including several that require real intelligence, not just plumbing:
Semantic memory that knows what matters. Not everything you do is worth remembering. A local scoring model evaluates every tool call and decision for novelty, importance, and surprise — only meaningful signals get stored. When a new session starts, a semantic search across your ChromaDB finds the most relevant entries for what you are working on right now, not just the most recent ones. Claude starts each session knowing your stack, your architecture decisions, and the non-obvious bugs you already solved. No re-explaining. No re-reading files.
Intelligent deduplication and contradiction detection. Before saving anything, the system checks your existing entries with a vector similarity search. If a new decision contradicts something already stored — for example, migrating from Spring Boot to FastAPI — a Cerebras LLM judge evaluates the conflict, marks the old entry as superseded, and saves the new one. Your memory stays coherent as your project evolves. You never end up with contradictory context injected into the same session.
Adaptive query classification. Before every memory search, a lightweight Cerebras model classifies your query — is this a resume signal asking for project context, a technical question, or a task? The classifier adjusts the search tier and scoring weights accordingly, so "what did we build last session" and "how does the auth flow work" get different retrieval strategies and consistently relevant results.
Graph-aware memory retrieval. Entries are not stored in isolation. The system builds a RAM-resident graph of relationships between memory nodes — auth connects to database, database connects to migration, and so on. When you ask about one component, a BFS traversal surfaces related entries you did not explicitly ask for but almost certainly need. This is why Claude can answer "how does our checkout work" without you telling it that checkout depends on cart, which depends on the product schema.
Cache optimization that actually works. The proxy manages Anthropic's prompt caching so your stable context — system prompt, memory block, tool definitions — is written to cache once and re-read at 10× lower cost on every subsequent turn. The memory block tracks its own hash every turn: when it has not changed, the proxy adds a second cache breakpoint so the block is served from cache rather than re-sent as fresh input. This happens automatically, every request, without any configuration.
Cache TTL extended from 5 minutes to 1 hour. Claude Code's native cache expires in 5 minutes. The proxy holds an explicit 1-hour cache and keeps it alive with a background keepalive ping every 55 minutes. Take a break, have a meeting, come back — your session is still warm.
Context compression that prevents late-session cost explosions. Long sessions accumulate context. Without compression, a 60-turn session can cost 4× more than the first 10 turns. The proxy runs three layers: a hard cap on large tool outputs, Cerebras-powered summarization of the oldest 50% of turns, and Anthropic server-side compaction — keeping context lean and costs flat throughout.
Tool definition filtering. Claude Code sends ~55,000 tokens of tool definitions with every single request. The proxy tracks which tools were used in the last four turns, keeps those plus a core set, and strips everything else. ~55,000 tokens becomes ~2,400. You pay for what you actually use.
| Without claude-code-memory | With claude-code-memory | |
|---|---|---|
| Memory between sessions | ❌ starts from zero every time | ✅ context injected automatically |
| Cache after 5 min break | ❌ full cold start | ✅ warm up to 1 hour |
| Tool token cost | ❌ ~55,000 tokens every turn | ✅ ~2,400 tokens (filtered) |
| Long session cost curve | ❌ grows linearly, no ceiling | ✅ compression keeps it flat |
| Past decisions | ❌ re-read files every session | ✅ stored and recalled semantically |
When you open a new project, claude-code-memory checks for .mcp.json locally before doing anything — zero token cost. If the file exists, boot proceeds normally.
Let Claude Code handle it. In most cases you do not need to do anything. Open your project, start working. Claude Code detects the new project and initializes automatically.
If it did not initialize automatically — this only needs to be done once per project — tell Claude explicitly:
cc initialize the memory
Claude will check for the config files, create the memory database, and confirm. Restart Claude Code after.
When Claude finishes something meaningful, it will suggest saving:
💾 Store in memory? `store` · `pin` · `skip`
Reply with one word. Nothing saves without your confirmation.
If no suggestion appears and you want to save something, just say it in plain language:
save that we use JWT with 24h expiry and refresh token rotation
pin that the frontend is React Native not Flutter
store that rate limiting is 100 req/min per user
Claude calls the save tool immediately — no tool syntax needed.
At the end of every productive session, save before closing:
save a summary of what we built today and what is left to do
Pin it. The next session starts with this loaded.
If Claude seems to have forgotten something or gives a generic answer, it probably did not search before responding. Ask in plain language:
search your memory for our authentication setup
what do we have stored about the database?
check your memory before answering this
If it keeps ignoring context, be more direct:
Before you answer, search your memory for everything about the API design
No tool syntax needed — plain language is enough.
If memory search consistently returns wrong or empty results, the vector index may need rebuilding:
reindex memory for this project
Takes 10–30 seconds. Use this any time semantic search feels off.
Open http://localhost:8082 after starting the container.
- Live session activity: cache hits, writes, costs per turn
- Memory entries browser: all entries with pin status and scores
- Injection history: what was loaded each session and cost
- Real-time event stream as Claude works
Note: a new project does not appear in the dashboard until after the first
storeorpinaction in that session. This is expected — the project registers on first save.
A synthetic Python/FastAPI codebase was built specifically for this test — 2,200 lines across 8 files with 4 planted bugs, a full pytest suite, and a 400-line application log.
Both runs used identical prompts, identical steps, and identical Claude Code settings. Sessions ran in separate windows with a 6-minute gap between them to expire Anthropic's native 5-minute cache. The proxy run started with a wiped Docker volume — no accumulated memory advantage on sessions 1–4.
Each session was recorded as a JSONL file. Costs are calculated from the raw usage fields in the Anthropic API response, deduplicated by message_id to avoid counting streaming chunks twice.
What was measured, not assumed:
cache_creation_input_tokens— new content written to cache, billed at 2× input ratecache_read_input_tokens— content re-read from cache, billed at 0.1× input rateoutput_tokens— identical both runs by design (same prescribed fixes, same format rule)
The only variable between the two runs was whether the proxy was active. Model, prompts, tasks, and codebase were identical.
Independent verification: the JSONL files for all 8 sessions (both runs) are included in the repository. Run the analysis yourself:
python benchmark/analyze.pyproduces the cost table and the chart from raw data.
8 sessions, same codebase, same tasks — run with and without the proxy. All sessions used Opus 4.6.
| Session | What happened | No-proxy | cc-memory |
|---|---|---|---|
| S1 — Debug | Read 7 files, run tests, fix 2 bugs | $1.55 | $1.29 |
| S2 — Debug | Same structure, 2 different bugs | $1.08 | $0.84 |
| S3-4 — Stress | Build feature + heavy log analysis, context hit 143k tokens | $4.45 | $2.56 |
| S5 — Memory recall | 6 questions about past sessions, no files read | $0.11 | $0.10 |
| S6 — Build | Create new service file, read db layer | $0.20 | $0.19 |
| S7 — Recall | What file did you build last session? | $0.03 | $0.13 |
| S8 — Estimated | Same as S1/S2 pattern | $1.20 | $1.00 |
| Total | $7.42 | $5.10 |
31.3% cheaper overall. $2.32 saved over 8 sessions.
A few things the numbers show honestly:
- S3-4 is where the gap is largest. Heavy file reads and large bash outputs bloat the context fast — compression and the output cap keep the proxy context 18,000 tokens smaller than no-proxy by the end.
- S7 costs more with the proxy ($0.13 vs $0.03). The memory injection overhead on a 2-turn session exceeds any saving. Short sessions are not where the proxy saves money.
- S5 memory recall: no-proxy answered zero questions correctly. cc-memory answered all six — named exact file names, method names, and bug fixes from 90 minutes earlier without reading a single file. That is not a cost metric, it is a capability metric.
The savings are conservative. These sessions ran with minimal memory overhead by design. Real working sessions with longer context accumulation show larger gaps.
As a project matures, sessions get longer, files multiply, and context grows. Without a proxy the cost curve is steep — more context means more tokens re-read every turn. The proxy's compression and caching keep the curve flatter.
Assumptions: 60 sessions/month (3/day, 5 days/week). Sonnet 4.6 for early stages, Opus 4.6 from month 4 onward as complexity increases.
| Month | Project stage | Context | No-proxy / session | cc-memory / session | Saved / month |
|---|---|---|---|---|---|
| 1 | Early — small codebase, getting started | 40k tokens | $0.55 | $0.27 | +$17 |
| 2 | Growing — 10–20 files, first features | 50k tokens | $0.80 | $0.42 | +$23 |
| 3 | Active — 30–50 files, architecture settled | 55k tokens | $0.99 | $0.54 | +$27 |
| 4 | Mature — 50–80 files, complex debug sessions | 70k tokens | $2.30 | $1.28 | +$61 |
| 6 | Large — 100+ files, multi-module | 100k tokens | $3.81 | $2.19 | +$97 |
| 12 | Production — maintenance + new features | 120k tokens | $4.80 | $2.78 | +$121 |
Over 12 months: $2,464 without proxy → $1,414 with cc-memory. $1,050 saved. 43% reduction.
The saving grows as the project grows because compression and caching have more to work with. A 10-turn early session has little to compress. A 60-turn session with 120k tokens of context is exactly where the three-layer compression pipeline earns its keep.
Note that the month-4 jump reflects switching from Sonnet to Opus — Opus costs 3–5× more per token, so the absolute saving in dollars grows even if the percentage stays similar (~43–45%).
🗺 Full architecture diagram — click to expand and zoom
Complete request flow from Claude Code through every proxy layer to the Anthropic API and back. Open the image in a new tab for full resolution and zoom.
Claude Code → claude-code-memory proxy (port 8082) → Anthropic API
│
├─ Dedup: catch duplicate requests before they hit Anthropic
├─ Memory: inject project context into every session
├─ Cache: manage Anthropic cache keys for maximum hit rate
├─ Compress: summarize old turns before context grows
├─ Filter: cut tool definitions from 55k to 2,400 tokens
└─ Keepalive: ping every 55min to extend cache TTL to 1 hour
Do not change model mid-session. Switching from Sonnet to Opus mid-session breaks the Anthropic cache key — every turn after pays full cache_write price again. If you need a different model, start a new session.
Do not leave sessions open indefinitely. After 40–50 turns, context is large and each turn is more expensive. Compression helps but has limits. When a session has been running 2+ hours or starts feeling slow, close it and open a new one. Memory carries everything forward automatically.
When you see a compact notification, restart. That is your signal that the session has reached its efficient limit. Start fresh — the new session runs at full cache efficiency from turn 1.
The 5-minute cache problem is solved. Native Claude Code cache expires after 5 minutes of inactivity — every break costs you a full cold start. claude-code-memory extends this to 1 hour with a background keepalive running inside the Docker container. Take a break, come back within an hour — your session is still warm. Beyond 1 hour is Anthropic's hard limit, nothing can extend it further.
Keep sessions focused. One task per session when possible. Mixed sessions create large unfocused context that costs more per turn.
Budget is 400 tokens (~10 entries). Be selective.
Pin — critical, always loaded, never trimmed:
- Stack:
".NET 8 + MongoDB, not SQL Server" - Architecture rules:
"Repository pattern — no direct DB calls in controllers" - Hard constraints:
"Must support iOS 15" - Non-obvious bugs you spent hours debugging
Store — useful, trimmed after several sessions:
- Features completed this session
- API endpoint shapes and decisions
- Config values not obvious from code
Skip:
- File names and paths — Claude finds those itself
- Implementation details visible in the code
- Anything you will change soon
Available inside Claude Code automatically on port 8083. Use plain language — you never need to call these directly.
| Tool | What it does |
|---|---|
memory_remember |
Boot only — loads full project snapshot once per session |
memory_context |
Semantic search before project questions |
memory_search |
Explicit keyword lookup |
memory_suggest |
Proposes saving a decision |
memory_save |
Persists after your confirmation |
memory_manage |
Forget / pin / unpin by ID |
memory_reduce |
Auto-trim oldest non-pinned entries |
memory_status |
Shows entry count and token usage |
memory_reindex |
Rebuilds vector index from SQLite |
memory_clear |
Wipes all memory (asks for confirmation) |
memory_export |
Exports all entries to JSON |
memory_init |
Creates memory database for new project |
"Memory not initialized"
Run cc initialize the memory, then restart Claude Code. Only needed once per project.
Search returns wrong or empty results
Run reindex memory for this project.
Claude ignores past decisions
Say check your memory before answering. If it keeps drifting, restart the session.
Wrong project saved
Say save to project correct-folder-name — Claude corrects itself. Project ID is always the folder name.
Cerebras rate limit (429)
Free tier: 30 requests/minute. Built-in retry handles it. Check docker logs cc-nim-memory if it persists.
Windows path issues Use Git Bash or WSL, not CMD. Handled automatically by setup.py.
AGPL 3.0 — see LICENSE.
| Name | Role | |
|---|---|---|
| 👤 | AbdoKnbGit | Creator · Architecture · Vision |
| 🤖 | Anthropic (Claude) | AI Pair Programmer · Code Generation · Bug Fixing |
Local Docker · No cloud dependency · Anthropic cache-optimized · AbdoKnbGit/claude-code-memory