Why I use MemPalace, and the road that nearly made me quit #1685
williamblair333
started this conversation in
Show and tell
Replies: 1 comment
-
|
This is exactly why I am very gung-ho on turbovecdb. https://github.com/kostadis/turbovecdb Removes all of the problems you just articulated by desgn. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Why I Use MemPalace, and the Road That Nearly Made Me Quit
Written with Claude as my ghostwriter — which is, as you'll see, exactly the point.
The problem I was trying to solve
Claude Code is remarkable. It can read your entire codebase, reason about architecture, write and debug code, and hold a genuinely useful conversation about complex systems. But the moment you close the terminal, it forgets everything. Next session, you're starting from scratch. Every decision you made last week, every bug you fought through, every "don't do it that way, we tried that" — gone.
I wanted Claude to remember. Not in a vague "here's a summary" way. I wanted it to be able to search its own history and actually find the relevant thing — the specific error message, the fix we landed on, the reason we chose one approach over another three weeks ago.
MemPalace fit the bill. It mines your Claude Code sessions and your project code, stores everything as searchable embeddings, and makes it available for retrieval on demand. The idea is that before Claude starts any task, it can search its own memory — "have I seen something like this before?" — and actually get useful answers instead of repeating past mistakes.
That's the theory. Here's what happened in practice.
The honeymoon
Installation was smooth. The first few weeks were genuinely impressive. I'd start a session, Claude would search its memory, and it would surface a decision from two weeks ago that was directly relevant. It felt like the missing piece.
I set up the infrastructure: nightly crons to mine both project code and conversation history, a health check script, a repair script for when things went sideways. I felt good about it.
Then things started getting weird.
The descent
Search results started degrading. Not catastrophically — just subtly. Fewer relevant hits. Older memories not surfacing. I ran the health check. It said everything was fine.
Then the first cryptic error appeared:
Nothing in the error told me where to look. I restarted Claude Code. It worked briefly. Then died again. I restarted again. Same thing. I assumed it was a server issue — something transient. I was wrong.
The actual problem was a pickle file on disk, in a format written by an older version of chromadb, that the current version couldn't read. Every restart loaded the same broken file and hit the same wall. Restarting was never going to fix it. I just didn't know that yet.
Around the same time, I noticed that search quality would crater every morning and slowly recover through the day. It took embarrassingly long to figure out why: the 4am repair cron was rebuilding the HNSW index from scratch every single night — and building it empty. The cron had no "skip if healthy" guard. It would archive the healthy index, rebuild from SQLite, and somehow produce an index with zero entries. Every morning, semantic search was dead. It recovered as the mining jobs ran during the day. I had no idea this was happening for weeks.
When I finally dug into the health check, I found another problem: the FTS5 integrity check was lying. It used
PRAGMA integrity_check, which reported clean results. The index was actually malformed — butPRAGMA quick_checkwould have caught it. The two commands behave differently, and I was running the wrong one. The health check had been giving false passes the whole time.And then I found out that the script I'd written to fix FTS5 corruption was making it worse. It ran as an async hook at session start, opening a concurrent FTS5 transaction while another process might be writing — which is exactly how you corrupt an FTS5 index. My safety net was the source of the problem.
I won't pretend this wasn't a painful few weeks.
Going deeper
Underneath all of this was a real upstream bug in chromadb's Rust HNSW bindings — a thread-safety issue in
updatePointthat causes silent index corruption under concurrent writes. It affects all of chromadb 1.5.x. There's no error. No warning. The index just quietly becomes unreliable.The mitigation: pin
chroma-hnswlib==0.7.6(uses Python HNSW instead of the Rust path) and sethnsw:num_threads=1in collection metadata to eliminate the concurrent update race. Neither of these is documented in MemPalace's own material. I found the bug by reading chromadb source and correlating symptoms.There was also a SQLite version mismatch — the system
sqlite3CLI (3.46.1) and Python's bundled sqlite3 (3.50.x) behave differently on FTS5 validation. If you validate from the command line, you might get clean results while Python sees corruption. The health check needs to use the same SQLite as everything else.And there was a subtle chromadb API issue:
PersistentClientsilently ignores theCHROMA_API_IMPLenvironment variable. It always uses its own internal Rust API regardless. If you're setting that env var to control behavior (which MemPalace's own scripts do),PersistentClientwon't respect it. Debugging with it produces misleading results.What it looks like now
The palace has been stable for days. 289,000+ memories. Search works. Nightly crons run with proper lock coordination so they don't step on each other. The health check now uses
PRAGMA quick_checkand runs on every session start. The 4am repair cron skips if the index is already healthy. The FTS5 repair runs with exclusive access, not as an async hook.The way I actually use it day-to-day: at the start of every task, Claude searches its own memory — not with jargon, just "have we solved something like this before?" It finds past bugs, past decisions, past "we tried that, it didn't work." At the end of each session, it writes a diary entry summarizing what happened. The mining jobs pull in both conversation history and project code nightly.
It's the persistent memory layer I wanted from the start. It just took a while to get there.
The fixes, briefly
For anyone hitting these symptoms:
chroma-hnswlib==0.7.6, sethnsw:num_threads=1on your collections in SQLite metadata.dict object has no attribute dimensionality: Your pickle is in legacy format. Migrate it totypes.SimpleNamespace— it has attribute access, survives pickle round-trips, and works with chromadb'scast().PRAGMA quick_check, notPRAGMA integrity_check. Use the same Python process (same SQLite version) as your mine jobs — not the system CLI.A more detailed writeup of the HNSW corruption specifically is coming in a follow-up post. PR #1607 (open) addresses the
mempalace repairFTS5 abort behavior — currently the repair bails and tells you to fix FTS5 manually when it could just fix it itself.The infrastructure I built around all of this is in a public repo if anyone wants to look at the actual scripts: Uncle J's Refinery. Not an advertisement — just the reference if the above is useful and you want to see working implementations.
Happy to answer questions. And if you're in the middle of any of this — it does come out the other side.
Beta Was this translation helpful? Give feedback.
All reactions