Summary
The main fusion.db database suffers recurring B-tree corruption, specifically targeting the agentLogEntries table (rootpage 21). In our project this has occurred twice within ~14 hours, each time rendering the database unusable until manual .recover.
Environment
- Fusion version:
0.13.0
- Node.js:
v24.15.0
- SQLite driver:
node:sqlite (DatabaseSync)
- OS: macOS 15.3 (Darwin 24.3.0), APFS filesystem
- Concurrency:
maxConcurrent: 4, maxSpawnedAgentsGlobal: 20
- DB size: ~13 MB, ~29K rows in
agentLogEntries, ~800 rows in activityLog
What happens
Running PRAGMA integrity_check produces errors like:
Error: stepping, database disk image is malformed (11)
*** in database main ***
Tree 21 page 3617: btreeInitPage() returns error code 11
Tree 21 page 3616: btreeInitPage() returns error code 11
Tree 21 page 3614: btreeInitPage() returns error code 11
... (dozens of consecutive pages)
row 541 missing from index idxActivityLogTypeTimestamp
wrong # of entries in index idxActivityLogTypeTimestamp
The affected table is always agentLogEntries (rootpage 21). The corrupted pages are always in a contiguous high-numbered range, suggesting corruption during a bulk append or WAL checkpoint.
Root cause analysis
After investigating the Fusion source (extension.js), we identified several contributing factors:
1. Unbatched individual auto-committed INSERTs
appendAgentLog() (line ~32759) executes individual INSERT INTO agentLogEntries calls with auto-commit. With 4 concurrent agents producing thinking/tool-call/text deltas, this generates hundreds of individually fsync'd writes per minute. Each write appends to the WAL and forces the B-tree to split pages.
Impact: Extreme WAL churn and B-tree page fragmentation on the highest-volume table.
2. WAL checkpoint interval is too long for this write volume
maintenanceIntervalMs defaults to 900000 (15 minutes). The checkpoint uses PRAGMA wal_checkpoint(TRUNCATE) — the most aggressive form, but only runs every 15 minutes. Between checkpoints, the WAL grows unbounded.
Additionally, if a maintenance cycle is skipped because the previous one is still running (observed in logs: "Maintenance cycle skipped -- previous cycle still running"), the WAL can grow even larger.
3. No explicit PRAGMA wal_autocheckpoint or PRAGMA journal_size_limit
Neither of these PRAGMAs is set in the connection constructor (lines 2796-2811). The defaults are:
wal_autocheckpoint = 1000 (~4 MB of WAL before passive checkpoint)
- No WAL size limit
For a busy 4-agent setup, 4 MB of WAL accumulates quickly, and passive checkpointing can fail if a reader is active.
4. Multi-process concurrent access risk
Both the Fusion daemon (Express server on port 4040) and the Claude Code extension process open fusion.db with DatabaseSync + WAL mode. If either process crashes or is killed (SIGKILL, system sleep/wake) while holding a lock or mid-checkpoint, the WAL can be left in an inconsistent state.
The only concurrency safeguard is busy_timeout = 5000. There is no startup integrity check or corrupted-DB detection.
Workaround (what we did)
- Stop all Fusion processes
- Delete
.fusion/fusion.db-wal and .fusion/fusion.db-shm (stale WAL)
- Run
sqlite3 fusion.db ".recover" | sqlite3 fusion_recovered.db
- Rebuild FTS:
INSERT INTO tasks_fts(tasks_fts) VALUES('rebuild');
- Run
ANALYZE;
- Replace
fusion.db with the recovered copy
- Restart Fusion
We also reduced maintenanceIntervalMs from 900000 to 300000 (5 min) to limit WAL growth.
Suggested fixes
-
Batch agent log writes in explicit transactions. Wrap multiple appendAgentLog() calls within a single transaction to reduce WAL traffic by 60-80% (one fsync per batch instead of per row).
-
Set PRAGMA wal_autocheckpoint = 100 in the connection constructor. This triggers passive checkpointing every ~400 KB instead of ~4 MB, keeping the WAL small.
-
Add PRAGMA journal_size_limit = 4194304 (4 MB) to cap WAL file growth.
-
Add a startup integrity check. Run PRAGMA integrity_check on DB open; if it fails, attempt automatic .recover or alert the user instead of silently continuing with a corrupted DB.
-
Consider PRAGMA synchronous = NORMAL instead of relying on the default FULL. In WAL mode, NORMAL is nearly as safe as FULL but significantly faster, because WAL writes are idempotent — a crash just replays the WAL.
-
Reduce the default maintenanceIntervalMs from 15 min to 5 min for the wal_checkpoint(TRUNCATE) cycle.
Reproduction
This is a steady-state issue rather than a discrete trigger. It occurs after prolonged operation with multiple concurrent agents producing high-volume log output. Our agentLogEntries table reached ~29K rows before the first corruption event.
Happy to provide the corrupted DB files or any additional diagnostics if helpful.
Summary
The main
fusion.dbdatabase suffers recurring B-tree corruption, specifically targeting theagentLogEntriestable (rootpage 21). In our project this has occurred twice within ~14 hours, each time rendering the database unusable until manual.recover.Environment
0.13.0v24.15.0node:sqlite(DatabaseSync)maxConcurrent: 4,maxSpawnedAgentsGlobal: 20agentLogEntries, ~800 rows inactivityLogWhat happens
Running
PRAGMA integrity_checkproduces errors like:The affected table is always
agentLogEntries(rootpage 21). The corrupted pages are always in a contiguous high-numbered range, suggesting corruption during a bulk append or WAL checkpoint.Root cause analysis
After investigating the Fusion source (
extension.js), we identified several contributing factors:1. Unbatched individual auto-committed INSERTs
appendAgentLog()(line ~32759) executes individualINSERT INTO agentLogEntriescalls with auto-commit. With 4 concurrent agents producing thinking/tool-call/text deltas, this generates hundreds of individually fsync'd writes per minute. Each write appends to the WAL and forces the B-tree to split pages.Impact: Extreme WAL churn and B-tree page fragmentation on the highest-volume table.
2. WAL checkpoint interval is too long for this write volume
maintenanceIntervalMsdefaults to 900000 (15 minutes). The checkpoint usesPRAGMA wal_checkpoint(TRUNCATE)— the most aggressive form, but only runs every 15 minutes. Between checkpoints, the WAL grows unbounded.Additionally, if a maintenance cycle is skipped because the previous one is still running (observed in logs:
"Maintenance cycle skipped -- previous cycle still running"), the WAL can grow even larger.3. No explicit
PRAGMA wal_autocheckpointorPRAGMA journal_size_limitNeither of these PRAGMAs is set in the connection constructor (lines 2796-2811). The defaults are:
wal_autocheckpoint = 1000(~4 MB of WAL before passive checkpoint)For a busy 4-agent setup, 4 MB of WAL accumulates quickly, and passive checkpointing can fail if a reader is active.
4. Multi-process concurrent access risk
Both the Fusion daemon (Express server on port 4040) and the Claude Code extension process open
fusion.dbwithDatabaseSync+ WAL mode. If either process crashes or is killed (SIGKILL, system sleep/wake) while holding a lock or mid-checkpoint, the WAL can be left in an inconsistent state.The only concurrency safeguard is
busy_timeout = 5000. There is no startup integrity check or corrupted-DB detection.Workaround (what we did)
.fusion/fusion.db-waland.fusion/fusion.db-shm(stale WAL)sqlite3 fusion.db ".recover" | sqlite3 fusion_recovered.dbINSERT INTO tasks_fts(tasks_fts) VALUES('rebuild');ANALYZE;fusion.dbwith the recovered copyWe also reduced
maintenanceIntervalMsfrom 900000 to 300000 (5 min) to limit WAL growth.Suggested fixes
Batch agent log writes in explicit transactions. Wrap multiple
appendAgentLog()calls within a single transaction to reduce WAL traffic by 60-80% (one fsync per batch instead of per row).Set
PRAGMA wal_autocheckpoint = 100in the connection constructor. This triggers passive checkpointing every ~400 KB instead of ~4 MB, keeping the WAL small.Add
PRAGMA journal_size_limit = 4194304(4 MB) to cap WAL file growth.Add a startup integrity check. Run
PRAGMA integrity_checkon DB open; if it fails, attempt automatic.recoveror alert the user instead of silently continuing with a corrupted DB.Consider
PRAGMA synchronous = NORMALinstead of relying on the defaultFULL. In WAL mode,NORMALis nearly as safe asFULLbut significantly faster, because WAL writes are idempotent — a crash just replays the WAL.Reduce the default
maintenanceIntervalMsfrom 15 min to 5 min for thewal_checkpoint(TRUNCATE)cycle.Reproduction
This is a steady-state issue rather than a discrete trigger. It occurs after prolonged operation with multiple concurrent agents producing high-volume log output. Our
agentLogEntriestable reached ~29K rows before the first corruption event.Happy to provide the corrupted DB files or any additional diagnostics if helpful.