Skip to content

Recurring SQLite B-tree corruption in fusion.db (agentLogEntries) #24

Description

@timothyjlaurent

Summary

The main fusion.db database suffers recurring B-tree corruption, specifically targeting the agentLogEntries table (rootpage 21). In our project this has occurred twice within ~14 hours, each time rendering the database unusable until manual .recover.

Environment

  • Fusion version: 0.13.0
  • Node.js: v24.15.0
  • SQLite driver: node:sqlite (DatabaseSync)
  • OS: macOS 15.3 (Darwin 24.3.0), APFS filesystem
  • Concurrency: maxConcurrent: 4, maxSpawnedAgentsGlobal: 20
  • DB size: ~13 MB, ~29K rows in agentLogEntries, ~800 rows in activityLog

What happens

Running PRAGMA integrity_check produces errors like:

Error: stepping, database disk image is malformed (11)
*** in database main ***
Tree 21 page 3617: btreeInitPage() returns error code 11
Tree 21 page 3616: btreeInitPage() returns error code 11
Tree 21 page 3614: btreeInitPage() returns error code 11
... (dozens of consecutive pages)
row 541 missing from index idxActivityLogTypeTimestamp
wrong # of entries in index idxActivityLogTypeTimestamp

The affected table is always agentLogEntries (rootpage 21). The corrupted pages are always in a contiguous high-numbered range, suggesting corruption during a bulk append or WAL checkpoint.

Root cause analysis

After investigating the Fusion source (extension.js), we identified several contributing factors:

1. Unbatched individual auto-committed INSERTs

appendAgentLog() (line ~32759) executes individual INSERT INTO agentLogEntries calls with auto-commit. With 4 concurrent agents producing thinking/tool-call/text deltas, this generates hundreds of individually fsync'd writes per minute. Each write appends to the WAL and forces the B-tree to split pages.

Impact: Extreme WAL churn and B-tree page fragmentation on the highest-volume table.

2. WAL checkpoint interval is too long for this write volume

maintenanceIntervalMs defaults to 900000 (15 minutes). The checkpoint uses PRAGMA wal_checkpoint(TRUNCATE) — the most aggressive form, but only runs every 15 minutes. Between checkpoints, the WAL grows unbounded.

Additionally, if a maintenance cycle is skipped because the previous one is still running (observed in logs: "Maintenance cycle skipped -- previous cycle still running"), the WAL can grow even larger.

3. No explicit PRAGMA wal_autocheckpoint or PRAGMA journal_size_limit

Neither of these PRAGMAs is set in the connection constructor (lines 2796-2811). The defaults are:

  • wal_autocheckpoint = 1000 (~4 MB of WAL before passive checkpoint)
  • No WAL size limit

For a busy 4-agent setup, 4 MB of WAL accumulates quickly, and passive checkpointing can fail if a reader is active.

4. Multi-process concurrent access risk

Both the Fusion daemon (Express server on port 4040) and the Claude Code extension process open fusion.db with DatabaseSync + WAL mode. If either process crashes or is killed (SIGKILL, system sleep/wake) while holding a lock or mid-checkpoint, the WAL can be left in an inconsistent state.

The only concurrency safeguard is busy_timeout = 5000. There is no startup integrity check or corrupted-DB detection.

Workaround (what we did)

  1. Stop all Fusion processes
  2. Delete .fusion/fusion.db-wal and .fusion/fusion.db-shm (stale WAL)
  3. Run sqlite3 fusion.db ".recover" | sqlite3 fusion_recovered.db
  4. Rebuild FTS: INSERT INTO tasks_fts(tasks_fts) VALUES('rebuild');
  5. Run ANALYZE;
  6. Replace fusion.db with the recovered copy
  7. Restart Fusion

We also reduced maintenanceIntervalMs from 900000 to 300000 (5 min) to limit WAL growth.

Suggested fixes

  1. Batch agent log writes in explicit transactions. Wrap multiple appendAgentLog() calls within a single transaction to reduce WAL traffic by 60-80% (one fsync per batch instead of per row).

  2. Set PRAGMA wal_autocheckpoint = 100 in the connection constructor. This triggers passive checkpointing every ~400 KB instead of ~4 MB, keeping the WAL small.

  3. Add PRAGMA journal_size_limit = 4194304 (4 MB) to cap WAL file growth.

  4. Add a startup integrity check. Run PRAGMA integrity_check on DB open; if it fails, attempt automatic .recover or alert the user instead of silently continuing with a corrupted DB.

  5. Consider PRAGMA synchronous = NORMAL instead of relying on the default FULL. In WAL mode, NORMAL is nearly as safe as FULL but significantly faster, because WAL writes are idempotent — a crash just replays the WAL.

  6. Reduce the default maintenanceIntervalMs from 15 min to 5 min for the wal_checkpoint(TRUNCATE) cycle.

Reproduction

This is a steady-state issue rather than a discrete trigger. It occurs after prolonged operation with multiple concurrent agents producing high-volume log output. Our agentLogEntries table reached ~29K rows before the first corruption event.

Happy to provide the corrupted DB files or any additional diagnostics if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions