Recurring SQLite B-tree corruption in fusion.db (agentLogEntries)

## Summary

The main `fusion.db` database suffers recurring B-tree corruption, specifically targeting the `agentLogEntries` table (rootpage 21). In our project this has occurred twice within ~14 hours, each time rendering the database unusable until manual `.recover`.

## Environment

- **Fusion version:** `0.13.0`
- **Node.js:** `v24.15.0`
- **SQLite driver:** `node:sqlite` (`DatabaseSync`)
- **OS:** macOS 15.3 (Darwin 24.3.0), APFS filesystem
- **Concurrency:** `maxConcurrent: 4`, `maxSpawnedAgentsGlobal: 20`
- **DB size:** ~13 MB, ~29K rows in `agentLogEntries`, ~800 rows in `activityLog`

## What happens

Running `PRAGMA integrity_check` produces errors like:

```
Error: stepping, database disk image is malformed (11)
*** in database main ***
Tree 21 page 3617: btreeInitPage() returns error code 11
Tree 21 page 3616: btreeInitPage() returns error code 11
Tree 21 page 3614: btreeInitPage() returns error code 11
... (dozens of consecutive pages)
row 541 missing from index idxActivityLogTypeTimestamp
wrong # of entries in index idxActivityLogTypeTimestamp
```

The affected table is always `agentLogEntries` (rootpage 21). The corrupted pages are always in a contiguous high-numbered range, suggesting corruption during a bulk append or WAL checkpoint.

## Root cause analysis

After investigating the Fusion source (`extension.js`), we identified several contributing factors:

### 1. Unbatched individual auto-committed INSERTs

`appendAgentLog()` (line ~32759) executes individual `INSERT INTO agentLogEntries` calls with auto-commit. With 4 concurrent agents producing thinking/tool-call/text deltas, this generates hundreds of individually fsync'd writes per minute. Each write appends to the WAL and forces the B-tree to split pages.

**Impact:** Extreme WAL churn and B-tree page fragmentation on the highest-volume table.

### 2. WAL checkpoint interval is too long for this write volume

`maintenanceIntervalMs` defaults to 900000 (15 minutes). The checkpoint uses `PRAGMA wal_checkpoint(TRUNCATE)` — the most aggressive form, but only runs every 15 minutes. Between checkpoints, the WAL grows unbounded.

Additionally, if a maintenance cycle is skipped because the previous one is still running (observed in logs: `"Maintenance cycle skipped -- previous cycle still running"`), the WAL can grow even larger.

### 3. No explicit `PRAGMA wal_autocheckpoint` or `PRAGMA journal_size_limit`

Neither of these PRAGMAs is set in the connection constructor (lines 2796-2811). The defaults are:
- `wal_autocheckpoint = 1000` (~4 MB of WAL before passive checkpoint)
- No WAL size limit

For a busy 4-agent setup, 4 MB of WAL accumulates quickly, and passive checkpointing can fail if a reader is active.

### 4. Multi-process concurrent access risk

Both the Fusion daemon (Express server on port 4040) and the Claude Code extension process open `fusion.db` with `DatabaseSync` + WAL mode. If either process crashes or is killed (SIGKILL, system sleep/wake) while holding a lock or mid-checkpoint, the WAL can be left in an inconsistent state.

The only concurrency safeguard is `busy_timeout = 5000`. There is no startup integrity check or corrupted-DB detection.

## Workaround (what we did)

1. Stop all Fusion processes
2. Delete `.fusion/fusion.db-wal` and `.fusion/fusion.db-shm` (stale WAL)
3. Run `sqlite3 fusion.db ".recover" | sqlite3 fusion_recovered.db`
4. Rebuild FTS: `INSERT INTO tasks_fts(tasks_fts) VALUES('rebuild');`
5. Run `ANALYZE;`
6. Replace `fusion.db` with the recovered copy
7. Restart Fusion

We also reduced `maintenanceIntervalMs` from 900000 to 300000 (5 min) to limit WAL growth.

## Suggested fixes

1. **Batch agent log writes in explicit transactions.** Wrap multiple `appendAgentLog()` calls within a single transaction to reduce WAL traffic by 60-80% (one fsync per batch instead of per row).

2. **Set `PRAGMA wal_autocheckpoint = 100`** in the connection constructor. This triggers passive checkpointing every ~400 KB instead of ~4 MB, keeping the WAL small.

3. **Add `PRAGMA journal_size_limit = 4194304`** (4 MB) to cap WAL file growth.

4. **Add a startup integrity check.** Run `PRAGMA integrity_check` on DB open; if it fails, attempt automatic `.recover` or alert the user instead of silently continuing with a corrupted DB.

5. **Consider `PRAGMA synchronous = NORMAL`** instead of relying on the default `FULL`. In WAL mode, `NORMAL` is nearly as safe as `FULL` but significantly faster, because WAL writes are idempotent — a crash just replays the WAL.

6. **Reduce the default `maintenanceIntervalMs`** from 15 min to 5 min for the `wal_checkpoint(TRUNCATE)` cycle.

## Reproduction

This is a steady-state issue rather than a discrete trigger. It occurs after prolonged operation with multiple concurrent agents producing high-volume log output. Our `agentLogEntries` table reached ~29K rows before the first corruption event.

Happy to provide the corrupted DB files or any additional diagnostics if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recurring SQLite B-tree corruption in fusion.db (agentLogEntries) #24

Summary

Environment

What happens

Root cause analysis

1. Unbatched individual auto-committed INSERTs

2. WAL checkpoint interval is too long for this write volume

3. No explicit `PRAGMA wal_autocheckpoint` or `PRAGMA journal_size_limit`

4. Multi-process concurrent access risk

Workaround (what we did)

Suggested fixes

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Recurring SQLite B-tree corruption in fusion.db (agentLogEntries) #24

Description

Summary

Environment

What happens

Root cause analysis

1. Unbatched individual auto-committed INSERTs

2. WAL checkpoint interval is too long for this write volume

3. No explicit PRAGMA wal_autocheckpoint or PRAGMA journal_size_limit

4. Multi-process concurrent access risk

Workaround (what we did)

Suggested fixes

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

3. No explicit `PRAGMA wal_autocheckpoint` or `PRAGMA journal_size_limit`