Skip to content

fix(node-ui): drop FTS5 log index that bloated node-ui.db to 9 GB#687

Closed
branarakic wants to merge 5 commits into
mainfrom
fix/node-ui-drop-logs-fts
Closed

fix(node-ui): drop FTS5 log index that bloated node-ui.db to 9 GB#687
branarakic wants to merge 5 commits into
mainfrom
fix/node-ui-drop-logs-fts

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Incident motivating this PR

Rc.10 → rc.11 upgrade on a 12-day-old testnet edge node, May 2026:

  • node-ui.db had grown to 8.98 GB, of which ~7 GB was the FTS5 shadow tables (logs_fts_data / _idx / _docsize / _config) backing free-text log search.
  • SQLite returned database disk image is malformed on boot.
  • The rc.11 daemon refused to start; recovery required moving node-ui.db aside and starting over with a fresh DB.

Forensic findings

The bloat was structural, not operational. Specifically:

  1. /api/logs?q= — the only HTTP consumer of the FTS5 index — has no production wiring. The dashboard's actual log viewer (LogsTab in Operations.tsx, the live log panel in PanelBottom.tsx) reads /api/node-log, which tails daemon.log directly and supports the same q= substring filter.
  2. fetchLogs() — the client wrapper — is exported from src/ui/api.ts but imported by zero React components (verified via grep across the entire src/ui/ tree). Only its own unit test exercised it.
  3. StructuredLogger — the "drop-in Logger that also writes to SQLite" advertised in SPEC_NODE_DASHBOARD.md — is exported from index.ts but never substituted for Logger in any daemon code path. The live log capture goes through Logger.setSink → dashDb.insertLog in lifecycle.ts, which this PR preserves.
  4. prune() ran on a 90-day retention cutoff and never deleted anything from a 12-day-old DB.
  5. FTS5 fragments without periodic optimize, which we never called.

What this PR does

V15 of DashboardDB removes the dead FTS5 infrastructure while preserving the one DB-backed log feature that is in production use — per-operation log correlation in /api/operations/:id (the OperationDetail panel) and the failed-ops list, both served by simple WHERE operation_id = ? queries that don't touch FTS5.

Schema migration v14 → v15

```sql
DROP TRIGGER IF EXISTS logs_ai;
DROP TRIGGER IF EXISTS logs_ad;
DROP TABLE IF EXISTS logs_fts; -- drops 4 shadow tables atomically
VACUUM; -- one-shot reclaim, try/catch-wrapped
```

Other changes

  • `DEFAULT_RETENTION_DAYS` `90 → 14`. Bounds worst-case growth of the (now FTS5-less) `logs` table to ~150 MB. Operators can override via `setRetentionDays()`; value persists in `settings`.
  • `prune()` now `VACUUM`s whenever it deletes >10k log rows, so disk is reclaimed periodically — not only at migration.
  • Remove `/api/logs` HTTP handler.
  • Remove `DashboardDB.searchLogs()` and private `searchLogsFts()`.
  • Remove `fetchLogs()` client wrapper.
  • Remove `StructuredLogger` class + its export + its test + spec/README mentions.

What is preserved

  • The `logs` table itself + `DashboardDB.insertLog()`.
  • `Logger.setSink → dashDb.insertLog` pipeline in `lifecycle.ts` (unchanged).
  • `/api/node-log` (file-tail endpoint — the actual production log viewer).
  • `DashboardDB.getOperation()` and `getFailedOperations()` per-operation log lookup.

Tests

  • New: V14 → V15 migration regression test. Builds a realistic V14 fixture (full schema via DashboardDB, then re-attaches FTS5 virtual table + both triggers + backfills the index), reopens through DashboardDB, asserts `user_version = 15`, that all FTS5 objects + both triggers are gone, that the pre-migration log row survives, and that subsequent `insertLog()` does not trip on an orphaned trigger.
  • Removed: `searchLogs` free-text / level / time-range / pagination test cases (method is gone; per-operation lookup is still covered by the existing operation-detail tests).
  • Removed: `structured-logger.test.ts` (module deleted).
  • Updated: `ui-api-pure.test.ts` drops `fetchLogs` import + test + matching mock-server branch.
  • Updated: 3 pre-existing `user_version` pin assertions bumped 14 → 15.

Full suite for `packages/node-ui`: 794 passed, 38 skipped, 0 failed.

Storage impact

  • Immediate after upgrade: VACUUM reclaims ~99% of `node-ui.db` on nodes that accumulated the FTS5 bloat. Verified on the incident node: 8.98 GB → ~150 MB after manual VACUUM of an FTS5-stripped copy.
  • Steady-state: `logs` table bounded at `retentionDays * daily-write-volume` (~150 MB at the new 14d default for an edge node, vs unbounded growth before).

Migration safety / rollback

  • Fresh installs no longer create `logs_fts` at any version — the V1 schema CREATE block is the canonical V15 shape.
  • In-place upgrades from any `version < 15` trigger the cleanup.
  • Rollback requires reverting this PR and manually resetting `user_version` back to 14 (the existing `if (version >= SCHEMA_VERSION) return;` guard in `migrate()` would otherwise short-circuit and never recreate the dropped objects).

Test plan

  • `pnpm --filter @origintrail-official/dkg-node-ui build` clean
  • `pnpm --filter @origintrail-official/dkg-node-ui test` — 794 passed
  • `pnpm --filter @origintrail-official/dkg build` clean (CLI transitively consumes node-ui)
  • `pnpm --filter @origintrail-official/dkg test` — same 2 pre-existing failures in `test/repro-issue-633.test.ts` (EPCIS SPARQL bindings, unrelated; verified identical failures on `origin/main`)
  • `pnpm --filter @origintrail-official/dkg-node-ui test:e2e` (suggested for reviewer; Playwright wasn't run locally)
  • Smoke test against a live testnet edge node: open dashboard, click an operation in Operations page, confirm per-op logs render; check Logs tab still tails `daemon.log`.

Made with Cursor

Production incident, rc.10 → rc.11 boundary (May 2026): a 12-day-old
testnet edge node accumulated a 9 GB node-ui.db, ~7 GB of which was
the FTS5 shadow tables (logs_fts_data/_idx/_docsize/_config) backing
free-text log search. SQLite eventually returned
"database disk image is malformed" on boot and the daemon refused to
start; recovery required moving node-ui.db aside and starting over
with a fresh DB.

Forensic findings showed the bloat was structural, not operational:

  1. /api/logs?q= — the only HTTP consumer of the FTS5 index — had no
     production wiring. The dashboard's actual log viewer (LogsTab in
     Operations.tsx, PanelBottom live log) reads /api/node-log, which
     tails daemon.log directly and supports the same q= substring
     filter.
  2. fetchLogs() — the client wrapper — was exported from src/ui/api.ts
     but never imported by any React component (verified via grep).
     Only its own unit test exercised it.
  3. StructuredLogger — the "drop-in Logger that also writes to SQLite"
     described in SPEC_NODE_DASHBOARD.md — was exported from index.ts
     but never substituted for Logger in any daemon code path. The
     live log capture went through Logger.setSink → dashDb.insertLog
     in lifecycle.ts, which is preserved.
  4. prune() ran on a 90-day retention cutoff and never deleted
     anything from a 12-day-old DB.
  5. FTS5 fragments without periodic optimize, which we never called.

V15 of DashboardDB cleans this up while preserving the one DB-backed
log feature that *is* in use — per-operation log correlation in
/api/operations/:id (OperationDetail panel) and the failed-ops list,
both served by simple `WHERE operation_id = ?` queries that don't
touch FTS5.

What changes
------------
* DashboardDB SCHEMA_VERSION 14 → 15.
  V15 migration: DROP TRIGGER logs_ai, DROP TRIGGER logs_ad,
  DROP TABLE logs_fts (drops 4 shadow tables atomically), then a
  one-shot VACUUM so existing nodes actually reclaim the GBs.
  VACUUM is wrapped in try/catch — it requires an exclusive lock
  and we never block startup on disk reclamation.
* DEFAULT_RETENTION_DAYS 90 → 14. Bounds worst-case growth of the
  now-FTS5-less logs table to ~150 MB. Operators who want longer
  retention can override via setRetentionDays(); the value is
  persisted in `settings` and re-read on next boot.
* prune() now VACUUMs whenever it deletes >10k log rows (well above
  test-suite noise, well below daily log volume on a busy edge node),
  so disk is reclaimed periodically — not only at migration.
* Remove /api/logs HTTP handler.
* Remove DashboardDB.searchLogs() and the private searchLogsFts().
* Remove fetchLogs() client wrapper.
* Remove StructuredLogger class + its export + its test + spec/README
  mentions (the class was dead code in production).

What is preserved
-----------------
* The `logs` table and DashboardDB.insertLog().
* The Logger.setSink → dashDb.insertLog pipeline in lifecycle.ts.
* /api/node-log file-tail endpoint (the actual production log viewer).
* DashboardDB.getOperation() and getFailedOperations() per-operation
  log lookup (the one DB-backed log feature with a UI consumer).

Migration safety
----------------
* Fresh installs no longer create logs_fts at any version (V1 schema
  CREATE block in this file is the canonical V15 shape).
* In-place upgrades from any version V<15 trigger the cleanup.
* Downgrade-safe in the sense that V14 code reading a V15-migrated DB
  will see `user_version = 15` and refuse to start (the existing
  `if (version >= SCHEMA_VERSION) return;` guard — never tries to
  recreate dropped objects). A rollback requires reverting this PR
  *and* manually resetting `user_version` back to 14.

Tests
-----
* New: V14 → V15 migration regression test. Builds a realistic V14
  fixture (full schema via DashboardDB, then re-attaches FTS5/triggers
  and backfills the index), reopens through DashboardDB, asserts
  user_version=15, that all FTS5 objects + both triggers are gone,
  that the pre-migration log row survives, and that subsequent
  insertLog() does not trip on an orphaned trigger.
* Removed: searchLogs free-text / level / time-range / pagination
  test cases (the method is gone; per-operation lookup is still
  covered by the operation-detail tests above).
* Removed: structured-logger.test.ts (module deleted).
* Updated: ui-api-pure.test.ts drops fetchLogs import + test +
  matching mock-server branch.
* Updated: 3 pre-existing user_version pin assertions bumped 14 → 15.

Storage impact on existing nodes
--------------------------------
* Immediate after upgrade: VACUUM reclaims ~99% of node-ui.db size on
  nodes that accumulated the FTS5 bloat (verified on the incident
  node: 8.98 GB → ~150 MB after manual VACUUM of an FTS5-stripped
  copy).
* Steady-state going forward: logs table bounded at retentionDays *
  daily-write-volume (~150 MB at 14d default for an edge node, vs
  unbounded growth before).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/node-ui/src/db.ts
Comment thread packages/node-ui/src/api.ts Outdated
Comment thread packages/node-ui/src/db.ts Outdated
Comment thread packages/node-ui/src/index.ts
Comment thread packages/node-ui/src/db.ts Outdated
Comment thread packages/node-ui/src/db.ts
Comment thread packages/node-ui/src/db.ts
Comment thread packages/node-ui/src/api.ts Outdated
Comment thread packages/node-ui/src/db.ts Outdated
Comment thread packages/node-ui/src/db.ts
Comment thread packages/node-ui/src/api.ts Outdated
Comment thread packages/node-ui/src/index.ts
Comment thread packages/node-ui/src/db.ts
@branarakic
Copy link
Copy Markdown
Contributor Author

Already merged into release/rc.12 in 719ec13; reaches main via #716. Closing as superseded — the open state against main is a side effect of merging into a non-base branch.

@branarakic branarakic closed this May 31, 2026
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
Drops the FTS5 log index that bloated node-ui.db to 9 GB. Directly
addresses the Miles cleanup root cause (Track 1.3 of the SWM-fanout
plan): the 21 GB dashDb on Miles was 2.1M log rows + their FTS5 index,
not oxigraph SWM cruft. Vacuum reclaimed 11.7 GB locally; this PR
prevents the bloat from recurring on every node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant