fix: prevent silent index gaps when per-record puts fail during schema migration#529
Conversation
|
Reviewed; no blockers found. |
| } | ||
| // Await the last pending put. If it rejects, that's also an indexing error. | ||
| try { | ||
| await lastResolution; |
There was a problem hiding this comment.
I'm starting to not like this pattern. If any of the non-last promises reject, they will be unhandled.
| delete attribute.lastIndexedKey; | ||
| delete attribute.indexingPID; | ||
| delete attribute.indexingFailed; | ||
| delete attribute.indexingAttempt; |
| attribute.lastIndexedKey = attributeDescriptor?.lastIndexedKey ?? undefined; | ||
| attribute.indexingPID = process.pid; | ||
| delete attribute.indexingFailed; // clear failure flag for the new run | ||
| delete attribute.indexingAttempt; // reset attempt counter |
| } | ||
| await lastResolution; | ||
| logger.warn( | ||
| `Indexing of ${Table.tableName} encountered errors on some records — index will remain incomplete. ` + |
There was a problem hiding this comment.
Question, should be be avoiding mdashes in our logs? I know it's 2026 and just because we can doesn't necessarily mean we should. It could complicate log monitoring.
| `Indexing of ${Table.tableName} encountered errors on some records — index will remain incomplete. ` + | |
| `Indexing of ${Table.tableName} encountered errors on some records - index will remain incomplete. ` + |
Adds targeted unit tests for fragility points in the runIndexing schema-migration backfill code path in resources/databases.ts: - F1: stale `changed` reused after re-fetch under exclusive lock (lines 1093-1133). Passes in single-thread; real exposure is two concurrent workers triggering redundant migrations. - F2: per-record indexing errors are caught and the loop continues silently, leaving gaps in the new index while marking the migration complete. **Test FAILS — 5 of 50 rows missing after backfill** when simulating transient ERR_BUSY errors on every 10th index put. This is the most likely root cause for serent-canopy issue #135's "Resource SDK search returns subset after schema migration + rolling restart" fingerprint: under production load, the rolling-restart deploy that added `@indexed createdAt` would have run the backfill concurrently with active writes / replication; any transient errors during that window leave silent index gaps that no future restart re-detects (indexingPID/lastIndexedKey are cleared on "completion"). - F3: stale composite key from concurrent write race during reindex. Test passes in single-thread; real exposure requires actual parallel writes during a longer-running reindex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…a migration The `runIndexing` backfill in resources/databases.ts had two related bugs: 1. **Silent completion on error (the primary fix)**: per-record `index.put` errors were caught and logged, and async rejections from `when()` error callbacks were also logged, but in both cases the loop continued to the next record and indexing was marked _complete_ at the end. `indexingPID` and `isIndexing` were cleared even when some records were never added to the new index. Callers of `table()` saw a "finished" index that silently dropped those records from search results — exactly the serent-canopy issue #135 fingerprint: `tables.X.search` returning a subset while `search_by_conditions` (which scans primaryStore) returned the full set. 2. **`await lastResolution` bypass**: if the very _last_ put in the loop was itself a rejected promise, the subsequent `await lastResolution` threw and jumped straight to the outer catch, bypassing the `hadIndexingErrors` check entirely. Fixed by wrapping `await lastResolution` in its own try-catch. Fix: when any error is detected during backfill, do NOT clear `indexingPID` or `isIndexing` and DO set `indexingFailed = true` on the descriptor. Consequences: - Queries continue to receive 503 "not indexed yet" rather than silently returning partial results. - `if (attributeDescriptor?.indexingPID)` in the `table()` open-index path now also keeps `isIndexing = true` on any new dbi created by a concurrent `resetDatabases()` during the migration, so that code path stays consistent. - Any subsequent call to `table()` (including after restart with a different PID) detects `indexingFailed = true` and re-triggers the backfill from the last saved checkpoint (`lastIndexedKey`). Tests: `unitTests/resources/schemaMigrationFragility.test.js` gains a concrete repro for F2. The test simulates transient ERR_BUSY rejects on every 10th index put (mock on `tagIndex.put`), then verifies: (a) search throws 503 "not indexed yet" after the partial migration — not a silent partial result set, and (b) `resetDatabases()` (simulating restart) detects `indexingFailed` and re-triggers a clean backfill pass that indexes all records. The companion `Object.defineProperty(attribute, 'dbi', ...)` call is made `configurable: true` so it can be updated if a future retry mechanism rewrites the dbi assignment. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
9348016 to
6855df6
Compare
- Remove two orphan lines (property was never set in the final code — leftover from an abandoned in-process retry mechanism). - Wrap in a clearer comment acknowledging the pre-existing unhandled-rejection risk for non-last puts in multi-value scenarios (out of scope for this fix). - Replace em dash with hyphen in log message for log-monitoring compatibility. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Addressed all four review comments:
🤖 Generated by Claude Sonnet 4.6 (1M context) |
- Remove two orphan lines (property was never set in the final code — leftover from an abandoned in-process retry mechanism). - Wrap in a clearer comment acknowledging the pre-existing unhandled-rejection risk for non-last puts in multi-value scenarios (out of scope for this fix). - Replace em dash with hyphen in log message for log-monitoring compatibility. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Remove two orphan lines (property was never set in the final code — leftover from an abandoned in-process retry mechanism). - Wrap in a clearer comment acknowledging the pre-existing unhandled-rejection risk for non-last puts in multi-value scenarios (out of scope for this fix). - Replace em dash with hyphen in log message for log-monitoring compatibility. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Summary
Fixes a silent-data-loss bug in
runIndexing(resources/databases.ts): when a per-recordindex.putfails during a schema-migration backfill, the error was caught/logged and the loop continued, eventually marking the index as complete with gaps. Queries against the new index silently returned fewer rows thanprimaryStorescans (SQL / ops API), which is exactly the serent-canopy issue #135 fingerprint.Root causes
when()error handler (line 1276) logged and continued — the migration appeared to complete successfully with dropped records.await lastResolutionbypass: if the last put in the loop was itself a rejected promise, it threw past thehadIndexingErrorscheck to the outer catch, so the fix code was never reached.Fix
await lastResolutionin its own try-catch that setshadIndexingErrors = true.hadIndexingErrorsis true: do NOT clearindexingPID,isIndexing, orlastIndexedKey. SetindexingFailed = trueon the descriptor and persist it.indexingFailedto the condition intable()so the next call (including after restart with a new PID) detects it and re-triggers the backfill fromlastIndexedKey.indexingPIDis set inheritsisIndexing = true(companion fix for theresetDatabases()race during an active migration).Object.defineProperty(attribute, 'dbi', ...)callconfigurable: truefor forward compatibility.Test
unitTests/resources/schemaMigrationFragility.test.js— F2 test now passes:resetDatabases()detectsindexingFailedand re-triggers a clean backfill.Review attention
table()condition change (lines 1106/1115): adding|| attributeDescriptor.indexingFailed— verify this doesn't fire spuriously in normal operation.try { await lastResolution } catchwrapping (line ~1298): previously the outer catch caught this; the new path correctly routes it throughhadIndexingErrors. Low risk, but worth a second look.if (attributeDescriptor?.indexingPID) dbi.isIndexing = trueline (after the migration detection block) is new. It ensures any dbi created by a concurrentresetDatabases()also seesisIndexing = truewhile the migration is running. Confirm it doesn't leave indexes stuck inisIndexingwhen migration was already complete (it shouldn't, sinceindexingPIDis cleared in the success path beforeindexing-finishedsignals).🤖 Generated with Claude Sonnet 4.6 (1M context)