Skip to content

fix: prevent silent index gaps when per-record puts fail during schema migration#529

Merged
kriszyp merged 4 commits into
mainfrom
investigate/schema-migration-fragility
May 14, 2026
Merged

fix: prevent silent index gaps when per-record puts fail during schema migration#529
kriszyp merged 4 commits into
mainfrom
investigate/schema-migration-fragility

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 14, 2026

Summary

Fixes a silent-data-loss bug in runIndexing (resources/databases.ts): when a per-record index.put fails during a schema-migration backfill, the error was caught/logged and the loop continued, eventually marking the index as complete with gaps. Queries against the new index silently returned fewer rows than primaryStore scans (SQL / ops API), which is exactly the serent-canopy issue #135 fingerprint.

Root causes

  1. Per-record error handling: both sync catch (line 1257) and async when() error handler (line 1276) logged and continued — the migration appeared to complete successfully with dropped records.
  2. await lastResolution bypass: if the last put in the loop was itself a rejected promise, it threw past the hadIndexingErrors check to the outer catch, so the fix code was never reached.

Fix

  • Wrap await lastResolution in its own try-catch that sets hadIndexingErrors = true.
  • When hadIndexingErrors is true: do NOT clear indexingPID, isIndexing, or lastIndexedKey. Set indexingFailed = true on the descriptor and persist it.
  • Add indexingFailed to the condition in table() so the next call (including after restart with a new PID) detects it and re-triggers the backfill from lastIndexedKey.
  • Also ensure any dbi opened while indexingPID is set inherits isIndexing = true (companion fix for the resetDatabases() race during an active migration).
  • Make the Object.defineProperty(attribute, 'dbi', ...) call configurable: true for forward compatibility.

Test

unitTests/resources/schemaMigrationFragility.test.js — F2 test now passes:

  • Simulates ERR_BUSY rejections on every 10th index put.
  • Verifies search throws 503 "not indexed yet" (not silently returns 45/50 rows).
  • Verifies resetDatabases() detects indexingFailed and re-triggers a clean backfill.

Review attention

  • table() condition change (lines 1106/1115): adding || attributeDescriptor.indexingFailed — verify this doesn't fire spuriously in normal operation.
  • try { await lastResolution } catch wrapping (line ~1298): previously the outer catch caught this; the new path correctly routes it through hadIndexingErrors. Low risk, but worth a second look.
  • The if (attributeDescriptor?.indexingPID) dbi.isIndexing = true line (after the migration detection block) is new. It ensures any dbi created by a concurrent resetDatabases() also sees isIndexing = true while the migration is running. Confirm it doesn't leave indexes stuck in isIndexing when migration was already complete (it shouldn't, since indexingPID is cleared in the success path before indexing-finished signals).

🤖 Generated with Claude Sonnet 4.6 (1M context)

@kriszyp kriszyp requested a review from a team as a code owner May 14, 2026 01:37
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 14, 2026

Reviewed; no blockers found.

Comment thread resources/databases.ts
}
// Await the last pending put. If it rejects, that's also an indexing error.
try {
await lastResolution;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to not like this pattern. If any of the non-last promises reject, they will be unhandled.

Comment thread resources/databases.ts Outdated
delete attribute.lastIndexedKey;
delete attribute.indexingPID;
delete attribute.indexingFailed;
delete attribute.indexingAttempt;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used?

Comment thread resources/databases.ts Outdated
attribute.lastIndexedKey = attributeDescriptor?.lastIndexedKey ?? undefined;
attribute.indexingPID = process.pid;
delete attribute.indexingFailed; // clear failure flag for the new run
delete attribute.indexingAttempt; // reset attempt counter
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used?

Comment thread resources/databases.ts Outdated
}
await lastResolution;
logger.warn(
`Indexing of ${Table.tableName} encountered errors on some records — index will remain incomplete. ` +
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, should be be avoiding mdashes in our logs? I know it's 2026 and just because we can doesn't necessarily mean we should. It could complicate log monitoring.

Suggested change
`Indexing of ${Table.tableName} encountered errors on some records index will remain incomplete. ` +
`Indexing of ${Table.tableName} encountered errors on some records - index will remain incomplete. ` +

kriszyp and others added 2 commits May 13, 2026 21:36
Adds targeted unit tests for fragility points in the runIndexing schema-migration
backfill code path in resources/databases.ts:

- F1: stale `changed` reused after re-fetch under exclusive lock (lines 1093-1133).
  Passes in single-thread; real exposure is two concurrent workers triggering
  redundant migrations.
- F2: per-record indexing errors are caught and the loop continues silently,
  leaving gaps in the new index while marking the migration complete.
  **Test FAILS — 5 of 50 rows missing after backfill** when simulating transient
  ERR_BUSY errors on every 10th index put. This is the most likely root cause
  for serent-canopy issue #135's "Resource SDK search returns subset after
  schema migration + rolling restart" fingerprint: under production load, the
  rolling-restart deploy that added `@indexed createdAt` would have run the
  backfill concurrently with active writes / replication; any transient errors
  during that window leave silent index gaps that no future restart re-detects
  (indexingPID/lastIndexedKey are cleared on "completion").
- F3: stale composite key from concurrent write race during reindex. Test
  passes in single-thread; real exposure requires actual parallel writes
  during a longer-running reindex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…a migration

The `runIndexing` backfill in resources/databases.ts had two related bugs:

1. **Silent completion on error (the primary fix)**: per-record `index.put` errors
   were caught and logged, and async rejections from `when()` error callbacks were
   also logged, but in both cases the loop continued to the next record and indexing
   was marked _complete_ at the end. `indexingPID` and `isIndexing` were cleared
   even when some records were never added to the new index. Callers of `table()`
   saw a "finished" index that silently dropped those records from search results —
   exactly the serent-canopy issue #135 fingerprint: `tables.X.search` returning a
   subset while `search_by_conditions` (which scans primaryStore) returned the full
   set.

2. **`await lastResolution` bypass**: if the very _last_ put in the loop was itself
   a rejected promise, the subsequent `await lastResolution` threw and jumped
   straight to the outer catch, bypassing the `hadIndexingErrors` check entirely.
   Fixed by wrapping `await lastResolution` in its own try-catch.

Fix: when any error is detected during backfill, do NOT clear `indexingPID` or
`isIndexing` and DO set `indexingFailed = true` on the descriptor. Consequences:
- Queries continue to receive 503 "not indexed yet" rather than silently returning
  partial results.
- `if (attributeDescriptor?.indexingPID)` in the `table()` open-index path now also
  keeps `isIndexing = true` on any new dbi created by a concurrent `resetDatabases()`
  during the migration, so that code path stays consistent.
- Any subsequent call to `table()` (including after restart with a different PID)
  detects `indexingFailed = true` and re-triggers the backfill from the last saved
  checkpoint (`lastIndexedKey`).

Tests: `unitTests/resources/schemaMigrationFragility.test.js` gains a concrete
repro for F2. The test simulates transient ERR_BUSY rejects on every 10th index
put (mock on `tagIndex.put`), then verifies: (a) search throws 503 "not indexed
yet" after the partial migration — not a silent partial result set, and (b)
`resetDatabases()` (simulating restart) detects `indexingFailed` and re-triggers a
clean backfill pass that indexes all records.

The companion `Object.defineProperty(attribute, 'dbi', ...)` call is made
`configurable: true` so it can be updated if a future retry mechanism rewrites
the dbi assignment.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@kriszyp kriszyp force-pushed the investigate/schema-migration-fragility branch from 9348016 to 6855df6 Compare May 14, 2026 03:37
kriszyp and others added 2 commits May 13, 2026 21:57
- Remove two orphan  lines (property
  was never set in the final code — leftover from an abandoned in-process
  retry mechanism).
- Wrap  in a clearer comment acknowledging the
  pre-existing unhandled-rejection risk for non-last puts in multi-value
  scenarios (out of scope for this fix).
- Replace em dash with hyphen in log message for log-monitoring
  compatibility.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@kriszyp
Copy link
Copy Markdown
Member Author

kriszyp commented May 14, 2026

Addressed all four review comments:

  1. Line 1302 (await lastResolution): Added a comment clarifying that this try-catch specifically handles the case where lastResolution itself rejects (the last put in the loop), which would otherwise bypass hadIndexingErrors. Acknowledged that the broader unhandled-rejection risk for non-last puts in multi-value attributes is a pre-existing issue in the when() pattern and out of scope for this fix.

  2. Lines 1130 & 1341 (delete attribute.indexingAttempt): Removed both lines. The indexingAttempt property was never set anywhere in the final code — it was a leftover from an abandoned in-process retry mechanism that was replaced with the simpler indexingFailed approach.

  3. Line 1331 (em dash): Changed to - in the log message.

🤖 Generated by Claude Sonnet 4.6 (1M context)

@kriszyp kriszyp merged commit 8b1d996 into main May 14, 2026
30 of 35 checks passed
@kriszyp kriszyp deleted the investigate/schema-migration-fragility branch May 14, 2026 15:02
kriszyp added a commit that referenced this pull request May 15, 2026
- Remove two orphan  lines (property
  was never set in the final code — leftover from an abandoned in-process
  retry mechanism).
- Wrap  in a clearer comment acknowledging the
  pre-existing unhandled-rejection risk for non-last puts in multi-value
  scenarios (out of scope for this fix).
- Replace em dash with hyphen in log message for log-monitoring
  compatibility.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
kriszyp added a commit that referenced this pull request May 18, 2026
- Remove two orphan  lines (property
  was never set in the final code — leftover from an abandoned in-process
  retry mechanism).
- Wrap  in a clearer comment acknowledging the
  pre-existing unhandled-rejection risk for non-last puts in multi-value
  scenarios (out of scope for this fix).
- Replace em dash with hyphen in log message for log-monitoring
  compatibility.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants