fix: prevent silent index gaps when per-record puts fail during schema migration by kriszyp · Pull Request #529 · HarperFast/harper

kriszyp · 2026-05-14T01:37:25Z

Summary

Fixes a silent-data-loss bug in runIndexing (resources/databases.ts): when a per-record index.put fails during a schema-migration backfill, the error was caught/logged and the loop continued, eventually marking the index as complete with gaps. Queries against the new index silently returned fewer rows than primaryStore scans (SQL / ops API), which is exactly the serent-canopy issue #135 fingerprint.

Root causes

Per-record error handling: both sync catch (line 1257) and async when() error handler (line 1276) logged and continued — the migration appeared to complete successfully with dropped records.
await lastResolution bypass: if the last put in the loop was itself a rejected promise, it threw past the hadIndexingErrors check to the outer catch, so the fix code was never reached.

Fix

Wrap await lastResolution in its own try-catch that sets hadIndexingErrors = true.
When hadIndexingErrors is true: do NOT clear indexingPID, isIndexing, or lastIndexedKey. Set indexingFailed = true on the descriptor and persist it.
Add indexingFailed to the condition in table() so the next call (including after restart with a new PID) detects it and re-triggers the backfill from lastIndexedKey.
Also ensure any dbi opened while indexingPID is set inherits isIndexing = true (companion fix for the resetDatabases() race during an active migration).
Make the Object.defineProperty(attribute, 'dbi', ...) call configurable: true for forward compatibility.

Test

unitTests/resources/schemaMigrationFragility.test.js — F2 test now passes:

Simulates ERR_BUSY rejections on every 10th index put.
Verifies search throws 503 "not indexed yet" (not silently returns 45/50 rows).
Verifies resetDatabases() detects indexingFailed and re-triggers a clean backfill.

Review attention

table() condition change (lines 1106/1115): adding || attributeDescriptor.indexingFailed — verify this doesn't fire spuriously in normal operation.
try { await lastResolution } catch wrapping (line ~1298): previously the outer catch caught this; the new path correctly routes it through hadIndexingErrors. Low risk, but worth a second look.
The if (attributeDescriptor?.indexingPID) dbi.isIndexing = true line (after the migration detection block) is new. It ensures any dbi created by a concurrent resetDatabases() also sees isIndexing = true while the migration is running. Confirm it doesn't leave indexes stuck in isIndexing when migration was already complete (it shouldn't, since indexingPID is cleared in the success path before indexing-finished signals).

🤖 Generated with Claude Sonnet 4.6 (1M context)

claude · 2026-05-14T01:41:33Z

Reviewed; no blockers found.

cb1kenobi · 2026-05-14T03:30:25Z

+		}
+		// Await the last pending put. If it rejects, that's also an indexing error.
+		try {
+			await lastResolution;


I'm starting to not like this pattern. If any of the non-last promises reject, they will be unhandled.

cb1kenobi · 2026-05-14T03:32:50Z

 				delete attribute.lastIndexedKey;
 				delete attribute.indexingPID;
+				delete attribute.indexingFailed;
+				delete attribute.indexingAttempt;


Is this being used?

cb1kenobi · 2026-05-14T03:32:56Z

 							attribute.lastIndexedKey = attributeDescriptor?.lastIndexedKey ?? undefined;
 							attribute.indexingPID = process.pid;
+							delete attribute.indexingFailed; // clear failure flag for the new run
+							delete attribute.indexingAttempt; // reset attempt counter


Is this being used?

cb1kenobi · 2026-05-14T03:34:37Z

+			}
+			await lastResolution;
+			logger.warn(
+				`Indexing of ${Table.tableName} encountered errors on some records — index will remain incomplete. ` +


Question, should be be avoiding mdashes in our logs? I know it's 2026 and just because we can doesn't necessarily mean we should. It could complicate log monitoring.

Suggested change

`Indexing of ${Table.tableName} encountered errors on some records — index will remain incomplete. ` +

`Indexing of ${Table.tableName} encountered errors on some records - index will remain incomplete. ` +

Adds targeted unit tests for fragility points in the runIndexing schema-migration backfill code path in resources/databases.ts: - F1: stale `changed` reused after re-fetch under exclusive lock (lines 1093-1133). Passes in single-thread; real exposure is two concurrent workers triggering redundant migrations. - F2: per-record indexing errors are caught and the loop continues silently, leaving gaps in the new index while marking the migration complete. **Test FAILS — 5 of 50 rows missing after backfill** when simulating transient ERR_BUSY errors on every 10th index put. This is the most likely root cause for serent-canopy issue #135's "Resource SDK search returns subset after schema migration + rolling restart" fingerprint: under production load, the rolling-restart deploy that added `@indexed createdAt` would have run the backfill concurrently with active writes / replication; any transient errors during that window leave silent index gaps that no future restart re-detects (indexingPID/lastIndexedKey are cleared on "completion"). - F3: stale composite key from concurrent write race during reindex. Test passes in single-thread; real exposure requires actual parallel writes during a longer-running reindex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…a migration The `runIndexing` backfill in resources/databases.ts had two related bugs: 1. **Silent completion on error (the primary fix)**: per-record `index.put` errors were caught and logged, and async rejections from `when()` error callbacks were also logged, but in both cases the loop continued to the next record and indexing was marked _complete_ at the end. `indexingPID` and `isIndexing` were cleared even when some records were never added to the new index. Callers of `table()` saw a "finished" index that silently dropped those records from search results — exactly the serent-canopy issue #135 fingerprint: `tables.X.search` returning a subset while `search_by_conditions` (which scans primaryStore) returned the full set. 2. **`await lastResolution` bypass**: if the very _last_ put in the loop was itself a rejected promise, the subsequent `await lastResolution` threw and jumped straight to the outer catch, bypassing the `hadIndexingErrors` check entirely. Fixed by wrapping `await lastResolution` in its own try-catch. Fix: when any error is detected during backfill, do NOT clear `indexingPID` or `isIndexing` and DO set `indexingFailed = true` on the descriptor. Consequences: - Queries continue to receive 503 "not indexed yet" rather than silently returning partial results. - `if (attributeDescriptor?.indexingPID)` in the `table()` open-index path now also keeps `isIndexing = true` on any new dbi created by a concurrent `resetDatabases()` during the migration, so that code path stays consistent. - Any subsequent call to `table()` (including after restart with a different PID) detects `indexingFailed = true` and re-triggers the backfill from the last saved checkpoint (`lastIndexedKey`). Tests: `unitTests/resources/schemaMigrationFragility.test.js` gains a concrete repro for F2. The test simulates transient ERR_BUSY rejects on every 10th index put (mock on `tagIndex.put`), then verifies: (a) search throws 503 "not indexed yet" after the partial migration — not a silent partial result set, and (b) `resetDatabases()` (simulating restart) detects `indexingFailed` and re-triggers a clean backfill pass that indexes all records. The companion `Object.defineProperty(attribute, 'dbi', ...)` call is made `configurable: true` so it can be updated if a future retry mechanism rewrites the dbi assignment. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Remove two orphan lines (property was never set in the final code — leftover from an abandoned in-process retry mechanism). - Wrap in a clearer comment acknowledging the pre-existing unhandled-rejection risk for non-last puts in multi-value scenarios (out of scope for this fix). - Replace em dash with hyphen in log message for log-monitoring compatibility. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

kriszyp · 2026-05-14T04:03:14Z

Addressed all four review comments:

Line 1302 (await lastResolution): Added a comment clarifying that this try-catch specifically handles the case where lastResolution itself rejects (the last put in the loop), which would otherwise bypass hadIndexingErrors. Acknowledged that the broader unhandled-rejection risk for non-last puts in multi-value attributes is a pre-existing issue in the when() pattern and out of scope for this fix.
Lines 1130 & 1341 (delete attribute.indexingAttempt): Removed both lines. The indexingAttempt property was never set anywhere in the final code — it was a leftover from an abandoned in-process retry mechanism that was replaced with the simpler indexingFailed approach.
Line 1331 (em dash): Changed — to - in the log message.

🤖 Generated by Claude Sonnet 4.6 (1M context)

- Remove two orphan lines (property was never set in the final code — leftover from an abandoned in-process retry mechanism). - Wrap in a clearer comment acknowledging the pre-existing unhandled-rejection risk for non-last puts in multi-value scenarios (out of scope for this fix). - Replace em dash with hyphen in log message for log-monitoring compatibility. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

kriszyp requested a review from a team as a code owner May 14, 2026 01:37

cb1kenobi reviewed May 14, 2026

View reviewed changes

kriszyp and others added 2 commits May 13, 2026 21:36

kriszyp force-pushed the investigate/schema-migration-fragility branch from 9348016 to 6855df6 Compare May 14, 2026 03:37

kriszyp and others added 2 commits May 13, 2026 21:57

docs: document runIndexing schema-migration internals in DESIGN.md

20292d0

cb1kenobi approved these changes May 14, 2026

View reviewed changes

kriszyp merged commit 8b1d996 into main May 14, 2026
30 of 35 checks passed

kriszyp deleted the investigate/schema-migration-fragility branch May 14, 2026 15:02

kriszyp mentioned this pull request May 18, 2026

cherry-pick: fix: prevent silent index gaps when per-record puts fail during schema migration (PR #529) #580

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent silent index gaps when per-record puts fail during schema migration#529

fix: prevent silent index gaps when per-record puts fail during schema migration#529
kriszyp merged 4 commits into
mainfrom
investigate/schema-migration-fragility

kriszyp commented May 14, 2026

Uh oh!

claude Bot commented May 14, 2026

Uh oh!

cb1kenobi May 14, 2026

Uh oh!

cb1kenobi May 14, 2026

Uh oh!

cb1kenobi May 14, 2026

Uh oh!

cb1kenobi May 14, 2026

Uh oh!

kriszyp commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	`Indexing of ${Table.tableName} encountered errors on some records — index will remain incomplete. ` +
	`Indexing of ${Table.tableName} encountered errors on some records - index will remain incomplete. ` +

Conversation

kriszyp commented May 14, 2026

Summary

Root causes

Fix

Test

Review attention

Uh oh!

claude Bot commented May 14, 2026

Uh oh!

cb1kenobi May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cb1kenobi May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cb1kenobi May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cb1kenobi May 14, 2026

Choose a reason for hiding this comment

Uh oh!

kriszyp commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants