Skip to content

Fix relaunch re-OCR: 1 ms tolerance on mtime fingerprint check#5

Merged
GordonBeeming merged 2 commits into
mainfrom
gb/fingerprint-mtime-tolerance
Apr 22, 2026
Merged

Fix relaunch re-OCR: 1 ms tolerance on mtime fingerprint check#5
GordonBeeming merged 2 commits into
mainfrom
gb/fingerprint-mtime-tolerance

Conversation

@GordonBeeming
Copy link
Copy Markdown
Owner

Summary

  • Indexer was re-OCR'ing ~37% of files on every relaunch. Verified on a 4,756-row production DB: 2972 already indexed, 1785 new file(s) to OCR on startup, despite no real changes to those files. Root cause is exact Date == Date equality on mtime after a SQLite round-trip — a Double ULP at year-2026 timestamps (~400 ns) falls below APFS's nanosecond mtime resolution, so a random subset of rows stopped matching themselves bit-for-bit.
  • Fix: route both Indexer call sites through a new ScreenshotStore.isAlreadyIndexed(at:mtime:size:) that compares mtime with 1 ms tolerance and size exactly. Size still catches genuine edits (byte count changes almost always), so the slack on mtime doesn't mask real changes. No schema change; the existing DB benefits immediately on next launch.
  • Regression test upserts a high-precision fractional mtime (Date(timeIntervalSince1970: 1_776_864_366.2835145)) and asserts isAlreadyIndexed returns true for the same (mtime, size), false when size differs. This is the exact shape of a filesystem mtime that was breaking equality in production.
  • The smell was already in the repo: testFingerprintReturnsLastKnown compares mtime with accuracy: 0.001 — the test tolerated what the indexer rejected.

Test plan

  • swift test --parallel — 36 tests green including new regression
  • On existing 4,756-row DB, first startup log: 4756 already indexed, 0 new file(s) to OCR (not 2972 already indexed, 1785 new)
  • Quit + relaunch — same result, deterministic
  • Add a fresh screenshot — indexer picks it up (N already indexed, 1 new)
  • Edit an existing screenshot (re-save with size change) — picked up on rescan

Verified in production against a 4,756-row index: ~1,785 files were
being re-OCR'd on every relaunch because the fingerprint guard used
exact `Date == Date`. After the round-trip
`Date → timeIntervalSince1970 → SQLite REAL → Date(timeIntervalSince1970:)`
a Double ULP at 2026 timestamps (~400 ns at 1.78e9 s) drifts the
value below APFS's ns-resolution mtime, so a random subset of rows
stopped matching themselves. The existing unit test for
`fingerprint(for:)` already compared with `accuracy: 0.001`, which
is the tolerance production needed all along.

Route both Indexer call sites through a new
`ScreenshotStore.isAlreadyIndexed(at:mtime:size:)` that applies a
1 ms mtime tolerance and an exact size match. Size still catches
real edits (which almost always change byte count), so the slack on
mtime doesn't mask genuine changes. No schema change; existing
rows benefit immediately on next launch.

Added a regression test that upserts a high-precision fractional
mtime (`Date(timeIntervalSince1970: 1_776_864_366.2835145)`) and
asserts `isAlreadyIndexed` returns `true` for the same `(mtime,
size)` and `false` when size differs.

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: GitButler <gitbutler@gitbutler.com>
@GordonBeeming GordonBeeming marked this pull request as ready for review April 22, 2026 14:18
Copilot AI review requested due to automatic review settings April 22, 2026 14:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes unnecessary re-OCR on relaunch by making the “already indexed” fingerprint check resilient to floating-point drift in mtimes after a SQLite REAL round-trip.

Changes:

  • Added ScreenshotStore.isAlreadyIndexed(at:mtime:size:) that compares mtime with ~1ms tolerance and size exactly.
  • Updated Indexer to use the new helper at both fingerprint-check call sites.
  • Added a regression test covering high-precision fractional mtimes and size mismatch behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
Sources/VistaCore/ScreenshotStore.swift Introduces tolerance-based “already indexed” check to avoid exact Date equality pitfalls.
Sources/VistaCore/Indexer.swift Routes incremental-scan and single-file skip logic through isAlreadyIndexed.
Tests/VistaCoreTests/ScreenshotStoreTests.swift Adds regression coverage for mtime precision drift across SQLite round-trips.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Sources/VistaCore/ScreenshotStore.swift Outdated
Comment thread Sources/VistaCore/ScreenshotStore.swift Outdated
Comment thread Sources/VistaCore/ScreenshotStore.swift Outdated
Comment thread Tests/VistaCoreTests/ScreenshotStoreTests.swift Outdated
Four bot-review follow-ups on the mtime tolerance fix:

- Extract the 0.001 s literal into `ScreenshotStore.mtimeTolerance`
  so the value lives in one place and tests can reference it instead
  of re-declaring a magic number.
- Switch the comparison from `drift < 0.001` to
  `drift <= Self.mtimeTolerance` so the doc's "1 ms tolerance" and
  the code agree on inclusive equality at exactly 1 ms.
- Rewrite the "below APFS's ns-resolution mtime" phrase in both the
  `ScreenshotStore` doc comment and the regression test comment:
  a ~400 ns Double ULP is *coarser* than APFS's 1 ns resolution,
  which is what makes some timestamps unrepresentable as Double —
  the previous wording read backwards.
- Added a drift-outside-tolerance assertion to the regression test
  using `mtimeTolerance * 10` as the epsilon (well beyond the
  ~400 ns round-trip slop, so deterministic).

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: GitButler <gitbutler@gitbutler.com>
@GordonBeeming GordonBeeming merged commit 5a6c05a into main Apr 22, 2026
2 checks passed
@GordonBeeming GordonBeeming deleted the gb/fingerprint-mtime-tolerance branch April 22, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants