Skip to content

Defer OTA when live DB open, use native download for large DBs, and reorder app init/hydration#1533

Closed
CraigBuckmaster wants to merge 1 commit into
masterfrom
codex/investigate-app-crash-after-db-download-a2f85k
Closed

Defer OTA when live DB open, use native download for large DBs, and reorder app init/hydration#1533
CraigBuckmaster wants to merge 1 commit into
masterfrom
codex/investigate-app-crash-after-db-download-a2f85k

Conversation

@CraigBuckmaster
Copy link
Copy Markdown
Owner

Motivation

  • Prevent crashes and corruption from swapping the on-disk content DB while an app-held SQLite handle is open.
  • Avoid JS memory spikes when downloading large full DB payloads by using a native download path instead of buffering via XHR.
  • Ensure app-level stores and user DB are hydrated only after the content DB is available, and make the first-launch download flow await post-download initialization.

Description

  • Added getDbIfInitialized() and closeDatabaseConnection() in src/db/database.ts to expose the live DB handle and allow safe close-before-swap.
  • In services/ContentUpdater.ts added a guard to defer OTA apply when a live DB handle exists, introduced LARGE_DB_NATIVE_DOWNLOAD_THRESHOLD_BYTES, and selected a native File.downloadFileAsync path for large full-DB downloads while keeping XHR-based progress for small downloads.
  • Replaced memory-heavy checksum-for-full-db flow by opening the downloaded DB and verifying the db_meta content_hash and PRAGMA integrity_check before swapping the file.
  • Reworked app startup in App.tsx to introduce hydrateAppState() and to early-return to the download screen when initDatabase() reports 'needs_download', and to run the full hydration sequence after a successful post-download init.
  • Updated DbDownloadScreen.tsx so onComplete may be async and is awaited in the download flow.
  • Updated __tests__/services/ContentUpdater.test.ts to mock the live DB helper, add tests for reusing live DB handles, deferring OTA when DB is open, native-download behavior for large payloads, and integrity-check failure scenarios, and to adjust progress/error expectations accordingly.

Testing

  • Ran the unit test suite for the updater with jest (the ContentUpdater service suite and related mocks); all tests in the updated __tests__/services/ContentUpdater.test.ts passed.
  • Verified the modified startup and download flow in unit tests that exercise getInstalledVersion, checkForUpdates, downloadFullDb, and applyDelta behaviors; tests succeeded.

Codex Task

CraigBuckmaster pushed a commit that referenced this pull request Apr 20, 2026
…S5 teardown crash)

## The crash
TestFlight 1.0.7(20) segfaulted 3.5s after launch — EXC_BAD_ACCESS
on Thread 7, triggered by:

  exsqlite3_finalize
  sqlite3Fts5IndexClose
  fts5DisconnectMethod
  sqlite3VtabUnlock
  disconnectAllVtab
  sqlite3Close
  SQLiteModule.closeDatabase (expo-sqlite iOS, SQLiteModule.swift:489)

The call chain: SDK 54 expo-sqlite's closeAsync → sqlite3Close →
disconnectAllVtab → iterates every FTS5 vtab → fts5DisconnectMethod
→ fts5FreeVtab → sqlite3Fts5IndexClose → exsqlite3_finalize on a
stale/freed statement pointer → crash.

This is a teardown-path bug in FTS5 virtual-table cleanup that
manifests specifically when:
  1. The DB has FTS5 tables (scripture.db has them for verse/people search)
  2. PRAGMA integrity_check was run on the connection (which loads
     and validates every FTS5 index, leaving them in a state that
     doesn't clean up correctly)
  3. closeAsync is called on that connection

Our full-DB download verification path added by #1529 (cherry-pick
from #1531 commit 1) did exactly that: open temp DB, run content_hash
check, run integrity_check, close. The integrity_check was the
trigger.

## The fix
Remove the PRAGMA integrity_check from the full-DB post-download
verification path. Keep the content_hash check — it's sufficient
validation: the hash was computed at build time over the full file
contents, baked into db_meta at build time, and uploaded with the
DB to R2. A corrupt/truncated download is overwhelmingly likely to
fail at SQLite open (exception thrown before our read) or to miss
the db_meta row entirely.

This is less rigorous than integrity_check but avoids the teardown
crash. TODO: re-add proper integrity verification once expo-sqlite's
FTS5 teardown bug is resolved upstream, or migrate to streaming
SHA-256 during the chunked write.

## Scope
INTENTIONALLY LIMITED. The delta-apply path in applyDelta() has
the same latent PRAGMA integrity_check → close pattern (~line 234).
Left in place because:
  - Delta updates don't fire for first-launch users (the crash we're
    chasing is build-20 first-launch)
  - Delta integrity_check runs AFTER transactional SQL mutation, so
    it's more valuable as a corruption check than the post-download
    one was
  - Touching it expands scope beyond the confirmed crash site
  - If it ever DOES crash for a delta user, same fix applies; we'll
    know when we see the crash log

## Tests
- Removed 'returns failed when integrity_check fails after download'
  (asserted behavior that no longer exists)
- Removed 5 dead integrity_check mocks from full-DB tests where they
  were set up but would never be consumed now
- Delta-path tests untouched (still exercise integrity_check)
- 35 ContentUpdater tests pass (was 36, minus 1 removed)
- tsc clean, lint 0 errors

## Build budget
Build 21 will be the 2nd build on master since #1529 shipped. With
no other changes, failure mode cascade from the cascade in #1529:
  - If Alert shows on relaunch → a JS error got past the fixes.
    Handler captured it. We see it next launch.
  - If app reaches home screen → all three crash causes fixed.
    We revert the diagnostic and ship.
  - If app still crashes silently → exceptionally rare; a different
    native-layer issue we haven't hit yet.

## References
- Crash incident: 8C94684A-D8BE-45F5-B9A7-412450196AC4
- iOS Analytics log: CompanionStudy-2026-04-19-..-1.0.7(20)
- Supersedes PR #1533 (which kept the integrity_check call in place)
@CraigBuckmaster
Copy link
Copy Markdown
Owner Author

Closing — the underlying crash was in the PRAGMA integrity_check + closeAsync sequence triggering FTS5 teardown on iOS 26, not OTA timing. The fix is to remove the integrity_check call entirely (#1534), not to defer OTA around it.

The 'defer OTA apply while live DB is open' idea from this PR is still valuable for the OTA-during-runtime path (subsequent launches, not first-launch). That idea is being folded into #1532 (the DB lifecycle hardening PR) as a follow-up once build 21 confirms #1534 resolves the first-launch crash.

CraigBuckmaster pushed a commit that referenced this pull request Apr 20, 2026
…tion

## Safety valve
Previously, closeDatabaseConnection() swallowed errors from closeAsync
and always set db = null, reporting success. If the native SQLite close
actually failed (e.g., a statement still in flight), the subsequent
file-swap would then move a temp file over a DB that SQLite still held
open — leading to lock errors or, on iOS 26 with FTS5 tables, the
teardown crash that PR #1534 just fixed for a different code path.

Now closeDatabaseConnection() returns Promise<boolean>:
  - true  : close succeeded OR db was already null (safe to swap)
  - false : closeAsync threw (NOT safe to swap)

ContentUpdater.applyDelta and .downloadFullDb both check the return
value. On false, they:
  1. Log a warning
  2. Cleanup any temp files (full-DB path)
  3. Return { status: 'up_to_date' }

The next OTA check will try again. Failing an update is strictly
better than corrupting the DB file.

## Option B TODO
Added a 30-line documentation block above ContentUpdaterService
explaining the long-term architectural direction: defer-to-cold-start
instead of close-and-reopen. Rationale:
  - The ONLY fully-safe time to swap the DB file is when no SQLite
    connection exists at all.
  - That condition naturally occurs during app cold-start, before
    initDatabase() runs.
  - Current close-and-reopen creates that condition artificially mid-
    session, which is fragile (stale getDb() refs, in-flight query
    errors).
  - Defer-to-cold-start trades 'updates land immediately' for
    'updates land on next launch', which is acceptable for scholarly
    content updates.

This PR ships phase 1 (close-and-reopen with safety valve). Phase 2
(full cold-start-swap architecture) is a separate project tracked as
a TODO comment. PR #1533's original 'defer OTA when live DB open'
idea informed this direction but cannot be implemented as-is (the
DB is always open after startup, so defer-on-live-db would mean
updates never fire).

## Tests
- Added 2 focused tests covering the safety valve:
  - aborts delta apply when live DB close fails
  - aborts full DB swap when live DB close fails
  Both assert status=up_to_date AND that the swap/backup was NOT
  attempted (mockFileOps.move / copy).
- Updated mock default return values for closeDatabaseConnection
  from undefined to true (matches new signature).
- 41 ContentUpdater tests pass (was 39 + 2 new).
- Broader sweep: 30 suites / 454 tests pass.
- Typecheck clean, lint 0 errors.

## Not in scope
Cold-start-swap implementation is out of scope for this PR — that's
a separate project. This PR keeps #1532's original close-and-reopen
strategy and hardens it; it does not replace it.
@CraigBuckmaster CraigBuckmaster deleted the codex/investigate-app-crash-after-db-download-a2f85k branch April 21, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant