fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE#2637
fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE#2637abdulraqeeb33 wants to merge 3 commits intomainfrom
Conversation
…oid SessionService NPE
Under SDK_BACKGROUND_THREADING, the public initWithContext(context, appId)
runs internalInit() on a fire-and-forget IO coroutine. The internal-only
suspend overload initWithContext(context) -- used by SyncJobService.onStartJob
-- was returning true as soon as initState was IN_PROGRESS, even though
bootstrap() had not yet populated SessionService.session. SyncJobService
would then call runBackgroundServices(), which invokes
SessionService.backgroundRun() -> endSession() and NPEs on session!!.isValid.
Fix the suspend overload to honor its documented contract ("Remain suspend
until initialization is fully completed"): when init is already in flight,
suspend on suspendCompletion.await() until it actually completes, then
return based on the final state. The public sync overload is unchanged --
host app's MainApplication.onCreate() still returns immediately, no ANR
risk re-introduced.
Also add defensive null guards to SessionService.endSession / onFocus /
onUnfocused / startTime / scheduleBackgroundRunIn so any future caller
that bypasses bootstrap() no-ops instead of crashing.
Adds a regression test (SDKInitTests) that stalls a first
initWithContextSuspend inside internalInit (via a custom
BlockingPrefsContext that signals when prefs are first touched), kicks off
a second initWithContextSuspend, and asserts that os.isInitialized is true
at the moment the second call returns. Verified to fail on main and pass
with the fix.
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a race condition under SDK_BACKGROUND_THREADING where a re-entrant suspend initialization (initWithContextSuspend) could return before bootstrap completed, allowing background services (notably SessionService) to run with uninitialized/null internal models.
Changes:
- Updated
OneSignalImp.initWithContextSuspendto await in-flight initialization completion before returning a result. - Added a regression test that reproduces the re-entrant init race and asserts the suspend init does not return early.
- Added defensive null-guards in
SessionServiceto avoid NPEs when invoked beforebootstrap().
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt | Makes re-entrant suspend init wait for completion instead of returning during IN_PROGRESS. |
| OneSignalSDK/onesignal/core/src/main/java/com/onesignal/session/internal/session/impl/SessionService.kt | Adds pre-bootstrap null handling to prevent crashes when background services run too early. |
| OneSignalSDK/onesignal/core/src/test/java/com/onesignal/core/internal/application/SDKInitTests.kt | Adds a regression test to catch early-return behavior during in-flight initialization. |
Comments suppressed due to low confidence (1)
OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt:662
- If
internalInit(...)throws (or any unexpected exception occurs beforenotifyInitComplete()),initStatecan remainIN_PROGRESSandsuspendCompletionmay never complete. With the newsuspendCompletion.await()path, this can hang re-entrant init callers indefinitely. Wrap the init execution soinitStateis set toFAILEDand the completion signal is completed in afinallyblock (capturing the exception intoinitFailureExceptionas appropriate).
if (!shouldRunInit) {
// Another caller has already started (or completed) init. Honor this method's
// contract by suspending until initialization is *fully* completed -- not just
// kicked off. This closes a race where re-entrant suspend callers (e.g. the
// SyncJobService entry point under SDK_BACKGROUND_THREADING) would otherwise
// proceed to use IBackgroundService implementations like SessionService whose
// bootstrap() had not yet run, NPE'ing on still-null model fields.
Logging.log(LogLevel.DEBUG, "initWithContext: init already in progress or completed, awaiting completion")
suspendCompletion.await()
return@withContext initState == InitState.SUCCESS
}
val result = internalInit(context, appId)
// initState is already set correctly in internalInit, no need to overwrite it
result
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| val blockingCtx = object : ContextWrapper(context) { | ||
| override fun getSharedPreferences(name: String, mode: Int): SharedPreferences { | ||
| started.countDown() | ||
| trigger.await() | ||
| return super.getSharedPreferences(name, mode) |
There was a problem hiding this comment.
Good catch — addressed in 524c45b. Bounded trigger.await(30, TimeUnit.SECONDS) inside the ContextWrapper.getSharedPreferences() overrides for both this test and the original in-flight init test, so a logic bug or cancelled coroutine can't leave the IO worker blocking forever and stalling the suite.
| // Sanity: the second caller has not pre-empted the test by returning before | ||
| // we unblock the first caller (timing depends on lazy ServiceProvider locks). | ||
| Thread.sleep(200) | ||
|
|
||
| // Unblock the first caller so internalInit() can complete (state -> SUCCESS). | ||
| trigger.countDown() |
There was a problem hiding this comment.
Acknowledged. The 200ms sleep is intentionally testing a negative assertion — "the second caller did NOT return within window X" — which collapses to the same wall-clock semantics whether expressed as Thread.sleep, delay(), or polling, since polling for an event that should never fire is just a sleep with extra steps. The flake risk is only on the upper bound (we picked 200ms, the actual window can be much smaller), and the bounded trigger.await(30s) from 524c45b ensures the test fails-fast rather than hanging if 200ms is ever insufficient.
Happy to switch to delay() for stylistic consistency with coroutine-based tests if you'd prefer — the semantic is identical.
| ) : ISessionService, IBootstrapService, IStartableService, IBackgroundService, IApplicationLifecycleHandler { | ||
| override val startTime: Long | ||
| get() = session!!.startTime | ||
| get() = session?.startTime ?: 0L |
There was a problem hiding this comment.
Fixed in 524c45b. startTime now defaults to _time.currentTimeMillis when bootstrap hasn't run, so call sites computing now - startTime (IAM session-duration in InAppMessagesManager and SESSION_TIME triggers in DynamicTriggerController) see ~0ms elapsed instead of ~58 years (Jan 1970).
| val shouldRunInit: Boolean | ||
| synchronized(initLock) { | ||
| if (initState.isSDKAccessible()) { | ||
| Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress") | ||
| return@withContext true | ||
| shouldRunInit = !initState.isSDKAccessible() | ||
| if (shouldRunInit) { | ||
| initState = InitState.IN_PROGRESS | ||
| } | ||
| } |
There was a problem hiding this comment.
Addressed in 4b5d86f. suspendCompletion is now a @Volatile var reset under synchronized(initLock) whenever state flips into IN_PROGRESS — applied to both the sync initWithContext(ctx, appId) and suspend initWithContextSuspend paths to keep them consistent. Both await sites (initWithContextSuspend and waitUntilInitInternal) now local-capture the deferred under the lock so they wait on the same generation they observed.
Added regression test initWithContextSuspend resets latch on retry-after-FAILED which: forces a FAILED first init, kicks off a stalled retry, then a re-entrant suspend caller. Verified to fail on the un-hardened code with expected:<true> but was:<false> (the re-entrant caller wakes on the stale latch and reads transient IN_PROGRESS state) and pass with the fix.
| Logging.log(LogLevel.DEBUG, "initWithContext(context: $context, appId: $appId)") | ||
|
|
||
| // This ensures the stack trace points to the caller that triggered init, not the async worker thread. | ||
| initFailureException = IllegalStateException("OneSignal initWithContext failed.") | ||
|
|
||
| // Use IO dispatcher for initialization to prevent ANRs and optimize for I/O operations | ||
| return withContext(runtimeIoDispatcher) { | ||
| // do not do this again if already initialized or init is in progress | ||
| val shouldRunInit: Boolean | ||
| synchronized(initLock) { | ||
| if (initState.isSDKAccessible()) { | ||
| Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress") | ||
| return@withContext true | ||
| shouldRunInit = !initState.isSDKAccessible() | ||
| if (shouldRunInit) { | ||
| initState = InitState.IN_PROGRESS | ||
| } | ||
| } |
There was a problem hiding this comment.
Fixed in 524c45b. initFailureException is now only assigned inside the shouldRunInit == true branch of the synchronized(initLock) block, so re-entrant callers (the SyncJobService entry point in particular) no longer overwrite the original initiator's failure-attribution stack trace.
|
|
||
| initState = InitState.IN_PROGRESS | ||
| if (!shouldRunInit) { | ||
| // Another caller has already started (or completed) init. Honor this method's | ||
| // contract by suspending until initialization is *fully* completed -- not just | ||
| // kicked off. This closes a race where re-entrant suspend callers (e.g. the | ||
| // SyncJobService entry point under SDK_BACKGROUND_THREADING) would otherwise | ||
| // proceed to use IBackgroundService implementations like SessionService whose | ||
| // bootstrap() had not yet run, NPE'ing on still-null model fields. | ||
| Logging.log(LogLevel.DEBUG, "initWithContext: init already in progress or completed, awaiting completion") | ||
| suspendCompletion.await() | ||
| return@withContext initState == InitState.SUCCESS |
There was a problem hiding this comment.
🔴 The new suspendCompletion.await() at line 656 introduced by this PR exposes two pre-existing lifecycle defects in suspendCompletion (declared as a single-shot val CompletableDeferred<Unit> at line 53). (1) Never reset on retry: InitState.isSDKAccessible() returns false for FAILED, so a retry-after-FAILED flips state back to IN_PROGRESS and re-runs internalInit — but the deferred is already permanently completed by the first failure's notifyInitComplete(), so the new await() returns instantly and return@withContext initState == InitState.SUCCESS reads transient state, silently dropping SyncJobService work (returns false while still IN_PROGRESS). (2) Never completed on throw: internalInit only calls notifyInitComplete() at three explicit return sites; a throw from initEssentials, bootstrapServices, subscribeToConfigStore, updateConfig, userSwitcher.initUser, or startupService.scheduleStart propagates out leaving the deferred uncompleted, so re-entrant SyncJobService callers now hang on await() indefinitely instead of early-returning, holding a JobService budget slot until the OS kills it. Fix: re-allocate suspendCompletion (or use a fresh signal) every time shouldRunInit flips state to IN_PROGRESS, AND wrap internalInit's body in try { … } catch (t: Throwable) { initState = FAILED; notifyInitComplete(); throw }.
Extended reasoning...
Bug summary
This PR introduces a new caller of suspendCompletion.await() in initWithContextSuspend (lines 647–657). That await() depends on two invariants the existing code does not actually guarantee, and as a result the very fix that closes the SessionService-NPE window introduces two new failure modes on adjacent paths.
suspendCompletion is declared once, in field-initializer position:
private val suspendCompletion = CompletableDeferred<Unit>() // line 53A CompletableDeferred is single-shot — once .complete() is called, it stays complete for the lifetime of the OneSignalImp instance and every subsequent await() returns immediately. There is no reassignment anywhere in the file.
notifyInitComplete() fires on every terminal path of internalInit, including the FAILED paths (lines 345-347 user-locked, lines 365-367 missing appId) as well as SUCCESS (line 376). And InitState.isSDKAccessible() returns true only for IN_PROGRESS/SUCCESS — not for FAILED — so a retry after a failed init is allowed: the synchronized block at 641–646 flips FAILED → IN_PROGRESS and re-runs internalInit.
Defect 1: stale latch on retry-after-FAILED
Step-by-step proof:
- Host calls
OneSignal.initWithContext(context)(no appId). Public sync path underSDK_BACKGROUND_THREADINGflipsinitStatetoIN_PROGRESSandsuspendifyOnIO { internalInit(context, null) }(lines 314–320).internalInithits theresolveAppIdfailure branch, setsinitState = FAILED, and callsnotifyInitComplete()→suspendCompletion.complete(Unit). The latch is now permanently tripped. - Host retries with
OneSignal.initWithContext(context, "validAppId"). State isFAILED→!isSDKAccessible()is true → state flips back toIN_PROGRESS→ fire-and-forgetinternalInitbegins bootstrapping again. SyncJobService.onStartJobfires concurrently and callsOneSignal.initWithContext(this), which routes toinitWithContextSuspend(context, null). In the synchronized block,isSDKAccessible() == true(state isIN_PROGRESS), soshouldRunInit = false.- The new branch at lines 648–657 calls
suspendCompletion.await(). It returns immediately because the latch was tripped in step 1. return@withContext initState == InitState.SUCCESSreads the transient state — almost alwaysIN_PROGRESS(returnsfalse).SyncJobService.onStartJobseesfalseand aborts beforerunBackgroundServices(), silently dropping the JobService work. If state has just flipped toSUCCESSbetween the read and the comparison,trueis returned mid-bootstrap — exactly the contract violation this PR is trying to close, just reached via a different sequence.
The retry-after-failure pattern is not hypothetical: SDKInitTests.kt line 65–83 exercises exactly this sequence (initWithContext with no appId fails, then initWithContext with appId is called and expected to succeed). The same flaw also affects waitUntilInitInternal at line 499 (used by getServiceWithFeatureGate/loginSuspend/logoutSuspend): on retry-after-FAILED with state IN_PROGRESS, await() returns immediately, the FAILED check at line 511 passes, and the caller proceeds to call getter() against a not-yet-bootstrapped service.
Defect 2: indefinite hang if internalInit throws
internalInit (lines 336–378) only calls notifyInitComplete() at three explicit return sites. There is no try/catch/finally wrapping the body. Any throw from initEssentials, bootstrapServices (which iterates and calls bootstrap() on every IBootstrapService), otelManager.subscribeToConfigStore, resolveAppId itself, updateConfig, userSwitcher.initUser, or startupService.scheduleStart propagates up the stack with initState left at IN_PROGRESS and suspendCompletion never completed.
Pre-PR, a re-entrant suspend caller would observe isSDKAccessible() == true (because IN_PROGRESS is accessible) and early-return true. Post-PR, the new await() at line 656 hangs indefinitely. Under SDK_BACKGROUND_THREADING, the public initWithContext path runs internalInit via suspendifyOnIO — fire-and-forget — so a throw is swallowed by the coroutine scope's exception handler with no signal back to the waiter. Every concurrent re-entrant call (including SyncJobService) hangs indefinitely on await(), holding the JobService budget slot until the OS terminates the worker. This is the inverted failure mode of the bug being fixed — instead of returning too early, the JobService coroutine never returns at all.
bootstrapServices() is the most likely throw site in practice: StartupService.bootstrap() iterates every registered IBootstrapService implementation across modules, and any single bootstrap() throwing leaks the IN_PROGRESS state.
Suggested fix
Two changes are needed:
- Re-allocate
suspendCompletion(make it avarand reset it, or use a freshCompletableDeferredper init) every time the synchronized block flips state from a terminal value intoIN_PROGRESS. - Wrap
internalInit's body intry { … } catch (t: Throwable) { initState = FAILED; notifyInitComplete(); throw }(or an equivalentfinallyblock) so any failure mode reliably releases waiters.
Both changes are mechanical and low-risk. The PR's own regression test (initWithContextSuspend with in-flight init waits for completion before returning) does not exercise either retry-after-FAILED or a throw inside internalInit, which is why these gaps slipped through.
There was a problem hiding this comment.
Both defects addressed in 4b5d86f.
Defect 1 (stale latch on retry-after-FAILED): suspendCompletion is now a @Volatile var reset under synchronized(initLock) whenever state transitions into IN_PROGRESS — applied to both the sync initWithContext(ctx, appId) and suspend initWithContextSuspend paths. Both await sites (initWithContextSuspend and waitUntilInitInternal) local-capture the deferred under the lock so they wait on the same generation they observed. Regression test initWithContextSuspend resets latch on retry-after-FAILED reproduces the exact sequence you described (forced FAILED + stalled retry + re-entrant suspend caller); fails on the un-hardened code with expected:<true> but was:<false> and passes with the fix.
Defect 2 (indefinite hang if internalInit throws): Wrapped internalInit's body in try { … } catch (t: Throwable) { … initState = FAILED; notifyInitComplete(); return false }. The catch attaches the cause to initFailureException via addSuppressed and returns false instead of rethrowing — guaranteeing both terminal-state and waiter-release on every code path including initEssentials / bootstrapServices / subscribeToConfigStore / updateConfig / userSwitcher.initUser / startupService.scheduleStart throws. Regression test initWithContextSuspend reaches terminal state when internalInit throws reproduces by mocking AndroidUtils.isAndroidUserUnlocked to throw; fails on the un-hardened code with the throw escaping as RuntimeException and passes with the fix.
Both regression tests verified to fail-then-pass by stashing the OneSignalImp.kt change locally, running the suite, restoring, and re-running. Thanks for the thorough writeup — the suggested fixes are exactly what landed.
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI): Defect 1 (stale latch on retry-after-FAILED): suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once any init terminated -- including FAILED -- the deferred stayed permanently complete. A re-entrant suspend caller arriving DURING a subsequent retry would `await()` on the already-completed deferred, return instantly, and read transient initState (likely IN_PROGRESS -> false), silently dropping JobService work. Same flaw also affected `waitUntilInitInternal`. Make suspendCompletion mutable (`@Volatile var`) and reset it under `synchronized(initLock)` whenever state flips into IN_PROGRESS (both sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`). Both await sites local-capture the deferred under the lock so they wait on the same generation they observed -- never on a stale one. Defect 2 (indefinite hang if internalInit throws): internalInit had no try/catch wrapping its body. An unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/ userSwitcher.initUser/startupService.scheduleStart would leave initState at IN_PROGRESS and suspendCompletion uncompleted forever. With the new await() path introduced in PR #2637, every re-entrant suspend caller (SyncJobService) would hang on await() until the OS killed the worker. Wrap internalInit's body in try/catch: on any throw, attach the cause to initFailureException, set initState = FAILED, call notifyInitComplete(), and return false instead of rethrowing. Guarantees a terminal state and released waiters on every code path. Two new regression tests in SDKInitTests: - "resets latch on retry-after-FAILED": stalls a retry after a forced FAILED first init, kicks off a re-entrant caller, asserts it doesn't wake on the stale latch (verified to fail on the un-hardened code with expected:<true> but was:<false>). - "reaches terminal state when internalInit throws": forces a throw inside internalInit, asserts the suspend init returns false and the SDK can retry cleanly afterwards (verified to fail on the un-hardened code with the throw escaping as RuntimeException). Co-authored-by: Cursor <cursoragent@cursor.com>
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed) instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call sites computing `now - startTime` (IAM session-duration / SESSION_TIME triggers) get a sensible default rather than Jan 1970 deltas. (Copilot review comment) - initWithContextSuspend: only assign `initFailureException` when the call actually starts init (`shouldRunInit == true`). Re-entrant callers no longer overwrite the original initiator's failure-attribution stack trace. (Copilot review comment) - SDKInitTests: bound the `trigger.await()` inside the in-test ContextWrapper overrides to 30s so that a logic bug elsewhere in the test (or a cancelled coroutine) can't deadlock the IO worker forever and stall the suite. (Copilot review comment) Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI): Defect 1 (stale latch on retry-after-FAILED): suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once any init terminated -- including FAILED -- the deferred stayed permanently complete. A re-entrant suspend caller arriving DURING a subsequent retry would `await()` on the already-completed deferred, return instantly, and read transient initState (likely IN_PROGRESS -> false), silently dropping JobService work. Same flaw also affected `waitUntilInitInternal`. Make suspendCompletion mutable (`@Volatile var`) and reset it under `synchronized(initLock)` whenever state flips into IN_PROGRESS (both sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`). Both await sites local-capture the deferred under the lock so they wait on the same generation they observed -- never on a stale one. Defect 2 (indefinite hang if internalInit throws): internalInit had no try/catch wrapping its body. An unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/ userSwitcher.initUser/startupService.scheduleStart would leave initState at IN_PROGRESS and suspendCompletion uncompleted forever. With the new await() path introduced in PR #2637, every re-entrant suspend caller (SyncJobService) would hang on await() until the OS killed the worker. Wrap internalInit's body in try/catch: on any throw, attach the cause to initFailureException, set initState = FAILED, call notifyInitComplete(), and return false instead of rethrowing. Guarantees a terminal state and released waiters on every code path. Two new regression tests in SDKInitTests: - "resets latch on retry-after-FAILED": stalls a retry after a forced FAILED first init, kicks off a re-entrant caller, asserts it doesn't wake on the stale latch (verified to fail on the un-hardened code with expected:<true> but was:<false>). - "reaches terminal state when internalInit throws": forces a throw inside internalInit, asserts the suspend init returns false and the SDK can retry cleanly afterwards (verified to fail on the un-hardened code with the throw escaping as RuntimeException). Co-authored-by: Cursor <cursoragent@cursor.com>
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed) instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call sites computing `now - startTime` (IAM session-duration / SESSION_TIME triggers) get a sensible default rather than Jan 1970 deltas. (Copilot review comment) - initWithContextSuspend: only assign `initFailureException` when the call actually starts init (`shouldRunInit == true`). Re-entrant callers no longer overwrite the original initiator's failure-attribution stack trace. (Copilot review comment) - SDKInitTests: bound the `trigger.await()` inside the in-test ContextWrapper overrides to 30s so that a logic bug elsewhere in the test (or a cancelled coroutine) can't deadlock the IO worker forever and stall the suite. (Copilot review comment) Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI): Defect 1 (stale latch on retry-after-FAILED): suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once any init terminated -- including FAILED -- the deferred stayed permanently complete. A re-entrant suspend caller arriving DURING a subsequent retry would `await()` on the already-completed deferred, return instantly, and read transient initState (likely IN_PROGRESS -> false), silently dropping JobService work. Same flaw also affected `waitUntilInitInternal`. Make suspendCompletion mutable (`@Volatile var`) and reset it under `synchronized(initLock)` whenever state flips into IN_PROGRESS (both sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`). Both await sites local-capture the deferred under the lock so they wait on the same generation they observed -- never on a stale one. Defect 2 (indefinite hang if internalInit throws): internalInit had no try/catch wrapping its body. An unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/ userSwitcher.initUser/startupService.scheduleStart would leave initState at IN_PROGRESS and suspendCompletion uncompleted forever. With the new await() path introduced in PR #2637, every re-entrant suspend caller (SyncJobService) would hang on await() until the OS killed the worker. Wrap internalInit's body in try/catch: on any throw, attach the cause to initFailureException, set initState = FAILED, call notifyInitComplete(), and return false instead of rethrowing. Guarantees a terminal state and released waiters on every code path. Two new regression tests in SDKInitTests: - "resets latch on retry-after-FAILED": stalls a retry after a forced FAILED first init, kicks off a re-entrant caller, asserts it doesn't wake on the stale latch (verified to fail on the un-hardened code with expected:<true> but was:<false>). - "reaches terminal state when internalInit throws": forces a throw inside internalInit, asserts the suspend init returns false and the SDK can retry cleanly afterwards (verified to fail on the un-hardened code with the throw escaping as RuntimeException). Co-authored-by: Cursor <cursoragent@cursor.com>
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed) instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call sites computing `now - startTime` (IAM session-duration / SESSION_TIME triggers) get a sensible default rather than Jan 1970 deltas. (Copilot review comment) - initWithContextSuspend: only assign `initFailureException` when the call actually starts init (`shouldRunInit == true`). Re-entrant callers no longer overwrite the original initiator's failure-attribution stack trace. (Copilot review comment) - SDKInitTests: bound the `trigger.await()` inside the in-test ContextWrapper overrides to 30s so that a logic bug elsewhere in the test (or a cancelled coroutine) can't deadlock the IO worker forever and stall the suite. (Copilot review comment) Co-authored-by: Cursor <cursoragent@cursor.com>
|
Branch unstacked from #2605 + force-pushed. The branch was previously rebased on top of #2605 ( What changed:
Local verification:
When #2605 lands on |
Description
One Line Summary
Closes a race under
SDK_BACKGROUND_THREADINGwhereSyncJobServicere-entering the internal suspendinitWithContextreturnedtruebefore bootstrap finished, causingSessionService.backgroundRun()to NPE on a still-nullsessionfield.Linear: SDK-4475
Details
Motivation
Production stack (only seen with
SDK_BACKGROUND_THREADINGenabled):The race:
OneSignal.initWithContext(context, appId). Under the FF this flipsinitStatetoIN_PROGRESSand runsinternalIniton a fire-and-forget IO coroutine, then returns immediately.internalInit(beforebootstrapServices()has calledSessionService.bootstrap()),SyncJobService.onStartJobfires on a separate IO worker.SyncJobServicecallsOneSignal.initWithContext(this)(the internal suspend overload, no-appId).initWithContextSuspendsawinitState.isSDKAccessible() == true(becauseIN_PROGRESSis "accessible") and returnedtruewithout waiting for bootstrap to finish — violating its own documented contract: "Remain suspend until initialization is fully completed."SyncJobServiceproceeded toOneSignal.getService<IBackgroundManager>().runBackgroundServices(), which loops throughIBackgroundServices and callsbackgroundRun()onSessionServicewhosesessionfield was stillnull. NPE onif (!session!!.isValid).With FF off, public
initWithContextrunsrunBlocking { internalInit(...) }synchronously — bootstrap is always finished before init returns. The window doesn't exist.Scope
OneSignalImp.initWithContextSuspend— when init is already in flight, now suspends onsuspendCompletion.await()until it actually completes, then returns based on the final state (SUCCESS→ true,FAILED→ false). Honors the documented contract.initWithContext(context, appId)— intentionally unchanged. The fire-and-forget under FF-on is the entire point of the FF (no ANR on host app'sMainApplication.onCreate()). The wait belongs on the suspend re-entry path, where the caller (SyncJobService) is already on a background coroutine and depends on the SDK being fully ready for its very next line.SessionService— defensive null guards inendSession/onFocus/onUnfocused/startTime/scheduleBackgroundRunInso any future caller that bypassesbootstrap()no-ops instead of crashing here.Wait duration
Bounded by actual
internalInitcost —initEssentials+bootstrapServices+resolveAppId+userSwitcher.initUser. Tens to hundreds of ms in practice, well withinJobServicebudgets.Relationship to #2605
#2605 addresses the same general class (callers acting while
IN_PROGRESS) but at the accessor side (getServiceWithFeatureGateblocks onIN_PROGRESS). It does not close this NPE becauseSyncJobServicereachesIBackgroundManagervia rawgetService<T>()(IServiceProvider), which doesn't go through the accessor gate. The two fixes are complementary; merge order is independent.Testing
Unit testing
Added
SDKInitTests."initWithContextSuspend with in-flight init waits for completion before returning":initWithContextSuspendinsideinternalInitvia a customBlockingPrefsContextthat signals (viaCountDownLatch) whengetSharedPreferencesis first touched — by which pointinitStateis deterministicallyIN_PROGRESS.initWithContextSuspendand capturesos.isInitializedat the exact moment that second call returns.os.isInitialized == trueat return — i.e. the second call did not return until init was fully complete.Verified locally:
OneSignalImp.ktreverted): test FAILS withexpected:<true> but was:<false>— exactly catches the bug.Full
:OneSignal:core:testReleaseUnitTestrun: 801 tests, only the 2 unrelated pre-existingSDKInitTestsfailures remain onmain(those are what #2605 addresses); confirmed pre-existing by stashing this PR and re-running. AllSDKInitSuspendTests(11) andSessionListenerTests(3) pass.Manual testing
Not directly reproducible at will without instrumentation (race window depends on JobService timing). Repro path is documented above; the regression test deterministically reproduces the contract violation.
Affected code checklist
Checklist
Overview
Testing
SDKInitTestsfailures remain — unrelated, addressed by fix: resolve pre-existing test failures on main #2605)Final pass
Made with Cursor