Skip to content

fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE#2637

Open
abdulraqeeb33 wants to merge 3 commits intomainfrom
ar/sdk-4475
Open

fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE#2637
abdulraqeeb33 wants to merge 3 commits intomainfrom
ar/sdk-4475

Conversation

@abdulraqeeb33
Copy link
Copy Markdown
Contributor

Description

One Line Summary

Closes a race under SDK_BACKGROUND_THREADING where SyncJobService re-entering the internal suspend initWithContext returned true before bootstrap finished, causing SessionService.backgroundRun() to NPE on a still-null session field.

Linear: SDK-4475

Details

Motivation

Production stack (only seen with SDK_BACKGROUND_THREADING enabled):

java.lang.NullPointerException
    at com.onesignal.session.internal.session.impl.b.endSession(Unknown Source:2)
    at com.onesignal.session.internal.session.impl.b.backgroundRun(Unknown Source:0)
    at com.onesignal.core.internal.background.impl.a$b$a.invokeSuspend(Unknown Source:74)
    ...
    at com.onesignal.core.internal.background.impl.a.runBackgroundServices(Unknown Source:6)
    at com.onesignal.core.services.SyncJobService$a.invokeSuspend(Unknown Source:104)

The race:

  1. Host app calls OneSignal.initWithContext(context, appId). Under the FF this flips initState to IN_PROGRESS and runs internalInit on a fire-and-forget IO coroutine, then returns immediately.
  2. While that coroutine is still inside internalInit (before bootstrapServices() has called SessionService.bootstrap()), SyncJobService.onStartJob fires on a separate IO worker.
  3. SyncJobService calls OneSignal.initWithContext(this) (the internal suspend overload, no-appId).
  4. initWithContextSuspend saw initState.isSDKAccessible() == true (because IN_PROGRESS is "accessible") and returned true without waiting for bootstrap to finish — violating its own documented contract: "Remain suspend until initialization is fully completed."
  5. SyncJobService proceeded to OneSignal.getService<IBackgroundManager>().runBackgroundServices(), which loops through IBackgroundServices and calls backgroundRun() on SessionService whose session field was still null. NPE on if (!session!!.isValid).

With FF off, public initWithContext runs runBlocking { internalInit(...) } synchronously — bootstrap is always finished before init returns. The window doesn't exist.

Scope

  • OneSignalImp.initWithContextSuspend — when init is already in flight, now suspends on suspendCompletion.await() until it actually completes, then returns based on the final state (SUCCESS → true, FAILED → false). Honors the documented contract.
  • Public sync initWithContext(context, appId) — intentionally unchanged. The fire-and-forget under FF-on is the entire point of the FF (no ANR on host app's MainApplication.onCreate()). The wait belongs on the suspend re-entry path, where the caller (SyncJobService) is already on a background coroutine and depends on the SDK being fully ready for its very next line.
  • SessionService — defensive null guards in endSession / onFocus / onUnfocused / startTime / scheduleBackgroundRunIn so any future caller that bypasses bootstrap() no-ops instead of crashing here.

Wait duration

Bounded by actual internalInit cost — initEssentials + bootstrapServices + resolveAppId + userSwitcher.initUser. Tens to hundreds of ms in practice, well within JobService budgets.

Relationship to #2605

#2605 addresses the same general class (callers acting while IN_PROGRESS) but at the accessor side (getServiceWithFeatureGate blocks on IN_PROGRESS). It does not close this NPE because SyncJobService reaches IBackgroundManager via raw getService<T>() (IServiceProvider), which doesn't go through the accessor gate. The two fixes are complementary; merge order is independent.

Testing

Unit testing

Added SDKInitTests."initWithContextSuspend with in-flight init waits for completion before returning":

  • Stalls a first initWithContextSuspend inside internalInit via a custom BlockingPrefsContext that signals (via CountDownLatch) when getSharedPreferences is first touched — by which point initState is deterministically IN_PROGRESS.
  • Kicks off a second initWithContextSuspend and captures os.isInitialized at the exact moment that second call returns.
  • Asserts os.isInitialized == true at return — i.e. the second call did not return until init was fully complete.

Verified locally:

  • Without the fix (OneSignalImp.kt reverted): test FAILS with expected:<true> but was:<false> — exactly catches the bug.
  • With the fix: test PASSES.

Full :OneSignal:core:testReleaseUnitTest run: 801 tests, only the 2 unrelated pre-existing SDKInitTests failures remain on main (those are what #2605 addresses); confirmed pre-existing by stashing this PR and re-running. All SDKInitSuspendTests (11) and SessionListenerTests (3) pass.

Manual testing

Not directly reproducible at will without instrumentation (race window depends on JobService timing). Repro path is documented above; the regression test deterministically reproduces the contract violation.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
  • Any Public API changes are explained in the PR details and conform to existing APIs (no public API changes)

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible (only the 2 pre-existing SDKInitTests failures remain — unrelated, addressed by fix: resolve pre-existing test failures on main #2605)
  • I have personally tested this on my device, or explained why that is not possible — race window depends on JobService timing; the regression test deterministically reproduces the contract violation instead

Final pass

  • Code is as readable as possible.
  • I have reviewed this PR myself, ensuring it meets each checklist item

Made with Cursor

…oid SessionService NPE

Under SDK_BACKGROUND_THREADING, the public initWithContext(context, appId)
runs internalInit() on a fire-and-forget IO coroutine. The internal-only
suspend overload initWithContext(context) -- used by SyncJobService.onStartJob
-- was returning true as soon as initState was IN_PROGRESS, even though
bootstrap() had not yet populated SessionService.session. SyncJobService
would then call runBackgroundServices(), which invokes
SessionService.backgroundRun() -> endSession() and NPEs on session!!.isValid.

Fix the suspend overload to honor its documented contract ("Remain suspend
until initialization is fully completed"): when init is already in flight,
suspend on suspendCompletion.await() until it actually completes, then
return based on the final state. The public sync overload is unchanged --
host app's MainApplication.onCreate() still returns immediately, no ANR
risk re-introduced.

Also add defensive null guards to SessionService.endSession / onFocus /
onUnfocused / startTime / scheduleBackgroundRunIn so any future caller
that bypasses bootstrap() no-ops instead of crashing.

Adds a regression test (SDKInitTests) that stalls a first
initWithContextSuspend inside internalInit (via a custom
BlockingPrefsContext that signals when prefs are first touched), kicks off
a second initWithContextSuspend, and asserts that os.isInitialized is true
at the moment the second call returns. Verified to fail on main and pass
with the fix.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 6, 2026 18:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a race condition under SDK_BACKGROUND_THREADING where a re-entrant suspend initialization (initWithContextSuspend) could return before bootstrap completed, allowing background services (notably SessionService) to run with uninitialized/null internal models.

Changes:

  • Updated OneSignalImp.initWithContextSuspend to await in-flight initialization completion before returning a result.
  • Added a regression test that reproduces the re-entrant init race and asserts the suspend init does not return early.
  • Added defensive null-guards in SessionService to avoid NPEs when invoked before bootstrap().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt Makes re-entrant suspend init wait for completion instead of returning during IN_PROGRESS.
OneSignalSDK/onesignal/core/src/main/java/com/onesignal/session/internal/session/impl/SessionService.kt Adds pre-bootstrap null handling to prevent crashes when background services run too early.
OneSignalSDK/onesignal/core/src/test/java/com/onesignal/core/internal/application/SDKInitTests.kt Adds a regression test to catch early-return behavior during in-flight initialization.
Comments suppressed due to low confidence (1)

OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt:662

  • If internalInit(...) throws (or any unexpected exception occurs before notifyInitComplete()), initState can remain IN_PROGRESS and suspendCompletion may never complete. With the new suspendCompletion.await() path, this can hang re-entrant init callers indefinitely. Wrap the init execution so initState is set to FAILED and the completion signal is completed in a finally block (capturing the exception into initFailureException as appropriate).
            if (!shouldRunInit) {
                // Another caller has already started (or completed) init. Honor this method's
                // contract by suspending until initialization is *fully* completed -- not just
                // kicked off. This closes a race where re-entrant suspend callers (e.g. the
                // SyncJobService entry point under SDK_BACKGROUND_THREADING) would otherwise
                // proceed to use IBackgroundService implementations like SessionService whose
                // bootstrap() had not yet run, NPE'ing on still-null model fields.
                Logging.log(LogLevel.DEBUG, "initWithContext: init already in progress or completed, awaiting completion")
                suspendCompletion.await()
                return@withContext initState == InitState.SUCCESS
            }

            val result = internalInit(context, appId)
            // initState is already set correctly in internalInit, no need to overwrite it
            result

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +512 to +516
val blockingCtx = object : ContextWrapper(context) {
override fun getSharedPreferences(name: String, mode: Int): SharedPreferences {
started.countDown()
trigger.await()
return super.getSharedPreferences(name, mode)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in 524c45b. Bounded trigger.await(30, TimeUnit.SECONDS) inside the ContextWrapper.getSharedPreferences() overrides for both this test and the original in-flight init test, so a logic bug or cancelled coroutine can't leave the IO worker blocking forever and stalling the suite.

Comment on lines +547 to +552
// Sanity: the second caller has not pre-empted the test by returning before
// we unblock the first caller (timing depends on lazy ServiceProvider locks).
Thread.sleep(200)

// Unblock the first caller so internalInit() can complete (state -> SUCCESS).
trigger.countDown()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The 200ms sleep is intentionally testing a negative assertion — "the second caller did NOT return within window X" — which collapses to the same wall-clock semantics whether expressed as Thread.sleep, delay(), or polling, since polling for an event that should never fire is just a sleep with extra steps. The flake risk is only on the upper bound (we picked 200ms, the actual window can be much smaller), and the bounded trigger.await(30s) from 524c45b ensures the test fails-fast rather than hanging if 200ms is ever insufficient.

Happy to switch to delay() for stylistic consistency with coroutine-based tests if you'd prefer — the semantic is identical.

) : ISessionService, IBootstrapService, IStartableService, IBackgroundService, IApplicationLifecycleHandler {
override val startTime: Long
get() = session!!.startTime
get() = session?.startTime ?: 0L
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 524c45b. startTime now defaults to _time.currentTimeMillis when bootstrap hasn't run, so call sites computing now - startTime (IAM session-duration in InAppMessagesManager and SESSION_TIME triggers in DynamicTriggerController) see ~0ms elapsed instead of ~58 years (Jan 1970).

Comment on lines +640 to +646
val shouldRunInit: Boolean
synchronized(initLock) {
if (initState.isSDKAccessible()) {
Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress")
return@withContext true
shouldRunInit = !initState.isSDKAccessible()
if (shouldRunInit) {
initState = InitState.IN_PROGRESS
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4b5d86f. suspendCompletion is now a @Volatile var reset under synchronized(initLock) whenever state flips into IN_PROGRESS — applied to both the sync initWithContext(ctx, appId) and suspend initWithContextSuspend paths to keep them consistent. Both await sites (initWithContextSuspend and waitUntilInitInternal) now local-capture the deferred under the lock so they wait on the same generation they observed.

Added regression test initWithContextSuspend resets latch on retry-after-FAILED which: forces a FAILED first init, kicks off a stalled retry, then a re-entrant suspend caller. Verified to fail on the un-hardened code with expected:<true> but was:<false> (the re-entrant caller wakes on the stale latch and reads transient IN_PROGRESS state) and pass with the fix.

Comment on lines 633 to +646
Logging.log(LogLevel.DEBUG, "initWithContext(context: $context, appId: $appId)")

// This ensures the stack trace points to the caller that triggered init, not the async worker thread.
initFailureException = IllegalStateException("OneSignal initWithContext failed.")

// Use IO dispatcher for initialization to prevent ANRs and optimize for I/O operations
return withContext(runtimeIoDispatcher) {
// do not do this again if already initialized or init is in progress
val shouldRunInit: Boolean
synchronized(initLock) {
if (initState.isSDKAccessible()) {
Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress")
return@withContext true
shouldRunInit = !initState.isSDKAccessible()
if (shouldRunInit) {
initState = InitState.IN_PROGRESS
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 524c45b. initFailureException is now only assigned inside the shouldRunInit == true branch of the synchronized(initLock) block, so re-entrant callers (the SyncJobService entry point in particular) no longer overwrite the original initiator's failure-attribution stack trace.

Comment on lines 647 to +657

initState = InitState.IN_PROGRESS
if (!shouldRunInit) {
// Another caller has already started (or completed) init. Honor this method's
// contract by suspending until initialization is *fully* completed -- not just
// kicked off. This closes a race where re-entrant suspend callers (e.g. the
// SyncJobService entry point under SDK_BACKGROUND_THREADING) would otherwise
// proceed to use IBackgroundService implementations like SessionService whose
// bootstrap() had not yet run, NPE'ing on still-null model fields.
Logging.log(LogLevel.DEBUG, "initWithContext: init already in progress or completed, awaiting completion")
suspendCompletion.await()
return@withContext initState == InitState.SUCCESS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new suspendCompletion.await() at line 656 introduced by this PR exposes two pre-existing lifecycle defects in suspendCompletion (declared as a single-shot val CompletableDeferred<Unit> at line 53). (1) Never reset on retry: InitState.isSDKAccessible() returns false for FAILED, so a retry-after-FAILED flips state back to IN_PROGRESS and re-runs internalInit — but the deferred is already permanently completed by the first failure's notifyInitComplete(), so the new await() returns instantly and return@withContext initState == InitState.SUCCESS reads transient state, silently dropping SyncJobService work (returns false while still IN_PROGRESS). (2) Never completed on throw: internalInit only calls notifyInitComplete() at three explicit return sites; a throw from initEssentials, bootstrapServices, subscribeToConfigStore, updateConfig, userSwitcher.initUser, or startupService.scheduleStart propagates out leaving the deferred uncompleted, so re-entrant SyncJobService callers now hang on await() indefinitely instead of early-returning, holding a JobService budget slot until the OS kills it. Fix: re-allocate suspendCompletion (or use a fresh signal) every time shouldRunInit flips state to IN_PROGRESS, AND wrap internalInit's body in try { … } catch (t: Throwable) { initState = FAILED; notifyInitComplete(); throw }.

Extended reasoning...

Bug summary

This PR introduces a new caller of suspendCompletion.await() in initWithContextSuspend (lines 647–657). That await() depends on two invariants the existing code does not actually guarantee, and as a result the very fix that closes the SessionService-NPE window introduces two new failure modes on adjacent paths.

suspendCompletion is declared once, in field-initializer position:

private val suspendCompletion = CompletableDeferred<Unit>()  // line 53

A CompletableDeferred is single-shot — once .complete() is called, it stays complete for the lifetime of the OneSignalImp instance and every subsequent await() returns immediately. There is no reassignment anywhere in the file.

notifyInitComplete() fires on every terminal path of internalInit, including the FAILED paths (lines 345-347 user-locked, lines 365-367 missing appId) as well as SUCCESS (line 376). And InitState.isSDKAccessible() returns true only for IN_PROGRESS/SUCCESSnot for FAILED — so a retry after a failed init is allowed: the synchronized block at 641–646 flips FAILED → IN_PROGRESS and re-runs internalInit.

Defect 1: stale latch on retry-after-FAILED

Step-by-step proof:

  1. Host calls OneSignal.initWithContext(context) (no appId). Public sync path under SDK_BACKGROUND_THREADING flips initState to IN_PROGRESS and suspendifyOnIO { internalInit(context, null) } (lines 314–320). internalInit hits the resolveAppId failure branch, sets initState = FAILED, and calls notifyInitComplete()suspendCompletion.complete(Unit). The latch is now permanently tripped.
  2. Host retries with OneSignal.initWithContext(context, "validAppId"). State is FAILED!isSDKAccessible() is true → state flips back to IN_PROGRESS → fire-and-forget internalInit begins bootstrapping again.
  3. SyncJobService.onStartJob fires concurrently and calls OneSignal.initWithContext(this), which routes to initWithContextSuspend(context, null). In the synchronized block, isSDKAccessible() == true (state is IN_PROGRESS), so shouldRunInit = false.
  4. The new branch at lines 648–657 calls suspendCompletion.await(). It returns immediately because the latch was tripped in step 1.
  5. return@withContext initState == InitState.SUCCESS reads the transient state — almost always IN_PROGRESS (returns false). SyncJobService.onStartJob sees false and aborts before runBackgroundServices(), silently dropping the JobService work. If state has just flipped to SUCCESS between the read and the comparison, true is returned mid-bootstrap — exactly the contract violation this PR is trying to close, just reached via a different sequence.

The retry-after-failure pattern is not hypothetical: SDKInitTests.kt line 65–83 exercises exactly this sequence (initWithContext with no appId fails, then initWithContext with appId is called and expected to succeed). The same flaw also affects waitUntilInitInternal at line 499 (used by getServiceWithFeatureGate/loginSuspend/logoutSuspend): on retry-after-FAILED with state IN_PROGRESS, await() returns immediately, the FAILED check at line 511 passes, and the caller proceeds to call getter() against a not-yet-bootstrapped service.

Defect 2: indefinite hang if internalInit throws

internalInit (lines 336–378) only calls notifyInitComplete() at three explicit return sites. There is no try/catch/finally wrapping the body. Any throw from initEssentials, bootstrapServices (which iterates and calls bootstrap() on every IBootstrapService), otelManager.subscribeToConfigStore, resolveAppId itself, updateConfig, userSwitcher.initUser, or startupService.scheduleStart propagates up the stack with initState left at IN_PROGRESS and suspendCompletion never completed.

Pre-PR, a re-entrant suspend caller would observe isSDKAccessible() == true (because IN_PROGRESS is accessible) and early-return true. Post-PR, the new await() at line 656 hangs indefinitely. Under SDK_BACKGROUND_THREADING, the public initWithContext path runs internalInit via suspendifyOnIO — fire-and-forget — so a throw is swallowed by the coroutine scope's exception handler with no signal back to the waiter. Every concurrent re-entrant call (including SyncJobService) hangs indefinitely on await(), holding the JobService budget slot until the OS terminates the worker. This is the inverted failure mode of the bug being fixed — instead of returning too early, the JobService coroutine never returns at all.

bootstrapServices() is the most likely throw site in practice: StartupService.bootstrap() iterates every registered IBootstrapService implementation across modules, and any single bootstrap() throwing leaks the IN_PROGRESS state.

Suggested fix

Two changes are needed:

  1. Re-allocate suspendCompletion (make it a var and reset it, or use a fresh CompletableDeferred per init) every time the synchronized block flips state from a terminal value into IN_PROGRESS.
  2. Wrap internalInit's body in try { … } catch (t: Throwable) { initState = FAILED; notifyInitComplete(); throw } (or an equivalent finally block) so any failure mode reliably releases waiters.

Both changes are mechanical and low-risk. The PR's own regression test (initWithContextSuspend with in-flight init waits for completion before returning) does not exercise either retry-after-FAILED or a throw inside internalInit, which is why these gaps slipped through.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both defects addressed in 4b5d86f.

Defect 1 (stale latch on retry-after-FAILED): suspendCompletion is now a @Volatile var reset under synchronized(initLock) whenever state transitions into IN_PROGRESS — applied to both the sync initWithContext(ctx, appId) and suspend initWithContextSuspend paths. Both await sites (initWithContextSuspend and waitUntilInitInternal) local-capture the deferred under the lock so they wait on the same generation they observed. Regression test initWithContextSuspend resets latch on retry-after-FAILED reproduces the exact sequence you described (forced FAILED + stalled retry + re-entrant suspend caller); fails on the un-hardened code with expected:<true> but was:<false> and passes with the fix.

Defect 2 (indefinite hang if internalInit throws): Wrapped internalInit's body in try { … } catch (t: Throwable) { … initState = FAILED; notifyInitComplete(); return false }. The catch attaches the cause to initFailureException via addSuppressed and returns false instead of rethrowing — guaranteeing both terminal-state and waiter-release on every code path including initEssentials / bootstrapServices / subscribeToConfigStore / updateConfig / userSwitcher.initUser / startupService.scheduleStart throws. Regression test initWithContextSuspend reaches terminal state when internalInit throws reproduces by mocking AndroidUtils.isAndroidUserUnlocked to throw; fails on the un-hardened code with the throw escaping as RuntimeException and passes with the fix.

Both regression tests verified to fail-then-pass by stashing the OneSignalImp.kt change locally, running the suite, restoring, and re-running. Thanks for the thorough writeup — the suggested fixes are exactly what landed.

abdulraqeeb33 pushed a commit that referenced this pull request May 7, 2026
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI):

Defect 1 (stale latch on retry-after-FAILED):
suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once
any init terminated -- including FAILED -- the deferred stayed permanently
complete. A re-entrant suspend caller arriving DURING a subsequent retry
would `await()` on the already-completed deferred, return instantly, and
read transient initState (likely IN_PROGRESS -> false), silently dropping
JobService work. Same flaw also affected `waitUntilInitInternal`.

Make suspendCompletion mutable (`@Volatile var`) and reset it under
`synchronized(initLock)` whenever state flips into IN_PROGRESS (both
sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`).
Both await sites local-capture the deferred under the lock so they wait
on the same generation they observed -- never on a stale one.

Defect 2 (indefinite hang if internalInit throws):
internalInit had no try/catch wrapping its body. An unchecked throw from
initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/
userSwitcher.initUser/startupService.scheduleStart would leave initState
at IN_PROGRESS and suspendCompletion uncompleted forever. With the new
await() path introduced in PR #2637, every re-entrant suspend caller
(SyncJobService) would hang on await() until the OS killed the worker.

Wrap internalInit's body in try/catch: on any throw, attach the cause to
initFailureException, set initState = FAILED, call notifyInitComplete(),
and return false instead of rethrowing. Guarantees a terminal state and
released waiters on every code path.

Two new regression tests in SDKInitTests:
- "resets latch on retry-after-FAILED": stalls a retry after a forced
  FAILED first init, kicks off a re-entrant caller, asserts it doesn't
  wake on the stale latch (verified to fail on the un-hardened code with
  expected:<true> but was:<false>).
- "reaches terminal state when internalInit throws": forces a throw
  inside internalInit, asserts the suspend init returns false and the
  SDK can retry cleanly afterwards (verified to fail on the un-hardened
  code with the throw escaping as RuntimeException).

Co-authored-by: Cursor <cursoragent@cursor.com>
abdulraqeeb33 pushed a commit that referenced this pull request May 7, 2026
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed)
  instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call
  sites computing `now - startTime` (IAM session-duration / SESSION_TIME
  triggers) get a sensible default rather than Jan 1970 deltas.
  (Copilot review comment)

- initWithContextSuspend: only assign `initFailureException` when the
  call actually starts init (`shouldRunInit == true`). Re-entrant callers
  no longer overwrite the original initiator's failure-attribution stack
  trace. (Copilot review comment)

- SDKInitTests: bound the `trigger.await()` inside the in-test
  ContextWrapper overrides to 30s so that a logic bug elsewhere in the
  test (or a cancelled coroutine) can't deadlock the IO worker forever
  and stall the suite. (Copilot review comment)

Co-authored-by: Cursor <cursoragent@cursor.com>
abdulraqeeb33 pushed a commit that referenced this pull request May 7, 2026
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI):

Defect 1 (stale latch on retry-after-FAILED):
suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once
any init terminated -- including FAILED -- the deferred stayed permanently
complete. A re-entrant suspend caller arriving DURING a subsequent retry
would `await()` on the already-completed deferred, return instantly, and
read transient initState (likely IN_PROGRESS -> false), silently dropping
JobService work. Same flaw also affected `waitUntilInitInternal`.

Make suspendCompletion mutable (`@Volatile var`) and reset it under
`synchronized(initLock)` whenever state flips into IN_PROGRESS (both
sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`).
Both await sites local-capture the deferred under the lock so they wait
on the same generation they observed -- never on a stale one.

Defect 2 (indefinite hang if internalInit throws):
internalInit had no try/catch wrapping its body. An unchecked throw from
initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/
userSwitcher.initUser/startupService.scheduleStart would leave initState
at IN_PROGRESS and suspendCompletion uncompleted forever. With the new
await() path introduced in PR #2637, every re-entrant suspend caller
(SyncJobService) would hang on await() until the OS killed the worker.

Wrap internalInit's body in try/catch: on any throw, attach the cause to
initFailureException, set initState = FAILED, call notifyInitComplete(),
and return false instead of rethrowing. Guarantees a terminal state and
released waiters on every code path.

Two new regression tests in SDKInitTests:
- "resets latch on retry-after-FAILED": stalls a retry after a forced
  FAILED first init, kicks off a re-entrant caller, asserts it doesn't
  wake on the stale latch (verified to fail on the un-hardened code with
  expected:<true> but was:<false>).
- "reaches terminal state when internalInit throws": forces a throw
  inside internalInit, asserts the suspend init returns false and the
  SDK can retry cleanly afterwards (verified to fail on the un-hardened
  code with the throw escaping as RuntimeException).

Co-authored-by: Cursor <cursoragent@cursor.com>
abdulraqeeb33 pushed a commit that referenced this pull request May 7, 2026
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed)
  instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call
  sites computing `now - startTime` (IAM session-duration / SESSION_TIME
  triggers) get a sensible default rather than Jan 1970 deltas.
  (Copilot review comment)

- initWithContextSuspend: only assign `initFailureException` when the
  call actually starts init (`shouldRunInit == true`). Re-entrant callers
  no longer overwrite the original initiator's failure-attribution stack
  trace. (Copilot review comment)

- SDKInitTests: bound the `trigger.await()` inside the in-test
  ContextWrapper overrides to 30s so that a logic bug elsewhere in the
  test (or a cancelled coroutine) can't deadlock the IO worker forever
  and stall the suite. (Copilot review comment)

Co-authored-by: Cursor <cursoragent@cursor.com>
@abdulraqeeb33 abdulraqeeb33 changed the base branch from main to fix/pre-existing-test-failures May 7, 2026 15:09
AR Abdul Azeez and others added 2 commits May 7, 2026 11:34
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI):

Defect 1 (stale latch on retry-after-FAILED):
suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once
any init terminated -- including FAILED -- the deferred stayed permanently
complete. A re-entrant suspend caller arriving DURING a subsequent retry
would `await()` on the already-completed deferred, return instantly, and
read transient initState (likely IN_PROGRESS -> false), silently dropping
JobService work. Same flaw also affected `waitUntilInitInternal`.

Make suspendCompletion mutable (`@Volatile var`) and reset it under
`synchronized(initLock)` whenever state flips into IN_PROGRESS (both
sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`).
Both await sites local-capture the deferred under the lock so they wait
on the same generation they observed -- never on a stale one.

Defect 2 (indefinite hang if internalInit throws):
internalInit had no try/catch wrapping its body. An unchecked throw from
initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/
userSwitcher.initUser/startupService.scheduleStart would leave initState
at IN_PROGRESS and suspendCompletion uncompleted forever. With the new
await() path introduced in PR #2637, every re-entrant suspend caller
(SyncJobService) would hang on await() until the OS killed the worker.

Wrap internalInit's body in try/catch: on any throw, attach the cause to
initFailureException, set initState = FAILED, call notifyInitComplete(),
and return false instead of rethrowing. Guarantees a terminal state and
released waiters on every code path.

Two new regression tests in SDKInitTests:
- "resets latch on retry-after-FAILED": stalls a retry after a forced
  FAILED first init, kicks off a re-entrant caller, asserts it doesn't
  wake on the stale latch (verified to fail on the un-hardened code with
  expected:<true> but was:<false>).
- "reaches terminal state when internalInit throws": forces a throw
  inside internalInit, asserts the suspend init returns false and the
  SDK can retry cleanly afterwards (verified to fail on the un-hardened
  code with the throw escaping as RuntimeException).

Co-authored-by: Cursor <cursoragent@cursor.com>
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed)
  instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call
  sites computing `now - startTime` (IAM session-duration / SESSION_TIME
  triggers) get a sensible default rather than Jan 1970 deltas.
  (Copilot review comment)

- initWithContextSuspend: only assign `initFailureException` when the
  call actually starts init (`shouldRunInit == true`). Re-entrant callers
  no longer overwrite the original initiator's failure-attribution stack
  trace. (Copilot review comment)

- SDKInitTests: bound the `trigger.await()` inside the in-test
  ContextWrapper overrides to 30s so that a logic bug elsewhere in the
  test (or a cancelled coroutine) can't deadlock the IO worker forever
  and stall the suite. (Copilot review comment)

Co-authored-by: Cursor <cursoragent@cursor.com>
@abdulraqeeb33 abdulraqeeb33 changed the base branch from fix/pre-existing-test-failures to main May 7, 2026 15:36
@abdulraqeeb33
Copy link
Copy Markdown
Contributor Author

Branch unstacked from #2605 + force-pushed.

The branch was previously rebased on top of #2605 (fix/pre-existing-test-failures), which pulled #2605's detekt violations into this PR's CI surface (12 weighted issues vs the maxIssues: 10 threshold). The two PRs don't actually depend on each other code-wise — they fix complementary problems at different layers — so I rebased back onto main directly.

What changed:

  • Base branch: fix/pre-existing-test-failuresmain.
  • Commits force-pushed:
  • Added "ReturnCount" to the @Suppress on internalInit so detekt doesn't trip on the baseline-ID mismatch caused by the @Suppress("TooGenericExceptionCaught") annotation. (The existing internalInit ReturnCount baseline entry was for the unannotated signature; once we added the annotation the ID no longer matched.)

Local verification:

  • :OneSignal:core:detekt → BUILD SUCCESSFUL (5 informational findings, all pre-existing on main, well below the maxIssues: 10 threshold).
  • :OneSignal:core:testReleaseUnitTest --tests SDKInitTests → 16/18 pass; the 2 failures are the same pre-existing main failures that fix: resolve pre-existing test failures on main #2605 is fixing (accessor instances after initWithContext without appID and initWithContext with appId does not block), unrelated to this PR.
  • All 3 regression tests in this PR pass.

When #2605 lands on main, this PR will need a future rebase to absorb the overlap in OneSignalImp.kt — that rebase will likely have conflicts since both PRs heavily modify internalInit / initWithContext. Happy to handle that once #2605 merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants