Skip to content

Conversation

@teskje
Copy link
Contributor

@teskje teskje commented Mar 27, 2025

Previously the code initializing the LD client would correctly await initialized_async to see if the initialization succeeded. However, if it didn't succeed it would simply wait a bit and then call initialized_async again. Reading the LD server sdk code, there is no reason to assume that the call would return something different if repeated.

This PR changes the logic to call start_with_default_executor again when initialized_async reports failure, to attempt a new initialization. It also moves to the mz-ore Retry type, instead of implementing manual retry logic.

Motivation

  • This PR fixes a previously unreported bug.

If initializing the LD client fails the first time, the code gets stuck forever in a retry loop.

Tips for reviewer

I only stumbled over this when reading the code and I might be missing something. Please check my work!

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@teskje teskje marked this pull request as ready for review March 28, 2025 10:11
@teskje teskje requested a review from a team as a code owner March 28, 2025 10:11
@teskje teskje requested a review from ParkMyCar March 28, 2025 10:11
Comment on lines 73 to 75
if tokio::time::timeout(config_sync_timeout, init)
.await
.is_err()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout error handling here also looks very suspicious to me. We just log an INFO event, but the LD sync might be broken. Shouldn't we at least return an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 it should be an error

Previously the code initializing the LD client would correctly await
`initialized_async` to see if the initialization succeeded. However,
if it didn't succeed it would simply wait a bit and then call
`initialized_async` again. Reading the LD server sdk code, there is no
reason to assume that the call would return something different if
repeated.

This commit changes the logic to call `start_with_default_executor`
again when `initialized_async` reports failure, to attempt a new
initialization. It also moves to the mz-ore `Retry` type, instead of
implementing manual retry logic.
Copy link
Contributor

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this one @teskje, thanks for making the change though!

Comment on lines 73 to 75
if tokio::time::timeout(config_sync_timeout, init)
.await
.is_err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 it should be an error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants