Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve handling of futures and threads during refresh. #1573

Merged
merged 2 commits into from
Oct 10, 2023

Conversation

hessjcg
Copy link
Collaborator

@hessjcg hessjcg commented Oct 3, 2023

Rewrite the performRefresh() as a chain of task futures from the ListeningScheduledExecutorService. Now, tasks
submitted to the ListeningScheduledExecutorService never block on another task submitted to the ListeningScheduledExecutorService.

This should fix a category of bugs that show up in exceptions and logs as "connection timed out" or "refresh failed"
or "bad client certificate". These exceptions can occur when the credentials fail to refresh.

This is the underlying bug: The ListeningScheduledExecutorService gets into a state where all its threads are busy
running tasks, all running tasks are blocked waiting for recently submitted task to complete, and the recently
submitted tasks can't start because there are no available threads in the ListeningScheduledExecutorService.

This changes the behavior of CloudSqlInstance.getInstanceData() and CloudSqlInstance.startRefreshAttempt()
in ways that have a very small possibility of destabilizing customer applications.

In version 1.14.1 and earlier: CloudSqlInstance.getInstanceData() behaved like this: When no refresh attempt is
in progress, returns immediately. Otherwise, blocks application thread until the current refresh attempt finishes.
If the refresh attempt succeeds, this returns the InstanceData. If not, this throws a RuntimeException, while a
new refresh attempt is submitted to the executor in the background.

@hessjcg hessjcg force-pushed the fix-refresh-futures branch 2 times, most recently from 50a8be0 to 4499923 Compare October 4, 2023 20:27
@hessjcg hessjcg force-pushed the fix-refresh-timeouts branch 2 times, most recently from b064bd6 to 1002788 Compare October 4, 2023 20:48
@github-actions github-actions bot removed the size: l label Oct 4, 2023
@github-actions github-actions bot removed the size: l label Oct 4, 2023
// If the currentInstanceData has expired, then force refresh (which will balk if a refresh
// is already running) and make this and future requests to getInstanceData wait on the
// refresh operation to complete.
if (instanceDataFuture.isDone()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this break our recent ZDT changes, though?

// If the currentInstanceData has expired, then force refresh (which will balk if a refresh
// is already running) and make this and future requests to getInstanceData wait on the
// refresh operation to complete.
if (instanceDataFuture.isDone()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we still think this is important, let's pull it out into a separate PR. It seems like a separate concern from the bigger threading fixes here.

@hessjcg hessjcg force-pushed the fix-refresh-futures branch 2 times, most recently from 51a5650 to d6ccb7e Compare October 9, 2023 20:20
@hessjcg hessjcg requested a review from enocom October 9, 2023 20:24
@hessjcg
Copy link
Collaborator Author

hessjcg commented Oct 9, 2023

New PR #1600 makes sure that CloudSqlInstance.getInstanceData() never returns expired data, but instead blocks
until it can return valid InstanceData.


AtomicInteger refreshCount = new AtomicInteger();
final PauseCondition badRequest1 = new PauseCondition();
final PauseCondition badRequest2 = new PauseCondition();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two bad requests? Could we write this test with just one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I want to ensure that the retry is working - that the chain of futures is being built correctly even after the first failure. Early iterations of this PR had a bug where it would retry after one failure, but stop after the second failure.

badRequest2.proceed();
badRequest2.waitForCondition(() -> refreshCount.get() == 3, 2000);

// Allow the third bad request to complete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

// Once rate limiter is done, attempt to getInstanceData.
ListenableFuture<InstanceData> dataFuture =
Futures.whenAllComplete(rateLimit)
.callAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question: why callAsync and not transformAsync?

@enocom
Copy link
Member

enocom commented Oct 10, 2023

For my own reference, this is a recreation of #1457.

@hessjcg hessjcg enabled auto-merge (squash) October 10, 2023 20:37
@hessjcg hessjcg merged commit f3458a6 into main Oct 10, 2023
21 checks passed
@hessjcg hessjcg deleted the fix-refresh-futures branch October 10, 2023 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants