fix: Improve handling of futures and threads during refresh. #1573

hessjcg · 2023-10-03T19:25:36Z

Rewrite the performRefresh() as a chain of task futures from the ListeningScheduledExecutorService. Now, tasks
submitted to the ListeningScheduledExecutorService never block on another task submitted to the ListeningScheduledExecutorService.

This should fix a category of bugs that show up in exceptions and logs as "connection timed out" or "refresh failed"
or "bad client certificate". These exceptions can occur when the credentials fail to refresh.

This is the underlying bug: The ListeningScheduledExecutorService gets into a state where all its threads are busy
running tasks, all running tasks are blocked waiting for recently submitted task to complete, and the recently
submitted tasks can't start because there are no available threads in the ListeningScheduledExecutorService.

This changes the behavior of CloudSqlInstance.getInstanceData() and CloudSqlInstance.startRefreshAttempt()
in ways that have a very small possibility of destabilizing customer applications.

In version 1.14.1 and earlier: CloudSqlInstance.getInstanceData() behaved like this: When no refresh attempt is
in progress, returns immediately. Otherwise, blocks application thread until the current refresh attempt finishes.
If the refresh attempt succeeds, this returns the InstanceData. If not, this throws a RuntimeException, while a
new refresh attempt is submitted to the executor in the background.

core/src/main/java/com/google/cloud/sql/core/CloudSqlInstance.java

enocom · 2023-10-09T17:42:14Z

core/src/main/java/com/google/cloud/sql/core/CloudSqlInstance.java

+      // If the currentInstanceData has expired, then force refresh (which will balk if a refresh
+      // is already running) and make this and future requests to getInstanceData wait on the
+      // refresh operation to complete.
+      if (instanceDataFuture.isDone()) {


Wouldn't this break our recent ZDT changes, though?

enocom · 2023-10-09T17:42:52Z

core/src/main/java/com/google/cloud/sql/core/CloudSqlInstance.java

+      // If the currentInstanceData has expired, then force refresh (which will balk if a refresh
+      // is already running) and make this and future requests to getInstanceData wait on the
+      // refresh operation to complete.
+      if (instanceDataFuture.isDone()) {


If we still think this is important, let's pull it out into a separate PR. It seems like a separate concern from the bigger threading fixes here.

core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceTest.java

hessjcg · 2023-10-09T20:50:00Z

New PR #1600 makes sure that CloudSqlInstance.getInstanceData() never returns expired data, but instead blocks
until it can return valid InstanceData.

enocom · 2023-10-09T22:00:50Z

core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceTest.java

+
+    AtomicInteger refreshCount = new AtomicInteger();
+    final PauseCondition badRequest1 = new PauseCondition();
+    final PauseCondition badRequest2 = new PauseCondition();


Why two bad requests? Could we write this test with just one?

Because I want to ensure that the retry is working - that the chain of futures is being built correctly even after the first failure. Early iterations of this PR had a bug where it would retry after one failure, but stop after the second failure.

enocom · 2023-10-09T22:01:35Z

core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceTest.java

+    badRequest2.proceed();
+    badRequest2.waitForCondition(() -> refreshCount.get() == 3, 2000);
+
+    // Allow the third bad request to complete


Is this comment correct?

enocom · 2023-10-10T19:37:20Z

core/src/main/java/com/google/cloud/sql/core/CloudSqlInstance.java

+    // Once rate limiter is done, attempt to getInstanceData.
+    ListenableFuture<InstanceData> dataFuture =
+        Futures.whenAllComplete(rateLimit)
+            .callAsync(


Open question: why callAsync and not transformAsync?

enocom · 2023-10-10T20:33:18Z

For my own reference, this is a recreation of #1457.

product-auto-label bot added size: l labels Oct 3, 2023

hessjcg force-pushed the fix-refresh-futures branch from 02a439b to 3d9a19a Compare October 3, 2023 19:54

hessjcg force-pushed the fix-refresh-timeouts branch from 8221fb6 to 3d72001 Compare October 3, 2023 19:54

hessjcg force-pushed the fix-refresh-futures branch from 3d9a19a to 8c6d6f0 Compare October 3, 2023 19:57

hessjcg force-pushed the fix-refresh-timeouts branch from 0c52db4 to f38e968 Compare October 3, 2023 20:43

hessjcg force-pushed the fix-refresh-futures branch from 8c6d6f0 to d3bfaf0 Compare October 3, 2023 20:43

hessjcg force-pushed the fix-refresh-timeouts branch from f38e968 to 4df326f Compare October 4, 2023 17:08

hessjcg force-pushed the fix-refresh-futures branch from d3bfaf0 to 62a72ad Compare October 4, 2023 17:08

github-actions bot removed size: l labels Oct 4, 2023

hessjcg force-pushed the fix-refresh-timeouts branch from 4df326f to 554ee9f Compare October 4, 2023 19:34

hessjcg force-pushed the fix-refresh-futures branch from 62a72ad to 79378f0 Compare October 4, 2023 19:34

product-auto-label bot added the size: l label Oct 4, 2023

hessjcg force-pushed the fix-refresh-timeouts branch from 554ee9f to dc0e382 Compare October 4, 2023 19:59

hessjcg force-pushed the fix-refresh-futures branch from 79378f0 to 48ae8d1 Compare October 4, 2023 19:59

hessjcg force-pushed the fix-refresh-timeouts branch from dc0e382 to 90f8546 Compare October 4, 2023 20:05

hessjcg force-pushed the fix-refresh-futures branch 2 times, most recently from 50a8be0 to 4499923 Compare October 4, 2023 20:27

hessjcg force-pushed the fix-refresh-timeouts branch from 90f8546 to 9c37eb4 Compare October 4, 2023 20:27

hessjcg force-pushed the fix-refresh-futures branch from 4499923 to f16917f Compare October 4, 2023 20:47

hessjcg force-pushed the fix-refresh-timeouts branch 2 times, most recently from b064bd6 to 1002788 Compare October 4, 2023 20:48

hessjcg force-pushed the fix-refresh-futures branch from f16917f to b14179d Compare October 4, 2023 20:53

github-actions bot removed the size: l label Oct 4, 2023

hessjcg force-pushed the fix-refresh-futures branch from b14179d to e760519 Compare October 4, 2023 22:11

product-auto-label bot added the size: l label Oct 4, 2023

github-actions bot removed the size: l label Oct 4, 2023

hessjcg force-pushed the fix-refresh-futures branch from e760519 to 3b68f1f Compare October 4, 2023 22:29

hessjcg force-pushed the fix-refresh-timeouts branch from 1c70685 to e1754cc Compare October 4, 2023 22:29

hessjcg force-pushed the fix-refresh-futures branch 2 times, most recently from bd6d635 to c6f81f9 Compare October 5, 2023 21:01

github-actions bot removed the size: l label Oct 6, 2023

product-auto-label bot added the api: cloudsql label Oct 6, 2023

github-actions bot removed the api: cloudsql label Oct 6, 2023

product-auto-label bot added the api: cloudsql label Oct 7, 2023

hessjcg force-pushed the fix-refresh-futures branch from c6f81f9 to 4ad8bbe Compare October 9, 2023 17:40

product-auto-label bot added the size: l label Oct 9, 2023

hessjcg force-pushed the fix-refresh-futures branch from 4ad8bbe to de5b290 Compare October 9, 2023 17:45

enocom requested changes Oct 9, 2023

View reviewed changes

hessjcg force-pushed the fix-refresh-futures branch 2 times, most recently from 51a5650 to d6ccb7e Compare October 9, 2023 20:20

hessjcg requested a review from enocom October 9, 2023 20:24

enocom reviewed Oct 9, 2023

View reviewed changes

github-actions bot removed size: l labels Oct 10, 2023

fix: Improve handling of futures and threads during refresh.

9968302

hessjcg force-pushed the fix-refresh-futures branch from d6ccb7e to 9968302 Compare October 10, 2023 14:36

product-auto-label bot added the size: l label Oct 10, 2023

hessjcg requested a review from enocom October 10, 2023 14:37

github-actions bot removed the size: l label Oct 10, 2023

enocom approved these changes Oct 10, 2023

View reviewed changes

Merge branch 'main' into fix-refresh-futures

0ecd331

product-auto-label bot added the size: l label Oct 10, 2023

hessjcg enabled auto-merge (squash) October 10, 2023 20:37

hessjcg merged commit f3458a6 into main Oct 10, 2023
21 checks passed

hessjcg deleted the fix-refresh-futures branch October 10, 2023 20:58

release-please bot mentioned this pull request Oct 10, 2023

chore(main): release 1.14.1 #1531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Improve handling of futures and threads during refresh. #1573

fix: Improve handling of futures and threads during refresh. #1573

hessjcg commented Oct 3, 2023 •

edited

enocom Oct 9, 2023

enocom Oct 9, 2023

hessjcg commented Oct 9, 2023

enocom Oct 9, 2023

hessjcg Oct 10, 2023

enocom Oct 9, 2023

hessjcg Oct 10, 2023

enocom Oct 10, 2023

enocom commented Oct 10, 2023

fix: Improve handling of futures and threads during refresh. #1573

fix: Improve handling of futures and threads during refresh. #1573

Conversation

hessjcg commented Oct 3, 2023 • edited

enocom Oct 9, 2023

Choose a reason for hiding this comment

enocom Oct 9, 2023

Choose a reason for hiding this comment

hessjcg commented Oct 9, 2023

enocom Oct 9, 2023

Choose a reason for hiding this comment

hessjcg Oct 10, 2023

Choose a reason for hiding this comment

enocom Oct 9, 2023

Choose a reason for hiding this comment

hessjcg Oct 10, 2023

Choose a reason for hiding this comment

enocom Oct 10, 2023

Choose a reason for hiding this comment

enocom commented Oct 10, 2023

hessjcg commented Oct 3, 2023 •

edited