Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove race condition bug in refresh logic #1390

Merged
merged 14 commits into from
Jul 20, 2023

Conversation

hessjcg
Copy link
Collaborator

@hessjcg hessjcg commented Jul 17, 2023

Update the logic in forceRefresh() to reduce the churn on the thread pool when the certificate refresh API calls are failing.

New forceRefresh() logic ensures that:

  • Only 1 refresh cycle may run at a time.
  • If a refresh cycle is in progress, then it will not be canceled until it succeeds.

Add new test cases to validate race conditions, deadlocks, and concurrency.

Add additional logging to help diagnose production problems with certificate refresh.

Related to #1314

Fixes #1209
Fixes #1159

@hessjcg hessjcg requested a review from a team as a code owner July 17, 2023 19:38
}

/** A constructor for use in tests that allows the caller to set the RateLimiter. */
CloudSqlInstance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just update the constructor to pass in the rate limiter. We don't need two constructors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer 2 constructors. I don't want to expose a public method to construct the default rate limiter just so that CoreSocketFactory can construct a CloudSqlInstance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule, I think we shouldn't be making test only APIs. And parameterizing the rate limiter is actually useful. Imagine if we wanted to expose rate limiting knobs to the user, or configure instances with different rate limits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I'm not convinced, but it's not worth arguing about.

return;
}

forceRefreshRunning = true;
nextInstanceData.cancel(false);
if (nextInstanceData.isCancelled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this if block and always schedule a next refresh since we also always call cancel? This after having observed that cancel doesn't always work as we might think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The if (isCanceled()) is now irrelevant since now we use the forceRefreshRunning field instead.

FYI: cancel works as expected: nextInstanceData.cancel(false) will cancel the task if it is queued, but not started. It will do nothing and return false if the task is running or done.

@@ -120,9 +120,11 @@ public static synchronized CoreSocketFactory getInstance() {
@VisibleForTesting
// Returns a listenable, scheduled executor that exits upon shutdown.
static ListeningScheduledExecutorService getDefaultExecutor() {
// TODO(kvg): Figure out correct way to determine number of threads
// Set the number of threads to 8. This should be enough threads that most users
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split this out into a separate PR since it's technically the fix of the bug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. See #1391

newRateLimiter());

assertThat(supplier.counter.get()).isEqualTo(0);
Thread.sleep(500);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we avoid a sleep to ensure our test remains solid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like an integration test of the ScheduledExecutorService. So this test is measuring how this JVM library class behaves when time passes. In this case, there is no way to mock the passing of time.

@hessjcg hessjcg changed the title fix: Increase threadpool count to avoid deadlock fix: Fix race condition bug in refresh logic and add logging Jul 18, 2023
@hessjcg hessjcg requested a review from enocom July 18, 2023 16:46
@enocom enocom changed the title fix: Fix race condition bug in refresh logic and add logging fix: remove race condition bug in refresh logic Jul 18, 2023
}

/** A constructor for use in tests that allows the caller to set the RateLimiter. */
CloudSqlInstance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule, I think we shouldn't be making test only APIs. And parameterizing the rate limiter is actually useful. Imagine if we wanted to expose rate limiting knobs to the user, or configure instances with different rate limits.

forceRefreshRunning = true;
nextInstanceData.cancel(false);
logger.fine(
"Force Refresh: the next refresh operation was cancelled."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the instance to the log message in leading brackets as we've done elsewhere. You might even make a convenience method for it that all methods (or even classes) can use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

logger.info(
String.format(
"[%s %d] INSTANCE DATA DONE",
instanceName, Thread.currentThread().getId()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the thread ID off -- unless it's worth keeping?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto with below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -152,14 +151,22 @@ public InstanceData getInstanceData(
.orElse(x509Certificate.getNotAfter());
}

logger.info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should all be logger.fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@hessjcg hessjcg force-pushed the gh-1314-increase-threadpool-count branch from 5f1a399 to 2c3d6cc Compare July 18, 2023 19:00
@hessjcg hessjcg requested a review from enocom July 18, 2023 19:13
@hessjcg hessjcg force-pushed the gh-1314-increase-threadpool-count branch from a5958e4 to 875cde6 Compare July 18, 2023 20:01
Copy link
Member

@enocom enocom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can test this directly. Let's talk tomorrow about how to do that.

@GuardedBy("instanceDataGuard")
private boolean forceRefreshRunning;

static final RateLimiter defaultRateLimiter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just pass this in and not expand the public API here. As an example, we pass in a RateLimiter in the AlloyDB Java Connector here: https://github.com/GoogleCloudPlatform/alloydb-java-connector/blob/main/alloydb-jdbc-connector/src/main/java/com/google/cloud/alloydb/Connector.java#L156.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


private final boolean flakey;

private AtomicInteger counter = new AtomicInteger();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this final.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

private final boolean flakey;

private AtomicInteger counter = new AtomicInteger();
private InstanceData response =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

null),
new SslData(null, null, null),
Date.from(Instant.now().plus(1, ChronoUnit.HOURS)));
private final ListeningScheduledExecutorService executor;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a local variable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


private static class TestDataSupplier implements InstanceDataSupplier {

private final boolean flakey;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flaky

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@hessjcg hessjcg requested a review from enocom July 19, 2023 16:55
Copy link
Member

@enocom enocom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the new tests here add ~1 minute to our build. I think there's a way we can test this without the time increase. Let's talk about how when you're ready.

@hessjcg hessjcg requested a review from enocom July 19, 2023 18:36
@hessjcg hessjcg enabled auto-merge (squash) July 19, 2023 18:44
MockAdminApi mockAdminApi = new MockAdminApi();
ListenableFuture<KeyPair> keyPairFuture =
Futures.immediateFuture(mockAdminApi.getClientKeyPair());
ListeningScheduledExecutorService executor = newTestExecutor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since newTestExecutor is a call through to the actual executor, shall we just inline this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Method inlined.

int c = counter.incrementAndGet();
Thread.sleep(100);

if (flaky && c % 2 == 0 && c > 10) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added the greater than 10 in to allow Hikari to establish two pools of 5 connections each successfully before simulating flaky behavior. Does this need to be here for test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I removed it.

// Attempt to retrieve data, ensure we wait for success
ListenableFuture<List<Object>> allData =
Futures.allAsList(
executor.submit(instance::getSslData),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the significant of three here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three would make multiple, simultaneous calls to getSslData().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why three? Why not four, or two?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just needs to be more than one. And three is a magic number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use two in that case, and add a comment. I'm asking for my personal interest, but also for future readers who will wonder the same.

executor.submit(instance::forceRefresh));
try {
all.get();
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we silently squashing the exception here? Should we add it to the test signature instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this method significantly and added comments. Please take another look.

allData2.get();
} catch (Exception e) {
}
assertThat(supplier.counter.get()).isEqualTo(2 + i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the supplier equal to 2 plus i? Shouldn't the supplier be called once for every completed force refresh?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should be more clear now.

}

@Test
public void testCloudSqlInstanceCorrectlyRefreshesInstanceData() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from a happy path test we have elsewhere? Do we need it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the happy path without any concurrency. Before I changed the mainline code, this test passed and the other tests with more concurrency failed. It was an important test to demonstrate that the bad behavior was caused by concurrency bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this wasn't used to drive out the bug, I think we should check that we have similar coverage elsewhere (I believe we do) and omit this one here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have exactly this same coverage elsewhere. This was used to drive out the bug. I would like to leave this here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is just the same as this test. In my view, the assertions don't communicate what different behavior we have.

newRateLimiter());

Thread.sleep(500);
Assert.assertThrows(RuntimeException.class, () -> instance.getSslData());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here? How does this test differ from other tests we already have?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is that this one throws an exception from within a scheduled executor task. This used to exacerbate the deadlock problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior covered by testCloudSqlInstanceRefreshesConsistentlyWithoutRaceConditions? It looks like a duplicate test to my eyes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has a single CloudSqlInstance and ensures that it successfully refreshes the correct number of times. Thus proving that the balking behavior of forceRefresh() works right.

testCloudSqlInstanceRefreshesConsistentlyWithoutRaceConditions tries to cause a race condition and deadlock with multiple CloudSqlInstances and many more threads. It is fundimentally a different test.

While developing, if there was a bug in my code, one of these tests would fail and the other would pass. So I think they are both useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this test proves that the CloudSqlInstance calls the underlying InstanceDataSupplier, which is a variation of this test. If you'd like to test that force refresh functions correctly, then I'd expect a call that fails, a force refresh, and then a call that succeeeds.

}

// Get SSL Data for each instance, forcing the first refresh to complete.
instances.forEach(i -> i.getSslData());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to a lambda reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Collaborator Author

@hessjcg hessjcg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the CloudSqlInstanceConcurrencyTest based on comments.

int c = counter.incrementAndGet();
Thread.sleep(100);

if (flaky && c % 2 == 0 && c > 10) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I removed it.

MockAdminApi mockAdminApi = new MockAdminApi();
ListenableFuture<KeyPair> keyPairFuture =
Futures.immediateFuture(mockAdminApi.getClientKeyPair());
ListeningScheduledExecutorService executor = newTestExecutor();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Method inlined.

// Attempt to retrieve data, ensure we wait for success
ListenableFuture<List<Object>> allData =
Futures.allAsList(
executor.submit(instance::getSslData),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three would make multiple, simultaneous calls to getSslData().

executor.submit(instance::forceRefresh));
try {
all.get();
} catch (Exception e) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this method significantly and added comments. Please take another look.

allData2.get();
} catch (Exception e) {
}
assertThat(supplier.counter.get()).isEqualTo(2 + i);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should be more clear now.

}

@Test
public void testCloudSqlInstanceCorrectlyRefreshesInstanceData() throws Exception {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the happy path without any concurrency. Before I changed the mainline code, this test passed and the other tests with more concurrency failed. It was an important test to demonstrate that the bad behavior was caused by concurrency bugs.

newRateLimiter());

Thread.sleep(500);
Assert.assertThrows(RuntimeException.class, () -> instance.getSslData());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is that this one throws an exception from within a scheduled executor task. This used to exacerbate the deadlock problem.

}

// Get SSL Data for each instance, forcing the first refresh to complete.
instances.forEach(i -> i.getSslData());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@hessjcg hessjcg requested a review from enocom July 20, 2023 15:47
// Attempt to retrieve data, ensure we wait for success
ListenableFuture<List<Object>> allData =
Futures.allAsList(
executor.submit(instance::getSslData),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why three? Why not four, or two?

}
}

@Test(timeout = 45000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If there is a deadlock, then the test will only fail with a timeout. So we should leave this in just incase we cause a deadlock again in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to wait 45 seconds before failing? What's the default value?

}

@Test
public void testCloudSqlInstanceCorrectlyRefreshesInstanceData() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this wasn't used to drive out the bug, I think we should check that we have similar coverage elsewhere (I believe we do) and omit this one here.

newRateLimiter());

Thread.sleep(500);
Assert.assertThrows(RuntimeException.class, () -> instance.getSslData());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior covered by testCloudSqlInstanceRefreshesConsistentlyWithoutRaceConditions? It looks like a duplicate test to my eyes.

// exception after the first refresh attempt fails.
}

@Test(timeout = 45000) // 45 seconds timeout in case of deadlock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here. Do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If this test deadlocks, then the test will fail with a timeout. Otherwise it will hang indefinitely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- can we bump this number down? Can we rely on the default value? I assume the test framework does a default timeout (I think it's 30s).

}
}

@Test(timeout = 45000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to wait 45 seconds before failing? What's the default value?

// Attempt to retrieve data, ensure we wait for success
ListenableFuture<List<Object>> allData =
Futures.allAsList(
executor.submit(instance::getSslData),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use two in that case, and add a comment. I'm asking for my personal interest, but also for future readers who will wonder the same.

// We expect some of these to throw an exception indicating that the refresh cycle
// got a failed attempt. When they throw an exception,
// sleep and try again. This shows that the refresh cycle is working.
Thread.sleep(100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a log statement here too?

}

@Test
public void testCloudSqlInstanceCorrectlyRefreshesInstanceData() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is just the same as this test. In my view, the assertions don't communicate what different behavior we have.

keyPairFuture,
newRateLimiter());

Thread.sleep(500);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the sleep here?

newRateLimiter());

Thread.sleep(500);
Assert.assertThrows(RuntimeException.class, () -> instance.getSslData());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this test proves that the CloudSqlInstance calls the underlying InstanceDataSupplier, which is a variation of this test. If you'd like to test that force refresh functions correctly, then I'd expect a call that fails, a force refresh, and then a call that succeeeds.

// exception after the first refresh attempt fails.
}

@Test(timeout = 45000) // 45 seconds timeout in case of deadlock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- can we bump this number down? Can we rely on the default value? I assume the test framework does a default timeout (I think it's 30s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants