-
Notifications
You must be signed in to change notification settings - Fork 4.6k
xdsclient: delay resource cache deletion to handle immediate re-subscription of same resource #8369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xdsclient: delay resource cache deletion to handle immediate re-subscription of same resource #8369
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8369 +/- ##
==========================================
+ Coverage 82.34% 82.42% +0.07%
==========================================
Files 414 414
Lines 40434 40531 +97
==========================================
+ Hits 33295 33406 +111
+ Misses 5781 5765 -16
- Partials 1358 1360 +2
🚀 New features to boost your workflow:
|
d361cc6
to
e70fea2
Compare
99d6a11
to
2e6ba1b
Compare
2e6ba1b
to
a06eb28
Compare
Thanks for the fix pull request! I have been running some tests and experiments across multiple trials to verify this race is fixed, it should be done soon. |
Status update: The initial end-to-end experiment (that was able to reproduce #8125) revealed a potential separate new issue but the experiment results could not confirm/deny this is fixed. Working on improving the experiment and rerun to confirm the race condition would be fixed |
// Call the event handler to remove unsubscribed cache entries. | ||
s.eventHandler.onRequiredToRemoveUnsubscribedCacheEntries(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this happen only if the call to Send
below is successful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the c++ logic here which does this right before sending the request. I think because we want to ensure the cache entries are deleted even if discovery request fails. In case of failure when the stream restarts, nonce is reset anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to ensure the cache entries are deleted even if discovery request fails. In case of failure when the stream restarts, nonce is reset anyways.
I think this is good information that is worth capturing in a comment. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Get the resource types that this specific ADS stream was handling | ||
// before stopping it. | ||
xc.ads.mu.Lock() | ||
typesHandledByStream := make([]ResourceType, 0, len(xc.ads.resourceTypeState)) | ||
for typ := range xc.ads.resourceTypeState { | ||
typesHandledByStream = append(typesHandledByStream, typ) | ||
} | ||
xc.ads.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need all this cleanup code in close
?
An xdsChannel
is closed only after all authorities have released their reference to the channel, and authority releases its reference to a channel only when it has no more watches. So, shouldn't the authority's cache be empty at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another rare case of having unsubscriptions for which we have not yet actually sent unsubscribe messages, and now we never will, so we do a pass to delete any cache entries for which we've unsubscribed. Such a scenario can happen if unsubscribe happens while stream is broken and then xdsclient is closed. In the case when garbage collection is done via onRequest, the xdschannel.Close() part will be no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with having to grab the lock of one type from another type. Could you please add a TODO here to ensure that this is cleaned up at some point in the future. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO. Although we can just move the logic to a helper in adsStreamImpl?
990cb99
to
4a0f4e6
Compare
a.logger.Infof("Resubscribing to resource of type %q and name %q", typ.TypeName, name) | ||
} | ||
xc.channel.subscribe(typ, name) | ||
if len(state.watchers) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should invert this condition and continue
if the condition is not met. That will ensure that the interesting piece of functionality is less indented. Please see: go/go-style/decisions#indent-error-flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Get the resource types that this specific ADS stream was handling | ||
// before stopping it. | ||
xc.ads.mu.Lock() | ||
typesHandledByStream := make([]ResourceType, 0, len(xc.ads.resourceTypeState)) | ||
for typ := range xc.ads.resourceTypeState { | ||
typesHandledByStream = append(typesHandledByStream, typ) | ||
} | ||
xc.ads.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with having to grab the lock of one type from another type. Could you please add a TODO here to ensure that this is cleaned up at some point in the future. Thanks.
1506c39
to
7f9e915
Compare
@danielzhaotongliu the test in this PR confirms that the resource is served from cache until an attempt to send unsubscription request is made. Without this fix, the watcher will timeout in case of subscribing a resource who has no watchers but hasn't yet sent unsubscribe request. Its similar to the change that was made in c++. I have imported the change to a fresh cl/774823957 over latest. Could you run the tests on that? Even if we can't repro the race or if there are other reasons/issues, we probably should handle in separate issue. We still would like to merge this fix because logically it fixes the race of unsubscribe/subscribe from the xdsclient point of view as described in #8125. |
LGTM. |
7f9e915
to
84ac083
Compare
84ac083
to
8f55bb5
Compare
8f55bb5
to
1f38f8e
Compare
The change being reverted here (#8369) is a prime suspect for a race that can show up with the following sequence of events: - create a new gRPC channel with the `xds:///` scheme - make an RPC - close the channel - repeat (possibly from multiple goroutines) The observable behavior from the race is that the xDS client thinks that a Listener resource is removed by the control plane when it clearly is not. This results in the user's gRPC channel moving to TRANSIENT_FAILURE and subsequent RPC failures. The reason the above mentioned PR is not being rolled back using `git revert` is because the xds directory structure has changed significantly since the time the PR was originally merged. Manually performing the revert seemed much easier. RELEASE NOTES: * xdsclient: Revert a change that introduces a race with xDS resource processing, leading to RPC failures
…ription of same resource (grpc#8369)
The change being reverted here (grpc#8369) is a prime suspect for a race that can show up with the following sequence of events: - create a new gRPC channel with the `xds:///` scheme - make an RPC - close the channel - repeat (possibly from multiple goroutines) The observable behavior from the race is that the xDS client thinks that a Listener resource is removed by the control plane when it clearly is not. This results in the user's gRPC channel moving to TRANSIENT_FAILURE and subsequent RPC failures. The reason the above mentioned PR is not being rolled back using `git revert` is because the xds directory structure has changed significantly since the time the PR was originally merged. Manually performing the revert seemed much easier. RELEASE NOTES: * xdsclient: Revert a change that introduces a race with xDS resource processing, leading to RPC failures
Fixes: #8125
RELEASE NOTES: