Skip to content

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented May 29, 2025

Fixes: #8125

RELEASE NOTES:

  • xdsclient: fixed an edge case race causing "resource doesn not exist" when rapidly subscribing and unsubscribing the same resource.

Copy link

codecov bot commented May 29, 2025

Codecov Report

Attention: Patch coverage is 71.95122% with 23 lines in your changes missing coverage. Please review.

Project coverage is 82.42%. Comparing base (0a12fb0) to head (1f38f8e).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/clients/xdsclient/authority.go 54.28% 10 Missing and 6 partials ⚠️
xds/internal/clients/xdsclient/channel.go 79.41% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8369      +/-   ##
==========================================
+ Coverage   82.34%   82.42%   +0.07%     
==========================================
  Files         414      414              
  Lines       40434    40531      +97     
==========================================
+ Hits        33295    33406     +111     
+ Misses       5781     5765      -16     
- Partials     1358     1360       +2     
Files with missing lines Coverage Δ
xds/internal/clients/xdsclient/ads_stream.go 83.03% <100.00%> (+0.07%) ⬆️
xds/internal/clients/xdsclient/xdsclient.go 83.19% <100.00%> (-0.93%) ⬇️
xds/internal/clients/xdsclient/channel.go 78.94% <79.41%> (+0.10%) ⬆️
xds/internal/clients/xdsclient/authority.go 79.43% <54.28%> (-1.24%) ⬇️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch 3 times, most recently from d361cc6 to e70fea2 Compare June 2, 2025 13:25
@dfawley
Copy link
Member

dfawley commented Jun 2, 2025

cc @danielzhaotongliu

@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 99d6a11 to 2e6ba1b Compare June 3, 2025 15:12
@purnesh42H purnesh42H requested review from easwars and dfawley June 3, 2025 15:14
@purnesh42H purnesh42H added Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug and removed Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Jun 3, 2025
@purnesh42H purnesh42H added this to the 1.74 Release milestone Jun 3, 2025
@dfawley dfawley removed their assignment Jun 3, 2025
@dfawley dfawley removed their request for review June 3, 2025 15:50
@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 2e6ba1b to a06eb28 Compare June 4, 2025 06:57
@danielzhaotongliu
Copy link
Contributor

Thanks for the fix pull request!

I have been running some tests and experiments across multiple trials to verify this race is fixed, it should be done soon.

@danielzhaotongliu
Copy link
Contributor

Status update:

The initial end-to-end experiment (that was able to reproduce #8125) revealed a potential separate new issue but the experiment results could not confirm/deny this is fixed. Working on improving the experiment and rerun to confirm the race condition would be fixed

Comment on lines 448 to 449
// Call the event handler to remove unsubscribed cache entries.
s.eventHandler.onRequiredToRemoveUnsubscribedCacheEntries(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this happen only if the call to Send below is successful?

Copy link
Contributor Author

@purnesh42H purnesh42H Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the c++ logic here which does this right before sending the request. I think because we want to ensure the cache entries are deleted even if discovery request fails. In case of failure when the stream restarts, nonce is reset anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to ensure the cache entries are deleted even if discovery request fails. In case of failure when the stream restarts, nonce is reset anyways.

I think this is good information that is worth capturing in a comment. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 148 to 157
// Get the resource types that this specific ADS stream was handling
// before stopping it.
xc.ads.mu.Lock()
typesHandledByStream := make([]ResourceType, 0, len(xc.ads.resourceTypeState))
for typ := range xc.ads.resourceTypeState {
typesHandledByStream = append(typesHandledByStream, typ)
}
xc.ads.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need all this cleanup code in close?

An xdsChannel is closed only after all authorities have released their reference to the channel, and authority releases its reference to a channel only when it has no more watches. So, shouldn't the authority's cache be empty at that point?

Copy link
Contributor Author

@purnesh42H purnesh42H Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another rare case of having unsubscriptions for which we have not yet actually sent unsubscribe messages, and now we never will, so we do a pass to delete any cache entries for which we've unsubscribed. Such a scenario can happen if unsubscribe happens while stream is broken and then xdsclient is closed. In the case when garbage collection is done via onRequest, the xdschannel.Close() part will be no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with having to grab the lock of one type from another type. Could you please add a TODO here to ensure that this is cleaned up at some point in the future. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO. Although we can just move the logic to a helper in adsStreamImpl?

@easwars easwars assigned purnesh42H and unassigned easwars Jun 16, 2025
@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch 3 times, most recently from 990cb99 to 4a0f4e6 Compare June 19, 2025 19:47
@purnesh42H purnesh42H requested a review from easwars June 19, 2025 19:51
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Jun 19, 2025
a.logger.Infof("Resubscribing to resource of type %q and name %q", typ.TypeName, name)
}
xc.channel.subscribe(typ, name)
if len(state.watchers) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should invert this condition and continue if the condition is not met. That will ensure that the interesting piece of functionality is less indented. Please see: go/go-style/decisions#indent-error-flow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 148 to 157
// Get the resource types that this specific ADS stream was handling
// before stopping it.
xc.ads.mu.Lock()
typesHandledByStream := make([]ResourceType, 0, len(xc.ads.resourceTypeState))
for typ := range xc.ads.resourceTypeState {
typesHandledByStream = append(typesHandledByStream, typ)
}
xc.ads.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with having to grab the lock of one type from another type. Could you please add a TODO here to ensure that this is cleaned up at some point in the future. Thanks.

@easwars easwars assigned purnesh42H and unassigned easwars Jun 23, 2025
@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 1506c39 to 7f9e915 Compare June 23, 2025 17:58
@purnesh42H
Copy link
Contributor Author

purnesh42H commented Jun 23, 2025

Status update:

The initial end-to-end experiment (that was able to reproduce #8125) revealed a potential separate new issue but the experiment results could not confirm/deny this is fixed. Working on improving the experiment and rerun to confirm the race condition would be fixed

@danielzhaotongliu the test in this PR confirms that the resource is served from cache until an attempt to send unsubscription request is made. Without this fix, the watcher will timeout in case of subscribing a resource who has no watchers but hasn't yet sent unsubscribe request. Its similar to the change that was made in c++. I have imported the change to a fresh cl/774823957 over latest. Could you run the tests on that?

Even if we can't repro the race or if there are other reasons/issues, we probably should handle in separate issue. We still would like to merge this fix because logically it fixes the race of unsubscribe/subscribe from the xdsclient point of view as described in #8125.

@danielzhaotongliu
Copy link
Contributor

LGTM.

@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 7f9e915 to 84ac083 Compare July 3, 2025 05:14
@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 84ac083 to 8f55bb5 Compare July 17, 2025 05:43
@purnesh42H purnesh42H force-pushed the fix-xdsclient-resource-sub-unsub-race branch from 8f55bb5 to 1f38f8e Compare July 23, 2025 04:43
@purnesh42H purnesh42H merged commit e1f69d8 into grpc:master Jul 24, 2025
15 checks passed
easwars added a commit to easwars/grpc-go that referenced this pull request Aug 20, 2025
easwars added a commit that referenced this pull request Aug 21, 2025
The change being reverted here (#8369) is a prime suspect for a race
that can show up with the following sequence of events:
- create a new gRPC channel with the `xds:///` scheme
- make an RPC
- close the channel
- repeat (possibly from multiple goroutines)

The observable behavior from the race is that the xDS client thinks that
a Listener resource is removed by the control plane when it clearly is
not. This results in the user's gRPC channel moving to TRANSIENT_FAILURE
and subsequent RPC failures.

The reason the above mentioned PR is not being rolled back using `git
revert` is because the xds directory structure has changed significantly
since the time the PR was originally merged. Manually performing the
revert seemed much easier.

RELEASE NOTES:
* xdsclient: Revert a change that introduces a race with xDS resource
processing, leading to RPC failures
dimpavloff pushed a commit to dimpavloff/grpc-go that referenced this pull request Aug 22, 2025
eshitachandwani pushed a commit to eshitachandwani/grpc-go that referenced this pull request Aug 29, 2025
The change being reverted here (grpc#8369) is a prime suspect for a race
that can show up with the following sequence of events:
- create a new gRPC channel with the `xds:///` scheme
- make an RPC
- close the channel
- repeat (possibly from multiple goroutines)

The observable behavior from the race is that the xDS client thinks that
a Listener resource is removed by the control plane when it clearly is
not. This results in the user's gRPC channel moving to TRANSIENT_FAILURE
and subsequent RPC failures.

The reason the above mentioned PR is not being rolled back using `git
revert` is because the xds directory structure has changed significantly
since the time the PR was originally merged. Manually performing the
revert seemed much easier.

RELEASE NOTES:
* xdsclient: Revert a change that introduces a race with xDS resource
processing, leading to RPC failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xdsclient: race around resource subscriptions and unsubscsriptions
5 participants