Skip to content

xdsclient: do not process updates from closed server channels #8389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Jun 9, 2025

The xDS client is currently making an incorrect assumption that can lead to panics.

When an xDS client authority receives a response from a server, it checks to see if the response is from the currently active channel. If not, it assumes that it is from a higher priority server (because when a response is received from a server, all lower priority servers are closed), and therefore expects that a channel to that server exists. This need not be true because the responses are not handled inline, but are handled in the serializer. Here is a possible sequence:

  • Authority contains servers A, B, C (in priority order).
  • Let's say A and B are DOWN, and C is UP. Therefore the active channel is now C.
  • At time T0, the authority receives a message from server B. This is not handled inline, but is queued in the serializer for handling.
  • At time T1, the authority receives a message from server C. This is not handled inline, but is queued in the serializer for handling.
  • At time T2, the serializer handles the message from server B and as part of this handling, releases the reference to server C as it has now seen a response from a higher priority server. Also, the active channel now becomes B.
  • At time T3, the serializer handles the message from server C, and implicitly (wrongly) assumes that since it is not from the currently active channel B, it must be from a higher priority server and expects a channel to be present for that server.

This fix ensures what when the authority receives a response from a server that is not currently the active channel, it must ensure that it is from a higher priority server before proceeding with handling that response. If it is not from a higher priority server, it should drop the response on the floor and move on.

RELEASE NOTES:

  • xdsclient: Fixed a rare panic caused by processing a response from a closed server.

@easwars easwars requested a review from arjan-bal June 9, 2025 18:16
@easwars
Copy link
Contributor Author

easwars commented Jun 9, 2025

Fixes internal issue: https://b.corp.google.com/issues/408227990

@easwars easwars added this to the 1.74 Release milestone Jun 9, 2025
@easwars
Copy link
Contributor Author

easwars commented Jun 9, 2025

@purnesh42H : FYI since the migration PR got rolled back. If this goes in before that eventual roll-forward, we need to ensure that we don't lose track of this fix. Thanks.

Copy link

codecov bot commented Jun 9, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 82.14%. Comparing base (d2e8366) to head (8c6cd66).

Files with missing lines Patch % Lines
xds/internal/xdsclient/authority.go 27.77% 10 Missing and 3 partials ⚠️
xds/internal/clients/xdsclient/authority.go 72.22% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8389      +/-   ##
==========================================
- Coverage   82.17%   82.14%   -0.03%     
==========================================
  Files         419      419              
  Lines       42065    42079      +14     
==========================================
  Hits        34567    34567              
- Misses       6023     6042      +19     
+ Partials     1475     1470       -5     
Files with missing lines Coverage Δ
xds/internal/clients/xdsclient/authority.go 70.60% <72.22%> (-2.72%) ⬇️
xds/internal/xdsclient/authority.go 77.94% <27.77%> (-2.45%) ⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants