Skip to content

xdsclient: fixed a bug to not report connectivity error from new stream recreation if a response was received before #8373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented May 30, 2025

Fixes: #8351

As per gRFC 57 we do not consider it an error if the ADS stream was closed after having received a response on the stream. This should hold true even if we get an error during first attempt of recreation of the stream.

Successful forge test

RELEASE NOTES:

  • xdsclient: fixed a bug to not report connectivity error from new stream recreation if a response was received before

Copy link

codecov bot commented May 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.15%. Comparing base (ec4810c) to head (feb42ad).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8373      +/-   ##
==========================================
- Coverage   82.22%   82.15%   -0.08%     
==========================================
  Files         419      419              
  Lines       42047    42066      +19     
==========================================
- Hits        34574    34560      -14     
- Misses       6003     6035      +32     
- Partials     1470     1471       +1     
Files with missing lines Coverage Δ
xds/internal/clients/xdsclient/ads_stream.go 80.08% <100.00%> (-0.14%) ⬇️
xds/internal/xdsclient/transport/ads/ads_stream.go 83.87% <100.00%> (+0.24%) ⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@purnesh42H purnesh42H force-pushed the fix-error-propagation-on-stream-restart-after-msg-recvd branch from 868f0f0 to 295dc3a Compare May 30, 2025 09:51
@purnesh42H purnesh42H requested review from easwars and arjan-bal May 30, 2025 09:53
@purnesh42H purnesh42H added Type: Bug Area: xDS Includes everything xDS related, including LB policies used with xDS. labels May 30, 2025
@purnesh42H purnesh42H added this to the 1.74 Release milestone May 30, 2025
…ng new stream creation if msg was received before
@purnesh42H purnesh42H force-pushed the fix-error-propagation-on-stream-restart-after-msg-recvd branch from 295dc3a to b54468a Compare May 30, 2025 10:01
@purnesh42H
Copy link
Contributor Author

purnesh42H commented May 30, 2025

Looks like its failing some fallack tests. Will check them. Unassigning the reviewers until then.

@purnesh42H purnesh42H self-assigned this May 30, 2025
@purnesh42H purnesh42H force-pushed the fix-error-propagation-on-stream-restart-after-msg-recvd branch 2 times, most recently from 02e583d to 86fbe6e Compare May 30, 2025 10:47
@purnesh42H purnesh42H force-pushed the fix-error-propagation-on-stream-restart-after-msg-recvd branch from 86fbe6e to 562b1a0 Compare May 30, 2025 11:05
@purnesh42H
Copy link
Contributor Author

Looks like its failing some fallack tests. Will check them. Unassigning the reviewers until then.

Fixed by resetting after first attempt of recreation

@purnesh42H purnesh42H changed the title xdsclient: fixed a bug to consider connectivity error as transient during new stream creation if a response was received before xdsclient: fixed a bug to not report connectivity error from new stream recreation if a response was received before May 30, 2025
@purnesh42H purnesh42H requested a review from easwars May 30, 2025 11:12
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H May 30, 2025
@purnesh42H purnesh42H requested a review from arjan-bal May 30, 2025 11:12
Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make the following fixes in the test logic and re-verify on forge?

  • Validate that we get a second update metric emitted.
  • Use a short context and verify that we don't receive a server failure metric.

mdWant = stats.MetricsData{
Handle: xdsClientServerFailureMetric.Descriptor(),
IntIncr: 1,
LabelKeys: []string{"grpc.target", "grpc.xds.server"},
LabelVals: []string{"Test/ServerFailureMetrics_AfterResponseRecv", mgmtServer.Address},
}
// Server failure should still have no recording point.
if err := tmr.WaitForInt64Count(ctx, mdWant); err == nil {
t.Fatal("tmr.WaitForInt64Count(ctx, mdWant) succeeded when expected to timeout.")
}

@purnesh42H
Copy link
Contributor Author

purnesh42H commented May 30, 2025

Can you please make the following fixes in the test logic and re-verify on forge?

  • Validate that we get a second update metric emitted.
  • Use a short context and verify that we don't receive a server failure metric.

mdWant = stats.MetricsData{
Handle: xdsClientServerFailureMetric.Descriptor(),
IntIncr: 1,
LabelKeys: []string{"grpc.target", "grpc.xds.server"},
LabelVals: []string{"Test/ServerFailureMetrics_AfterResponseRecv", mgmtServer.Address},
}
// Server failure should still have no recording point.
if err := tmr.WaitForInt64Count(ctx, mdWant); err == nil {
t.Fatal("tmr.WaitForInt64Count(ctx, mdWant) succeeded when expected to timeout.")
}

Validate that we get a second update metric emitted.

I don't think we can deterministically verify that with the current setup of test metrics recorder as only one metric is kept at a time. We already have tests that verify the behavior of re-requesting resources in case of stream restarts. I have removed the restart operation in metrics test so that we don't have possibility of re-request and resource update metrics.

  • Use a short context and verify that we don't receive a server failure metric.

Done

@purnesh42H purnesh42H requested a review from arjan-bal May 30, 2025 16:02
@purnesh42H
Copy link
Contributor Author

purnesh42H commented May 30, 2025

Looks like its failing some fallack tests. Will check them. Unassigning the reviewers until then.

Fixed by resetting messageReceived to false after first attempt of recreation because subsequent attempts should backoff as usual.

@easwars easwars assigned purnesh42H and unassigned easwars Jun 4, 2025
@easwars
Copy link
Contributor

easwars commented Jun 4, 2025

Made a comment on the bug. In my opinion, the test should be changed instead of the ADS stream implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: Test/ServerFailureMetrics_AfterResponseRecv
3 participants