-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Revision sets to True when deploy has the minimum number #15890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: houshengbo The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #15890 +/- ##
==========================================
+ Coverage 80.93% 80.95% +0.01%
==========================================
Files 210 210
Lines 16731 16731
==========================================
+ Hits 13542 13545 +3
+ Misses 2839 2836 -3
Partials 350 350 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
If min-scale is involved then I think the issue might be in a different place - more specifically the PodAutoscaler. For example serving/pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go Lines 33 to 38 in 8a39d5e
We mark the revision ready here serving/pkg/apis/serving/v1/revision_lifecycle.go Lines 196 to 203 in 8a39d5e
In |
Another thing we should do is update the e2e test to surface the issue you found - eg. fail if Ready=True but ReplicaReadyCount < MinScale https://github.com/knative/serving/blob/main/test/e2e/minscale_readiness_test.go |
@dprotaso Do we allow this situation to happen? If the revision is set to true, I guess traffic is assigned and shifted to the new revisiosn, and old revision starts to terminate(This is how we run into the issue). If deployment is running with pods less than the minscale, revision availability should NOT be set to true. I will check on the podautoscaler part and the test cases. |
No - because I would expect When that condition is |
This line was added into Per the PodAutoscaler, the logic was correct as in kpa:
and in hpa
so the PodAutoscaler is running as we expected for both kpa and hpa. They both check if the minscale is reached before setting the rev status. |
/retest |
/retest |
2 similar comments
/retest |
/retest |
/test istio-latest-no-mesh_serving_main |
curling github for a file is getting a HTTP 429 rate limit /test istio-latest-no-mesh_serving_main |
/test istio-latest-no-mesh_serving_main |
@dprotaso I tried multiple times to make sure there is no race condition. The test results seem to be good so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to run the changes to the TestMinScale locally against v1.18.0 (that should have the broken scaling). I'm finding the test as is not failing reliably.
I think we need to change the test to check that the prior revision isn't scaled down until after the second revision is ready.
if replicas := *revision.Status.ActualReplicas; replicas < minScale { | ||
t.Fatalf("Container is indicated as healthy. Expected actual replicas for revision %v to be %v but got %v", revision.Name, minScale, replicas) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hitting this condition when running locally against a broken serving release. I wouldn't expect that so it makes me think there's a delay when ActualReplicas is updated.
This fails sometimes when scaling up the first revision or second. Because of that I don't think we can reliably use this value in the test.
This is probably a separate issue to the one you're addressing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is also duplicated on line 117
One other observation I have - I think as part of the test you should introduce a readiness delay on the second revision. What I'm seeing happen locally is the new revisions spin up instantaneously because of the image being pre-cached on the node. Thus there isn't an observed early termination of the first revision. |
@houshengbo @dprotaso where are we on this ? |
I dug into this more - I think we should fix
When we have no pods ready we end up with a PA Status of
When 1 replica is ready (out of 3) we have the PA status
So |
Fixes #15889
Proposed Changes
Release Note