Revision sets to True when deploy has the minimum number #15890

houshengbo · 2025-05-15T21:29:12Z

Proposed Changes

Revision is set to healthy, when the deployment has the minimum number of replicas running and ready.

Release Note

Revision is set to healthy, when the deployment has the minimum number of replicas running and ready.

knative-prow · 2025-05-15T21:29:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: houshengbo
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-05-15T21:34:24Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.95%. Comparing base (bbf34f6) to head (797b64c).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15890      +/-   ##
==========================================
+ Coverage   80.93%   80.95%   +0.01%     
==========================================
  Files         210      210              
  Lines       16731    16731              
==========================================
+ Hits        13542    13545       +3     
+ Misses       2839     2836       -3     
  Partials      350      350

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dprotaso · 2025-05-16T15:34:14Z

If min-scale is involved then I think the issue might be in a different place - more specifically the PodAutoscaler.

For example PodAutoscalerConditionScaleTargetInitialized should only be 'True' when we've reached min-scale.

serving/pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

Lines 33 to 38 in 8a39d5e

    
           var podCondSet = apis.NewLivingConditionSet( 
        
           	PodAutoscalerConditionActive, 
        
           	PodAutoscalerConditionScaleTargetInitialized, 
        
           	PodAutoscalerConditionSKSReady, 
        
           )

We mark the revision ready here

serving/pkg/apis/serving/v1/revision_lifecycle.go

Lines 196 to 203 in 8a39d5e

    
           // Don't mark the resources available, if deployment status already determined 
        
           // it isn't so. 
        
           if ps.IsScaleTargetInitialized() && !resUnavailable { 
        
           	// Precondition for PA being initialized is SKS being active and 
        
           	// that implies that |service.endpoints| > 0. 
        
           	rs.MarkResourcesAvailableTrue() 
        
           	rs.MarkContainerHealthyTrue() 
        
           }

In

dprotaso · 2025-05-16T15:36:30Z

Another thing we should do is update the e2e test to surface the issue you found - eg. fail if Ready=True but ReplicaReadyCount < MinScale

https://github.com/knative/serving/blob/main/test/e2e/minscale_readiness_test.go

houshengbo · 2025-05-16T16:07:20Z

@dprotaso Do we allow this situation to happen?
Revision is set to true even if there is only one pod, even if the minscale is larger than one.

If the revision is set to true, I guess traffic is assigned and shifted to the new revisiosn, and old revision starts to terminate(This is how we run into the issue). If deployment is running with pods less than the minscale, revision availability should NOT be set to true.

I will check on the podautoscaler part and the test cases.

dprotaso · 2025-05-16T16:11:30Z

Revision is set to true even if there is only one pod, even if the minscale is larger than one.

No - because I would expect PodAutoscalerConditionScaleTargetInitialized to be False until minReady=minScale

When that condition is False then PodAutoscaler.Ready Condition should be False as well because it's a child condition of that LivingConditionSet.

houshengbo · 2025-05-16T17:06:41Z

rev.Status.MarkContainerHealthyTrue()

This line was added into pkg/reconciler/revision/reconcile_resources.go, because we would like the changes of the deployment resource itself to be reflected in the revision.

Per the PodAutoscaler, the logic was correct as in kpa:

https://github.com/knative/serving/blob/main/pkg/reconciler/autoscaling/kpa/kpa.go#L274

and in hpa

https://github.com/knative/serving/blob/main/pkg/reconciler/autoscaling/hpa/hpa.go#L102

so the PodAutoscaler is running as we expected for both kpa and hpa. They both check if the minscale is reached before setting the rev status.

houshengbo · 2025-05-16T21:17:28Z

/retest

houshengbo · 2025-05-17T01:39:29Z

/retest

houshengbo · 2025-05-17T13:05:10Z

/retest

houshengbo · 2025-05-17T14:39:47Z

/retest

dprotaso · 2025-05-18T19:08:20Z

/test istio-latest-no-mesh_serving_main

dprotaso · 2025-05-18T22:14:20Z

curling github for a file is getting a HTTP 429 rate limit

/test istio-latest-no-mesh_serving_main

houshengbo · 2025-05-19T16:09:53Z

/test istio-latest-no-mesh_serving_main

houshengbo · 2025-05-19T17:28:07Z

@dprotaso I tried multiple times to make sure there is no race condition. The test results seem to be good so far.

dprotaso

I've been trying to run the changes to the TestMinScale locally against v1.18.0 (that should have the broken scaling). I'm finding the test as is not failing reliably.

I think we need to change the test to check that the prior revision isn't scaled down until after the second revision is ready.

dprotaso · 2025-05-22T22:21:53Z

test/e2e/minscale_readiness_test.go

+	if replicas := *revision.Status.ActualReplicas; replicas < minScale {
+		t.Fatalf("Container is indicated as healthy. Expected actual replicas for revision %v to be %v but got %v", revision.Name, minScale, replicas)
+	}


I'm hitting this condition when running locally against a broken serving release. I wouldn't expect that so it makes me think there's a delay when ActualReplicas is updated.

This fails sometimes when scaling up the first revision or second. Because of that I don't think we can reliably use this value in the test.

This is probably a separate issue to the one you're addressing

This check is also duplicated on line 117

dprotaso · 2025-05-22T22:42:08Z

One other observation I have - I think as part of the test you should introduce a readiness delay on the second revision.

What I'm seeing happen locally is the new revisions spin up instantaneously because of the image being pre-cached on the node. Thus there isn't an observed early termination of the first revision.

yuzisun · 2025-06-01T20:30:39Z

@houshengbo @dprotaso where are we on this ?

dprotaso · 2025-06-05T02:26:05Z

I dug into this more - I think we should fix PropagateAutoscalerStatus

serving/pkg/apis/serving/v1/revision_lifecycle.go

Line 172 in 794c02f

    
           func (rs *RevisionStatus) PropagateAutoscalerStatus(ps *autoscalingv1alpha1.PodAutoscalerStatus) {

When we have no pods ready we end up with a PA Status of

status:
  actualScale: 0
  conditions:
  - lastTransitionTime: "2025-06-05T01:42:44Z"
    message: Requests to the target are being buffered as resources are provisioned.
    reason: Queued
    status: Unknown
    type: Active
  - lastTransitionTime: "2025-06-05T01:42:44Z"
    message: Requests to the target are being buffered as resources are provisioned.
    reason: Queued
    status: Unknown
    type: Ready
  - lastTransitionTime: "2025-06-05T01:42:44Z"
    message: K8s Service is not ready
    reason: NotReady
    status: Unknown
    type: SKSReady
  - lastTransitionTime: "2025-06-05T01:42:44Z"
    status: Unknown
    type: ScaleTargetInitialized
  desiredScale: 3
  metricsServiceName: revision-failure-00002-private
  observedGeneration: 1
  serviceName: revision-failure-00002

When 1 replica is ready (out of 3) we have the PA status

status:
  actualScale: 1
  conditions:
  - lastTransitionTime: "2025-06-05T01:42:44Z"
    message: Requests to the target are being buffered as resources are provisioned.
    reason: Queued
    status: Unknown
    type: Active
  - lastTransitionTime: "2025-06-05T01:42:44Z"
    message: Requests to the target are being buffered as resources are provisioned.
    reason: Queued
    status: Unknown
    type: Ready
  - lastTransitionTime: "2025-06-05T01:44:09Z"
    status: "True"
    type: SKSReady
  - lastTransitionTime: "2025-06-05T01:42:44Z"
    status: Unknown
    type: ScaleTargetInitialized
  desiredScale: 3
  metricsServiceName: revision-failure-00002-private
  observedGeneration: 1
  serviceName: revision-failure-00002

So PropagateAutoscalerStatus doesn't handle the case where SKSReady=True and ScaleTargetInitialized=Unknown - it seems like we should set RevisionConditionResourcesAvailable=Unknown given that

Revision sets to True when deploy has the minimum number

609adac

knative-prow bot requested review from dprotaso, dsimansk and skonto May 15, 2025 21:29

knative-prow bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 15, 2025

Added the e2e test verification to make sure pod number reaches minScale

0cb7ff8

knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 16, 2025

houshengbo force-pushed the fix-rollout branch from 6e6d33b to 5d6fed2 Compare May 16, 2025 20:38

Used the correct name for the revision

1bbcee7

houshengbo force-pushed the fix-rollout branch from 5d6fed2 to 1bbcee7 Compare May 17, 2025 00:47

Removed the unused function IsRevisionContainerHealthy

797b64c

dprotaso reviewed May 22, 2025

View reviewed changes

Revision sets to True when deploy has the minimum number #15890

Are you sure you want to change the base?

Revision sets to True when deploy has the minimum number #15890

Uh oh!

Conversation

houshengbo commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Uh oh!

knative-prow bot commented May 15, 2025

Uh oh!

codecov bot commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dprotaso commented May 16, 2025

Uh oh!

dprotaso commented May 16, 2025

Uh oh!

houshengbo commented May 16, 2025

Uh oh!

dprotaso commented May 16, 2025

Uh oh!

houshengbo commented May 16, 2025

Uh oh!

houshengbo commented May 16, 2025

Uh oh!

houshengbo commented May 17, 2025

Uh oh!

houshengbo commented May 17, 2025

Uh oh!

houshengbo commented May 17, 2025

Uh oh!

dprotaso commented May 18, 2025

Uh oh!

dprotaso commented May 18, 2025

Uh oh!

houshengbo commented May 19, 2025

Uh oh!

houshengbo commented May 19, 2025

Uh oh!

dprotaso left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dprotaso May 22, 2025

Choose a reason for hiding this comment

Uh oh!

dprotaso May 22, 2025

Choose a reason for hiding this comment

Uh oh!

dprotaso commented May 22, 2025

Uh oh!

yuzisun commented Jun 1, 2025

Uh oh!

dprotaso commented Jun 5, 2025

Uh oh!

Uh oh!

houshengbo commented May 15, 2025 •

edited

Loading

codecov bot commented May 15, 2025 •

edited

Loading

dprotaso left a comment •

edited

Loading