Send success for pods for a deployment when its rolled out successfully #6534

tejal29 · 2021-08-27T20:51:03Z

Fixes https://github.com/GoogleCloudPlatform/cloud-code-vscode-internal/issues/5277

I wanted to implement this approach last time in #6517.
However, I decided to not and rely on kubectl deployment rollout status to return success if pods are up and healthy.

I did try to investigate more on why kubectl rollout status deployment is returning a success and pods are unhealthy.
As per the docs, this should not happen https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

I am wondering if this could be due to fluctuating network.

From, event logs @vincentjocodes provided here https://github.com/GoogleCloudPlatform/cloud-code-vscode-internal/issues/5277#issuecomment-906806476, it seems the pod was created, unhealthy, successful (probably health check passed) and then unhealthy again.

See this

(Look at line 301, 337, 389 (https://paste.googleplex.com/6664979378864128). The pod/ledgerwriter-74c84f5c7c-d7dh2 goes from 'InProgress' to 'Succeeded' to 'InProgress'. Doesn't resolve either 'Succeeded' or 'Failed'.)

The current logic,

If deployment rolled out successfully, we go fetch pods.
It could be that, the pod passed the health check, deployment became successful and then due to network connectivity issues pod failed health check again.
(The health check failure in this case was due to connection failure)

{"timestamp":"2021-08-26T23:13:41.666277Z","statusCheckSubtaskEvent":{"id":"pod/ledgerwriter-74c84f5c7c-d7dh2","task_id":"Deploy-1",
"resource":"pod/ledgerwriter-74c84f5c7c-d7dh2","
status":"InProgress",
"message":"Readiness probe failed: Get http://10.0.5.14:8080/ready: dial tcp 10.0.5.14:8080: connect: connection refused","statusCode":"STATUSCHECK_UNHEALTHY","actionableErr":{"errCode":"STATUSCHECK_UNHEALTHY","message":"Readiness probe failed: Get http://10.0.5.14:8080/ready: dial tcp 10.0.5.14:8080: connect: connection refused","suggestions":[{"suggestionCode":"CHECK_READINESS_PROBE","action":"Try checking container config `readinessProbe`"}]}}}

Solution:

If deployment is successful i.e. kubectl rollout status dep/ledgerwriter returns "deployment successfully rolled out`, send success event for all pods in the deployment.
This will ensure, we don't see the pending sign on VSC UI due to health checks instability.
Update the test to not expect a Validator run once deployment is successful.

Side effects:

Depending on when a pod becomes healthy, we might see multiple pod successful event. See here The IDEs would not have any side effect due to this. /cc @etanshaul
If from last time skaffold fetched pods on a pending deployment, a new pod was allocated and deployment rolled out successfully, then the new pod will be missed.
- A new pod would be allocated if a node on which previous pod was running died, or a deployment was updated due to kubectl apply or pod was killed due to too many retrials. However, all these scenarios are rare. Also note, the polling period is 1s.

MarlonGamez · 2021-08-27T21:32:26Z

pkg/skaffold/kubernetes/status/resource/deployment.go

+	if ae.ErrCode == proto.StatusCode_STATUSCHECK_SUCCESS {
+		for _, pod := range d.pods {
+			eventV2.ResourceStatusCheckEventCompletedMessage(
+				pod.String(),
+				fmt.Sprintf("%s %s: running.\n", tabHeader, pod.String()),
+				protoV2.ActionableErr{ErrCode: proto.StatusCode_STATUSCHECK_SUCCESS},
+			)
+		}
+		return


does this mean we'll send duplicate pod success messages since we're also sending them in fetchPods() func?

Yes, I think this would be the case unless I am confusing something. From the fetchPods() snippet here - https://github.com/GoogleContainerTools/skaffold/blob/main/pkg/skaffold/kubernetes/status/resource/deployment.go#L298-L303:

case proto.StatusCode_STATUSCHECK_SUCCESS: event.ResourceStatusCheckEventCompleted(p.String(), p.ActionableError()) eventV2.ResourceStatusCheckEventCompletedMessage( p.String(), fmt.Sprintf("%s running.\n", prefix), sErrors.V2fromV1(p.ActionableError()))

I believe we also send the same ResourceStatusCheckEventCompletedMessage event there

Yes. I agree there could be duplicate sucessful events depending upon timings.
e.g.

dep check -> waiting for rollout

pod check -> unhealhty
skaffold send a resource in progress event for pod
3 ) pod becomes healthy

Status check sleeps for 100 ms

dep rollout status -> successful.
skaffold send a resource complete event for pod

however in case where.

dep check -> waiting for rollout

pod becomes healthy and running

pod check -> returns succes.
skaffold sends a resource complete event for pod

Status check sleeps for 100 ms

dep rollout status -> successful.
skaffold send a another resource complete event for pod

It should be fine to send 2 successful resource complete event for a pod. It would be a no-op on the IDE side as the UI node was already marked green.

codecov · 2021-08-27T21:41:47Z

Codecov Report

Merging #6534 (17cb3bc) into main (cfe7882) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #6534      +/-   ##
==========================================
+ Coverage   70.44%   70.47%   +0.02%     
==========================================
  Files         515      515              
  Lines       23144    23150       +6     
==========================================
+ Hits        16303    16314      +11     
+ Misses       5785     5779       -6     
- Partials     1056     1057       +1

Impacted Files	Coverage Δ
pkg/skaffold/kubernetes/status/status_check.go	`52.15% <ø> (-0.51%)`	⬇️
.../skaffold/kubernetes/status/resource/deployment.go	`85.79% <100.00%> (+1.94%)`	⬆️
pkg/diag/validator/resource.go	`47.05% <0.00%> (-11.77%)`	⬇️
pkg/skaffold/docker/image.go	`65.56% <0.00%> (-1.00%)`	⬇️
pkg/skaffold/event/v2/status_check.go	`100.00% <0.00%> (+14.54%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfe7882...17cb3bc. Read the comment docs.

aaron-prindle · 2021-08-27T23:37:35Z

Seems there is gofmt error from Travis logs:

Gofmt errors in files:
./pkg/skaffold/kubernetes/status/resource/deployment_test.go
diff -u ./pkg/skaffold/kubernetes/status/resource/deployment_test.go.orig ./pkg/skaffold/kubernetes/status/resource/deployment_test.go
--- ./pkg/skaffold/kubernetes/status/resource/deployment_test.go.orig	2021-08-27 21:25:49.084829549 +0000
+++ ./pkg/skaffold/kubernetes/status/resource/deployment_test.go	2021-08-27 21:25:49.084829549 +0000
@@ -166,7 +166,7 @@
 	}
 	for _, test := range tests {
 		testutil.Run(t, test.description, func(t *testutil.T) {
-			ae := parseKubectlRolloutError(test.details, 10 * time.Second, test.err)
+			ae := parseKubectlRolloutError(test.details, 10*time.Second, test.err)
 			t.CheckDeepEqual(test.expectedAe, ae)
 		})
 	}
FAILED hack/gofmt.sh

tejal29 · 2021-08-28T02:31:40Z

Seems there is gofmt error from Travis logs:

Gofmt errors in files:
./pkg/skaffold/kubernetes/status/resource/deployment_test.go
diff -u ./pkg/skaffold/kubernetes/status/resource/deployment_test.go.orig ./pkg/skaffold/kubernetes/status/resource/deployment_test.go
--- ./pkg/skaffold/kubernetes/status/resource/deployment_test.go.orig	2021-08-27 21:25:49.084829549 +0000
+++ ./pkg/skaffold/kubernetes/status/resource/deployment_test.go	2021-08-27 21:25:49.084829549 +0000
@@ -166,7 +166,7 @@
 	}
 	for _, test := range tests {
 		testutil.Run(t, test.description, func(t *testutil.T) {
-			ae := parseKubectlRolloutError(test.details, 10 * time.Second, test.err)
+			ae := parseKubectlRolloutError(test.details, 10*time.Second, test.err)
 			t.CheckDeepEqual(test.expectedAe, ae)
 		})
 	}
FAILED hack/gofmt.sh

Thanks I have to fix my go version :(

MarlonGamez

lgtm, thanks for clarification on duplicate items

aaron-prindle

LGTM

Send success for pods if deployment is successfule

b2839e4

tejal29 requested a review from a team as a code owner August 27, 2021 20:51

tejal29 requested a review from aaron-prindle August 27, 2021 20:51

pull-request-size bot added the size/M label Aug 27, 2021

google-cla bot added the cla: yes label Aug 27, 2021

fix test

bda9eef

MarlonGamez reviewed Aug 27, 2021

View reviewed changes

fix lint

17cb3bc

MarlonGamez approved these changes Aug 30, 2021

View reviewed changes

aaron-prindle approved these changes Aug 30, 2021

View reviewed changes

MarlonGamez merged commit 290280e into GoogleContainerTools:main Aug 31, 2021

MarlonGamez mentioned this pull request Aug 31, 2021

Use ResourceStatusCheckEvent[Updated/Completed] where appropriate #6550

Merged

tejal29 deleted the send_success_events_for_pods branch November 9, 2021 20:25

sethrylan mentioned this pull request Nov 29, 2022

fix: add v2 to module path #8169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send success for pods for a deployment when its rolled out successfully #6534

Send success for pods for a deployment when its rolled out successfully #6534

tejal29 commented Aug 27, 2021 •

edited

MarlonGamez Aug 27, 2021

aaron-prindle Aug 27, 2021 •

edited

tejal29 Aug 28, 2021 •

edited

codecov bot commented Aug 27, 2021 •

edited

aaron-prindle commented Aug 27, 2021

tejal29 commented Aug 28, 2021

MarlonGamez left a comment

aaron-prindle left a comment

Send success for pods for a deployment when its rolled out successfully #6534

Send success for pods for a deployment when its rolled out successfully #6534

Conversation

tejal29 commented Aug 27, 2021 • edited

MarlonGamez Aug 27, 2021

Choose a reason for hiding this comment

aaron-prindle Aug 27, 2021 • edited

Choose a reason for hiding this comment

tejal29 Aug 28, 2021 • edited

Choose a reason for hiding this comment

codecov bot commented Aug 27, 2021 • edited

Codecov Report

aaron-prindle commented Aug 27, 2021

tejal29 commented Aug 28, 2021

MarlonGamez left a comment

Choose a reason for hiding this comment

aaron-prindle left a comment

Choose a reason for hiding this comment

tejal29 commented Aug 27, 2021 •

edited

aaron-prindle Aug 27, 2021 •

edited

tejal29 Aug 28, 2021 •

edited

codecov bot commented Aug 27, 2021 •

edited