Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send success for pods for a deployment when its rolled out successfully #6534

Merged

Conversation

tejal29
Copy link
Member

@tejal29 tejal29 commented Aug 27, 2021

Fixes https://github.com/GoogleCloudPlatform/cloud-code-vscode-internal/issues/5277

I wanted to implement this approach last time in #6517.
However, I decided to not and rely on kubectl deployment rollout status to return success if pods are up and healthy.

I did try to investigate more on why kubectl rollout status deployment is returning a success and pods are unhealthy.
As per the docs, this should not happen https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

I am wondering if this could be due to fluctuating network.

From, event logs @vincentjocodes provided here https://github.com/GoogleCloudPlatform/cloud-code-vscode-internal/issues/5277#issuecomment-906806476, it seems the pod was created, unhealthy, successful (probably health check passed) and then unhealthy again.

See this

(Look at line 301, 337, 389 (https://paste.googleplex.com/6664979378864128). The pod/ledgerwriter-74c84f5c7c-d7dh2 goes from 'InProgress' to 'Succeeded' to 'InProgress'. Doesn't resolve either 'Succeeded' or 'Failed'.)

The current logic,

  1. If deployment rolled out successfully, we go fetch pods.
    It could be that, the pod passed the health check, deployment became successful and then due to network connectivity issues pod failed health check again.
    (The health check failure in this case was due to connection failure)
{"timestamp":"2021-08-26T23:13:41.666277Z","statusCheckSubtaskEvent":{"id":"pod/ledgerwriter-74c84f5c7c-d7dh2","task_id":"Deploy-1",
"resource":"pod/ledgerwriter-74c84f5c7c-d7dh2","
status":"InProgress",
"message":"Readiness probe failed: Get http://10.0.5.14:8080/ready: dial tcp 10.0.5.14:8080: connect: connection refused","statusCode":"STATUSCHECK_UNHEALTHY","actionableErr":{"errCode":"STATUSCHECK_UNHEALTHY","message":"Readiness probe failed: Get http://10.0.5.14:8080/ready: dial tcp 10.0.5.14:8080: connect: connection refused","suggestions":[{"suggestionCode":"CHECK_READINESS_PROBE","action":"Try checking container config `readinessProbe`"}]}}}

Solution:

  1. If deployment is successful i.e. kubectl rollout status dep/ledgerwriter returns "deployment successfully rolled out`, send success event for all pods in the deployment.
    This will ensure, we don't see the pending sign on VSC UI due to health checks instability.
  2. Update the test to not expect a Validator run once deployment is successful.

Side effects:

  1. Depending on when a pod becomes healthy, we might see multiple pod successful event. See here The IDEs would not have any side effect due to this. /cc @etanshaul
  2. If from last time skaffold fetched pods on a pending deployment, a new pod was allocated and deployment rolled out successfully, then the new pod will be missed.
    • A new pod would be allocated if a node on which previous pod was running died, or a deployment was updated due to kubectl apply or pod was killed due to too many retrials. However, all these scenarios are rare. Also note, the polling period is 1s.

@tejal29 tejal29 requested a review from a team as a code owner August 27, 2021 20:51
@google-cla google-cla bot added the cla: yes label Aug 27, 2021
Comment on lines +144 to +152
if ae.ErrCode == proto.StatusCode_STATUSCHECK_SUCCESS {
for _, pod := range d.pods {
eventV2.ResourceStatusCheckEventCompletedMessage(
pod.String(),
fmt.Sprintf("%s %s: running.\n", tabHeader, pod.String()),
protoV2.ActionableErr{ErrCode: proto.StatusCode_STATUSCHECK_SUCCESS},
)
}
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we'll send duplicate pod success messages since we're also sending them in fetchPods() func?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would be the case unless I am confusing something. From the fetchPods() snippet here - https://github.com/GoogleContainerTools/skaffold/blob/main/pkg/skaffold/kubernetes/status/resource/deployment.go#L298-L303:

			case proto.StatusCode_STATUSCHECK_SUCCESS:
				event.ResourceStatusCheckEventCompleted(p.String(), p.ActionableError())
				eventV2.ResourceStatusCheckEventCompletedMessage(
					p.String(),
					fmt.Sprintf("%s running.\n", prefix),
					sErrors.V2fromV1(p.ActionableError()))

I believe we also send the same ResourceStatusCheckEventCompletedMessage event there

Copy link
Member Author

@tejal29 tejal29 Aug 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I agree there could be duplicate sucessful events depending upon timings.
e.g.

  1. dep check -> waiting for rollout
  2. pod check -> unhealhty
    skaffold send a resource in progress event for pod
    3 ) pod becomes healthy
  3. Status check sleeps for 100 ms
  4. dep rollout status -> successful.
    skaffold send a resource complete event for pod

however in case where.

  1. dep check -> waiting for rollout
  2. pod becomes healthy and running
  3. pod check -> returns succes.
    skaffold sends a resource complete event for pod
  4. Status check sleeps for 100 ms
  5. dep rollout status -> successful.
    skaffold send a another resource complete event for pod

It should be fine to send 2 successful resource complete event for a pod. It would be a no-op on the IDE side as the UI node was already marked green.

@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

Merging #6534 (17cb3bc) into main (cfe7882) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6534      +/-   ##
==========================================
+ Coverage   70.44%   70.47%   +0.02%     
==========================================
  Files         515      515              
  Lines       23144    23150       +6     
==========================================
+ Hits        16303    16314      +11     
+ Misses       5785     5779       -6     
- Partials     1056     1057       +1     
Impacted Files Coverage Δ
pkg/skaffold/kubernetes/status/status_check.go 52.15% <ø> (-0.51%) ⬇️
.../skaffold/kubernetes/status/resource/deployment.go 85.79% <100.00%> (+1.94%) ⬆️
pkg/diag/validator/resource.go 47.05% <0.00%> (-11.77%) ⬇️
pkg/skaffold/docker/image.go 65.56% <0.00%> (-1.00%) ⬇️
pkg/skaffold/event/v2/status_check.go 100.00% <0.00%> (+14.54%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfe7882...17cb3bc. Read the comment docs.

@aaron-prindle
Copy link
Contributor

Seems there is gofmt error from Travis logs:

Gofmt errors in files:
./pkg/skaffold/kubernetes/status/resource/deployment_test.go
diff -u ./pkg/skaffold/kubernetes/status/resource/deployment_test.go.orig ./pkg/skaffold/kubernetes/status/resource/deployment_test.go
--- ./pkg/skaffold/kubernetes/status/resource/deployment_test.go.orig	2021-08-27 21:25:49.084829549 +0000
+++ ./pkg/skaffold/kubernetes/status/resource/deployment_test.go	2021-08-27 21:25:49.084829549 +0000
@@ -166,7 +166,7 @@
 	}
 	for _, test := range tests {
 		testutil.Run(t, test.description, func(t *testutil.T) {
-			ae := parseKubectlRolloutError(test.details, 10 * time.Second, test.err)
+			ae := parseKubectlRolloutError(test.details, 10*time.Second, test.err)
 			t.CheckDeepEqual(test.expectedAe, ae)
 		})
 	}
FAILED hack/gofmt.sh

@tejal29
Copy link
Member Author

tejal29 commented Aug 28, 2021

Seems there is gofmt error from Travis logs:

Gofmt errors in files:
./pkg/skaffold/kubernetes/status/resource/deployment_test.go
diff -u ./pkg/skaffold/kubernetes/status/resource/deployment_test.go.orig ./pkg/skaffold/kubernetes/status/resource/deployment_test.go
--- ./pkg/skaffold/kubernetes/status/resource/deployment_test.go.orig	2021-08-27 21:25:49.084829549 +0000
+++ ./pkg/skaffold/kubernetes/status/resource/deployment_test.go	2021-08-27 21:25:49.084829549 +0000
@@ -166,7 +166,7 @@
 	}
 	for _, test := range tests {
 		testutil.Run(t, test.description, func(t *testutil.T) {
-			ae := parseKubectlRolloutError(test.details, 10 * time.Second, test.err)
+			ae := parseKubectlRolloutError(test.details, 10*time.Second, test.err)
 			t.CheckDeepEqual(test.expectedAe, ae)
 		})
 	}
FAILED hack/gofmt.sh

Thanks I have to fix my go version :(

Copy link
Contributor

@MarlonGamez MarlonGamez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for clarification on duplicate items

Copy link
Contributor

@aaron-prindle aaron-prindle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MarlonGamez MarlonGamez merged commit 290280e into GoogleContainerTools:main Aug 31, 2021
@tejal29 tejal29 deleted the send_success_events_for_pods branch November 9, 2021 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants