New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send success for pods for a deployment when its rolled out successfully #6534
Send success for pods for a deployment when its rolled out successfully #6534
Conversation
if ae.ErrCode == proto.StatusCode_STATUSCHECK_SUCCESS { | ||
for _, pod := range d.pods { | ||
eventV2.ResourceStatusCheckEventCompletedMessage( | ||
pod.String(), | ||
fmt.Sprintf("%s %s: running.\n", tabHeader, pod.String()), | ||
protoV2.ActionableErr{ErrCode: proto.StatusCode_STATUSCHECK_SUCCESS}, | ||
) | ||
} | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean we'll send duplicate pod success messages since we're also sending them in fetchPods()
func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this would be the case unless I am confusing something. From the fetchPods()
snippet here - https://github.com/GoogleContainerTools/skaffold/blob/main/pkg/skaffold/kubernetes/status/resource/deployment.go#L298-L303:
case proto.StatusCode_STATUSCHECK_SUCCESS:
event.ResourceStatusCheckEventCompleted(p.String(), p.ActionableError())
eventV2.ResourceStatusCheckEventCompletedMessage(
p.String(),
fmt.Sprintf("%s running.\n", prefix),
sErrors.V2fromV1(p.ActionableError()))
I believe we also send the same ResourceStatusCheckEventCompletedMessage
event there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I agree there could be duplicate sucessful events depending upon timings.
e.g.
- dep check -> waiting for rollout
- pod check -> unhealhty
skaffold send a resource in progress event for pod
3 ) pod becomes healthy - Status check sleeps for 100 ms
- dep rollout status -> successful.
skaffold send a resource complete event for pod
however in case where.
- dep check -> waiting for rollout
- pod becomes healthy and running
- pod check -> returns succes.
skaffold sends a resource complete event for pod - Status check sleeps for 100 ms
- dep rollout status -> successful.
skaffold send a another resource complete event for pod
It should be fine to send 2 successful resource complete event for a pod. It would be a no-op on the IDE side as the UI node was already marked green.
Codecov Report
@@ Coverage Diff @@
## main #6534 +/- ##
==========================================
+ Coverage 70.44% 70.47% +0.02%
==========================================
Files 515 515
Lines 23144 23150 +6
==========================================
+ Hits 16303 16314 +11
+ Misses 5785 5779 -6
- Partials 1056 1057 +1
Continue to review full report at Codecov.
|
Seems there is gofmt error from Travis logs:
|
Thanks I have to fix my go version :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for clarification on duplicate items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes https://github.com/GoogleCloudPlatform/cloud-code-vscode-internal/issues/5277
I wanted to implement this approach last time in #6517.
However, I decided to not and rely on
kubectl deployment rollout status
to return success if pods are up and healthy.I did try to investigate more on why
kubectl rollout status deployment
is returning a success and pods are unhealthy.As per the docs, this should not happen https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment
I am wondering if this could be due to fluctuating network.
From, event logs @vincentjocodes provided here https://github.com/GoogleCloudPlatform/cloud-code-vscode-internal/issues/5277#issuecomment-906806476, it seems the pod was created, unhealthy, successful (probably health check passed) and then unhealthy again.
See this
The current logic,
It could be that, the pod passed the health check, deployment became successful and then due to network connectivity issues pod failed health check again.
(The health check failure in this case was due to connection failure)
Solution:
kubectl rollout status dep/ledgerwriter
returns "deployment successfully rolled out`, send success event for all pods in the deployment.This will ensure, we don't see the pending sign on VSC UI due to health checks instability.
Side effects:
kubectl apply
or pod was killed due to too many retrials. However, all these scenarios are rare. Also note, the polling period is 1s.