Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make sure status code error code is propogated to devloopEndEvent #4468

Merged
merged 6 commits into from
Jul 14, 2020

Conversation

tejal29
Copy link
Member

@tejal29 tejal29 commented Jul 14, 2020

As discovered when testing, the DevLoopEndEvent consumed by IDEs do not contain the failure seen in deploy phase.

In this PR,

  1. Add StatusCode to BiuldState, DeployState and StatusCheckState
  2. Make changes to deploy.StatusCheck to report first deployment with a pod error as StatusCheck.ErrorCode.
    ( in future, maybe we should create this as an array. One or more distinct error code can happen for multi-service deployments, e.g. ImagePullErr for "foo" deployment and PodUnhealthy for "bar" deployment)
  3. Finally DevLoopFailedInPhase now consumes the newly added StatusCode.

Other small changes:

  1. change the diag.validator.Pod to only fetch events when unknown error code is read.

@tejal29
Copy link
Member Author

tejal29 commented Jul 14, 2020

Testing on cloud code samples,

tejaldesai@tejaldesai-macbookpro2 go-hello-world (master)../../../skaffold/out/skaffold dev -d gcr.io/tejal-test 
Listing files to watch...
 - go-hello-world
Generating tags...
 - go-hello-world -> gcr.io/tejal-test/go-hello-world:latest
Checking cache...
 - go-hello-world: Found Remotely
Tags used in deployment:
 - go-hello-world -> gcr.io/tejal-test/go-hello-world:latest@sha256:03de98982cf604768ff1ff28948df5d9a0b6008eae55861c5ae3e45b54a0cf11
Starting deploy...
WARN[0001] image [go-hello-world] is not used by the deployment 
 - deployment.apps/go-hello-world created
 - service/go-hello-world-external created
Waiting for deployments to stabilize...
 - deployment/go-hello-world: waiting for rollout to finish: 0 of 1 updated replicas are available...
    - pod/go-hello-world-dcbdf9cdd-2zfrg: creating container server
 - deployment/go-hello-world: waiting for rollout to finish: 0 of 1 updated replicas are available...
    - pod/go-hello-world-dcbdf9cdd-2zfrg: container server is waiting to start: go-hello-world1 can't be pulled
 - deployment/go-hello-world failed. Error: could not stabilize within 10s: context deadline exceeded.
Cleaning up...
 - deployment.apps "go-hello-world" deleted
 - service "go-hello-world-external" deleted

Events:

tejaldesai@tejaldesai-macbookpro2 go-hello-world (master)curl localhost:50052/v1/event_log
{"result":{"timestamp":"2020-07-14T08:01:41.389844Z","event":{"metaEvent":{"entry":"Starting Skaffold: ...}

{"result":{"timestamp":"2020-07-14T08:01:41.389880Z",
     "event":{"devLoopEvent":{"status":"In Progress"}},
       "entry":"DevInit Iteration 0 in progress"}}
{"result":{"timestamp":"2020-07-14T08:01:42.691181Z",
      "event":{"deployEvent":{"status":"In Progress"}},"entry":"Deploy started"}}
{"result":{"timestamp":"2020-07-14T08:01:44.214254Z",
       "event":{"deployEvent":{"status":"Complete"}},"entry":"Deploy complete"}}
{"result":{"timestamp":"2020-07-14T08:01:44.214284Z","
     event":{"statusCheckEvent":{"status":"Started"}},"entry":"Status check started"}}
{"result":{"timestamp":"2020-07-14T08:01:45.133382Z","
        event":{"resourceStatusCheckEvent":{"resource":"pod/go-hello-world-dcbdf9cdd-2zfrg",
                      "status":"In Progress",
                       "message":"creating container server",
                     "statusCode":"STATUSCHECK_CONTAINER_CREATING",
                "actionableErr":{...},
                   "entry":"Resource pod/go-hello-world-dcbdf9cdd-2zfrg status updated to In Progress"}}
... 
{"result":{"timestamp":"2020-07-14T08:01:46.553280Z",
        "event":{"resourceStatusCheckEvent":{
                  "resource":"pod/go-hello-world-dcbdf9cdd-2zfrg",
                  "status":"Failed","err":"container server is waiting to start: go-hello-world1 can't be pulled",
                "statusCode":"STATUSCHECK_IMAGE_PULL_ERR",
          "actionableErr":{"errCode":"STATUSCHECK_IMAGE_PULL_ERR",
             "message":"container server is waiting to start: go-hello-world1 can't be pulled",
            "suggestions":[{"suggestionCode":"CHECK_CONTAINER_IMAGE",
         "action":"Try checking container config `image`"}]}}},
         "entry":"Resource pod/go-hello-world-dcbdf9cdd-2zfrg status failed with container server is waiting to start: go-hello-world1 can't be pulled"}}

....
{"result":{"timestamp":"2020-07-14T08:01:54.444497Z",
        "event":{"resourceStatusCheckEvent":{
                  "resource":"deployment/go-hello-world",
                 "status":"Failed",
                 "err":"could not stabilize within 10s: context deadline exceeded",
                "statusCode":"STATUSCHECK_DEADLINE_EXCEEDED",
             "actionableErr":{"errCode":"STATUSCHECK_DEADLINE_EXCEEDED",
                 "message":"could not stabilize within 10s: context deadline exceeded"}}},
      "entry":"Resource deployment/go-hello-world status failed with could not stabilize within 10s: context deadline exceeded"}}

{"result":{"timestamp":"2020-07-14T08:01:54.444606Z",
        "event":{"statusCheckEvent":{"status":"Failed",
              "err":"1/1 deployment(s) failed",
               "errCode":"STATUSCHECK_IMAGE_PULL_ERR",
          "actionableErr":{"errCode":"STATUSCHECK_IMAGE_PULL_ERR",
               "message":"1/1 deployment(s) failed"}}},"entry":"Status check failed"}}

{"result":{"timestamp":"2020-07-14T08:01:54.445727Z",
           "event":{"devLoopEvent":{
                  "status":"Failed",
                  "err":{"errCode":"STATUSCHECK_IMAGE_PULL_ERR",
                    "message":"1/1 deployment(s) failed"}}},
           "entry":"DevInit Iteration 0 failed with error code STATUSCHECK_IMAGE_PULL_ERR"}}

curl: (18) transfer closed with outstanding read data remaining

@tejal29
Copy link
Member Author

tejal29 commented Jul 14, 2020

Readiness Healthcheck Test

  1. Add a bogus healthcheck
tejaldesai@tejaldesai-macbookpro2 go-hello-world (master)cat kubernetes-manifests/hello.deployment.yaml 
 ....
    spec:
      containers:
      - name: server
        image: go-hello-world
        ports:
        - containerPort: 8080
        readinessProbe:
          exec:
            command:
            - cat
            - /tmp/healthy
  1. Run skaffold dev
../../../skaffold/out/skaffold dev -d gcr.io/tejal-test 
  1. Event stream
{"result":{"timestamp":"2020-07-14T09:01:31.838816Z",
   "event":{"metaEvent":
          {"entry":"Starting Skaffold: \u0026{Version:v1.12.0-40-g746f3996a ConfigVersion:skaffold/v2beta6 .....

{"result":{"timestamp":"2020-07-14T09:01:36.421384Z",
   "event":{"deployEvent":{"status":"Complete"}},"entry":"Deploy complete"}}
{"result":{"timestamp":"2020-07-14T09:01:36.421451Z",
    "event":{"statusCheckEvent":{"status":"Started"}},"entry":"Status check started"}}

{"result":{"timestamp":"2020-07-14T09:01:38.124380Z",
    "event":{"resourceStatusCheckEvent":{"resource":"deployment/go-hello-world","status":"In Progress","message":"waiting for rollout to finish: 0 of 1 updated replicas are available...\n",
   "statusCode":"STATUSCHECK_DEPLOYMENT_ROLLOUT_PENDING","actionableErr":{"errCode":"STATUSCHECK_DEPLOYMENT_ROLLOUT_PENDING","message":"waiting for rollout to finish: 0 of 1 updated replicas are available...\n"}}},"entry":"Resource deployment/go-hello-world status updated to In Progress"}}

{"result":{"timestamp":"2020-07-14T09:01:38.645748Z","event":{"resourceStatusCheckEvent":{"resource":"pod/go-hello-world-7b5869b68b-fhh9d",
 "status":"Succeeded","message":"Succeeded",
 "statusCode":"STATUSCHECK_SUCCESS"}},"entry":"Resource pod/go-hello-world-7b5869b68b-fhh9d status completed successfully"}}
...

{"result":{"timestamp":"2020-07-14T09:01:49.218603Z",
   "event":{"resourceStatusCheckEvent":{"resource":"pod/go-hello-world-7b5869b68b-fhh9d",
    "status":"Failed",
      "err":"Readiness probe failed: cat: /tmp/healthy: No such file or directory\n",
     "statusCode":"STATUSCHECK_UNHEALTHY",
     "actionableErr":{"errCode":"STATUSCHECK_UNHEALTHY","message":"Readiness probe failed: cat: /tmp/healthy: No such file or directory\n",
   "suggestions":[
             {"suggestionCode":"CHECK_READINESS_PROBE",
             "action":"Try checking container config `readinessProbe`"}
   ]}}},
    "entry":"Resource pod/go-hello-world-7b5869b68b-fhh9d status failed with Readiness probe failed: cat: /tmp/healthy: No such file or directory\n"}}
....

{"result":{"timestamp":"2020-07-14T09:03:37.224202Z",
  "event":{"statusCheckEvent":{
                 "status":"Failed","err":"1/1 deployment(s) failed",
                "errCode":"STATUSCHECK_UNHEALTHY",
                 "actionableErr":{"errCode":"STATUSCHECK_UNHEALTHY","message":"1/1 deployment(s) failed"}}},"entry":"Status check failed"}}

{"result":{"timestamp":"2020-07-14T09:03:37.224612Z",
          "event":{
           "devLoopEvent":{
                 "status":"Failed",
                  "err":{
                     "errCode":"STATUSCHECK_UNHEALTHY",
                      "message":"1/1 deployment(s) failed"}}},
   "entry":"DevInit Iteration 0 failed with error code STATUSCHECK_UNHEALTHY"}}

@codecov
Copy link

codecov bot commented Jul 14, 2020

Codecov Report

Merging #4468 into master will increase coverage by 0.06%.
The diff coverage is 73.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4468      +/-   ##
==========================================
+ Coverage   72.26%   72.33%   +0.06%     
==========================================
  Files         331      332       +1     
  Lines       12863    12929      +66     
==========================================
+ Hits         9296     9352      +56     
- Misses       2975     2981       +6     
- Partials      592      596       +4     
Impacted Files Coverage Δ
pkg/diag/validator/pod.go 2.89% <0.00%> (-0.07%) ⬇️
pkg/skaffold/deploy/status_check.go 51.74% <53.84%> (+0.68%) ⬆️
pkg/skaffold/event/event.go 90.62% <80.95%> (-0.22%) ⬇️
pkg/skaffold/deploy/resource/deployment.go 78.33% <91.66%> (+1.48%) ⬆️
pkg/skaffold/errors/errors.go 78.57% <0.00%> (-4.77%) ⬇️
...affold/kubernetes/portforward/kubectl_forwarder.go 60.97% <0.00%> (-2.44%) ⬇️
pkg/skaffold/build/buildpacks/lifecycle.go 82.71% <0.00%> (-1.04%) ⬇️
cmd/skaffold/app/cmd/flags.go 86.36% <0.00%> (ø)
pkg/skaffold/config/options.go 100.00% <0.00%> (ø)
pkg/skaffold/build/buildpacks/fetcher.go 64.28% <0.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6af3198...f1a9f52. Read the comment docs.

return proto.StatusCode_STATUSCHECK_SUCCESS
}

func (d *Deployment) WithPodStatuses(scs []proto.StatusCode) *Deployment {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a test function that shouldn't be here?

pkg/skaffold/event/event_test.go Outdated Show resolved Hide resolved
@@ -195,12 +206,15 @@ func statusCheckEventSucceeded() {
})
}

func statusCheckEventFailed(err error) {
aiErr := sErrors.ActionableErr(sErrors.StatusCheck, err)
func statusCheckEventFailed(err error, errCode proto.StatusCode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ordering is inconsistent with StatusCheckEventEnded()

@@ -299,8 +319,22 @@ func DevLoopFailedWithErrorCode(i int, statusCode proto.StatusCode, err error) {

// DevLoopFailed notifies that a dev loop has failed in a given phase
func DevLoopFailedInPhase(iteration int, phase sErrors.Phase, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to improve this. I think we should be aiming to generate the StatusCode at the error-raising time so that we're not guessing.

@briandealwis
Copy link
Member

Last commit removes spurious empty logs output that results from no log out or output with a trailing newline:

 - deployment/leeroy-app: waiting for rollout to finish: 0 of 1 updated replicas are available...
    - pod/leeroy-app-549b49dfb4-2q7qw: container leeroy-app is backing off waiting to restart
[leeroy-app-549b49dfb4-2q7qw leeroy-app] 

@briandealwis briandealwis merged commit 58f3319 into GoogleContainerTools:master Jul 14, 2020
briandealwis added a commit that referenced this pull request Jul 14, 2020
Also avoid spurious empty lines in log output

Co-authored-by: Brian de Alwis <bdealwis@google.com>
briandealwis added a commit that referenced this pull request Jul 14, 2020
Also avoid spurious empty lines in log output

Co-authored-by: Brian de Alwis <bdealwis@google.com>
@briandealwis briandealwis mentioned this pull request Jul 14, 2020
@tejal29 tejal29 deleted the fix_devloop_end branch April 15, 2021 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants