Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 error on badly cleaned up containers/workers #3588

Closed
enugentdt opened this issue Mar 25, 2019 · 9 comments
Closed

500 error on badly cleaned up containers/workers #3588

enugentdt opened this issue Mar 25, 2019 · 9 comments
Labels

Comments

@enugentdt
Copy link

enugentdt commented Mar 25, 2019

Bug Report

I restarted all of my Concourse nodes to apply v5.0.1, but did it badly, and ended up just SIGKILL'ing everything (I'm sorry!). Now that stuff is rebooted, almost every resource is causing this error in Concourse:

Mar 25 17:53:54 web1.concourse.stm.inf.demilletech.net concourse[12561]: {"timestamp":"2019-03-25T21:53:54.670443499Z","level":"error","source":"atc","message":"atc.pipelines.radar.failed-to-run-scan-resource","data":{"error":"Backend error: Exit status: 500, message: {\"Type\":\"\",\"Message\":\"exit status 2\",\"Handle\":\"\",\"ProcessID\":\"\",\"Binary\":\"\"}\n","pipeline":"api-server","session":"18.5","team":"isoscribe"}}

Steps to Reproduce

  1. Create a pipeline
  2. Make it do stuff
  3. Sigkill everything that has to do with the workers
  4. Start back up the nodes

Expected Results

Not this

Actual Results

This

Version Info

  • Concourse version: 5.0.1
  • Deployment type (BOSH/Docker/binary): Binary
  • Infrastructure/IaaS: VMware
  • Browser (if applicable): Chrome
  • Did this used to work? Yes
@enugentdt
Copy link
Author

Update: found these fun logs in the worker.

Mar 25 18:04:16 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:16.967840789Z","level":"info","source":"guardian","message":"guardian.api.garden-server.get-properties.got-properties","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","session":"3.1.3547"}}
Mar 25 18:04:16 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:16.970850780Z","level":"info","source":"guardian","message":"guardian.run.started","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","path":"/opt/resource/check","session":"2457"}}
Mar 25 18:04:16 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:16.970924230Z","level":"info","source":"guardian","message":"guardian.run.exec.start","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","path":"/opt/resource/check","session":"2457.2"}}
Mar 25 18:04:16 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:16.982503296Z","level":"info","source":"guardian","message":"guardian.run.exec.execrunner.start","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","id":"3fd48814-8ccc-42c1-63cb-e7c4e4957cb0","path":"/opt/resource/check","session":"2457.2.2"}}
Mar 25 18:04:16 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:16.985502348Z","level":"info","source":"guardian","message":"guardian.run.exec.execrunner.read-exit-fd","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","id":"3fd48814-8ccc-42c1-63cb-e7c4e4957cb0","path":"/opt/resource/check","session":"2457.2.2"}}
Mar 25 18:04:17 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:17.028544812Z","level":"info","source":"guardian","message":"guardian.run.exec.execrunner.runc-exit-status","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","id":"3fd48814-8ccc-42c1-63cb-e7c4e4957cb0","path":"/opt/resource/check","session":"2457.2.2","status":0}}
Mar 25 18:04:17 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:17.028653502Z","level":"info","source":"guardian","message":"guardian.run.exec.execrunner.done","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","id":"3fd48814-8ccc-42c1-63cb-e7c4e4957cb0","path":"/opt/resource/check","session":"2457.2.2"}}
Mar 25 18:04:17 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:17.028675807Z","level":"info","source":"guardian","message":"guardian.run.exec.finished","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","path":"/opt/resource/check","session":"2457.2"}}
Mar 25 18:04:17 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:17.028692734Z","level":"info","source":"guardian","message":"guardian.run.finished","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","path":"/opt/resource/check","session":"2457"}}
Mar 25 18:04:17 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:17.028712947Z","level":"info","source":"guardian","message":"guardian.api.garden-server.run.spawned","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","id":"3fd48814-8ccc-42c1-63cb-e7c4e4957cb0","session":"3.1.3548","spec":{"Path":"/opt/resource/check","Dir":"","User":"root","Limits":{},"TTY":null}}}
Mar 25 18:04:18 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:04:18.614739424Z","level":"info","source":"guardian","message":"guardian.api.garden-server.run.exited","data":{"handle":"1deea1d4-508d-4da6-5cd3-33512f1e6744","id":"3fd48814-8ccc-42c1-63cb-e7c4e4957cb0","session":"3.1.3548","status":0}}

@vito
Copy link
Member

vito commented Mar 26, 2019

Sorry but without verifiable steps to reproduce, expected/actual results, etc. this isn't really a bug report - it looks like something is definitely wrong, but this is better off in our support forums or in Discord as it will take a bit of digging to get to the bottom of it. From the logs pasted it's not super clear where the bug is (the logs in the second comment show things working normally, if not verbosely).

This may be helped by #3079 though, which we're working on.

@vito vito added the support label Mar 26, 2019
@support
Copy link

support bot commented Mar 26, 2019

👋 @enugentdt, we use the issue tracker exclusively for bug reports and feature requests. However, this issue appears to be a support request. Please ask in the support forums or in Discord instead of opening GitHub issues.

@support support bot closed this as completed Mar 26, 2019
@fiftin
Copy link

fiftin commented Mar 28, 2019

Have similar error after changed DNS settings and restarted server.
v5.0.1
Ubuntu 18.04
AWS

@enugentdt
Copy link
Author

I just realized I attached the wrong logs... 🤦‍♂️ I'm sorry about that. Here are the real logs:

Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.940735079Z","level":"info","source":"guardian","message":"guardian.list-containers.starting","data":{"session":"2448"}}
Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.941216346Z","level":"info","source":"guardian","message":"guardian.list-containers.finished","data":{"session":"2448"}}
Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.951915374Z","level":"info","source":"guardian","message":"guardian.api.garden-server.get-properties.got-properties","data":{"handle":"ca6c871f-0c4d-47fd-452d-d76a058d4e3d","session":"3.1.3535"}}
Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.954830475Z","level":"info","source":"guardian","message":"guardian.run.started","data":{"handle":"ca6c871f-0c4d-47fd-452d-d76a058d4e3d","path":"/opt/resource/check","session":"2449"}}
Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.954890434Z","level":"info","source":"guardian","message":"guardian.run.exec.start","data":{"handle":"ca6c871f-0c4d-47fd-452d-d76a058d4e3d","path":"/opt/resource/check","session":"2449.2"}}
Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.968447637Z","level":"error","source":"guardian","message":"guardian.run.exec.create-workdir-failed","data":{"error":"exit status 2","handle":"ca6c871f-0c4d-47fd-452d-d76a058d4e3d","path":"/opt/resource/check","session":"2449.2"}}
Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.968511963Z","level":"info","source":"guardian","message":"guardian.run.exec.finished","data":{"handle":"ca6c871f-0c4d-47fd-452d-d76a058d4e3d","path":"/opt/resource/check","session":"2449.2"}}
Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.968531217Z","level":"info","source":"guardian","message":"guardian.run.finished","data":{"handle":"ca6c871f-0c4d-47fd-452d-d76a058d4e3d","path":"/opt/resource/check","session":"2449"}}
Mar 25 18:03:55 worker3 concourse[23925]: {"timestamp":"2019-03-25T22:03:55.968550597Z","level":"error","source":"guardian","message":"guardian.api.garden-server.run.failed","data":{"error":"exit status 2","handle":"ca6c871f-0c4d-47fd-452d-d76a058d4e3d","session":"3.1.3536"}}

What was interesting is that it exited with "exit status 2." My guess is that yeah, #3079 would fix this. It looks like it expects a container to exist, runs a command on the "existing" container, and fails for obvious reasons.

Sorry again about attaching the wrong logs!

@enugentdt
Copy link
Author

I just wanted to update the reproducing steps, as I'm able to consistently make this happen. So might this be worth reopening, @vito ?

Steps to reproduce:

  1. Create a concourse web node and a concourse worker
  2. Create a pipeline, and have it run through at least once (could be just a simple check-put, or could be something existing)
  3. Hard poweroff the worker VM
  4. Bring the worker VM back online

When the worker VM comes back up, this is the view you'll get:
Screen Shot 2019-04-04 at 11 27 26 PM

All the resources show some variation of the following error:

Backend error: Exit status: 500, message: {"Type":"","Message":"exit status 2","Handle":"","ProcessID":"","Binary":""}

But yes, #3079 will fix this (hopefully). However, it has persisted since early v4, so maybe there's something lower that is upset? Or maybe it's a Garden issue, but I'm not too sure on how Garden works with regards to Concourse.

@vito
Copy link
Member

vito commented Apr 10, 2019

@enugentdt Thanks for following up! I think I'll leave this closed anyway though since #3079 is already on our radar will probably be our approach to fixing this in general. 👍

@enugentdt
Copy link
Author

No worries at all. Sounds good, and I will be looking forward to the release containing #3079!

Thanks

@pizzapim
Copy link

pizzapim commented Jun 8, 2023

For anybody stumbling on this issue, I found a manual way of fixing the pipeline. For me, destroying and creating the pipeline again does not fix the issue. Not sure why, but manually checking the resource while the pipeline is still paused after creation fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants