Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return an error payload if run_async! fails #143

Merged
merged 1 commit into from
Nov 21, 2023

Conversation

agrare
Copy link
Member

@agrare agrare commented Nov 15, 2023

Currently if the command to start the container fails we simply raise the exception up to the caller rather than "handle" the error payload. This effectively aborts the workflow runtime without going through the normal failure paths that are already builtin to handle if a task fails after it is started.

This commit changes how these failures are handled in order to line up with how the rest of the Task error handling operates.

#141

@agrare agrare requested a review from Fryguy as a code owner November 15, 2023 18:57
@agrare agrare mentioned this pull request Nov 15, 2023
3 tasks
@agrare
Copy link
Member Author

agrare commented Nov 15, 2023

Docker with a bad image name:

Before:

$ bundle exec exe/floe --workflow examples/workflow.asl --input '{"foo": 1}'
I, [2023-11-15T13:57:48.875720 #1307986]  INFO -- : Running state: [FirstState] with input [{"foo"=>1}]...
D, [2023-11-15T13:57:48.876025 #1307986] DEBUG -- : Running docker run --detach -e foo\=1 -e _CREDENTIALS\=/run/secrets -v /tmp/20231115-1307986-3gb4mb:/run/secrets:z --name floe-hello-worl-054ec601 docker.io/agrare/hello-worl:latest
bundler: failed to load command: exe/floe (exe/floe)
/home/grare/adam/.gem/ruby/3.0.0/gems/awesome_spawn-1.6.0/lib/awesome_spawn.rb:111:in `run!': docker exit code: 125 error was: Unable to find image 'agrare/hello-worl:latest' locally (AwesomeSpawn::CommandResultError)
docker: Error response from daemon: manifest for agrare/hello-worl:latest not found: manifest unknown: manifest unknown.
See 'docker run --help'.
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/runner/docker.rb:120:in `docker!'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/runner/docker.rb:74:in `run_container'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/runner/docker.rb:32:in `run_async!'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/states/task.rb:38:in `start'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow/state.rb:52:in `run_nonblock!'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow.rb:80:in `step_nonblock'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow.rb:67:in `step'
	from /home/grare/adam/src/manageiq/floe/lib/floe/workflow.rb:62:in `run!'
	from exe/floe:37:in `<top (required)>'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/cli/exec.rb:58:in `load'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/cli/exec.rb:58:in `kernel_load'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/cli/exec.rb:23:in `run'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/cli.rb:492:in `exec'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/cli.rb:34:in `dispatch'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/cli.rb:28:in `start'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/exe/bundle:37:in `block in <top (required)>'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/lib/bundler/friendly_errors.rb:117:in `with_friendly_errors'
	from /home/grare/adam/.gem/ruby/3.0.0/gems/bundler-2.4.19/exe/bundle:29:in `<top (required)>'
	from /home/grare/adam/.gem/ruby/3.0.0/bin/bundle:25:in `load'
	from /home/grare/adam/.gem/ruby/3.0.0/bin/bundle:25:in `<main>'

After:

adam@desktop:~/src/manageiq/floe$ bundle exec exe/floe --workflow examples/workflow.asl --input '{"foo": 1}'
I, [2023-11-15T13:58:10.519437 #1308095]  INFO -- : Running state: [FirstState] with input [{"foo"=>1}]...
D, [2023-11-15T13:58:10.519799 #1308095] DEBUG -- : Running docker run --detach -e foo\=1 -e _CREDENTIALS\=/run/secrets -v /tmp/20231115-1308095-6b6s9g:/run/secrets:z --name floe-hello-worl-1502d56a docker.io/agrare/hello-worl:latest
I, [2023-11-15T13:58:10.811974 #1308095]  INFO -- : Running state: [FirstState] with input [{"foo"=>1}]...Complete - next state: [FailState] output: [{"foo"=>1, "Error"=>"States.TaskFailed", "Cause"=>"docker exit code: 125 error was: Unable to find image 'agrare/hello-worl:latest' locally\ndocker: Error response from daemon: manifest for agrare/hello-worl:latest not found: manifest unknown: manifest unknown.\nSee 'docker run --help'.\n"}]
I, [2023-11-15T13:58:10.812034 #1308095]  INFO -- : Running state: [FailState] with input [{"foo"=>1, "Error"=>"States.TaskFailed", "Cause"=>"docker exit code: 125 error was: Unable to find image 'agrare/hello-worl:latest' locally\ndocker: Error response from daemon: manifest for agrare/hello-worl:latest not found: manifest unknown: manifest unknown.\nSee 'docker run --help'.\n"}]...
I, [2023-11-15T13:58:10.812069 #1308095]  INFO -- : Running state: [FailState] with input [{"foo"=>1, "Error"=>"States.TaskFailed", "Cause"=>"docker exit code: 125 error was: Unable to find image 'agrare/hello-worl:latest' locally\ndocker: Error response from daemon: manifest for agrare/hello-worl:latest not found: manifest unknown: manifest unknown.\nSee 'docker run --help'.\n"}]...Complete - next state: [] output: [{"Error"=>"FailStateError", "Cause"=>"No Matches!"}]
{"Error"=>"FailStateError", "Cause"=>"No Matches!"}

NOTE: the No Matches! is because examples/workflow.asl has a failure match, the new way of handling errors allows a user to catch and retry (or any other type of handling they want) these types of exceptions now.

@agrare agrare force-pushed the handle_docker_podman_image_errors branch from 65a37c4 to cf4f721 Compare November 15, 2023 19:07
cleanup(runner_context)
raise
{"Error" => "States.TaskFailed", "Cause" => err.to_s}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help with clarity to also include the error class?

Suggested change
{"Error" => "States.TaskFailed", "Cause" => err.to_s}
{"Error" => "States.TaskFailed", "Cause" => "#{err.class.name}: #{err}"}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm thinking about this a little more, not sure how helpful the ruby exception class is going to be to the user since this is floe internal stuff.

What if we are more explicit about catching e.g. Kubeclient::Error and AwesomeSpawn::CommandResultError and just print the error string but let other exceptions raise up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion

Copy link
Member

@Fryguy Fryguy Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we are more explicit about catching e.g. Kubeclient::Error and AwesomeSpawn::CommandResultError and just print the error string but let other exceptions raise up.

Not sure - the ultimate problem is that the parent wasn't handling it at the task level and it just got stuck, right? Perhaps we need changes for handling other errors on that side as well?

Copy link
Member Author

@agrare agrare Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the problem that we saw on kubernetes was that the pod status was Pending so even though it "failed" we treated it as if it hadn't started yet (since that status was also Pending). The problem you were likely seeing on docker wasn't that an exception had been raised but it actually wasn't done pulling the image yet.

If an unhandled exception is raised the MiqQueue deliver method will catch it and still invoke the queue_callback to mark the task as failed (I just tested this by throwing a raise in the workflow run_nonblock).

I'd like for user errors like podman/k8s failed to pull an image be handled as ASL errors, and anything else like NoMethodError on NilClass (straight bugs) be raised up as exceptions so we get a backtrace in the logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah I agree 👍

@agrare agrare force-pushed the handle_docker_podman_image_errors branch from cf4f721 to 16e6b54 Compare November 17, 2023 14:17
@agrare agrare force-pushed the handle_docker_podman_image_errors branch from 16e6b54 to aa67039 Compare November 17, 2023 14:36
@miq-bot
Copy link
Member

miq-bot commented Nov 17, 2023

Checked commit agrare@aa67039 with ruby 2.6.10, rubocop 1.28.2, haml-lint 0.35.0, and yamllint
4 files checked, 0 offenses detected
Everything looks fine. 🍪

@kbrock
Copy link
Member

kbrock commented Nov 21, 2023

After. verified it is working

{
  "Comment": "Invalid Image Name",
  "StartAt": "UnknownImageNameRun",
  "States": {
    "UnknownImageNameRun": {
      "Type": "Task",
      "Resource": "docker://docker.io/kbrock/unknown:latest",
      "Parameters": {
        "ERROR": "failure message"
      },
      "End": true
    }
  }
}
$ bundle exec exe/floe --docker-runner podman --workflow examples/error-invalid-image-name.json --input={"foo": 2}

{"Error"=>"States.TaskFailed", "Cause"=>"podman exit code: 125 error was: Trying to pull docker.io/kbrock/unknown:latest...\nError: initializing source docker://kbrock/unknown:latest: reading manifest latest in docker.io/kbrock/unknown: manifest unknown\n"}

$ bundle exec exe/floe --docker-runner docker --workflow examples/error-invalid-image-name.json --input={"foo": 2}

{"Error"=>"States.TaskFailed", "Cause"=>"docker exit code: 125 error was: Unable to find image 'kbrock/unknown:latest' locally\ndocker: Error response from daemon: manifest for kbrock/unknown:latest not found: manifest unknown: manifest unknown.\nSee 'docker run --help'.\n"}

$ bundle exec exe/floe --docker-runner kubernetes --docker-runner-options=kubeconfig_context=default --workflow examples/error-invalid-image-name.json --input={"foo": 2}

{"Error"=>"ErrImagePull", "Cause"=>"rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/kbrock/unknown:latest\": failed to resolve reference \"docker.io/kbrock/unknown:latest\": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed"}

@kbrock kbrock merged commit f586f21 into ManageIQ:master Nov 21, 2023
5 checks passed
@agrare agrare deleted the handle_docker_podman_image_errors branch November 21, 2023 14:48
agrare added a commit that referenced this pull request Nov 21, 2023
Fixed
- Return an error payload if run_async! fails (#143)

Changed
- Extract run_container_params for docker/podman (#142)
@kbrock
Copy link
Member

kbrock commented Nov 21, 2023

Yea, so only comment here is the error differ based upon platform. Hope this information will help others writing workflows:

# kubernetes
  "Error" => "ErrorImagePull"
# docker and podman
  "Error" => "States.TaskFailed"

@agrare
Copy link
Member Author

agrare commented Nov 21, 2023

Yeah I went with the more general taskfailed because I couldn't get a helpful exception type from the failed AwesomeSpawn runs. I didn't think AwesomeSpawn::CommandResultError as the "Error" was helpful to the user.

@kbrock kbrock self-assigned this Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants