Add result verification to task runner #337

KnVerey · 2018-10-03T22:30:33Z

PLEASE NOTE: The refactoring piece of this has now been extracted to #340

Feature

The runner task now streams container logs and waits for the pod to succeed or fail by default

You can disable this using --skip-wait or verify_result: false (Note: named for consistency with the deploy task)
If you choose not to verify the result, we'll now take a look at the pod once and print its status in the summary (it will probably be pending, but seemed more helpful than nothing).
You can use --max-watch-seconds=seconds or max_watch_seconds: N to set a time limit on the new result verification feature

Closes #304

Detail

This proposes using the existing ResourceWatcher but moving the log handling for all pods into a new object with a traditional (for our resource models) sync-data/use-result separation. This makes it possible to change log output from being a side-effect of sync, to using a new post-sync hook that is run sequentially (sync is concurrent). I have not done so in this PR, but I expect this implementation to make it trivially possible to stream task-runner logs during deploys as well, which could be a nice improvement.

This PR also tweaks the UX (i.e. logging) for this task in general, to try to make it more useful and more consistent with the deploy task. See screenshots at the bottom of this description.

Notes for reviewers

Since this overhauls the task's logging, please tophat ALL the runner integration tests using PRINT_LOGS=1.
Please also try to come up with test scenarios I may have missed.
Since this PR is already pretty big, I'm deferring some potential improvements to future PRs. If you think any of these should go in before feature release, please speak up. List:
- If a pod is stuck pending for more than X seconds, either emit a warning or time out (task runner pods in deploys too)
- If a pod hasn't logged in the past X seconds, time out (not entirely convinced we should do this at all)
- Add a context check to the validations (like we have for the deploy task)
- Add a central retry mechanism and use it on many/most of the calls the task makes itself (as opposed to the ones in the watcher, which effectively retries every second anyway)
- Stream task runner pods during deploys too (when/once there's only one pod to stream)
- Try to make the watcher responsible for raising timeout and failure exceptions at the end of the watch (currently done by the calling tasks, with the exception of "global" timeouts)
- Reconsider our treatment of exists? in all resource classes. If the API server has no record of it after the creation request, it isn't going to spontaneously appear (the user probably deleted it out of band?), so we should consider it failed.
- Consider a "delete pod on timeout" option

If we're in agreement that these are ok to defer, I'll cut them into separate issues before merging.

Alternatives considered

K8s run watch pod #320, in particular at:
- this commit, which introduces a separate watcher type for streaming
- this commit, which uses the normal watcher but doesn't have a separate class for logs yet
- this commit, which uses the normal watcher and has a class for the logs, but changes less overall that I have her (logging still happens during sync, the logs class exposes both fetching and printing, and it does not store logs seen internally)
I also tried using kubeclient.watch_pod_log on this branch. I'm glad I tried it, but I liked it a lot less. It was more complex (e.g. after a few iterations, I ended up using the watcher in a side thread anyway, to be able to abort appropriately), is not compatible with the deploy part of the tool so duplicated/fought some of its functionality, and to boot the stream still wasn't working against my minikube by the end of the day I gave myself to try it.

Screenshots

dturn · 2018-10-03T23:01:21Z

lib/kubernetes-deploy/resource_watcher.rb


        @sync_mediator.sync(remainder)
+        remainder.each(&:post_sync)


As implemented sync is run all on all the resources then post_sync is called. In rails / https://api.rubyonrails.org/classes/ActiveSupport/Callbacks.html post_sync would trigger immediately after each resource has sync called. Is the implemented behavior needed? If not I think it would be preferable to do it the more traditional way.

tl;dr The technical reason is that sync runs concurrently and post_sync shouldn't. I also don't see the value of the indirection (vs writing the code at the end of sync itself) if it were executed immediately after.

The idea here is that sync should exclusively collect state from the API server and should not have any side-effects. It is run in parallel, so weird things can happen if the instance associates additional behaviours with it (e.g. seems to me that if two pods complete in different sync groups during the same polling cycle today, their logs will get interleaved). The post_sync hook (naming suggestions appreciated) is a way to let instances do something every polling cycle if they need to, with the guarantee that (a) their instance data is already up to date; and (b) nothing else is logging at the same time. I considered making this dump_logs or something, but thought it would be better design to keep the watcher agnostic about the details of what instances might want to do.

(b) seems like its just a result of running the hook serially. If the goal is to keep the watcher agnostic, which I agree with, I'm think a more descriptive name would help serialized_post_sync (I'm terrible at naming :( )

dturn · 2018-10-03T23:08:06Z

lib/kubernetes-deploy/runner_task.rb

+    def create_pod(pod)
+      @logger.info "Creating pod '#{pod.name}'"
+      pod.deploy_started_at = Time.now.utc
+      kubeclient.create_pod(Kubeclient::Resource.new(pod.definition))


Is this inverted? e.g pod.to_kubeclient_resource

👍 I like that idea

dturn · 2018-10-03T23:10:24Z

lib/kubernetes-deploy/runner_task.rb

+      @logger.info "Creating pod '#{pod.name}'"
+      pod.deploy_started_at = Time.now.utc
+      kubeclient.create_pod(Kubeclient::Resource.new(pod.definition))
+      @pod_name = pod.name


Why an instance var?

Because today, the pod's name is exposed as the return value of run (note attr_reader :pod_name above). I don't want to keep doing that, but there are legitimate use cases for wanting to know the name of the pod created so you can take follow-up action. For example, the caller might want to rescue DeploymentTimeoutError, delete the pod created (to make sure two aren't running at once), and retry the runner. Incidentally, I'm wondering if we should add a --delete-pod-on-timeout option to the runner itself (not in this PR).

dturn · 2018-10-03T23:13:05Z

lib/kubernetes-deploy/pod_logs.rb

+      end
+    end
+
+    def print_all(prevent_duplicate: true)


Plans for prevent_duplicate later?

Not really. I had added this based on Flo's comment to the effect that having print_all no-op on the second call by default is unintuitive.

KnVerey · 2018-10-04T17:26:19Z

lib/kubernetes-deploy/pod_logs.rb

+    def print_latest
+      @last_printed_indexes ||= Hash.new { |hash, key| hash[key] = -1 }
+      @container_logs.each do |container_name, logs|
+        prefix = "[#{container_name}]  " if @container_names.length > 1


An alternative to this would be to force the caller to specify a single container to print/"stream"

Part of me was thinking that we have to do some ugly things because a pod can have multiple containers. But is that something we actually need/want to allow for the runner?

We definitely need to not care if they have extra containers, but they do already have to tell us which one is doing the main work (so we can inject the env vars/args/etc in that one). Maybe we could set pod.main_container or something and pick that one. If we use this in deploys though (which I'd like to), we won't have that information, so we'd probably need to keep the prefixing for that case. WDYT? 🤔

KnVerey · 2018-10-04T17:36:27Z

lib/kubernetes-deploy/pod_logs.rb

+        ]
+        cmd << "--since-time=#{since.to_datetime.rfc3339}" if since.present?
+
+        out, _err, _st = kubectl.run(*cmd, log_failure: false)


This continues the existing behaviour of treating ALL failures to retrieve logs as "none available I guess". I'm thinking it would just be spammy to output anything about failures because:
(a) log output is for the benefit of the human operator only (in no way affects the control loop);
(b) this call already ends up getting "retried" every polling cycle, which for the runner is every second; and
(c) If Kubernetes is behind on consuming the logs from the container, there's nothing the operator can do about it.
Can you think of a case where it could be helpful to the operator, or a class of error that should actually be handled in some way?

KnVerey · 2018-10-04T17:37:08Z

lib/kubernetes-deploy/pod_logs.rb

+    end
+
+    def sync(kubectl)
+      latest_logs = fetch_logs(kubectl, since: @last_timestamp_seen_for_container.values.min)


hm, @last_timestamp_seen_for_container could be a method call rather than tracked separately in this implementation. Will change.

KnVerey · 2018-10-04T21:25:11Z

The commit I just pushed refactors PodLogs into RemoteLogs (it didn't actually need anything pod-specific in it) and ContainerLogs. I think this makes the code a lot easier to read, since log fetching actually happens at the container level.

dturn

This is looking good. What do you think about adding a test for RemoteLogs where a pod has 2 containers.

dturn · 2018-10-04T21:58:09Z

lib/kubernetes-deploy/container_logs.rb

+      out.split("\n")
+    end
+
+    def rfc3330_timestamp(time)


typo: rfc3339_timestamp

dturn · 2018-10-05T00:02:36Z

lib/kubernetes-deploy/resource_watcher.rb


        @sync_mediator.sync(remainder)
+        remainder.each(&:post_sync)


(b) seems like its just a result of running the hook serially. If the goal is to keep the watcher agnostic, which I agree with, I'm think a more descriptive name would help serialized_post_sync (I'm terrible at naming :( )

KnVerey · 2018-10-05T00:59:11Z

I split off the refactoring pieces of this into #340 for ease of review. This branch is now based on that one. I will still merge them at the same time.

KnVerey · 2018-10-05T02:36:02Z

What do you think about adding a test for RemoteLogs where a pod has 2 containers.

Added a bunch of unit tests in #340. 👍

dturn

Thanks for splitting out the logging stuff.

dturn · 2018-10-05T18:39:41Z

lib/kubernetes-deploy/kubernetes_resource/pod.rb

@@ -8,9 +8,10 @@ class Pod < KubernetesResource
      Evicted
      Preempting
    )
+    attr_writer :stream_logs


do you have a use case in mind where you'd create the pod w/ stream_logs: false but change it before pod completed?

Yes, actually! I'm thinking that DeployTask could set this on pods in the predeploy with essentially pods.first.stream_logs = true if pods.one?, and the watcher could do the same to start streaming once there's only one pod left, e.g:

Waiting for pod/fast-task, pod/slow-task... Logs from pod/fast-task: dump the logs Pod/fast-task deployed in 5 seconds Waiting for pod/slow-task... stream the logs Pod/slow-task deployed in 3 minutes

We can discuss that on a future issue/PR though, and I can remove this line for now.

fw42

If a pod is stuck pending for more than X seconds, either emit a warning or time out (task runner pods in deploys too)

Agree that this would be good to have!

If a pod hasn't logged in the past X seconds, time out (not entirely convinced we should do this at all)

For activefailover, this would have saved our ass a few times in the past. Not saying we need this necessarily, but one thing we definitely want to be resilient against is "pod dies and therefore will never log anything, but kubernetes-deploy already started tailing the logs and will never finish". I assume your code here is already better than the code we used to have in activefailover, but having a test specifically for this scenario would be great (not necessarily in this PR, just saying).

please tophat ALL the runner integration tests

FYI I didn't do this since I don't really know how to do this and also don't have this project set up locally.

If we're in agreement that these are ok to defer

Agree that this makes sense since this PR is already so big.

Overall this PR looks mostly good to me. I don't feel super qualified to review this and I saw a bunch of small things that bother me a little bit, but cleaning those up would be a bigger refactor that has nothing to do with this PR, so I'll keep those opinions to myself for now.

Thanks for working on this!

fw42 · 2018-10-08T09:27:03Z

lib/kubernetes-deploy/runner_task.rb

      else
-        raise
+        raise TaskConfigurationError, "Error communicating with the API server"


Why the change here and not just raise like before? Seems like treating "error talking to the API" as "your configuration must be wrong" might be a bit confusing (what if the problem was on the server's side?).

As a rule, I like to wrap errors we're covering in something < FatalDeploymentError so that they get displayed nicely. You're right that TaskConfigurationError doesn't really make sense though. How about raise FatalDeploymentError, "Error retrieving pod template: #{e.class.name}: #{e.message}"? Or a new KubernetesAPIError < FatalDeploymentError maybe?

FatalDeploymentError or KubernetesAPIError both sound good to me

fw42 · 2018-10-08T09:27:15Z

lib/kubernetes-deploy/runner_task.rb


      pod_template.template
    rescue KubeException => error
      if error.error_code == 404
-        raise TaskTemplateMissingError.new(template_name, @namespace, @context)
+        msg = "Pod template `#{template_name}` not found in namespace `#{@namespace}`, context `#{@context}`"
+        @logger.summary.add_paragraph msg


nitpick: missing parentheses here

fw42 · 2018-10-08T09:29:58Z

test/test_helper.rb

@@ -210,8 +213,12 @@ def obj.run(*)

    private

+    def log_to_stderr?
+      ENV["PRINT_LOGS"].to_s == "1"


env vars are always strings, aren't they? the to_s here seems unnecessary

fw42 · 2018-10-08T09:44:45Z

lib/kubernetes-deploy/runner_task.rb

+      pod_template = build_pod_definition(task_template)
+      set_container_overrides!(pod_template, entrypoint, args, env_vars)
+      ensure_valid_restart_policy!(pod_template, verify_result)
+      Pod.new(namespace: @namespace, context: @context, logger: @logger, stream_logs: true,


I was a bit surprised that stream_logs is always true here and doesn't seem to depend on any of the new options (like verify_result). Why is that?

This option is really about how the logs are displayed when the pod is being watched, not if: "stream logs as we get them" vs. "dump them all when this succeeds/fails". If you don't start watching this pod at all (because verify_result is false), stream_logs is irrelevant. Currently stream_logs is always true in task runs and always false in deploys. As discussed with Danny in a comment above, I think it would be cool if we could intelligently set it to true during certain deploys too.

fw42 · 2018-10-08T09:47:03Z

test/helpers/env_test_helper.rb

+# frozen_string_literal: true
+module EnvTestHelper
+  def with_env(key, value)
+    old_env_id = ENV[key]


nitpick but I don't understand why this is called an _id

fw42 · 2018-10-08T09:48:23Z

test/integration/runner_task_test.rb

+    template = kubeclient.get_pod_template('hello-cloud-template-runner', @namespace)
+    mock.expects(:get_pod_template).returns(template)
+    mock.expects(:create_pod).raises(Kubeclient::HttpError.new("409", "Pod with same name exists", {}))
+    task_runner.instance_variable_set(:@kubeclient, mock)


can you do task_runner.expects(:kubeclient).returns(mock) here? I think that would be a tiny bit less sketchy

That would require disabling Mocha::Configuration.prevent(:stubbing_non_public_method). Not sure the sketchiness difference between the private ivar and the private method is worth removing that.

Oh sweet. I didn't even know about stubbing_non_public_method. TIL. I guess this is ok then.

fw42 · 2018-10-08T18:02:19Z

test/integration/runner_task_test.rb

  end

-  def test_run_with_template_runner_template_missing
+  def test_run_with_bad_restart_policy
+    deploy_task_template do |f|


what does f stand for?

can we just call it fixtures?

fw42 · 2018-10-08T18:03:58Z

test/integration/runner_task_test.rb

  end

-  def test_run_with_env_vars
-    deploy_fixtures("hello-cloud", subset: ["template-runner.yml"])
+  def test_run_with_missing_namespace


nitpick: test_run_fails_if_namespace_is_missing?

KnVerey · 2018-10-09T22:06:09Z

test/integration/runner_task_test.rb

    assert_task_run_success(result)

    logging_assertion do |all_logs|
      nums_printed = all_logs.scan(/Line (\d+)$/).flatten

-      missing_nums = nums_printed - (1..5_000).map(&:to_s)


This was backwards, so the test wasn't working as expected.

KnVerey mentioned this pull request Oct 3, 2018

K8s run watch pod #320

Closed

dturn reviewed Oct 3, 2018

View reviewed changes

KnVerey commented Oct 4, 2018

View reviewed changes

dturn reviewed Oct 5, 2018

View reviewed changes

KnVerey force-pushed the runner_verify_result branch from 74c7515 to ba88330 Compare October 5, 2018 00:54

KnVerey mentioned this pull request Oct 5, 2018

Introduce classes for remote logs #340

Merged

KnVerey changed the base branch from master to classes_for_remote_logs October 5, 2018 00:55

dturn reviewed Oct 5, 2018

View reviewed changes

KnVerey requested a review from fw42 October 5, 2018 20:22

fw42 approved these changes Oct 8, 2018

View reviewed changes

This was referenced Oct 9, 2018

[kubernetes-run] Warn when pod is stuck pending #342

Open

[kubernetes-run] Option to delete pod automatically on failure #343

Closed

Consistent preflight validations for all tasks #344

Closed

[kubernetes-run] Retry k8s API calls #345

Open

KnVerey force-pushed the runner_verify_result branch from f22892e to 0c2b82e Compare October 9, 2018 22:05

KnVerey commented Oct 9, 2018

View reviewed changes

dturn approved these changes Oct 9, 2018

View reviewed changes

dturn and others added 4 commits October 10, 2018 11:49

Command runner now watches pod

8c81e7b

Version2

97a52b0

PR Feedback

b92d4c4

Fix nasty bug where @last_timestamp accidentally set to nil

e549ff1

KnVerey force-pushed the runner_verify_result branch from 6d97100 to e549ff1 Compare October 10, 2018 15:50

KnVerey changed the base branch from classes_for_remote_logs to master October 10, 2018 15:50

KnVerey merged commit 49e7d88 into master Oct 10, 2018

KnVerey deleted the runner_verify_result branch October 10, 2018 17:43

fw42 temporarily deployed to rubygems October 16, 2018 10:39 Inactive

Add result verification to task runner #337

Add result verification to task runner #337

Conversation

KnVerey commented Oct 3, 2018 • edited Loading

Feature

Detail

Notes for reviewers

Alternatives considered

Screenshots

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey Oct 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Oct 4, 2018

dturn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Oct 5, 2018

KnVerey commented Oct 5, 2018

dturn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fw42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Oct 3, 2018 •

edited

Loading

KnVerey Oct 4, 2018 •

edited

Loading