-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add result verification to task runner #337
Conversation
|
||
@sync_mediator.sync(remainder) | ||
remainder.each(&:post_sync) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As implemented sync
is run all on all the resources then post_sync
is called. In rails / https://api.rubyonrails.org/classes/ActiveSupport/Callbacks.html post_sync
would trigger immediately after each resource has sync
called. Is the implemented behavior needed? If not I think it would be preferable to do it the more traditional way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr The technical reason is that sync
runs concurrently and post_sync
shouldn't. I also don't see the value of the indirection (vs writing the code at the end of sync
itself) if it were executed immediately after.
The idea here is that sync
should exclusively collect state from the API server and should not have any side-effects. It is run in parallel, so weird things can happen if the instance associates additional behaviours with it (e.g. seems to me that if two pods complete in different sync groups during the same polling cycle today, their logs will get interleaved). The post_sync
hook (naming suggestions appreciated) is a way to let instances do something every polling cycle if they need to, with the guarantee that (a) their instance data is already up to date; and (b) nothing else is logging at the same time. I considered making this dump_logs
or something, but thought it would be better design to keep the watcher agnostic about the details of what instances might want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(b) seems like its just a result of running the hook serially. If the goal is to keep the watcher agnostic, which I agree with, I'm think a more descriptive name would help serialized_post_sync
(I'm terrible at naming :( )
lib/kubernetes-deploy/runner_task.rb
Outdated
def create_pod(pod) | ||
@logger.info "Creating pod '#{pod.name}'" | ||
pod.deploy_started_at = Time.now.utc | ||
kubeclient.create_pod(Kubeclient::Resource.new(pod.definition)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this inverted? e.g pod.to_kubeclient_resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I like that idea
@logger.info "Creating pod '#{pod.name}'" | ||
pod.deploy_started_at = Time.now.utc | ||
kubeclient.create_pod(Kubeclient::Resource.new(pod.definition)) | ||
@pod_name = pod.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why an instance var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because today, the pod's name is exposed as the return value of run
(note attr_reader :pod_name
above). I don't want to keep doing that, but there are legitimate use cases for wanting to know the name of the pod created so you can take follow-up action. For example, the caller might want to rescue DeploymentTimeoutError, delete the pod created (to make sure two aren't running at once), and retry the runner. Incidentally, I'm wondering if we should add a --delete-pod-on-timeout
option to the runner itself (not in this PR).
lib/kubernetes-deploy/pod_logs.rb
Outdated
end | ||
end | ||
|
||
def print_all(prevent_duplicate: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plans for prevent_duplicate
later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I had added this based on Flo's comment to the effect that having print_all
no-op on the second call by default is unintuitive.
lib/kubernetes-deploy/pod_logs.rb
Outdated
def print_latest | ||
@last_printed_indexes ||= Hash.new { |hash, key| hash[key] = -1 } | ||
@container_logs.each do |container_name, logs| | ||
prefix = "[#{container_name}] " if @container_names.length > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to this would be to force the caller to specify a single container to print/"stream"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me was thinking that we have to do some ugly things because a pod can have multiple containers. But is that something we actually need/want to allow for the runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need to not care if they have extra containers, but they do already have to tell us which one is doing the main work (so we can inject the env vars/args/etc in that one). Maybe we could set pod.main_container
or something and pick that one. If we use this in deploys though (which I'd like to), we won't have that information, so we'd probably need to keep the prefixing for that case. WDYT? 🤔
lib/kubernetes-deploy/pod_logs.rb
Outdated
] | ||
cmd << "--since-time=#{since.to_datetime.rfc3339}" if since.present? | ||
|
||
out, _err, _st = kubectl.run(*cmd, log_failure: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This continues the existing behaviour of treating ALL failures to retrieve logs as "none available I guess". I'm thinking it would just be spammy to output anything about failures because:
(a) log output is for the benefit of the human operator only (in no way affects the control loop);
(b) this call already ends up getting "retried" every polling cycle, which for the runner is every second; and
(c) If Kubernetes is behind on consuming the logs from the container, there's nothing the operator can do about it.
Can you think of a case where it could be helpful to the operator, or a class of error that should actually be handled in some way?
lib/kubernetes-deploy/pod_logs.rb
Outdated
end | ||
|
||
def sync(kubectl) | ||
latest_logs = fetch_logs(kubectl, since: @last_timestamp_seen_for_container.values.min) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, @last_timestamp_seen_for_container
could be a method call rather than tracked separately in this implementation. Will change.
The commit I just pushed refactors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. What do you think about adding a test for RemoteLogs
where a pod has 2 containers.
out.split("\n") | ||
end | ||
|
||
def rfc3330_timestamp(time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: rfc3339_timestamp
|
||
@sync_mediator.sync(remainder) | ||
remainder.each(&:post_sync) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(b) seems like its just a result of running the hook serially. If the goal is to keep the watcher agnostic, which I agree with, I'm think a more descriptive name would help serialized_post_sync
(I'm terrible at naming :( )
74c7515
to
ba88330
Compare
I split off the refactoring pieces of this into #340 for ease of review. This branch is now based on that one. I will still merge them at the same time. |
Added a bunch of unit tests in #340. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting out the logging stuff.
@@ -8,9 +8,10 @@ class Pod < KubernetesResource | |||
Evicted | |||
Preempting | |||
) | |||
attr_writer :stream_logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a use case in mind where you'd create the pod w/ stream_logs: false
but change it before pod completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually! I'm thinking that DeployTask could set this on pods in the predeploy with essentially pods.first.stream_logs = true if pods.one?
, and the watcher could do the same to start streaming once there's only one pod left, e.g:
Waiting for pod/fast-task, pod/slow-task...
Logs from pod/fast-task:
dump
the
logs
Pod/fast-task deployed in 5 seconds
Waiting for pod/slow-task...
stream
the
logs
Pod/slow-task deployed in 3 minutes
We can discuss that on a future issue/PR though, and I can remove this line for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a pod is stuck pending for more than X seconds, either emit a warning or time out (task runner pods in deploys too)
Agree that this would be good to have!
If a pod hasn't logged in the past X seconds, time out (not entirely convinced we should do this at all)
For activefailover, this would have saved our ass a few times in the past. Not saying we need this necessarily, but one thing we definitely want to be resilient against is "pod dies and therefore will never log anything, but kubernetes-deploy already started tailing the logs and will never finish". I assume your code here is already better than the code we used to have in activefailover, but having a test specifically for this scenario would be great (not necessarily in this PR, just saying).
please tophat ALL the runner integration tests
FYI I didn't do this since I don't really know how to do this and also don't have this project set up locally.
If we're in agreement that these are ok to defer
Agree that this makes sense since this PR is already so big.
Overall this PR looks mostly good to me. I don't feel super qualified to review this and I saw a bunch of small things that bother me a little bit, but cleaning those up would be a bigger refactor that has nothing to do with this PR, so I'll keep those opinions to myself for now.
Thanks for working on this!
lib/kubernetes-deploy/runner_task.rb
Outdated
else | ||
raise | ||
raise TaskConfigurationError, "Error communicating with the API server" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change here and not just raise
like before? Seems like treating "error talking to the API" as "your configuration must be wrong" might be a bit confusing (what if the problem was on the server's side?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rule, I like to wrap errors we're covering in something < FatalDeploymentError
so that they get displayed nicely. You're right that TaskConfigurationError
doesn't really make sense though. How about raise FatalDeploymentError, "Error retrieving pod template: #{e.class.name}: #{e.message}"
? Or a new KubernetesAPIError < FatalDeploymentError
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FatalDeploymentError
or KubernetesAPIError
both sound good to me
lib/kubernetes-deploy/runner_task.rb
Outdated
|
||
pod_template.template | ||
rescue KubeException => error | ||
if error.error_code == 404 | ||
raise TaskTemplateMissingError.new(template_name, @namespace, @context) | ||
msg = "Pod template `#{template_name}` not found in namespace `#{@namespace}`, context `#{@context}`" | ||
@logger.summary.add_paragraph msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: missing parentheses here
test/test_helper.rb
Outdated
@@ -210,8 +213,12 @@ def obj.run(*) | |||
|
|||
private | |||
|
|||
def log_to_stderr? | |||
ENV["PRINT_LOGS"].to_s == "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
env vars are always strings, aren't they? the to_s
here seems unnecessary
pod_template = build_pod_definition(task_template) | ||
set_container_overrides!(pod_template, entrypoint, args, env_vars) | ||
ensure_valid_restart_policy!(pod_template, verify_result) | ||
Pod.new(namespace: @namespace, context: @context, logger: @logger, stream_logs: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit surprised that stream_logs
is always true
here and doesn't seem to depend on any of the new options (like verify_result
). Why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is really about how the logs are displayed when the pod is being watched, not if: "stream logs as we get them" vs. "dump them all when this succeeds/fails". If you don't start watching this pod at all (because verify_result
is false), stream_logs
is irrelevant. Currently stream_logs
is always true in task runs and always false in deploys. As discussed with Danny in a comment above, I think it would be cool if we could intelligently set it to true during certain deploys too.
test/helpers/env_test_helper.rb
Outdated
# frozen_string_literal: true | ||
module EnvTestHelper | ||
def with_env(key, value) | ||
old_env_id = ENV[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick but I don't understand why this is called an _id
template = kubeclient.get_pod_template('hello-cloud-template-runner', @namespace) | ||
mock.expects(:get_pod_template).returns(template) | ||
mock.expects(:create_pod).raises(Kubeclient::HttpError.new("409", "Pod with same name exists", {})) | ||
task_runner.instance_variable_set(:@kubeclient, mock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do task_runner.expects(:kubeclient).returns(mock)
here? I think that would be a tiny bit less sketchy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require disabling Mocha::Configuration.prevent(:stubbing_non_public_method)
. Not sure the sketchiness difference between the private ivar and the private method is worth removing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sweet. I didn't even know about stubbing_non_public_method
. TIL. I guess this is ok then.
test/integration/runner_task_test.rb
Outdated
end | ||
|
||
def test_run_with_template_runner_template_missing | ||
def test_run_with_bad_restart_policy | ||
deploy_task_template do |f| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does f
stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixtures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just call it fixtures
?
test/integration/runner_task_test.rb
Outdated
end | ||
|
||
def test_run_with_env_vars | ||
deploy_fixtures("hello-cloud", subset: ["template-runner.yml"]) | ||
def test_run_with_missing_namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: test_run_fails_if_namespace_is_missing
?
f22892e
to
0c2b82e
Compare
test/integration/runner_task_test.rb
Outdated
assert_task_run_success(result) | ||
|
||
logging_assertion do |all_logs| | ||
nums_printed = all_logs.scan(/Line (\d+)$/).flatten | ||
|
||
missing_nums = nums_printed - (1..5_000).map(&:to_s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was backwards, so the test wasn't working as expected.
6d97100
to
e549ff1
Compare
PLEASE NOTE: The refactoring piece of this has now been extracted to #340
Feature
The runner task now streams container logs and waits for the pod to succeed or fail by default
--skip-wait
orverify_result: false
(Note: named for consistency with the deploy task)--max-watch-seconds=seconds
ormax_watch_seconds: N
to set a time limit on the new result verification featureCloses #304
@Shopify/cloudx @fw42 @sirupsen
Detail
This proposes using the existing ResourceWatcher but moving the log handling for all pods into a new object with a traditional (for our resource models) sync-data/use-result separation. This makes it possible to change log output from being a side-effect of sync, to using a new post-sync hook that is run sequentially (sync is concurrent). I have not done so in this PR, but I expect this implementation to make it trivially possible to stream task-runner logs during deploys as well, which could be a nice improvement.
This PR also tweaks the UX (i.e. logging) for this task in general, to try to make it more useful and more consistent with the deploy task. See screenshots at the bottom of this description.
Notes for reviewers
PRINT_LOGS=1
.exists?
in all resource classes. If the API server has no record of it after the creation request, it isn't going to spontaneously appear (the user probably deleted it out of band?), so we should consider it failed.If we're in agreement that these are ok to defer, I'll cut them into separate issues before merging.
Alternatives considered
kubeclient.watch_pod_log
on this branch. I'm glad I tried it, but I liked it a lot less. It was more complex (e.g. after a few iterations, I ended up using the watcher in a side thread anyway, to be able to abort appropriately), is not compatible with the deploy part of the tool so duplicated/fought some of its functionality, and to boot the stream still wasn't working against my minikube by the end of the day I gave myself to try it.Screenshots