Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to detect and cleanup failed deployments #20444

Merged
merged 19 commits into from
Sep 9, 2020

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Aug 13, 2020

What this PR does:

  • pods are labeled with the orchestrator pod that manages them
  • for this orchestrator's pods, a collector thread gets the initial pod information and monitors for updates
  • deployments with 1+ terminated container(s) and sum of restarts > 5 are "killed"

The easiest way to recreate this:

  • Add an amazon cloud provider in the manageiq UI.
  • Wait a few seconds for oc get pods to show new amazon-cloud-event-catcher and amazon-cloud-refresh worker pods creating/starting:
$ oc get pods
NAME                                              READY   STATUS              RESTARTS   AGE
1-amazon-cloud-event-catcher-2-7bd6479946-84c8n   0/1     ContainerCreating   0          2s
1-amazon-cloud-refresh-2-3-4-5548695867-9qkzx     0/1     ContainerCreating   0          1s
  • Delete the amazon cloud provider in the manageiq UI before the new pods finish starting.
  • The new worker pods will continually start/restart/fail/backoff until it reaches 6 restarts and gets killed.

TODO:

  • Mutex around read/writes
  • Ensure the pod collector thread is running/restarted
  • Guard against monitoring/killing deployments for things such as postgresql, memcached, httpd, etc. To do this, we'll label all of the worker pods that are managed by the orchestrator so the orchestrator will only monitor and kill failed pods it's monitoring. This will require new images to be tested to ensure the labels are set correctly and that the orchestrator can monitor this subset of pods that it's managing as represented by this label.
  • Fix some thread safety issues identified below.
  • More nuanced detection of failed deployments (5+ restarts and terminated status) as we might flag pods that often hit memory/cpu limits over days/months.. This is basic and doesn't conflict with liveness checks failures since those will be restarted, and not remain in terminated lastState. Any pod that has 5 or more container restarts and remains in terminated state will get removed as a deployment.
  • Manual tests are great but we'll need some tests
    Fixes: Worker deployments exist after worker records are removed #20147

Here's an example of the events indicating two failed worker pods and their subsequent automatic removal:

image

Side effect bonus:

  • Each pod's labels show which orchestrator they're managed by:
    image

  • By filtering by the orchestrator, such asmanageiq-orchestrated-by orchestrator-5f89795bcc-89ztg, we can see all of the deployments managed by that orchestrator (or any orchestator pod if you filter by manageiq-orchestrated-by) and therefore which ones we're monitoring and will get killed if they continually fail:
    image

@jrafanie jrafanie requested a review from agrare August 13, 2020 20:17
@miq-bot miq-bot added the wip label Aug 13, 2020
@jrafanie jrafanie force-pushed the cleanup_failed_deployments branch 2 times, most recently from f023847 to b421fba Compare August 24, 2020 20:45
end

start_pod_monitor
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied from similar behavior in the event catcher

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with you and @agrare, maybe not for this PR, but I suggest we move the general pattern of having a WatchThread that can be auto-restarted into a generic for in the core manageiq repo in the lib dir, for eventual extraction into kubeclient. This way issues that arise can be fixed in one place (like the 410 gone issue).

I recommened it lives in core, and then the kubernetes / openshift providers use it directly. We may need to tweak the interface to be less ems oriented and more generic.

when "status"
# other times, we can get 'status' type/kind, with a code of 410
# https://github.com/ManageIQ/manageiq-providers-kubernetes/blob/745ba1332fa43cfb0795644279f3f55b8751f1c8/app/models/manageiq/providers/kubernetes/container_manager/refresh_worker/watch_thread.rb#L48
break if event.code == 410
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrare I couldn't track down why we're doing one thing in the kubernetes provider here and something different in the fluent-plugin. referenced above... maybe different api versions return a toplevel status object which can be used here and other api versions return an error watch event with the status object inside? I'll need to track this down either way.

Copy link
Member

@agrare agrare Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that k8s defines the event type when a 410 Gone is returned, here are the docs on watches https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes so this might be an implementation detail (whether it returns "status" or "error" as the event type).

I'll need to do some research to see how we should handle it, would be unfortunate if we had to handle both cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, at the worst, if we have to support both formats for 410, we can combine them like:

      when "error", "status"
        break if event.code == 410                        # outer watch event has failed 410 error code
        break if event.object && event.object.code == 410 # or outer watch event completed but inner event object has the 410 error code

It's less than ideal but if the outer watch event contains the code or the inner object contains the code,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect depending on what you are asking for, you can get "success" but with a failure object OR the whole request can fail. It's possible there are bugs on the kubernetes side in terms of consistency in various API while there are possibly legitimate reasons to have a successful failure vs. a failing failure.

@jrafanie jrafanie force-pushed the cleanup_failed_deployments branch 5 times, most recently from 6effc55 to e0d8b4e Compare August 28, 2020 20:22
@jrafanie jrafanie marked this pull request as ready for review September 3, 2020 17:37
@jrafanie
Copy link
Member Author

jrafanie commented Sep 3, 2020

@Fryguy @agrare @brandon I think this is ready for review. In terms of a surgical change, I think this is the basics and can't really remove any of the functionality.

There are still things to do but it feels like we can have separate discussions on that. Perhaps I can make this logic optional to start and we can get this in and discuss the future items.

Note, I've tested this on pods by injecting the code into a rails console in the orchestrator so I'll need to retest or show others how to test when we have an image they can run this with.

Some of the items remaining that we might want to include here or just do later (YAGNI):

  • Label pods on deployment so we can identify/search just for deployments we care to kill and let be recreated. Do we want to support killing httpd deployment? Or memcached? Or postgresql? Or just pods that are in the miq_workers table? Or just ems based workers? We can choose to have a hardcoded opt-in or opt-out list to start and use labels in kubernetes for more future work.

Later:

  • What do we want to do with the miq_workers table? This PR doesn't address the fact that pods still doesn't create the worker row at orchestrator deployment time in the orchestrator but instead as the pod initializes and starts from within the pod.
  • Looking for terminated lastState + >+ 5 container restarts across the pod is pretty naive but seems to work for the situations we care about. Additionally, maybe there isn't a big downside to redeploying a pod that manages to get included as a false positive? Basically, if the pod is restarted and fails enough, it will hit this state and could be removed and redeployed.

@jrafanie jrafanie changed the title [WIP] Add ability to detect and cleanup failed deployments Add ability to detect and cleanup failed deployments Sep 3, 2020
@miq-bot miq-bot removed the wip label Sep 3, 2020
private
def pod_options
@pod_options ||= {:namespace => my_namespace, :label_selector => "app=#{app_name}"}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pairing with @Fryguy, we came up with a label we can set for all deployments {:"#{app_name}-orchestrated-by" => ENV['POD_NAME']} in the object definition and we can then do a selector here app=manageiq,manageiq-orchestrated-by=orchestrator-9f99d8cb9-7mprg so we'll only get pods that are managed by the orchestrator. In the orchestrator code, we can then look for all pods that we're managing...

…liminate the accessor

Update log message on error since we're not resetting the resource version
anymore.
def collect_initial_pods
pods = orchestrator.get_pods
pods.each { |p| save_pod(p) }
pods.resourceVersion
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agrare for the suggestion... returning the resourceVersion here and passing it to the watch is simpler.

Copy link
Member

@Fryguy Fryguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Joe!

@Fryguy
Copy link
Member

Fryguy commented Sep 9, 2020

@jrafanie Can you fix the rubocops? Some of them are legit.

@jrafanie
Copy link
Member Author

jrafanie commented Sep 9, 2020

@jrafanie Can you fix the rubocops? Some of them are legit.

Yeah, I'm doing one more full build to make sure it fixes the whole end-to-end process and I'll clean those up.

* stale comments
* style
@miq-bot
Copy link
Member

miq-bot commented Sep 9, 2020

Checked commits jrafanie/manageiq@db496aa~...9ba2fc4 with ruby 2.6.3, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
6 files checked, 2 offenses detected

app/models/miq_server/worker_management/monitor/kubernetes.rb

lib/container_orchestrator.rb

@jrafanie
Copy link
Member Author

jrafanie commented Sep 9, 2020

Ok, final tests were successful. I added screenshots in the description to hopefully better document what this PR does.

The final style issues are meh:

  • get_pods is the kubeclient method, so I'm naming our caller of that the same.
  • @start_pod_monitor isn't any better than @monitor_thread... we can address a more appropriate name when we extract this concept with what we're doing very similar things elsewhere

@jrafanie
Copy link
Member Author

jrafanie commented Sep 9, 2020

Thanks @agrare @Fryguy @bdunne @simaishi for all the help with reviews/image building.

@Fryguy Fryguy merged commit f3c20e8 into ManageIQ:master Sep 9, 2020
@Fryguy Fryguy self-assigned this Sep 9, 2020
@jrafanie jrafanie deleted the cleanup_failed_deployments branch September 10, 2020 12:50
@simaishi
Copy link
Contributor

@jrafanie backporting this to jansa conflicts as #20420 is not in jansa branch. Not sure if we want to take #20420 as well. If not, please create a separate PR for jansa branch.

@jrafanie
Copy link
Member Author

@agrare I'm ok with bringing back #20420 to jansa, what do you think? I think there isn't much risk since systemd wasn't working correctly anyway, right?

Thanks @simaishi

@agrare
Copy link
Member

agrare commented Sep 10, 2020

Yeah I'm 👍 with that

@jrafanie
Copy link
Member Author

@simaishi Can you let me know if you are able to backport #20420 in order to backport this PR? Thanks!

@simaishi
Copy link
Contributor

@jrafanie I can backport #20420, followed by #20444 without conflicts.

@jrafanie
Copy link
Member Author

Sounds good @simaishi. We're both comfortable with bringing back both PRs to jansa.

simaishi pushed a commit that referenced this pull request Sep 11, 2020
Add ability to detect and cleanup failed deployments

(cherry picked from commit f3c20e8)
@simaishi
Copy link
Contributor

Jansa backport details:

$ git log -1
commit ab194ebd64e259b0a8e44ad791cdbfdc9e19fecf
Author: Jason Frey <fryguy9@gmail.com>
Date:   Wed Sep 9 18:21:17 2020 -0400

    Merge pull request #20444 from jrafanie/cleanup_failed_deployments

    Add ability to detect and cleanup failed deployments

    (cherry picked from commit f3c20e88dbf7c29b775fdd225b3cc1e53e4e494f)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Worker deployments exist after worker records are removed
6 participants