Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill workers that don't stop after a configurable time #13805

Merged
merged 2 commits into from Feb 14, 2017

Conversation

@jrafanie
Copy link
Member

jrafanie commented Feb 7, 2017

Previously, we'd gracefully ask them to exit and if the queue work
they're doing, takes 1 hour to do, they'd exceed
memory thresholds, keep running until the work is done and finally
respond to the exit request.

Now, we mark them as 'stopping' when they
exceed a threshold and they have up to 10 minutes to finish before we'd
kill them. This value is configurable in the 'stopping_timeout' field in
each worker's advanced settings.

https://bugzilla.redhat.com/show_bug.cgi?id=1395736

@gtanzillo @carbonin Please review

app/models/miq_server/worker_management/monitor.rb Outdated
@@ -99,6 +99,12 @@ def check_not_responding(class_name = nil)
processed_workers.collect(&:id)
end

NOT_RESPONDING = :not_responding
MEMORY_EXCEEDED = :memory_exceeded

This comment has been minimized.

Copy link
@carbonin

carbonin Feb 8, 2017

Member

I feel like this information belongs somewhere else. Maybe MiqServer::WorkerManagement::Monitor::Reason?

Then ideally callers of worker_set_monitor_reason will also use the same constants. Not sure if making that kind of change is in this PR's scope though.

Copy link
Member

gtanzillo left a comment

👍 Looks good

@gtanzillo

This comment has been minimized.

Copy link
Member

gtanzillo commented Feb 10, 2017

Will merge after @jrafanie makes a couple of small changes.

jrafanie added 2 commits Jan 3, 2017
Previously, we'd gracefully ask them to exit and if the queue work
they're doing, takes 1 hour to do, they'd exceed
memory thresholds, keep running until the work is done and finally
respond to the exit request.

Now, we mark them as 'stopping' when they
exceed a threshold and they have up to 10 minutes to finish before we'd
kill them.  This value is configurable in the 'stopping_timeout' field in
each worker's advanced settings.

https://bugzilla.redhat.com/show_bug.cgi?id=1395736
@jrafanie jrafanie force-pushed the jrafanie:stopping_worker branch to b60a5f0 Feb 10, 2017
@miq-bot

This comment has been minimized.

Copy link
Member

miq-bot commented Feb 10, 2017

Checked commits jrafanie/manageiq@e5f4bd3~...b60a5f0 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
6 files checked, 1 offense detected

spec/models/miq_server/worker_management/monitor_spec.rb

@jrafanie

This comment has been minimized.

Copy link
Member Author

jrafanie commented Feb 10, 2017

Ok, I think I got your good suggestion in. Take another look @carbonin

Copy link
Member

carbonin left a comment

Looks good!

@jrafanie

This comment has been minimized.

Copy link
Member Author

jrafanie commented Feb 13, 2017

cc @jcarter12 (this is the stopping workers PR)

@gtanzillo gtanzillo merged commit 9764870 into ManageIQ:master Feb 14, 2017
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.01%) to 42.354%
Details
jrafanie added a commit to jrafanie/manageiq that referenced this pull request Feb 14, 2017
Due to module inclusion spaghetti, it's easier and less confusing
to reference the Reason constants consistently in the MiqServer class,
which is the ultimate destination for all of these modules.

Fixes ManageIQ#13901 introduced in ManageIQ#13805

https://bugzilla.redhat.com/show_bug.cgi?id=1395736
@jrafanie jrafanie added the euwe/yes label Feb 14, 2017
@jrafanie jrafanie deleted the jrafanie:stopping_worker branch Feb 14, 2017
@jrafanie jrafanie added the darga/yes label Feb 14, 2017
jrafanie added a commit to jrafanie/manageiq that referenced this pull request Feb 16, 2017
Kill workers that don't stop after a configurable time
(cherry picked from commit 9764870)

https://bugzilla.redhat.com/show_bug.cgi?id=1395736
jrafanie added a commit to jrafanie/manageiq that referenced this pull request Feb 16, 2017
Kill workers that don't stop after a configurable time
(cherry picked from commit 9764870)

https://bugzilla.redhat.com/show_bug.cgi?id=1395736
@jrafanie jrafanie added bug and removed enhancement labels Feb 16, 2017
@jrafanie

This comment has been minimized.

Copy link
Member Author

jrafanie commented Feb 16, 2017

Euwe backport: #13949
Darga backport: #13950

@simaishi

This comment has been minimized.

Copy link
Contributor

simaishi commented Mar 3, 2017

Backported to Euwe via #13949

@simaishi simaishi added euwe/backported and removed euwe/yes labels Mar 3, 2017
@simaishi

This comment has been minimized.

Copy link
Contributor

simaishi commented Mar 10, 2017

Backported to Darga via #13950

@simaishi simaishi added darga/backported and removed darga/yes labels Mar 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.