Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disaster detection and disabled actions #1230

Merged
merged 51 commits into from Sep 8, 2016
Merged

Disaster detection and disabled actions #1230

merged 51 commits into from Sep 8, 2016

Conversation

ssalinas
Copy link
Member

@ssalinas ssalinas commented Aug 18, 2016

In a critical situation it is helpful to limit the amount of task churn in Singularity. This PR adds the ability for an admin to globally disable certain actions. So far it is implemented for BOUNCE, DEPLOY, SCALE, REMOVE, and DECOMMISSION but it's easy to add more if needed.

a POST to /disasters/disabled-actions/{action} adds an action to the list of ones that are disabled with an optional message
a DELETE to /disasters/disabled-actions/{action} removes it from that list

Singularity will respond with a 423 (locked) and the message given when disabling (or a default message)

In this PR I am also adding an automated way of disabling actions based on things such as task lag and the frequency of lost slaves or lost tasks.

TODO for this PR:

  • Add a test endpoint to explicitly trigger task reconciliation
  • Add the ability to globally disable actions of certain types
  • Admin ui page to add/remove disabled actions
  • Add disaster detection around things like number of overdue tasks, slaves lost, etc, that will run on the leader only
  • Allow the disaster detection to disable certain actions when one of its thresholds is crossed
  • Email admins when a disaster is detected
  • Tune metrics appropriately to decide when something bad is happening
  • Add an api endpoint to manually enter/exit disaster mode
  • Admin ui page to manually enter/exit disaster mode and show current stats
  • Docs

/cc @tpetr

@ssalinas ssalinas modified the milestone: 0.11.0 Aug 19, 2016
@ssalinas
Copy link
Member Author

@tpetr @wolfd @Calvinp feel free to tear apart my react if I'm doing anything silly here. Modeled this page off of the webhooks one since it's fairly similar and pretty simple.

name: 'type',
label: 'Type',
isRequired: true,
options: DISABLED_ACTION_TYPES.map((type) => ({
Copy link
Contributor

@Calvinp Calvinp Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to exclude already-disabled actions from this list.
(It would also be nice to be able to select multiple. But I don't think we have a multiselect component in FormModal yet, and I don't know if the backend even supports that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't write the backend to support multiple yet, might be a good idea at some point. However, it does support overriding the message for one that is already present (i.e if you issue a new request for disabling bounces when they are alreayd disabled, it will take the message form the newer one). So I might leave it be for the moment

@ssalinas ssalinas added the hs_qa label Aug 22, 2016
@ssalinas
Copy link
Member Author

@tpetr Updated the PR a bit to take the time range over which events are occurring into account better. - task lag takes into account how long the calculated value has been over the specified threshold. A disaster will trigger if it has been over for a certain amount of time (45s default)

  • For lost tasks/slaves, the poller will take all lost items from the last X seconds into account (30s by default) instead of just checking an instantaneous value. Now the interval on which the poller runs does not matter as much (though things get more accurate the more often it does)
  • Poller now keeps more data points (10 by default)
  • Eliminated the call to get all pending tasks for efficiency (just getting the IDs) so now the poller can run more often as well.

@ssalinas
Copy link
Member Author

@tpetr added a more comprehensive SingularityAction enum, each action in the list has a property to determine if it can be disabled or not (i.e. if I actually put the validator code in place to make it disableable)

FREEZE_SLAVE(true), ACTIVATE_SLAVE(true), DECOMMISSION_SLAVE(true), VIEW_SLAVES(false),
FREEZE_RACK(true), ACTIVATE_RACK(true), DECOMMISSION_RACK(true), VIEW_RACKS(false);

private final boolean disableable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we word this better? canDisable maybe?

@@ -2,6 +2,8 @@

import static com.hubspot.singularity.data.transcoders.SingularityJsonTranscoderBinder.bindTranscoder;

import javax.ws.rs.HEAD;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this snuck in from merge conflicts

@ssalinas ssalinas merged commit 142f2bd into master Sep 8, 2016
@ssalinas ssalinas deleted the action_disable branch September 8, 2016 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants