New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disaster detection and disabled actions #1230

Merged
merged 51 commits into from Sep 8, 2016

Conversation

Projects
None yet
3 participants
@ssalinas
Member

ssalinas commented Aug 18, 2016

In a critical situation it is helpful to limit the amount of task churn in Singularity. This PR adds the ability for an admin to globally disable certain actions. So far it is implemented for BOUNCE, DEPLOY, SCALE, REMOVE, and DECOMMISSION but it's easy to add more if needed.

a POST to /disasters/disabled-actions/{action} adds an action to the list of ones that are disabled with an optional message
a DELETE to /disasters/disabled-actions/{action} removes it from that list

Singularity will respond with a 423 (locked) and the message given when disabling (or a default message)

In this PR I am also adding an automated way of disabling actions based on things such as task lag and the frequency of lost slaves or lost tasks.

TODO for this PR:

  • Add a test endpoint to explicitly trigger task reconciliation
  • Add the ability to globally disable actions of certain types
  • Admin ui page to add/remove disabled actions
  • Add disaster detection around things like number of overdue tasks, slaves lost, etc, that will run on the leader only
  • Allow the disaster detection to disable certain actions when one of its thresholds is crossed
  • Email admins when a disaster is detected
  • Tune metrics appropriately to decide when something bad is happening
  • Add an api endpoint to manually enter/exit disaster mode
  • Admin ui page to manually enter/exit disaster mode and show current stats
  • Docs

/cc @tpetr

@ssalinas ssalinas modified the milestone: 0.11.0 Aug 19, 2016

@ssalinas ssalinas added the hs_staging label Aug 19, 2016

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Aug 19, 2016

Member

@tpetr @wolfd @Calvinp feel free to tear apart my react if I'm doing anything silly here. Modeled this page off of the webhooks one since it's fairly similar and pretty simple.

Member

ssalinas commented Aug 19, 2016

@tpetr @wolfd @Calvinp feel free to tear apart my react if I'm doing anything silly here. Modeled this page off of the webhooks one since it's fairly similar and pretty simple.

Show outdated Hide outdated SingularityUI/app/components/common/modalButtons/NewDisabledActionModal.jsx
name: 'type',
label: 'Type',
isRequired: true,
options: DISABLED_ACTION_TYPES.map((type) => ({

This comment has been minimized.

@Calvinp

Calvinp Aug 19, 2016

Contributor

It would be nice to exclude already-disabled actions from this list.
(It would also be nice to be able to select multiple. But I don't think we have a multiselect component in FormModal yet, and I don't know if the backend even supports that.)

@Calvinp

Calvinp Aug 19, 2016

Contributor

It would be nice to exclude already-disabled actions from this list.
(It would also be nice to be able to select multiple. But I don't think we have a multiselect component in FormModal yet, and I don't know if the backend even supports that.)

This comment has been minimized.

@ssalinas

ssalinas Aug 22, 2016

Member

Yeah, I didn't write the backend to support multiple yet, might be a good idea at some point. However, it does support overriding the message for one that is already present (i.e if you issue a new request for disabling bounces when they are alreayd disabled, it will take the message form the newer one). So I might leave it be for the moment

@ssalinas

ssalinas Aug 22, 2016

Member

Yeah, I didn't write the backend to support multiple yet, might be a good idea at some point. However, it does support overriding the message for one that is already present (i.e if you issue a new request for disabling bounces when they are alreayd disabled, it will take the message form the newer one). So I might leave it be for the moment

Show outdated Hide outdated SingularityUI/app/components/disabledActions/DisabledActions.jsx
return props.fetchDisabledActions();
}
export default connect(mapStateToProps, mapDispatchToProps)(rootComponent(DisabledActions, 'DisabledActions', refresh));

This comment has been minimized.

@Calvinp

Calvinp Aug 19, 2016

Contributor

We probably want a space in the page title. ('Disabled Actions' instead of 'DisabledActions')

@Calvinp

Calvinp Aug 19, 2016

Contributor

We probably want a space in the page title. ('Disabled Actions' instead of 'DisabledActions')

This comment has been minimized.

@ssalinas

ssalinas Aug 22, 2016

Member

fixed

@ssalinas ssalinas added the hs_qa label Aug 22, 2016

@ssalinas ssalinas changed the title from Ability to globally disable actions to Disaster detection and disabled actions Aug 25, 2016

ssalinas added some commits Aug 25, 2016

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Aug 29, 2016

Member

@tpetr Updated the PR a bit to take the time range over which events are occurring into account better. - task lag takes into account how long the calculated value has been over the specified threshold. A disaster will trigger if it has been over for a certain amount of time (45s default)

  • For lost tasks/slaves, the poller will take all lost items from the last X seconds into account (30s by default) instead of just checking an instantaneous value. Now the interval on which the poller runs does not matter as much (though things get more accurate the more often it does)
  • Poller now keeps more data points (10 by default)
  • Eliminated the call to get all pending tasks for efficiency (just getting the IDs) so now the poller can run more often as well.
Member

ssalinas commented Aug 29, 2016

@tpetr Updated the PR a bit to take the time range over which events are occurring into account better. - task lag takes into account how long the calculated value has been over the specified threshold. A disaster will trigger if it has been over for a certain amount of time (45s default)

  • For lost tasks/slaves, the poller will take all lost items from the last X seconds into account (30s by default) instead of just checking an instantaneous value. Now the interval on which the poller runs does not matter as much (though things get more accurate the more often it does)
  • Poller now keeps more data points (10 by default)
  • Eliminated the call to get all pending tasks for efficiency (just getting the IDs) so now the poller can run more often as well.

ssalinas added some commits Aug 29, 2016

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Aug 30, 2016

Member

@tpetr added a more comprehensive SingularityAction enum, each action in the list has a property to determine if it can be disabled or not (i.e. if I actually put the validator code in place to make it disableable)

Member

ssalinas commented Aug 30, 2016

@tpetr added a more comprehensive SingularityAction enum, each action in the list has a property to determine if it can be disabled or not (i.e. if I actually put the validator code in place to make it disableable)

@ssalinas ssalinas added the hs_stable label Sep 2, 2016

Show outdated Hide outdated ...ularityBase/src/main/java/com/hubspot/singularity/SingularityAction.java
FREEZE_SLAVE(true), ACTIVATE_SLAVE(true), DECOMMISSION_SLAVE(true), VIEW_SLAVES(false),
FREEZE_RACK(true), ACTIVATE_RACK(true), DECOMMISSION_RACK(true), VIEW_RACKS(false);
private final boolean disableable;

This comment has been minimized.

@tpetr

tpetr Sep 7, 2016

Member

can we word this better? canDisable maybe?

@tpetr

tpetr Sep 7, 2016

Member

can we word this better? canDisable maybe?

Show outdated Hide outdated ...om/hubspot/singularity/scheduler/SingularityDisasterDetectionPoller.java
numPastDueTasks++;
totalTaskLagMillis += taskLagMillis;
if (taskLagMillis > configuration.getDeltaAfterWhichTasksAreLateMillis()) {
numLateTasks ++;

This comment has been minimized.

@tpetr

tpetr Sep 7, 2016

Member

nit: spacing

@tpetr

tpetr Sep 7, 2016

Member

nit: spacing

Show outdated Hide outdated ...om/hubspot/singularity/data/transcoders/SingularityTranscoderModule.java
@@ -2,6 +2,8 @@
import static com.hubspot.singularity.data.transcoders.SingularityJsonTranscoderBinder.bindTranscoder;
import javax.ws.rs.HEAD;

This comment has been minimized.

@tpetr

tpetr Sep 7, 2016

Member

this snuck in from merge conflicts

@tpetr

tpetr Sep 7, 2016

Member

this snuck in from merge conflicts

ssalinas added some commits Sep 7, 2016

@ssalinas ssalinas merged commit 142f2bd into master Sep 8, 2016

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ssalinas ssalinas deleted the action_disable branch Sep 8, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment