Disaster detection and disabled actions#1230
Conversation
| name: 'type', | ||
| label: 'Type', | ||
| isRequired: true, | ||
| options: DISABLED_ACTION_TYPES.map((type) => ({ |
There was a problem hiding this comment.
It would be nice to exclude already-disabled actions from this list.
(It would also be nice to be able to select multiple. But I don't think we have a multiselect component in FormModal yet, and I don't know if the backend even supports that.)
There was a problem hiding this comment.
Yeah, I didn't write the backend to support multiple yet, might be a good idea at some point. However, it does support overriding the message for one that is already present (i.e if you issue a new request for disabling bounces when they are alreayd disabled, it will take the message form the newer one). So I might leave it be for the moment
|
@tpetr Updated the PR a bit to take the time range over which events are occurring into account better. - task lag takes into account how long the calculated value has been over the specified threshold. A disaster will trigger if it has been over for a certain amount of time (45s default)
|
|
@tpetr added a more comprehensive |
| FREEZE_SLAVE(true), ACTIVATE_SLAVE(true), DECOMMISSION_SLAVE(true), VIEW_SLAVES(false), | ||
| FREEZE_RACK(true), ACTIVATE_RACK(true), DECOMMISSION_RACK(true), VIEW_RACKS(false); | ||
|
|
||
| private final boolean disableable; |
There was a problem hiding this comment.
can we word this better? canDisable maybe?
|
|
||
| import static com.hubspot.singularity.data.transcoders.SingularityJsonTranscoderBinder.bindTranscoder; | ||
|
|
||
| import javax.ws.rs.HEAD; |
There was a problem hiding this comment.
this snuck in from merge conflicts
In a critical situation it is helpful to limit the amount of task churn in Singularity. This PR adds the ability for an admin to globally disable certain actions. So far it is implemented for
BOUNCE,DEPLOY,SCALE,REMOVE, andDECOMMISSIONbut it's easy to add more if needed.a
POSTto/disasters/disabled-actions/{action}adds an action to the list of ones that are disabled with an optional messagea
DELETEto/disasters/disabled-actions/{action}removes it from that listSingularity will respond with a 423 (locked) and the message given when disabling (or a default message)
In this PR I am also adding an automated way of disabling actions based on things such as task lag and the frequency of lost slaves or lost tasks.
TODO for this PR:
/cc @tpetr