Disaster detection #1247

Merged
merged 20 commits into from Aug 25, 2016

Conversation

Projects
None yet
2 participants
@ssalinas
Member

ssalinas commented Aug 23, 2016

  • Add a test endpoint to explicitly trigger task reconciliation
  • Add disaster detection around things like number of overdue tasks, slaves lost, etc, that will run on the leader only
  • Allow the disaster detection to disable certain actions when one of its thresholds is crossed
  • Email admins when a disaster is detected
  • Tune metrics appropriately to decide when something bad is happening
  • Add an api endpoint to manually enter/exit disaster mode
  • Admin ui page to manually enter/exit disaster mode (Maybe?)

@ssalinas ssalinas added the hs_staging label Aug 24, 2016

@ssalinas ssalinas modified the milestone: 0.11.0 Aug 24, 2016

ssalinas added some commits Aug 24, 2016

+ }
+
+ for (SingularityDisasterType disaster : newActiveDisasters) {
+ addDisaster(disaster);

This comment has been minimized.

@darcatron

darcatron Aug 24, 2016

Contributor

Doesn't this again add previously added disasters?

@darcatron

darcatron Aug 24, 2016

Contributor

Doesn't this again add previously added disasters?

This comment has been minimized.

@ssalinas

ssalinas Aug 24, 2016

Member

Ah, yeah I can make it skip those. Originally I was going to store a message in there, so it made sense to call that again in case the message had been updated. Can probably do a if (!exists(path)) check first, thanks!

@ssalinas

ssalinas Aug 24, 2016

Member

Ah, yeah I can make it skip those. Originally I was going to store a message in there, so it made sense to call that again in case the message had been updated. Can probably do a if (!exists(path)) check first, thanks!

This comment has been minimized.

@ssalinas

ssalinas Aug 24, 2016

Member

updated 👍

@ssalinas

ssalinas Aug 24, 2016

Member

updated 👍

checkRackAfterSlaveLoss(slave.get());
} else {
LOG.warn("Lost a slave {}, but didn't know about it", slaveId);
}
}
+ private void updateDisasterCounter(SingularitySlave slave) {
+ if (slave.getCurrentState().getState() == MachineState.ACTIVE) {

This comment has been minimized.

@darcatron

darcatron Aug 24, 2016

Contributor

I might not understand this part fully, but if the slave state is changed to MachineState.DEAD, will this ever be true?

@darcatron

darcatron Aug 24, 2016

Contributor

I might not understand this part fully, but if the slave state is changed to MachineState.DEAD, will this ever be true?

This comment has been minimized.

@ssalinas

ssalinas Aug 24, 2016

Member

Ah, yeah the variable could probably use renaming. The one I'm grabbing is actually the previous slave state, not the one we are about to update to. Essentially I'm looking for any transitions from ACTIVE -> DEAD.

@ssalinas

ssalinas Aug 24, 2016

Member

Ah, yeah the variable could probably use renaming. The one I'm grabbing is actually the previous slave state, not the one we are about to update to. Essentially I'm looking for any transitions from ACTIVE -> DEAD.

ssalinas added some commits Aug 24, 2016

@ssalinas ssalinas added the hs_qa label Aug 25, 2016

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Aug 25, 2016

Member

Going to merge this into disabled actions since they go together and are now both on staging + qa

Member

ssalinas commented Aug 25, 2016

Going to merge this into disabled actions since they go together and are now both on staging + qa

@ssalinas ssalinas merged commit 1cb3945 into action_disable Aug 25, 2016

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details

@ssalinas ssalinas deleted the disaster_detection branch Aug 25, 2016

@ssalinas ssalinas changed the title from (WIP) Disaster detection to Disaster detection Sep 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment