Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disaster detection #1247

Merged
merged 20 commits into from Aug 25, 2016
Merged

Disaster detection #1247

merged 20 commits into from Aug 25, 2016

Conversation

@ssalinas
Copy link
Member

ssalinas commented Aug 23, 2016

  • Add a test endpoint to explicitly trigger task reconciliation
  • Add disaster detection around things like number of overdue tasks, slaves lost, etc, that will run on the leader only
  • Allow the disaster detection to disable certain actions when one of its thresholds is crossed
  • Email admins when a disaster is detected
  • Tune metrics appropriately to decide when something bad is happening
  • Add an api endpoint to manually enter/exit disaster mode
  • Admin ui page to manually enter/exit disaster mode (Maybe?)
@ssalinas ssalinas added the hs_staging label Aug 24, 2016
@ssalinas ssalinas modified the milestone: 0.11.0 Aug 24, 2016
ssalinas added 6 commits Aug 24, 2016
}

for (SingularityDisasterType disaster : newActiveDisasters) {
addDisaster(disaster);

This comment has been minimized.

Copy link
@darcatron

darcatron Aug 24, 2016

Contributor

Doesn't this again add previously added disasters?

This comment has been minimized.

Copy link
@ssalinas

ssalinas Aug 24, 2016

Author Member

Ah, yeah I can make it skip those. Originally I was going to store a message in there, so it made sense to call that again in case the message had been updated. Can probably do a if (!exists(path)) check first, thanks!

This comment has been minimized.

Copy link
@ssalinas

ssalinas Aug 24, 2016

Author Member

updated 👍


checkRackAfterSlaveLoss(slave.get());
} else {
LOG.warn("Lost a slave {}, but didn't know about it", slaveId);
}
}

private void updateDisasterCounter(SingularitySlave slave) {
if (slave.getCurrentState().getState() == MachineState.ACTIVE) {

This comment has been minimized.

Copy link
@darcatron

darcatron Aug 24, 2016

Contributor

I might not understand this part fully, but if the slave state is changed to MachineState.DEAD, will this ever be true?

This comment has been minimized.

Copy link
@ssalinas

ssalinas Aug 24, 2016

Author Member

Ah, yeah the variable could probably use renaming. The one I'm grabbing is actually the previous slave state, not the one we are about to update to. Essentially I'm looking for any transitions from ACTIVE -> DEAD.

ssalinas added 3 commits Aug 24, 2016
@ssalinas ssalinas added the hs_qa label Aug 25, 2016
@ssalinas
Copy link
Member Author

ssalinas commented Aug 25, 2016

Going to merge this into disabled actions since they go together and are now both on staging + qa

@ssalinas ssalinas merged commit 1cb3945 into action_disable Aug 25, 2016
0 of 2 checks passed
0 of 2 checks passed
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details
@ssalinas ssalinas deleted the disaster_detection branch Aug 25, 2016
@ssalinas ssalinas changed the title (WIP) Disaster detection Disaster detection Sep 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.