Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disaster detection #1247

Merged
merged 20 commits into from Aug 25, 2016
Merged

Disaster detection #1247

merged 20 commits into from Aug 25, 2016

Conversation

ssalinas
Copy link
Member

@ssalinas ssalinas commented Aug 23, 2016

  • Add a test endpoint to explicitly trigger task reconciliation
  • Add disaster detection around things like number of overdue tasks, slaves lost, etc, that will run on the leader only
  • Allow the disaster detection to disable certain actions when one of its thresholds is crossed
  • Email admins when a disaster is detected
  • Tune metrics appropriately to decide when something bad is happening
  • Add an api endpoint to manually enter/exit disaster mode
  • Admin ui page to manually enter/exit disaster mode (Maybe?)

@ssalinas ssalinas modified the milestone: 0.11.0 Aug 24, 2016
}

for (SingularityDisasterType disaster : newActiveDisasters) {
addDisaster(disaster);
Copy link
Contributor

@darcatron darcatron Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this again add previously added disasters?

Copy link
Member Author

@ssalinas ssalinas Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah I can make it skip those. Originally I was going to store a message in there, so it made sense to call that again in case the message had been updated. Can probably do a if (!exists(path)) check first, thanks!

Copy link
Member Author

@ssalinas ssalinas Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated 👍


checkRackAfterSlaveLoss(slave.get());
} else {
LOG.warn("Lost a slave {}, but didn't know about it", slaveId);
}
}

private void updateDisasterCounter(SingularitySlave slave) {
if (slave.getCurrentState().getState() == MachineState.ACTIVE) {
Copy link
Contributor

@darcatron darcatron Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not understand this part fully, but if the slave state is changed to MachineState.DEAD, will this ever be true?

Copy link
Member Author

@ssalinas ssalinas Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah the variable could probably use renaming. The one I'm grabbing is actually the previous slave state, not the one we are about to update to. Essentially I'm looking for any transitions from ACTIVE -> DEAD.

@ssalinas ssalinas added the hs_qa label Aug 25, 2016
@ssalinas
Copy link
Member Author

ssalinas commented Aug 25, 2016

Going to merge this into disabled actions since they go together and are now both on staging + qa

@ssalinas ssalinas merged commit 1cb3945 into action_disable Aug 25, 2016
0 of 2 checks passed
@ssalinas ssalinas deleted the disaster_detection branch Aug 25, 2016
@ssalinas ssalinas removed the hs_qa label Aug 25, 2016
@ssalinas ssalinas changed the title (WIP) Disaster detection Disaster detection Sep 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants