Skip to content

Disaster detection#1247

Merged
ssalinas merged 20 commits intoaction_disablefrom
disaster_detection
Aug 25, 2016
Merged

Disaster detection#1247
ssalinas merged 20 commits intoaction_disablefrom
disaster_detection

Conversation

@ssalinas
Copy link
Copy Markdown
Contributor

@ssalinas ssalinas commented Aug 23, 2016

  • Add a test endpoint to explicitly trigger task reconciliation
  • Add disaster detection around things like number of overdue tasks, slaves lost, etc, that will run on the leader only
  • Allow the disaster detection to disable certain actions when one of its thresholds is crossed
  • Email admins when a disaster is detected
  • Tune metrics appropriately to decide when something bad is happening
  • Add an api endpoint to manually enter/exit disaster mode
  • Admin ui page to manually enter/exit disaster mode (Maybe?)

@ssalinas ssalinas modified the milestone: 0.11.0 Aug 24, 2016
}

for (SingularityDisasterType disaster : newActiveDisasters) {
addDisaster(disaster);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this again add previously added disasters?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah I can make it skip those. Originally I was going to store a message in there, so it made sense to call that again in case the message had been updated. Can probably do a if (!exists(path)) check first, thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated 👍

}

private void updateDisasterCounter(SingularitySlave slave) {
if (slave.getCurrentState().getState() == MachineState.ACTIVE) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not understand this part fully, but if the slave state is changed to MachineState.DEAD, will this ever be true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah the variable could probably use renaming. The one I'm grabbing is actually the previous slave state, not the one we are about to update to. Essentially I'm looking for any transitions from ACTIVE -> DEAD.

@ssalinas ssalinas added the hs_qa label Aug 25, 2016
@ssalinas
Copy link
Copy Markdown
Contributor Author

Going to merge this into disabled actions since they go together and are now both on staging + qa

@ssalinas ssalinas merged commit 1cb3945 into action_disable Aug 25, 2016
@ssalinas ssalinas deleted the disaster_detection branch August 25, 2016 13:50
@ssalinas ssalinas removed the hs_qa label Aug 25, 2016
@ssalinas ssalinas changed the title (WIP) Disaster detection Disaster detection Sep 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants