New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent flapping slave from rejoining cluster #1428

Merged
merged 9 commits into from Mar 20, 2017

Conversation

Projects
None yet
2 participants
@PtrTeixeira
Contributor

PtrTeixeira commented Feb 22, 2017

This adds a node to zk (/singularity/inactive) that keeps just
contains an array of hosts whose slaves have been marked as inactive.
When Singularity checks if an offer should be accepted, it grabs the
node from zk. If the offer is from a slave on a bad host, Singularity
will discard it.

/cc @ssalinas Is the right way to go about this?

PtrTeixeira added some commits Feb 22, 2017

Prevent flapping slave from rejoining cluster
This adds a node to zk (`/singularity/inactive`) that keeps just
contains an array of hosts whose slaves have been marked as inactive.
When Singularity checks if an offer should be accepted, it grabs the
node from zk. If the offer is from a slave on a bad host, Singularity
will discard it.
Fill out functions to active/deactivate slaves
Complete the functions that are actually used to mark slaves as
activated or deactivated. Previously I was manually editing the list in
zk.
Add small test suite for inactive slave manager
Add a pair of tests for the the `InactiveSlaveManager`. This is mostly
just to make sure that I get what's going on; in practice the tests
don't do much more than run Curator through a very small trial run.
Show outdated Hide outdated ...ice/src/main/java/com/hubspot/singularity/data/InactiveSlaveManager.java
Show outdated Hide outdated ...n/java/com/hubspot/singularity/mesos/SingularitySlaveAndRackManager.java
}
@Test
public void itShouldNotContainHostAfterActivatingHost() {

This comment has been minimized.

@ssalinas

ssalinas Feb 23, 2017

Member

to extend these tests, you can have the class extend SingularitySchedulerTestBase. In there is a resourceOffers method which mocks the sending of an offer, allowing you to trigger the code path of getting an offer from a new slave.

@ssalinas

ssalinas Feb 23, 2017

Member

to extend these tests, you can have the class extend SingularitySchedulerTestBase. In there is a resourceOffers method which mocks the sending of an offer, allowing you to trigger the code path of getting an offer from a new slave.

PtrTeixeira added some commits Feb 23, 2017

Mark slaves on inactive host as decommissioned
When a previously-seen slave on a host which is marked as inactive
attempts to join the cluster, it is now marked as `DECOMMISSIONED`.
Previously, it was ignored and nothing actually happened to it. This
actually will stop it from accepting offers, as well as provide
visibility into what is actually going on w/r/t the flapping slave.
Add resource for marking slaves as inactive
Next step toward being able to mark a machine for a slave up-for-review
via the UI. Previously you would have needed to manually edit the node
in ZK in order mark a node as inactive. Notably, un-marking this host as
inactive will not allow the slave to being accepting offers until the
slave is also un-marked as decom'ed or it disappears and reappears with
a new slaveID. So restoring a slave will likely be a two step process.
Add UI for marking hosts inactive
Adds a button on each slave for marking the host that it's on as
active/inactive. When there are hosts that are marked as inactive, it
will also display a list of all hosts that have been marked as inactive.

@PtrTeixeira PtrTeixeira changed the title from [WIP] Prevent flapping slave from rejoining cluster to Prevent flapping slave from rejoining cluster Feb 24, 2017

@PtrTeixeira

This comment has been minimized.

Show comment
Hide comment
@PtrTeixeira
Contributor

PtrTeixeira commented Feb 24, 2017

/cc @ssalinas

Show outdated Hide outdated ...ice/src/main/java/com/hubspot/singularity/data/InactiveSlaveManager.java
Show outdated Hide outdated ...c/main/java/com/hubspot/singularity/mesos/SingularityMesosScheduler.java
Show outdated Hide outdated ...n/java/com/hubspot/singularity/mesos/SingularitySlaveAndRackManager.java
>
<p>Are you sure you want to mark the host {slave.host} as inactive?</p>
<p>
This will decommission every slave on this host until you reactivate

This comment has been minimized.

@ssalinas

ssalinas Feb 27, 2017

Member

For clarity, maybe something like:

This will automatically decommission any host that joins with a matching hostname.

I don't think we will need to mention the piece about reactivation, since the slave will still appear in the decommissioned list with the reactivate button next to it

@ssalinas

ssalinas Feb 27, 2017

Member

For clarity, maybe something like:

This will automatically decommission any host that joins with a matching hostname.

I don't think we will need to mention the piece about reactivation, since the slave will still appear in the decommissioned list with the reactivate button next to it

Show outdated Hide outdated SingularityUI/app/components/machines/Slaves.jsx
Show outdated Hide outdated SingularityUI/app/components/machines/Slaves.jsx
Fix issues from code review
In particular, it changes a handful of phrasing issues on the UI and
reorganizes how the data is stored in zk. In particular, rather than
single node which contains an array as children, it now has a main node
whose (empty) children represent the hosts that have been deactivated.
There is now an additional method which simply checks whether a host is
active. This allows the query to zk to be deferred until it actually
receives an offer that reveals a new slave.
Rename ZooKeeper path
Change the name of the path in zk from `/inactive` to `/inactiveSlaves`.
@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Mar 6, 2017

Member

👍 Let's test this out in hs_staging

Member

ssalinas commented Mar 6, 2017

👍 Let's test this out in hs_staging

@ssalinas ssalinas added the hs_qa label Mar 13, 2017

@ssalinas ssalinas modified the milestone: 0.15.0 Mar 13, 2017

@ssalinas ssalinas added the hs_stable label Mar 20, 2017

@ssalinas ssalinas merged commit 98761ef into master Mar 20, 2017

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details

@ssalinas ssalinas deleted the deactivate-flapping-slave branch Mar 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment