Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent flapping slave from rejoining cluster #1428

Merged
merged 9 commits into from Mar 20, 2017
Merged

Conversation

@PtrTeixeira
Copy link
Contributor

PtrTeixeira commented Feb 22, 2017

This adds a node to zk (/singularity/inactive) that keeps just
contains an array of hosts whose slaves have been marked as inactive.
When Singularity checks if an offer should be accepted, it grabs the
node from zk. If the offer is from a slave on a bad host, Singularity
will discard it.

/cc @ssalinas Is the right way to go about this?

PtrTeixeira added 3 commits Feb 22, 2017
This adds a node to zk (`/singularity/inactive`) that keeps just
contains an array of hosts whose slaves have been marked as inactive.
When Singularity checks if an offer should be accepted, it grabs the
node from zk. If the offer is from a slave on a bad host, Singularity
will discard it.
Complete the functions that are actually used to mark slaves as
activated or deactivated. Previously I was manually editing the list in
zk.
Add a pair of tests for the the `InactiveSlaveManager`. This is mostly
just to make sure that I get what's going on; in practice the tests
don't do much more than run Curator through a very small trial run.
}
}

public void deactiveSlave(String slave) {

This comment has been minimized.

@ssalinas

ssalinas Feb 23, 2017 Member

deactivate?

final String slaveId = offer.getSlaveId().getValue();
final String rackId = slaveAndRackHelper.getRackIdOrDefault(offer);
final String host = slaveAndRackHelper.getMaybeTruncatedHost(offer);
final Map<String, String> textAttributes = slaveAndRackHelper.getTextAttributes(offer);

final SingularitySlave slave = new SingularitySlave(slaveId, host, rackId, textAttributes, Optional.<MesosResourcesObject>absent());

if (inactiveSlaves.contains(host)) {

This comment has been minimized.

@ssalinas

ssalinas Feb 23, 2017 Member

maybe instead of doing this check here, we can do it after we find out if it's a new slave. My idea here would be that we can find the new slave, but if it is in the 'bad hosts' list, move it immediately to decommissioned. The end goal is to not schedule tasks on a slave that might be flapping due to issues. So, we definitely don't want it as active, but if we ignore it or mark it dead we don't have any info bout the fact that it's still (somewhat) present. Moving it to decommissioned right away (maybe with a message set) lets us know it's still there, but without the danger of scheduling tasks on a bad slave.

}

@Test
public void itShouldNotContainHostAfterActivatingHost() {

This comment has been minimized.

@ssalinas

ssalinas Feb 23, 2017 Member

to extend these tests, you can have the class extend SingularitySchedulerTestBase. In there is a resourceOffers method which mocks the sending of an offer, allowing you to trigger the code path of getting an offer from a new slave.

PtrTeixeira added 3 commits Feb 23, 2017
When a previously-seen slave on a host which is marked as inactive
attempts to join the cluster, it is now marked as `DECOMMISSIONED`.
Previously, it was ignored and nothing actually happened to it. This
actually will stop it from accepting offers, as well as provide
visibility into what is actually going on w/r/t the flapping slave.
Next step toward being able to mark a machine for a slave up-for-review
via the UI. Previously you would have needed to manually edit the node
in ZK in order mark a node as inactive. Notably, un-marking this host as
inactive will not allow the slave to being accepting offers until the
slave is also un-marked as decom'ed or it disappears and reappears with
a new slaveID. So restoring a slave will likely be a two step process.
Adds a button on each slave for marking the host that it's on as
active/inactive. When there are hosts that are marked as inactive, it
will also display a list of all hosts that have been marked as inactive.
@PtrTeixeira PtrTeixeira changed the title [WIP] Prevent flapping slave from rejoining cluster Prevent flapping slave from rejoining cluster Feb 24, 2017
@PtrTeixeira
Copy link
Contributor Author

PtrTeixeira commented Feb 24, 2017

/cc @ssalinas

}

public Set<String> getInactiveSlaves() {
Optional<String[]> maybeInactiveSlaves = getData(ROOT_PATH, this.inactiveSlaveTranscoder);

This comment has been minimized.

@ssalinas

ssalinas Feb 27, 2017 Member

Wondering if it might be easier to just create empty child nodes with the proper name here, rather than continually reading/writing an array. Creation would just be creating a node at /inactive/(hostname), getting all would be a getChildren() call, and checking if an individual one would just be an exists() call rather than grabbing the whole array.

@@ -135,7 +138,7 @@ public void resourceOffers(SchedulerDriver driver, List<Protos.Offer> offers) {
final Set<Protos.OfferID> acceptedOffers = Sets.newHashSetWithExpectedSize(offers.size());

for (Protos.Offer offer : offers) {
slaveAndRackManager.checkOffer(offer);
slaveAndRackManager.checkOffer(offer, inactiveSlaveManager.getInactiveSlaves());

This comment has been minimized.

@ssalinas

ssalinas Feb 27, 2017 Member

For efficiency we should probably grab this before looping through offers so we don't grab it 50+ times

@@ -405,7 +406,14 @@ public void checkOffer(Offer offer) {
final SingularitySlave slave = new SingularitySlave(slaveId, host, rackId, textAttributes, Optional.<MesosResourcesObject>absent());

if (check(slave, slaveManager) == CheckResult.NEW) {
LOG.info("Offer revealed a new slave {}", slave);
if (inactiveSlaves.contains(slave.getHost())) {

This comment has been minimized.

@ssalinas

ssalinas Feb 27, 2017 Member

alternatively, if we switch to the method of storing the hostnames I mentioned in the previous comment, we don't need to feed the list to this method and can just check if the individual inactive hostname node exists here

>
<p>Are you sure you want to mark the host {slave.host} as inactive?</p>
<p>
This will decommission every slave on this host until you reactivate

This comment has been minimized.

@ssalinas

ssalinas Feb 27, 2017 Member

For clarity, maybe something like:

This will automatically decommission any host that joins with a matching hostname.

I don't think we will need to mention the piece about reactivation, since the slave will still appear in the decommissioned list with the reactivate button next to it

>
<p>Are you sure you want to reactivate host {slave.host}?</p>
<p>
This will allow new slaves from this host to make offers without

This comment has been minimized.

@ssalinas

ssalinas Feb 27, 2017 Member

New slaves with this hostname will no longer be automatically marked as decommissioned

The making offers piece is a bit misleading. For one, we get the offers form the master, not the slave. Also, we will continue to get offers from the master for that slave as long as it is still connected. From Singularity's perspective, the important part is that it is a new slave, not necessarily that it continues to make offers

<p>These hosts are marked as inactive: </p>
<ul className="list-group">
{inactiveHosts.map((host) => (
<li className="list-group-item" key={host}>{host}</li>

This comment has been minimized.

@ssalinas

ssalinas Feb 27, 2017 Member

might be good to be able to remove the host from this list in this location as well.

In particular, it changes a handful of phrasing issues on the UI and
reorganizes how the data is stored in zk. In particular, rather than
single node which contains an array as children, it now has a main node
whose (empty) children represent the hosts that have been deactivated.
There is now an additional method which simply checks whether a host is
active. This allows the query to zk to be deferred until it actually
receives an offer that reveals a new slave.

@Singleton
public class InactiveSlaveManager extends CuratorManager {
private static final String ROOT_PATH = "/inactive";

This comment has been minimized.

@ssalinas

ssalinas Mar 3, 2017 Member

nit pick here, but could we have this be /inactiveSlaves? when looking at the top level in zk would be easier to tell what it's for

Change the name of the path in zk from `/inactive` to `/inactiveSlaves`.
@ssalinas
Copy link
Member

ssalinas commented Mar 6, 2017

👍 Let's test this out in hs_staging

@ssalinas ssalinas added the hs_qa label Mar 13, 2017
@ssalinas ssalinas modified the milestone: 0.15.0 Mar 13, 2017
@ssalinas ssalinas added the hs_stable label Mar 20, 2017
@ssalinas ssalinas merged commit 98761ef into master Mar 20, 2017
0 of 2 checks passed
0 of 2 checks passed
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details
@ssalinas ssalinas deleted the deactivate-flapping-slave branch Mar 20, 2017
@ssalinas ssalinas removed hs_qa labels Mar 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.