Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent flapping slave from rejoining cluster #1428

Merged
merged 9 commits into from Mar 20, 2017

Conversation

PtrTeixeira
Copy link
Contributor

This adds a node to zk (/singularity/inactive) that keeps just
contains an array of hosts whose slaves have been marked as inactive.
When Singularity checks if an offer should be accepted, it grabs the
node from zk. If the offer is from a slave on a bad host, Singularity
will discard it.

/cc @ssalinas Is the right way to go about this?

This adds a node to zk (`/singularity/inactive`) that keeps just
contains an array of hosts whose slaves have been marked as inactive.
When Singularity checks if an offer should be accepted, it grabs the
node from zk. If the offer is from a slave on a bad host, Singularity
will discard it.
Complete the functions that are actually used to mark slaves as
activated or deactivated. Previously I was manually editing the list in
zk.
Add a pair of tests for the the `InactiveSlaveManager`. This is mostly
just to make sure that I get what's going on; in practice the tests
don't do much more than run Curator through a very small trial run.
}
}

public void deactiveSlave(String slave) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deactivate?

final String slaveId = offer.getSlaveId().getValue();
final String rackId = slaveAndRackHelper.getRackIdOrDefault(offer);
final String host = slaveAndRackHelper.getMaybeTruncatedHost(offer);
final Map<String, String> textAttributes = slaveAndRackHelper.getTextAttributes(offer);

final SingularitySlave slave = new SingularitySlave(slaveId, host, rackId, textAttributes, Optional.<MesosResourcesObject>absent());

if (inactiveSlaves.contains(host)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of doing this check here, we can do it after we find out if it's a new slave. My idea here would be that we can find the new slave, but if it is in the 'bad hosts' list, move it immediately to decommissioned. The end goal is to not schedule tasks on a slave that might be flapping due to issues. So, we definitely don't want it as active, but if we ignore it or mark it dead we don't have any info bout the fact that it's still (somewhat) present. Moving it to decommissioned right away (maybe with a message set) lets us know it's still there, but without the danger of scheduling tasks on a bad slave.

}

@Test
public void itShouldNotContainHostAfterActivatingHost() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to extend these tests, you can have the class extend SingularitySchedulerTestBase. In there is a resourceOffers method which mocks the sending of an offer, allowing you to trigger the code path of getting an offer from a new slave.

When a previously-seen slave on a host which is marked as inactive
attempts to join the cluster, it is now marked as `DECOMMISSIONED`.
Previously, it was ignored and nothing actually happened to it. This
actually will stop it from accepting offers, as well as provide
visibility into what is actually going on w/r/t the flapping slave.
Next step toward being able to mark a machine for a slave up-for-review
via the UI. Previously you would have needed to manually edit the node
in ZK in order mark a node as inactive. Notably, un-marking this host as
inactive will not allow the slave to being accepting offers until the
slave is also un-marked as decom'ed or it disappears and reappears with
a new slaveID. So restoring a slave will likely be a two step process.
Adds a button on each slave for marking the host that it's on as
active/inactive. When there are hosts that are marked as inactive, it
will also display a list of all hosts that have been marked as inactive.
@PtrTeixeira PtrTeixeira changed the title [WIP] Prevent flapping slave from rejoining cluster Prevent flapping slave from rejoining cluster Feb 24, 2017
@PtrTeixeira
Copy link
Contributor Author

/cc @ssalinas

}

public Set<String> getInactiveSlaves() {
Optional<String[]> maybeInactiveSlaves = getData(ROOT_PATH, this.inactiveSlaveTranscoder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it might be easier to just create empty child nodes with the proper name here, rather than continually reading/writing an array. Creation would just be creating a node at /inactive/(hostname), getting all would be a getChildren() call, and checking if an individual one would just be an exists() call rather than grabbing the whole array.

@@ -135,7 +138,7 @@ public void resourceOffers(SchedulerDriver driver, List<Protos.Offer> offers) {
final Set<Protos.OfferID> acceptedOffers = Sets.newHashSetWithExpectedSize(offers.size());

for (Protos.Offer offer : offers) {
slaveAndRackManager.checkOffer(offer);
slaveAndRackManager.checkOffer(offer, inactiveSlaveManager.getInactiveSlaves());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficiency we should probably grab this before looping through offers so we don't grab it 50+ times

@@ -405,7 +406,14 @@ public void checkOffer(Offer offer) {
final SingularitySlave slave = new SingularitySlave(slaveId, host, rackId, textAttributes, Optional.<MesosResourcesObject>absent());

if (check(slave, slaveManager) == CheckResult.NEW) {
LOG.info("Offer revealed a new slave {}", slave);
if (inactiveSlaves.contains(slave.getHost())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, if we switch to the method of storing the hostnames I mentioned in the previous comment, we don't need to feed the list to this method and can just check if the individual inactive hostname node exists here

>
<p>Are you sure you want to mark the host {slave.host} as inactive?</p>
<p>
This will decommission every slave on this host until you reactivate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, maybe something like:

This will automatically decommission any host that joins with a matching hostname.

I don't think we will need to mention the piece about reactivation, since the slave will still appear in the decommissioned list with the reactivate button next to it

>
<p>Are you sure you want to reactivate host {slave.host}?</p>
<p>
This will allow new slaves from this host to make offers without
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New slaves with this hostname will no longer be automatically marked as decommissioned

The making offers piece is a bit misleading. For one, we get the offers form the master, not the slave. Also, we will continue to get offers from the master for that slave as long as it is still connected. From Singularity's perspective, the important part is that it is a new slave, not necessarily that it continues to make offers

<p>These hosts are marked as inactive: </p>
<ul className="list-group">
{inactiveHosts.map((host) => (
<li className="list-group-item" key={host}>{host}</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to be able to remove the host from this list in this location as well.

In particular, it changes a handful of phrasing issues on the UI and
reorganizes how the data is stored in zk. In particular, rather than
single node which contains an array as children, it now has a main node
whose (empty) children represent the hosts that have been deactivated.
There is now an additional method which simply checks whether a host is
active. This allows the query to zk to be deferred until it actually
receives an offer that reveals a new slave.

@Singleton
public class InactiveSlaveManager extends CuratorManager {
private static final String ROOT_PATH = "/inactive";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick here, but could we have this be /inactiveSlaves? when looking at the top level in zk would be easier to tell what it's for

Change the name of the path in zk from `/inactive` to `/inactiveSlaves`.
@ssalinas
Copy link
Member

ssalinas commented Mar 6, 2017

👍 Let's test this out in hs_staging

@ssalinas ssalinas added the hs_qa label Mar 13, 2017
@ssalinas ssalinas modified the milestone: 0.15.0 Mar 13, 2017
@ssalinas ssalinas merged commit 98761ef into master Mar 20, 2017
@ssalinas ssalinas deleted the deactivate-flapping-slave branch March 20, 2017 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants