New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent flapping slave from rejoining cluster #1428
Conversation
This adds a node to zk (`/singularity/inactive`) that keeps just contains an array of hosts whose slaves have been marked as inactive. When Singularity checks if an offer should be accepted, it grabs the node from zk. If the offer is from a slave on a bad host, Singularity will discard it.
Complete the functions that are actually used to mark slaves as activated or deactivated. Previously I was manually editing the list in zk.
Add a pair of tests for the the `InactiveSlaveManager`. This is mostly just to make sure that I get what's going on; in practice the tests don't do much more than run Curator through a very small trial run.
} | ||
} | ||
|
||
public void deactiveSlave(String slave) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deactivate?
final String slaveId = offer.getSlaveId().getValue(); | ||
final String rackId = slaveAndRackHelper.getRackIdOrDefault(offer); | ||
final String host = slaveAndRackHelper.getMaybeTruncatedHost(offer); | ||
final Map<String, String> textAttributes = slaveAndRackHelper.getTextAttributes(offer); | ||
|
||
final SingularitySlave slave = new SingularitySlave(slaveId, host, rackId, textAttributes, Optional.<MesosResourcesObject>absent()); | ||
|
||
if (inactiveSlaves.contains(host)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead of doing this check here, we can do it after we find out if it's a new slave. My idea here would be that we can find the new slave, but if it is in the 'bad hosts' list, move it immediately to decommissioned. The end goal is to not schedule tasks on a slave that might be flapping due to issues. So, we definitely don't want it as active, but if we ignore it or mark it dead we don't have any info bout the fact that it's still (somewhat) present. Moving it to decommissioned right away (maybe with a message set) lets us know it's still there, but without the danger of scheduling tasks on a bad slave.
} | ||
|
||
@Test | ||
public void itShouldNotContainHostAfterActivatingHost() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to extend these tests, you can have the class extend SingularitySchedulerTestBase
. In there is a resourceOffers
method which mocks the sending of an offer, allowing you to trigger the code path of getting an offer from a new slave.
When a previously-seen slave on a host which is marked as inactive attempts to join the cluster, it is now marked as `DECOMMISSIONED`. Previously, it was ignored and nothing actually happened to it. This actually will stop it from accepting offers, as well as provide visibility into what is actually going on w/r/t the flapping slave.
Next step toward being able to mark a machine for a slave up-for-review via the UI. Previously you would have needed to manually edit the node in ZK in order mark a node as inactive. Notably, un-marking this host as inactive will not allow the slave to being accepting offers until the slave is also un-marked as decom'ed or it disappears and reappears with a new slaveID. So restoring a slave will likely be a two step process.
Adds a button on each slave for marking the host that it's on as active/inactive. When there are hosts that are marked as inactive, it will also display a list of all hosts that have been marked as inactive.
/cc @ssalinas |
} | ||
|
||
public Set<String> getInactiveSlaves() { | ||
Optional<String[]> maybeInactiveSlaves = getData(ROOT_PATH, this.inactiveSlaveTranscoder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it might be easier to just create empty child nodes with the proper name here, rather than continually reading/writing an array. Creation would just be creating a node at /inactive/(hostname)
, getting all would be a getChildren()
call, and checking if an individual one would just be an exists()
call rather than grabbing the whole array.
@@ -135,7 +138,7 @@ public void resourceOffers(SchedulerDriver driver, List<Protos.Offer> offers) { | |||
final Set<Protos.OfferID> acceptedOffers = Sets.newHashSetWithExpectedSize(offers.size()); | |||
|
|||
for (Protos.Offer offer : offers) { | |||
slaveAndRackManager.checkOffer(offer); | |||
slaveAndRackManager.checkOffer(offer, inactiveSlaveManager.getInactiveSlaves()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For efficiency we should probably grab this before looping through offers so we don't grab it 50+ times
@@ -405,7 +406,14 @@ public void checkOffer(Offer offer) { | |||
final SingularitySlave slave = new SingularitySlave(slaveId, host, rackId, textAttributes, Optional.<MesosResourcesObject>absent()); | |||
|
|||
if (check(slave, slaveManager) == CheckResult.NEW) { | |||
LOG.info("Offer revealed a new slave {}", slave); | |||
if (inactiveSlaves.contains(slave.getHost())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, if we switch to the method of storing the hostnames I mentioned in the previous comment, we don't need to feed the list to this method and can just check if the individual inactive hostname node exists here
> | ||
<p>Are you sure you want to mark the host {slave.host} as inactive?</p> | ||
<p> | ||
This will decommission every slave on this host until you reactivate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, maybe something like:
This will automatically decommission any host that joins with a matching hostname.
I don't think we will need to mention the piece about reactivation, since the slave will still appear in the decommissioned list with the reactivate button next to it
> | ||
<p>Are you sure you want to reactivate host {slave.host}?</p> | ||
<p> | ||
This will allow new slaves from this host to make offers without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New slaves with this hostname will no longer be automatically marked as decommissioned
The making offers piece is a bit misleading. For one, we get the offers form the master, not the slave. Also, we will continue to get offers from the master for that slave as long as it is still connected. From Singularity's perspective, the important part is that it is a new slave, not necessarily that it continues to make offers
<p>These hosts are marked as inactive: </p> | ||
<ul className="list-group"> | ||
{inactiveHosts.map((host) => ( | ||
<li className="list-group-item" key={host}>{host}</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to be able to remove the host from this list in this location as well.
In particular, it changes a handful of phrasing issues on the UI and reorganizes how the data is stored in zk. In particular, rather than single node which contains an array as children, it now has a main node whose (empty) children represent the hosts that have been deactivated. There is now an additional method which simply checks whether a host is active. This allows the query to zk to be deferred until it actually receives an offer that reveals a new slave.
|
||
@Singleton | ||
public class InactiveSlaveManager extends CuratorManager { | ||
private static final String ROOT_PATH = "/inactive"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit pick here, but could we have this be /inactiveSlaves
? when looking at the top level in zk would be easier to tell what it's for
Change the name of the path in zk from `/inactive` to `/inactiveSlaves`.
|
This adds a node to zk (
/singularity/inactive
) that keeps justcontains an array of hosts whose slaves have been marked as inactive.
When Singularity checks if an offer should be accepted, it grabs the
node from zk. If the offer is from a slave on a bad host, Singularity
will discard it.
/cc @ssalinas Is the right way to go about this?