Disk decider can allocate more data than the node can handle #7753

grantr · 2014-09-16T22:23:12Z

We had a disk full event recently that exposed a potentially dangerous behavior of the disk-based shard allocation.

It appears that the disk-based allocation algorithm checks to see whether shards will fit on a node and disallows shards that would increase the usage past the high watermark. That's good. But in our case, the disks filled up anyway.

We had a cluster where every node's data partition was close to full. When a node (we'll call it node A) ran out of space, most of the shards allocated to it failed and were deleted. This took it very close to the low watermark, but not quite under. Later, an unknown event (possibly a merge, possibly a human doing something) freed more space and brought disk usage back under the low watermark. Elasticsearch allocated a few of the failed shards from before back to the same node. However, recovery of those shards failed due to disk full errors.

At roughly the same time, another node in the cluster (node B) ran out of disk and failed a bunch of shards.

I believe the recovery failed because two events triggered allocation at roughly the same time. The first caused the disk-based allocator to allocate some shards to the node. While those shards were initializing, the second event caused another instance of the allocator to allocate even more shards to the same node.

Does the disk-based allocator consider the expected disk usage after current recoveries are finished, or does it ignore current recoveries?

Unfortunately I don't have logs of allocation decisions, so I don't know exactly which shards were allocated where. I know that all the shards that failed recovery were originally allocated to node A. It's possible that none of the shards from node B were actually allocated to node A.

Regardless of what actually happened, I'm hoping that someone can explain to me what the disk-based allocator would be expected to do in the above case where there are two allocation events in a short time.

bluelu · 2014-09-18T10:50:17Z

Hi, we have hit the same issue multiple times (version 1.0.2):

This should easily be able to be tested when restarting a cluster.

Our disk load is about 70% on average with 3 shards per node, before we restart the cluster. If we restart the cluster, elasticsearch allocates sometimes more than 2-3 more shards to some of these nodes, even though that node's disk is already nearly full. (limits are set to 65% don't allocate, and 95% move away). At the end these nodes won't have more than 4 shards "active", but there are still "unused" shards on the node which server as backup copies if the recovery from the primary fails (as we restarted the cluster).

The main issue is, that new allocation does not seem to take into consideration the expected shard size of the primary.

Then after a few minutes, nodes start running completely full and we have to manually aboard the relocation process to these nodes.

dakrone · 2014-09-18T10:52:08Z

Does the disk-based allocator consider the expected disk usage after current recoveries are finished, or does it ignore current recoveries?

It only considers the expected disk after the shard in question has completed relocation to the node. The size is independently considered for each shard, because AllocationDeciders are considered to be semi-stateless and run in multiple simulations.

I'm hoping that someone can explain to me what the disk-based allocator would be expected to do in the above case where there are two allocation events in a short time.

The disk usage and shard allocation is a bit more complex, because in order to determine the free disk usage we need to poll at an interval for the amount of disk space used, since the AllocationDecider.canAllocate can sometimes be run thousands of times in a second depending on the size of a cluster (we can't run a nodes stats request every time it's run to get a fresh disk usage). This is why the InternalClusterInfoService has an interval (cluster.info.update.interval) to get new disk information every 30 seconds by default.

So here's what can happen if not careful:

Node A is sitting at 74% disk usage, the low watermark is 75%.
FS stats are gathered
Master decides to allocate a shard to node A, sees it's under the low watermark, and starts relocation
Relocation finishes, Node A is now at 80% usage
Master decides to allocate another shard to node A, it's still under the low watermark because new fs stats haven't been gathered yet
Relocation finishes, even though Node A didn't have room for it!
FS stats are gathered
Master finally sees Node A is now above the high watermark and can try to relocate data away from it

This is worst-case scenario. There are a few ways to address this.

First, the interval for the cluster info update can be shortened, so that FS information is gathered more frequently.

Second, the cluster.routing.allocation.cluster_concurrent_rebalance can be lowered, the default is 2 (I don't know if you raised this or not), but lowering it to 1 means that only a single shard can be rebalanced, which will help have more "even" knowledge of the disk usages. You can also lower the

For future debugging, to log what ES thinks the current sizes are, you can enable TRACE logging for cluster.InternalClusterInfoService and it will log the retrieved sizes, or cluster.routing.allocation.decider.DiskThresholdDecider to log information about the allocation decisions.

I will also try to think of a better way to prevent this situation from happening in the future.

bluelu · 2014-09-18T11:02:48Z

@dakrone

Could it be that when the master allocates a shard, that it only takes into consideration hd usage and the predicted size of that shard, but not the ones which haven't yet recovered/initialised and where it did the same operation just before? In our case relocation will never finish.

I'm not sure, but I think cluster_concurrent_rebalance has any effect when you do a cluster restart as primaries and replicas need to be allocated.
Also an out of date FS information should have no affect on our case.

That would explain why we see this behaviour at cluster restart

dakrone · 2014-09-18T11:09:09Z

@bluelu:

Could it be that when the master allocates a shard, that it only takes into consideration hd usage and the predicted size of that shard, but not the ones which haven't yet recovered/initialised and where it did the same operation just before?

It does take disk usage into account (through the ClusterInfoService) and predicted size of that shard, however it does not take into account the final, total size of shards that are currently relocating to the node.

So the decider looking at a node with 0% disk usage evaluating a shard that's 5gb will see that the node will end up with 5gb of used space, even if there are n other relocations of other shards to this node that just started.

dakrone · 2014-09-18T11:16:30Z

I think it might be possible to get the list of other shards currently relocating to a node and factor their size into the final disk usage total, if this solution sounds like it would be useful for you @bluelu @grantr , however, it would be good to confirm the source of the issue (if it is indeed multiple relocations being evaluated independently) first. Turning on the logging I mentioned above would be helpful if you see the issue again!

bluelu · 2014-09-18T11:22:42Z

Sounds like the exact solution for the problem we have.

dakrone · 2014-09-18T11:26:31Z

I believe this may also help with #6168 /cc @gibrown

bluelu · 2014-09-18T12:00:35Z

Yes, we run into this if we don't manually intervene like we do now.

TwP · 2014-09-18T17:06:20Z

@dakrone we have run into this situation before where the available disk space calculation did not take into account in progress relocations. When @drewr was helping us recover a cluster, he recommended setting the number of concurrent relocations to1 in order to prevent this from happening.

I can drop more details in here when I'm back in front of a computer.

grantr · 2014-09-18T17:17:11Z

You can also lower the

You can also lower the what? Inquiring minds want to know ;)

dakrone · 2014-09-19T10:22:55Z

You can also lower the what? Inquiring minds want to know ;)

Whoops sorry! I meant to say lower the speed at which the shard is recovered (through throttling), to give the InternalClusterInfoService more time to gather updated information, but that is not as ideal as taking the relocations into account.

When using the DiskThresholdDecider, it's possible that shards could already be marked as relocating to the node being evaluated. This commit adds a new setting `cluster.routing.allocation.disk.include_relocations` which adds the size of the shards currently being relocated to this node to the node's used disk space. This new option defaults to `true`, however it's possible to over-estimate the usage for a node if the relocation is already partially complete, for instance: A node with a 10gb shard that's 45% of the way through a relocation would add 10gb + (.45 * 10) = 14.5gb to the node's disk usage before examining the watermarks to see if a new shard can be allocated. Fixes elastic#7753 Relates to elastic#6168

When using the DiskThresholdDecider, it's possible that shards could already be marked as relocating to the node being evaluated. This commit adds a new setting `cluster.routing.allocation.disk.include_relocations` which adds the size of the shards currently being relocated to this node to the node's used disk space. This new option defaults to `true`, however it's possible to over-estimate the usage for a node if the relocation is already partially complete, for instance: A node with a 10gb shard that's 45% of the way through a relocation would add 10gb + (.45 * 10) = 14.5gb to the node's disk usage before examining the watermarks to see if a new shard can be allocated. Fixes #7753 Relates to #6168

dakrone mentioned this issue Sep 18, 2014

Add option to take currently relocating shards' sizes into account #7785

Merged

dakrone closed this as completed in #7785 Sep 19, 2014

clintongormley assigned dakrone Sep 25, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk decider can allocate more data than the node can handle #7753

Disk decider can allocate more data than the node can handle #7753

grantr commented Sep 16, 2014

bluelu commented Sep 18, 2014

dakrone commented Sep 18, 2014

bluelu commented Sep 18, 2014

dakrone commented Sep 18, 2014

dakrone commented Sep 18, 2014

bluelu commented Sep 18, 2014

dakrone commented Sep 18, 2014

bluelu commented Sep 18, 2014

TwP commented Sep 18, 2014

grantr commented Sep 18, 2014

dakrone commented Sep 19, 2014

Disk decider can allocate more data than the node can handle #7753

Disk decider can allocate more data than the node can handle #7753

Comments

grantr commented Sep 16, 2014

bluelu commented Sep 18, 2014

dakrone commented Sep 18, 2014

bluelu commented Sep 18, 2014

dakrone commented Sep 18, 2014

dakrone commented Sep 18, 2014

bluelu commented Sep 18, 2014

dakrone commented Sep 18, 2014

bluelu commented Sep 18, 2014

TwP commented Sep 18, 2014

grantr commented Sep 18, 2014

dakrone commented Sep 19, 2014