Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk decider can allocate more data than the node can handle #7753

Closed
grantr opened this issue Sep 16, 2014 · 11 comments · Fixed by #7785
Closed

Disk decider can allocate more data than the node can handle #7753

grantr opened this issue Sep 16, 2014 · 11 comments · Fixed by #7785
Assignees

Comments

@grantr
Copy link

grantr commented Sep 16, 2014

We had a disk full event recently that exposed a potentially dangerous behavior of the disk-based shard allocation.

It appears that the disk-based allocation algorithm checks to see whether shards will fit on a node and disallows shards that would increase the usage past the high watermark. That's good. But in our case, the disks filled up anyway.

We had a cluster where every node's data partition was close to full. When a node (we'll call it node A) ran out of space, most of the shards allocated to it failed and were deleted. This took it very close to the low watermark, but not quite under. Later, an unknown event (possibly a merge, possibly a human doing something) freed more space and brought disk usage back under the low watermark. Elasticsearch allocated a few of the failed shards from before back to the same node. However, recovery of those shards failed due to disk full errors.

At roughly the same time, another node in the cluster (node B) ran out of disk and failed a bunch of shards.

I believe the recovery failed because two events triggered allocation at roughly the same time. The first caused the disk-based allocator to allocate some shards to the node. While those shards were initializing, the second event caused another instance of the allocator to allocate even more shards to the same node.

Does the disk-based allocator consider the expected disk usage after current recoveries are finished, or does it ignore current recoveries?

Unfortunately I don't have logs of allocation decisions, so I don't know exactly which shards were allocated where. I know that all the shards that failed recovery were originally allocated to node A. It's possible that none of the shards from node B were actually allocated to node A.

Regardless of what actually happened, I'm hoping that someone can explain to me what the disk-based allocator would be expected to do in the above case where there are two allocation events in a short time.

@bluelu
Copy link

bluelu commented Sep 18, 2014

Hi, we have hit the same issue multiple times (version 1.0.2):

This should easily be able to be tested when restarting a cluster.

Our disk load is about 70% on average with 3 shards per node, before we restart the cluster. If we restart the cluster, elasticsearch allocates sometimes more than 2-3 more shards to some of these nodes, even though that node's disk is already nearly full. (limits are set to 65% don't allocate, and 95% move away). At the end these nodes won't have more than 4 shards "active", but there are still "unused" shards on the node which server as backup copies if the recovery from the primary fails (as we restarted the cluster).

The main issue is, that new allocation does not seem to take into consideration the expected shard size of the primary.

Then after a few minutes, nodes start running completely full and we have to manually aboard the relocation process to these nodes.

@dakrone
Copy link
Member

dakrone commented Sep 18, 2014

Does the disk-based allocator consider the expected disk usage after current recoveries are finished, or does it ignore current recoveries?

It only considers the expected disk after the shard in question has completed relocation to the node. The size is independently considered for each shard, because AllocationDeciders are considered to be semi-stateless and run in multiple simulations.

I'm hoping that someone can explain to me what the disk-based allocator would be expected to do in the above case where there are two allocation events in a short time.

The disk usage and shard allocation is a bit more complex, because in order to determine the free disk usage we need to poll at an interval for the amount of disk space used, since the AllocationDecider.canAllocate can sometimes be run thousands of times in a second depending on the size of a cluster (we can't run a nodes stats request every time it's run to get a fresh disk usage). This is why the InternalClusterInfoService has an interval (cluster.info.update.interval) to get new disk information every 30 seconds by default.

So here's what can happen if not careful:

  • Node A is sitting at 74% disk usage, the low watermark is 75%.
  • FS stats are gathered
  • Master decides to allocate a shard to node A, sees it's under the low watermark, and starts relocation
  • Relocation finishes, Node A is now at 80% usage
  • Master decides to allocate another shard to node A, it's still under the low watermark because new fs stats haven't been gathered yet
  • Relocation finishes, even though Node A didn't have room for it!
  • FS stats are gathered
  • Master finally sees Node A is now above the high watermark and can try to relocate data away from it

This is worst-case scenario. There are a few ways to address this.

First, the interval for the cluster info update can be shortened, so that FS information is gathered more frequently.

Second, the cluster.routing.allocation.cluster_concurrent_rebalance can be lowered, the default is 2 (I don't know if you raised this or not), but lowering it to 1 means that only a single shard can be rebalanced, which will help have more "even" knowledge of the disk usages. You can also lower the

For future debugging, to log what ES thinks the current sizes are, you can enable TRACE logging for cluster.InternalClusterInfoService and it will log the retrieved sizes, or cluster.routing.allocation.decider.DiskThresholdDecider to log information about the allocation decisions.

I will also try to think of a better way to prevent this situation from happening in the future.

@bluelu
Copy link

bluelu commented Sep 18, 2014

@dakrone

Could it be that when the master allocates a shard, that it only takes into consideration hd usage and the predicted size of that shard, but not the ones which haven't yet recovered/initialised and where it did the same operation just before? In our case relocation will never finish.

I'm not sure, but I think cluster_concurrent_rebalance has any effect when you do a cluster restart as primaries and replicas need to be allocated.
Also an out of date FS information should have no affect on our case.

That would explain why we see this behaviour at cluster restart

@dakrone
Copy link
Member

dakrone commented Sep 18, 2014

@bluelu:

Could it be that when the master allocates a shard, that it only takes into consideration hd usage and the predicted size of that shard, but not the ones which haven't yet recovered/initialised and where it did the same operation just before?

It does take disk usage into account (through the ClusterInfoService) and predicted size of that shard, however it does not take into account the final, total size of shards that are currently relocating to the node.

So the decider looking at a node with 0% disk usage evaluating a shard that's 5gb will see that the node will end up with 5gb of used space, even if there are n other relocations of other shards to this node that just started.

@dakrone
Copy link
Member

dakrone commented Sep 18, 2014

I think it might be possible to get the list of other shards currently relocating to a node and factor their size into the final disk usage total, if this solution sounds like it would be useful for you @bluelu @grantr , however, it would be good to confirm the source of the issue (if it is indeed multiple relocations being evaluated independently) first. Turning on the logging I mentioned above would be helpful if you see the issue again!

@bluelu
Copy link

bluelu commented Sep 18, 2014

Sounds like the exact solution for the problem we have.

@dakrone
Copy link
Member

dakrone commented Sep 18, 2014

I believe this may also help with #6168 /cc @gibrown

@bluelu
Copy link

bluelu commented Sep 18, 2014

Yes, we run into this if we don't manually intervene like we do now.

@TwP
Copy link

TwP commented Sep 18, 2014

@dakrone we have run into this situation before where the available disk space calculation did not take into account in progress relocations. When @drewr was helping us recover a cluster, he recommended setting the number of concurrent relocations to1 in order to prevent this from happening.

I can drop more details in here when I'm back in front of a computer.

@grantr
Copy link
Author

grantr commented Sep 18, 2014

You can also lower the

You can also lower the what? Inquiring minds want to know ;)

@dakrone
Copy link
Member

dakrone commented Sep 19, 2014

You can also lower the what? Inquiring minds want to know ;)

Whoops sorry! I meant to say lower the speed at which the shard is recovered (through throttling), to give the InternalClusterInfoService more time to gather updated information, but that is not as ideal as taking the relocations into account.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 19, 2014
When using the DiskThresholdDecider, it's possible that shards could
already be marked as relocating to the node being evaluated. This commit
adds a new setting `cluster.routing.allocation.disk.include_relocations`
which adds the size of the shards currently being relocated to this node
to the node's used disk space.

This new option defaults to `true`, however it's possible to
over-estimate the usage for a node if the relocation is already
partially complete, for instance:

A node with a 10gb shard that's 45% of the way through a relocation
would add 10gb + (.45 * 10) = 14.5gb to the node's disk usage before
examining the watermarks to see if a new shard can be allocated.

Fixes elastic#7753
Relates to elastic#6168
dakrone added a commit that referenced this issue Sep 19, 2014
When using the DiskThresholdDecider, it's possible that shards could
already be marked as relocating to the node being evaluated. This commit
adds a new setting `cluster.routing.allocation.disk.include_relocations`
which adds the size of the shards currently being relocated to this node
to the node's used disk space.

This new option defaults to `true`, however it's possible to
over-estimate the usage for a node if the relocation is already
partially complete, for instance:

A node with a 10gb shard that's 45% of the way through a relocation
would add 10gb + (.45 * 10) = 14.5gb to the node's disk usage before
examining the watermarks to see if a new shard can be allocated.

Fixes #7753
Relates to #6168
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants