Shard allocation to take into account free disk space #3480

synhershko · 2013-08-10T23:22:05Z

Simon says:

I can imagine some sort of disk space allocation decider that can restrict a node from
allocating any further shards given the used / free disk space and / or move shards
away given a certain limit etc. We can also make allocation decision based on the size
of the shards or move shards around once they fill up and we see that certain
shards are much bigger than others

More at https://groups.google.com/forum/#!topic/elasticsearch/p-et4UxvcyU

s1monw · 2013-08-11T06:56:52Z

Thanks for opening this issue! I think we will get to this pretty soon ie. next week or so

synhershko · 2013-08-28T08:33:07Z

eta?

s1monw · 2013-08-28T17:13:55Z

@dakrone what's the status of this...

dakrone · 2013-08-28T18:43:41Z

@synhershko I'm currently working on developing this, it's trickier than a usual AllocationDecider (which prevents allocation) because fetching the disk usages and shard sizes is overhead that we don't want to incur for every operation, so it needs to be cached for a time and refreshed at certain intervals.

This commit adds two main pieces, the first is a ClusterInfoService that provides a service running on the master nodes that fetches the total/free bytes for each data node in the cluster as well as the sizes of all shards in the cluster. This information is gathered by default every 30 seconds, and can be changed dynamically by setting the `cluster.info.update.interval` setting. This ClusterInfoService can hopefully be used in the future to weight nodes for allocation based on their disk usage, if desired. The second main piece is the DiskThresholdDecider, which can disallow a shard from being allocated to a node, or from remaining on the node depending on configuration parameters. There are three main configuration parameters for the DiskThresholdDecider: `cluster.routing.allocation.disk.threshold_enabled` controls whether the decider is enabled. It defaults to false (disabled). Note that the decider is also disabled for clusters with only a single data node. `cluster.routing.allocation.disk.watermark.low` controls the low watermark for disk usage. It defaults to 0.70, meaning ES will not allocate new shards to nodes once they have more than 70% disk used. It can also be set to an absolute byte value (like 500mb) to prevent ES from allocating shards if less than the configured amount of space is available. `cluster.routing.allocation.disk.watermark.high` controls the high watermark. It defaults to 0.85, meaning ES will attempt to relocate shards to another node if the node disk usage rises above 85%. It can also be set to an absolute byte value (similar to the low watermark) to relocate shards once less than the configured amount of space is available on the node. Closes #3480

synhershko · 2013-09-30T07:39:26Z

hey guys - any chance this piece of decider will participate in other operations working with the FS, like optimization for example? what we are seeing is large indices being optimized and occasionally servers running very low on disk space because of that.

Maybe if an index doesn't have enough room to optimize a rebalancing should kick in?

s1monw · 2013-09-30T13:52:35Z

Hmm that is something that is pretty rare condition though. I wonder if we really should have something like this in the core system or if we should just ask for a customer allocation decider since deciders can trigger a rebalance on such a condition via canRemain - maybe this should go in a different issue? Can you open one?

synhershko · 2013-09-30T13:57:25Z

#3807

It isn't this rare if you run a large data shop with replicas and all, with data constantly going in. It doesn't happen everyday, but it did happen to us.

s1monw · 2013-09-30T13:58:13Z

IMO optimize should be rare in most cases unless you have time based indices etc. ;)

synhershko · 2013-09-30T14:03:34Z

We do use rolling indexes...

dakrone · 2013-09-30T14:29:17Z

Are you seeing a consistent amount of disk used for the optimize? If you know in advance about how much room you'll need for the optimize, you could set the high watermark for the disk threshold and ES should relocate shards if the disk usage passes that watermark.

synhershko · 2013-09-30T14:32:35Z

Since ES can have this kind of info and do the maths for me, I don't see why I need to plan for it in advance. Plus I don't think ES can relocate a shard which is in the middle of it being optimized, and setting too high high-watermark is something we wouldn't want to do as well.

synhershko · 2013-10-13T17:42:39Z

@dakrone something that just occurred to us - how would the free space decider play along with ES's defaut to try and have the same number of shards on each node?

In our scenario (and I'm assuming this is quite common) we have many data servers each with different HD capacities, ranging from ~120GB to ~1000GB. I'm pretty sure if ES will try to balance based on both criteria something will go very wrong.

Did you take that into account? or should we try breaking this with some nasty tests?

dakrone · 2013-11-15T05:05:47Z

@synhershko since the decider is part of the balancing process, the allocator will attempt to find the "best" weights that still satisfy all of the deciders, so it will try to balance the shards evenly, but will still allow uneven allocation in the event that the disk limit has been reached on a particular machine or set of machines.

Did you take that into account? or should we try breaking this with some nasty tests?

It should already be taken into account, but nasty tests are always appreciated! :)

synhershko · 2013-11-17T13:19:33Z

Will try to get to it soon, then

synhershko · 2013-12-19T08:34:46Z

@dakrone just to let you know we now use this feature in our highly uneven cluster and so far all looks good. It seems like ES still tries to even the number of shards on each node but the free-space decider seems to do a good job. Thanks!

dakrone · 2013-12-19T19:24:30Z

Awesome! I'm glad to hear it's working well for you! :D

s1monw · 2013-12-19T19:30:45Z

@synhershko this is great that you come back to us. Do you have trouble with the balancing in anyway as you mention that it tries to balance?

synhershko · 2013-12-20T13:04:53Z

@s1monw no it all seems to be fine. What I meant is when there's balancing in action, which hardly ever happens now as far as I can tell, it will try to get to an end result where disk allocation limits are respected AND there is more or less the same amount of shards on each node. Which I think makes sense.

Before upgrading I set out to write some integration which will test our cluster configuration with the internal moving pieces. I was able to recreate a similar scenario in the test (different node sizes, different index sizes) and everything worked (shards were allocated, no node was over-allocated etc). I just couldn't find edge-cases to test there - all seems to have already been tested in @dakrone 's tests so we dropped this effort and decided to take the pill.

We upgraded from a variant of 0.90.0 (custom compiled with some mods) to vanilla 0.90.7. The upgrade took a while (different Lucene versions) but went smooth, and once the disk-aware decider was enabled it took the cluster some while to stabilize, but ever since we did that all seems to work fine.

I'm leaving the company this week so will probably stop monitor that cluster, but as I said so far this looks very good and stable.

This commit adds two main pieces, the first is a ClusterInfoService that provides a service running on the master nodes that fetches the total/free bytes for each data node in the cluster as well as the sizes of all shards in the cluster. This information is gathered by default every 30 seconds, and can be changed dynamically by setting the `cluster.info.update.interval` setting. This ClusterInfoService can hopefully be used in the future to weight nodes for allocation based on their disk usage, if desired. The second main piece is the DiskThresholdDecider, which can disallow a shard from being allocated to a node, or from remaining on the node depending on configuration parameters. There are three main configuration parameters for the DiskThresholdDecider: `cluster.routing.allocation.disk.threshold_enabled` controls whether the decider is enabled. It defaults to false (disabled). Note that the decider is also disabled for clusters with only a single data node. `cluster.routing.allocation.disk.watermark.low` controls the low watermark for disk usage. It defaults to 0.70, meaning ES will not allocate new shards to nodes once they have more than 70% disk used. It can also be set to an absolute byte value (like 500mb) to prevent ES from allocating shards if less than the configured amount of space is available. `cluster.routing.allocation.disk.watermark.high` controls the high watermark. It defaults to 0.85, meaning ES will attempt to relocate shards to another node if the node disk usage rises above 85%. It can also be set to an absolute byte value (similar to the low watermark) to relocate shards once less than the configured amount of space is available on the node. Closes elastic#3480

ghost assigned s1monw Aug 11, 2013

ghost assigned dakrone Aug 12, 2013

dakrone mentioned this issue Sep 5, 2013

Add AllocationDecider that takes free disk space into account #3637

Merged

dakrone closed this as completed in 7d52d58 Sep 9, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard allocation to take into account free disk space #3480

Shard allocation to take into account free disk space #3480

synhershko commented Aug 10, 2013

s1monw commented Aug 11, 2013

synhershko commented Aug 28, 2013

s1monw commented Aug 28, 2013

dakrone commented Aug 28, 2013

synhershko commented Sep 30, 2013

s1monw commented Sep 30, 2013

synhershko commented Sep 30, 2013

s1monw commented Sep 30, 2013

synhershko commented Sep 30, 2013

dakrone commented Sep 30, 2013

synhershko commented Sep 30, 2013

synhershko commented Oct 13, 2013

dakrone commented Nov 15, 2013

synhershko commented Nov 17, 2013

synhershko commented Dec 19, 2013

dakrone commented Dec 19, 2013

s1monw commented Dec 19, 2013

synhershko commented Dec 20, 2013

Shard allocation to take into account free disk space #3480

Shard allocation to take into account free disk space #3480

Comments

synhershko commented Aug 10, 2013

s1monw commented Aug 11, 2013

synhershko commented Aug 28, 2013

s1monw commented Aug 28, 2013

dakrone commented Aug 28, 2013

synhershko commented Sep 30, 2013

s1monw commented Sep 30, 2013

synhershko commented Sep 30, 2013

s1monw commented Sep 30, 2013

synhershko commented Sep 30, 2013

dakrone commented Sep 30, 2013

synhershko commented Sep 30, 2013

synhershko commented Oct 13, 2013

dakrone commented Nov 15, 2013

synhershko commented Nov 17, 2013

synhershko commented Dec 19, 2013

dakrone commented Dec 19, 2013

s1monw commented Dec 19, 2013

synhershko commented Dec 20, 2013