Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for 'grace period expiration' before shard reallocation? #3569

Closed
diranged opened this issue Aug 25, 2013 · 8 comments
Closed

Allow for 'grace period expiration' before shard reallocation? #3569

diranged opened this issue Aug 25, 2013 · 8 comments
Assignees
Labels

Comments

@diranged
Copy link

It would be really useful to allow for a 'grace period' between when ES notices that a particular node has gone down, and shard-reallocation begins. There are times when we might want to do a quick restart of an ES node ... or take one down for a full reboot ... and we don't want to do a re-allocation of shards because thats a very IO-intensive operation. In our case, we also use the Zookeeper plugin, and a shard-reallocation is triggered by a short communication break between the ES nodes and Zookeeper.

@s1monw
Copy link
Contributor

s1monw commented Aug 26, 2013

I think you can simply use disable allocation via the cluster settings API cluster.routing.allocation.disable_allocation If you have scheduled downtimes that is the way to go. Does that resolve your issue, if so can you please close the issue?

@diranged
Copy link
Author

Not really -- there are cases where we may see a little blip in the network that trigger the shard reallocation. I think its a pretty worth-while feature to say "wait 5m for things to settle before re-allocating shards". This is also useful for cases where you may be bringing up more than 1 additional node in terms of capacity.

@ghost ghost assigned dakrone Aug 27, 2013
@Plasma
Copy link

Plasma commented Mar 28, 2014

Was curious about this too; I'd like to see ES give me a grace period of something like 5 minutes before deciding to rebalance shards based on a node going offline.

Most likely that node is coming back up (eg, due to maintenance reboot of the service or host) and would be preferred the cluster remain in a degraded state to hope that the node comes back online, but then consider it dead after a certain timeout and rebalance as needed.

@clintongormley
Copy link

If a shard disappears briefly then returns, it needs to recover from the primary shard in case anything has changed. Currently, that is likely to mean copying lots of segments over from the primary (as the primary and replica will probably have diverged). Making this fast will not be possible until #6069 is implemented.

We can revisit this issue on #6069 is in.

@clintongormley clintongormley added test and removed test labels Sep 5, 2014
@jlintz
Copy link

jlintz commented Sep 27, 2014

Would love to see a configurable timeout for this as well. While we can disable_allocation for planned events, we've had cases where a network blip will cause indices to begin reshuffling

@cfeio
Copy link

cfeio commented Feb 17, 2015

+1, We would really love to see a configurable timeout also, if not being able to disable it completely.

In an unplanned outage we would prefer to be able to configure not have replicas reassigned and shuffled, as this leads to a lot of data moving. As far as I know I am not aware of such a setting.

@shyem
Copy link

shyem commented Feb 19, 2015

+1, this would be really helpful for us as we have a massive amount of data in our ELK cluster and since we already have 1 replica available still there is no point of duplicating that one more time.

@jpountz
Copy link
Contributor

jpountz commented Aug 26, 2015

Fixed via #11712

@jpountz jpountz closed this as completed Aug 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants