Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shards relocating during rolling restarts #14387

Closed
PhaedrusTheGreek opened this issue Oct 30, 2015 · 12 comments
Closed

Shards relocating during rolling restarts #14387

PhaedrusTheGreek opened this issue Oct 30, 2015 · 12 comments
Assignees
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v1.7.4 v2.1.0

Comments

@PhaedrusTheGreek
Copy link
Contributor

This behaviour is reproducible in v1.6.0 through 2.0.0.

Expected during rolling restarts that no shard relocations will occur, however there is shard movement occurring while the cluster is in a yellow health state.

Steps to reproduce:

  1. Create a cluster with at least 3 nodes, 1 index with 2 shards + 1 replica (4 shards total), and index some data.
  2. Stop all indexing
  3. Set allocation: none
  4. _all/_flush/synced
  5. restart a single node
  6. reenable allocation

At step 6, shards are observed to be relocating, in addition to any recovery by sync_id that has occurred. After recoveries and relocations, the cluster will change to green state. This was tested in slow motion by limiting bandwidth to one of the nodes in the cluster.

Relocations are not observed in a 2 node cluster, or when restarting the entire cluster.

@clintongormley
Copy link

Hi @PhaedrusTheGreek

Could you add the exact commands etc that you used to test. I'm on a poor network and can't view the video.

thanks

@PhaedrusTheGreek
Copy link
Contributor Author

The relocating shards seems to be recoveries, not rebalances. I infer this because when i set the following, I see them all happening at once.

"cluster.routing.allocation.node_concurrent_recoveries" : 10

This is what i'm seeing after restarting a node - shards moving on and off.

index shard prirep state      docs   store ip           node                                                
big   0     r      RELOCATING 1026 605.7kb 192.168.0.2  Max -> 192.168.0.25 gUii9aw4QTW_CRP4Akg_Nw Scrier   
big   0     p      STARTED    1026 605.7kb 192.168.0.25 Arclight                                            
big   1     r      RELOCATING 1018 854.4kb 192.168.0.2  Max -> 192.168.0.25 8y8i2N0oQpag6zsVPm323g Arclight 
big   1     p      STARTED    1018 854.4kb 192.168.0.25 Scrier                                              
big2  2     r      STARTED     413 278.4kb 192.168.0.25 Arclight                                            
big2  2     p      RELOCATING  413 278.4kb 192.168.0.25 Scrier -> 192.168.0.2 m3eFHPt0QyqEdiXUnk59Yg Max    
big2  0     r      RELOCATING  405 270.6kb 192.168.0.2  Max -> 192.168.0.25 8y8i2N0oQpag6zsVPm323g Arclight 
big2  0     p      STARTED     405 270.6kb 192.168.0.25 Scrier                                              
big2  3     p      RELOCATING  405   269kb 192.168.0.25 Arclight -> 192.168.0.2 m3eFHPt0QyqEdiXUnk59Yg Max  
big2  3     r      STARTED     405   269kb 192.168.0.25 Scrier                                              
big2  1     r      RELOCATING  410 426.8kb 192.168.0.2  Max -> 192.168.0.25 gUii9aw4QTW_CRP4Akg_Nw Scrier   
big2  1     p      STARTED     410 426.8kb 192.168.0.25 Arclight                                            
big2  4     p      RELOCATING  411 443.7kb 192.168.0.25 Arclight -> 192.168.0.2 m3eFHPt0QyqEdiXUnk59Yg Max  
big2  4     r      STARTED     411 443.6kb 192.168.0.25 Scrier                                              
big3  2     r      STARTED     407 344.6kb 192.168.0.2  Max                                                 
big3  2     p      STARTED     407 344.6kb 192.168.0.25 Scrier                                              
big3  0     r      STARTED     406 319.2kb 192.168.0.25 Arclight                                            
big3  0     p      RELOCATING  406 319.3kb 192.168.0.25 Scrier -> 192.168.0.2 m3eFHPt0QyqEdiXUnk59Yg Max    
big3  3     r      STARTED     413 402.1kb 192.168.0.2  Max                                                 
big3  3     p      STARTED     413 402.1kb 192.168.0.25 Scrier                                              
big3  1     r      STARTED     411 276.8kb 192.168.0.2  Max                                                 
big3  1     p      STARTED     411 276.8kb 192.168.0.25 Arclight                                            
big3  4     r      STARTED     407 342.5kb 192.168.0.2  Max                                                 
big3  4     p      STARTED     407 342.5kb 192.168.0.25 Arclight                    

TRACE logs show a lot of this:

[2015-11-05 10:22:00,938][TRACE][indices.recovery         ] [Max] [big3][0] recovery completed from [Scrier][gUii9aw4QTW_CRP4Akg_Nw][Jasons-MacBook-Pro-3.local][inet[/192.168.0.25:9300]], took[2.6m]
   phase1: recovered_files [7] with total_size of [319.2kb], took [2.5m], throttling_wait [0s]
         : reusing_files   [0] with total_size of [0b]
   phase2: start took [19ms]
         : recovered [0] transaction log operations, took [0s]
   phase3: recovered [0] transaction log operations, took [1ms]

@PhaedrusTheGreek
Copy link
Contributor Author

As for the exact command for testing, all that i am doing is starting up 3 nodes, and restarting one with

CTRL-C; bin/elasticsearch

Then watching things move around with

GET /_cat/shards?v

@s1monw
Copy link
Contributor

s1monw commented Nov 6, 2015

I assigned it to @ywelsch we will look into this and come back to you shortly. In the meanwhile can you show all the commands you are executing especially the one that: Set allocation: none

@PhaedrusTheGreek
Copy link
Contributor Author

This is the exact command I used:

PUT /_cluster/settings
{
        "persistent" : {
            "cluster.routing.allocation.enable" : "none"
        }
}

And I would see something like this on all nodes:

[2015-10-30 10:54:51,429][INFO ][cluster.routing.allocation.decider] [Humus Sapien] updating [cluster.routing.allocation.enable] from [ALL] to [NONE]

Shard relocations / recoveries begin after relocation is reenabled like this:

PUT /_cluster/settings
{
        "persistent" : {
            "cluster.routing.allocation.enable" : "all"
        }
}

@bleskes
Copy link
Contributor

bleskes commented Nov 6, 2015

A short update - @clintongormley and I researched this. It has to do with a race condition between the gateway allocator and the cluster balancer. When the node comes back/allocation is enabled the gateway allocator goes and asks the node for information about it's shard store. This is done async. While that request is in flight, the balanced allocator thinks the node is empty and assigns shards to it. Only later when the gateway allocator assigns the missing shard back to node does the cluster rebalances again. Our idea for a fix was to disable balancing while there are in flight data fetching requests...

@s1monw
Copy link
Contributor

s1monw commented Nov 6, 2015

@bleskes makes sense to me - I will take a look at implementing this.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Nov 6, 2015
…ilable

This commit prevents running rebalance operations if the store allocator is
still fetching async shard / store data to prevent pre-mature rebalance decisions
which need to be reverted once shard store data is available. This is typically happening
on rolling restarts which can make those restarts extremely painful.

Closes elastic#14387
@PhaedrusTheGreek
Copy link
Contributor Author

Tested these workarounds with good results:

1.x

 "cluster.routing.allocation.balance.threshold" : "100.0f" (During Node Restart)
 "cluster.routing.allocation.balance.threshold" : "1.0f" (Return to Default)

2.0

"cluster.routing.rebalance.enable" : "none" (During Node Restart)
"cluster.routing.rebalance.enable" : "all" (Return to Default)

@astefan
Copy link
Contributor

astefan commented Nov 10, 2015

Was this present in ES versions before 1.6?

@s1monw
Copy link
Contributor

s1monw commented Nov 10, 2015

Was this present in ES versions before 1.6?

no I don't think so since back then we fetched data synchronously so this couldn't happen.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Nov 10, 2015
…ilable

This commit prevents running rebalance operations if the store allocator is
still fetching async shard / store data to prevent pre-mature rebalance decisions
which need to be reverted once shard store data is available. This is typically happening
on rolling restarts which can make those restarts extremely painful.

Closes elastic#14387
@bittusarkar
Copy link

@s1monw Is this issue fixed in Elasticsearch 2.x?

@s1monw
Copy link
Contributor

s1monw commented Oct 27, 2016

@bittusarkar yes see #14652

@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018
fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v1.7.4 v2.1.0
Projects
None yet
Development

No branches or pull requests

8 participants