After relocation shards might temporarily not be searchable if still in POST_RECOVERY #9421

brwe · 2015-01-26T16:27:23Z

SimpleSortTests.testIssue8226 for example fails about once a week. Example failure:
http://build-us-00.elasticsearch.org/job/es_g1gc_1x_metal/3129/

I can reproduce it locally (although very rarely) with some additional logging (action.search.type: TRACE).

Here is a brief analysis of what happened. Would be great if someone could take a look and let me know if this makes sense.

Failure:

1> REPRODUCE WITH  : mvn clean test -Dtests.seed=774A2866F1B6042D -Dtests.class=org.elasticsearch.search.sort.SimpleSortTests -Dtests.method="testIssue8226 {#76 seed=[774A2866F1B6042D:ACB4FF9F8C8CA341]}" -Des.logger.level=DEBUG -Des.node.mode=network -Dtests.security.manager=true -Dtests.nightly=false -Dtests.client.ratio=0.0 -Dtests.heap.size=512m -Dtests.jvm.argline="-server -XX:+UseConcMarkSweepGC -XX:-UseCompressedOops -XX:+AggressiveOpts -Djava.net.preferIPv4Stack=true" -Dtests.locale=fi_FI -Dtests.timezone=Etc/GMT+9 -Dtests.processors=4
  1> Throwable:
  1> java.lang.AssertionError: One or more shards were not successful but didn't trigger a failure
  1> Expected: <47>
  1>      but: was <46>

Here is an example failure in detail, the relevant parts of the logs are below:

State

node_0 is master.
[test_5][0] is relocating from node_1 to node_0.
Cluster state 3673 has the shard as relocating, in cluster state 3674 it is started.
node_0 is the coordinating node for the search request.

In brief, the request fails for shard [test_5][0] because node_0 operates on an older cluster state 3673 when processing the search request, while node_1 is already on 3674.

Course of events:

node_0 sends shard started, but the shard is still in state POST_RECOVERY and will remain so until it receives the new cluster state and applies it locally
node_0(master) receives the shard started request and publishes the new cluster state 3674 to node_0 and node_1
node_1 receives the cluster state 3674 and applies it locally
node_0 sends search request for [test_5][0] to node_1 because according to cluster state 3673 the shard is there and relocating
-> request fails with IndexShardMissingException because node_1 already applied cluster state 3674 and deleted the shard.
node_0 then sends request for [test_5][0] to node_0 because the shard is there as well (according to cluster state 3673 it is and initializing)
-> request fails with IllegalIndexShardStateException because node_0 has not yet processed cluster state 3674 and therefore the shard is in POST_RECOVERY instead of STARTED
No shard failure is logged because IndexShardMissingException and IllegalIndexShardStateException are explicitly excluded from shard failures.
node_0 finally also gets to process the new cluster state and moves the shard [test_5][0] to STARTED but it is too late

This is a very rare condition and maybe too bad on client side because the information that one shard did not deliver results is there although it is not explicitly listed as shard failure. We can probably make the test pass easily be just waiting for relocations before executing the search request but that seems wrong because any search request can fail this way.

Sample log

[....]

  1> [2015-01-26 09:27:14,435][DEBUG][indices.recovery         ] [node_0] [test_5][0] recovery completed from [[node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}], took [84ms]
  1> [2015-01-26 09:27:14,435][DEBUG][cluster.action.shard     ] [node_0] sending shard started for [test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING], indexUUID [E3T8J7CaRkyK533W0hMBPw], reason [after recovery (replica) from node [[node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}]]
  1> [2015-01-26 09:27:14,435][DEBUG][cluster.action.shard     ] [node_0] received shard started for [test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING], indexUUID [E3T8J7CaRkyK533W0hMBPw], reason [after recovery (replica) from node [[node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}]]
  1> [2015-01-26 09:27:14,436][DEBUG][cluster.service          ] [node_0] processing [shard-started ([test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING]), reason [after recovery (replica) from node [[node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}]]]: execute
  1> [2015-01-26 09:27:14,436][DEBUG][cluster.action.shard     ] [node_0] [test_5][0] will apply shard started [test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING], indexUUID [E3T8J7CaRkyK533W0hMBPw], reason [after recovery (replica) from node [[node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}]]


[....]


  1> [2015-01-26 09:27:14,441][DEBUG][cluster.service          ] [node_0] cluster state updated, version [3674], source [shard-started ([test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING]), reason [after recovery (replica) from node [[node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}]]]
  1> [2015-01-26 09:27:14,441][DEBUG][cluster.service          ] [node_0] publishing cluster state version 3674
  1> [2015-01-26 09:27:14,442][DEBUG][discovery.zen.publish    ] [node_1] received cluster state version 3674
  1> [2015-01-26 09:27:14,443][DEBUG][cluster.service          ] [node_1] processing [zen-disco-receive(from master [[node_0][GQ6yYxmyRT-sfvT0cmuqQQ][schmusi][inet[/192.168.2.102:9400]]{mode=network, enable_custom_paths=true}])]: execute
  1> [2015-01-26 09:27:14,443][DEBUG][cluster.service          ] [node_1] cluster state updated, version [3674], source [zen-disco-receive(from master [[node_0][GQ6yYxmyRT-sfvT0cmuqQQ][schmusi][inet[/192.168.2.102:9400]]{mode=network, enable_custom_paths=true}])]
  1> [2015-01-26 09:27:14,443][DEBUG][cluster.service          ] [node_1] set local cluster state to version 3674
  1> [2015-01-26 09:27:14,443][DEBUG][indices.cluster          ] [node_1] [test_5][0] removing shard (not allocated)
  1> [2015-01-26 09:27:14,443][DEBUG][index                    ] [node_1] [test_5] [0] closing... (reason: [removing shard (not allocated)])
  1> [2015-01-26 09:27:14,443][INFO ][test.store               ] [node_1] [test_5][0] Shard state before potentially flushing is STARTED
  1> [2015-01-26 09:27:14,453][DEBUG][search.sort              ] cluster state:
  1> version: 3673
  1> meta data version: 2043
  1> nodes:
  1>    [node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}
  1>    [node_0][GQ6yYxmyRT-sfvT0cmuqQQ][schmusi][inet[/192.168.2.102:9400]]{mode=network, enable_custom_paths=true}, local, master
  1> routing_table (version 3006):
  1> -- index [test_4]
  1> ----shard_id [test_4][4]
  1> --------[test_4][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_4][7]
  1> --------[test_4][7], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][7], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_4][0]
  1> --------[test_4][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_4][3]
  1> --------[test_4][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][3], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_4][1]
  1> --------[test_4][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_4][5]
  1> --------[test_4][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][5], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_4][6]
  1> --------[test_4][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][6], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_4][2]
  1> --------[test_4][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1>
  1> -- index [test_3]
  1> ----shard_id [test_3][2]
  1> --------[test_3][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_3][0]
  1> --------[test_3][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_3][3]
  1> --------[test_3][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_3][1]
  1> --------[test_3][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_3][5]
  1> --------[test_3][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][5], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_3][6]
  1> --------[test_3][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][6], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_3][4]
  1> --------[test_3][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][4], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1>
  1> -- index [test_2]
  1> ----shard_id [test_2][2]
  1> --------[test_2][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_2][0]
  1> --------[test_2][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_2][3]
  1> --------[test_2][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][3], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_2][1]
  1> --------[test_2][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][1], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_2][5]
  1> --------[test_2][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][5], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_2][4]
  1> --------[test_2][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1>
  1> -- index [test_1]
  1> ----shard_id [test_1][0]
  1> --------[test_1][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_1][1]
  1> --------[test_1][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1>
  1> -- index [test_0]
  1> ----shard_id [test_0][4]
  1> --------[test_0][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][4], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_0][0]
  1> --------[test_0][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][0], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_0][7]
  1> --------[test_0][7], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][7], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_0][3]
  1> --------[test_0][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_0][1]
  1> --------[test_0][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_0][5]
  1> --------[test_0][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][5], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_0][6]
  1> --------[test_0][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][6], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_0][2]
  1> --------[test_0][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1>
  1> -- index [test_8]
  1> ----shard_id [test_8][0]
  1> --------[test_8][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_8][1]
  1> --------[test_8][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_8][2]
  1> --------[test_8][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1>
  1> -- index [test_7]
  1> ----shard_id [test_7][2]
  1> --------[test_7][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_7][0]
  1> --------[test_7][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_7][3]
  1> --------[test_7][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_7][1]
  1> --------[test_7][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_7][4]
  1> --------[test_7][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1>
  1> -- index [test_6]
  1> ----shard_id [test_6][0]
  1> --------[test_6][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_6][3]
  1> --------[test_6][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_6][1]
  1> --------[test_6][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_6][2]
  1> --------[test_6][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_6][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1>
  1> -- index [test_5]
  1> ----shard_id [test_5][0]
  1> --------[test_5][0], node[G4AEDzbrRae5BC_UD9zItA], relocating [GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[RELOCATING]
  1> ----shard_id [test_5][3]
  1> --------[test_5][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_5][1]
  1> --------[test_5][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_5][2]
  1> --------[test_5][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1>
  1> routing_nodes:
  1> -----node_id[GQ6yYxmyRT-sfvT0cmuqQQ][V]
  1> --------[test_4][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][7], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_1][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][7], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_8][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_8][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_7][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_7][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_6][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING]
  1> --------[test_5][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> -----node_id[G4AEDzbrRae5BC_UD9zItA][V]
  1> --------[test_4][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_4][7], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_4][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_4][3], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_4][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_4][5], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_4][6], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_4][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_3][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][5], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][6], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_3][4], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_2][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_2][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_2][3], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_2][1], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_2][5], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_2][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_1][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][4], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_0][0], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_0][7], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][5], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][6], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_0][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_8][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_7][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_7][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_7][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_6][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_6][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_6][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_6][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_5][0], node[G4AEDzbrRae5BC_UD9zItA], relocating [GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[RELOCATING]
  1> --------[test_5][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_5][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ---- unassigned
  1>
  1> tasks: (1):
  1> 13638/URGENT/shard-started ([test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING]), reason [after recovery (replica) from node [[node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}]]/13ms
  1>



[...]


 1> [2015-01-26 09:27:14,459][TRACE][action.search.type       ] [node_0] got first-phase result from [G4AEDzbrRae5BC_UD9zItA][test_3][3]
  1> [2015-01-26 09:27:14,460][TRACE][action.search.type       ] [node_0] [test_5][0], node[G4AEDzbrRae5BC_UD9zItA], relocating [GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[RELOCATING]: Failed to execute [org.elasticsearch.action.search.SearchRequest@1469040a] lastShard [false]
  1> org.elasticsearch.transport.RemoteTransportException: [node_1][inet[/192.168.2.102:9401]][indices:data/read/search[phase/dfs]]
  1> Caused by: org.elasticsearch.index.IndexShardMissingException: [test_5][0] missing
  1>     at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:203)
  1>     at org.elasticsearch.search.SearchService.createContext(SearchService.java:539)
  1>     at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:523)
  1>     at org.elasticsearch.search.SearchService.executeDfsPhase(SearchService.java:208)
  1>     at org.elasticsearch.search.action.SearchServiceTransportAction$SearchDfsTransportHandler.messageReceived(SearchServiceTransportAction.java:757)
  1>     at org.elasticsearch.search.action.SearchServiceTransportAction$SearchDfsTransportHandler.messageReceived(SearchServiceTransportAction.java:748)
  1>     at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:275)
  1>     at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
  1>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>     at java.lang.Thread.run(Thread.java:745)
  1> [2015-01-26 09:27:14,455][TRACE][action.search.type       ] [node_0] got first-phase result from [GQ6yYxmyRT-sfvT0cmuqQQ][test_6][1]
  1> [2015-01-26 09:27:14,455][TRACE][action.search.type       ] [node_0] got first-phase result from [GQ6yYxmyRT-sfvT0cmuqQQ][test_5][2]
  1> [2015-01-26 09:27:14,455][TRACE][action.search.type       ] [node_0] got first-phase result from [GQ6yYxmyRT-sfvT0cmuqQQ][test_7][1]
  1> [2015-01-26 09:27:14,455][TRACE][action.search.type       ] [node_0] got first-phase result from [GQ6yYxmyRT-sfvT0cmuqQQ][test_3][2]
  1> [2015-01-26 09:27:14,455][TRACE][action.search.type       ] [node_0] got first-phase result from [GQ6yYxmyRT-sfvT0cmuqQQ][test_8][2]
  1> [2015-01-26 09:27:14,455][TRACE][action.search.type       ] [node_0] got first-phase result from [GQ6yYxmyRT-sfvT0cmuqQQ][test_1][0]


[...]


  1> [2015-01-26 09:27:14,463][TRACE][action.search.type       ] [node_0] [test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING]: Failed to execute [org.elasticsearch.action.search.SearchRequest@1469040a] lastShard [true]
  1> org.elasticsearch.index.shard.IllegalIndexShardStateException: [test_5][0] CurrentState[POST_RECOVERY] operations only allowed when started/relocated
  1>     at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:839)
  1>     at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:651)
  1>     at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:647)
  1>     at org.elasticsearch.search.SearchService.createContext(SearchService.java:543)
  1>     at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:523)
  1>     at org.elasticsearch.search.SearchService.executeDfsPhase(SearchService.java:208)
  1>     at org.elasticsearch.search.action.SearchServiceTransportAction$3.call(SearchServiceTransportAction.java:197)
  1>     at org.elasticsearch.search.action.SearchServiceTransportAction$3.call(SearchServiceTransportAction.java:194)
  1>     at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
  1>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>     at java.lang.Thread.run(Thread.java:745)
  1> [2015-01-26 09:27:14,459][TRACE][action.search.type       ] [node_0] got first-phase result from [G4AEDzbrRae5BC_UD9zItA][test_2][4]



[...]



  1> [2015-01-26 09:27:14,493][DEBUG][cluster.service          ] [node_0] set local cluster state to version 3674
  1> [2015-01-26 09:27:14,493][DEBUG][index.shard              ] [node_0] [test_5][0] state: [POST_RECOVERY]->[STARTED], reason [global state is [STARTED]]
  1> [2015-01-26 09:27:14,493][DEBUG][river.cluster            ] [node_0] processing [reroute_rivers_node_changed]: execute
  1> [2015-01-26 09:27:14,493][DEBUG][river.cluster            ] [node_0] processing [reroute_rivers_node_changed]: no change in cluster_state
  1> [2015-01-26 09:27:14,493][DEBUG][cluster.service          ] [node_0] processing [shard-started ([test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], relocating [G4AEDzbrRae5BC_UD9zItA], [P], s[INITIALIZING]), reason [after recovery (replica) from node [[node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}]]]: done applying updated cluster_state (version: 3674)
  1> [2015-01-26 09:27:14,456][TRACE][action.search.type       ] [node_0] got first-phase result from [GQ6yYxmyRT-sfvT0cmuqQQ][test_2][3]


[...]

  1> [2015-01-26 09:27:14,527][DEBUG][search.sort              ] cluster state:
  1> version: 3674
  1> meta data version: 2043
  1> nodes:
  1>    [node_1][G4AEDzbrRae5BC_UD9zItA][schmusi][inet[/192.168.2.102:9401]]{mode=network, enable_custom_paths=true}
  1>    [node_0][GQ6yYxmyRT-sfvT0cmuqQQ][schmusi][inet[/192.168.2.102:9400]]{mode=network, enable_custom_paths=true}, local, master
  1> routing_table (version 3007):
  1> -- index [test_4]
  1> ----shard_id [test_4][2]
  1> --------[test_4][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_4][7]
  1> --------[test_4][7], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][7], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_4][0]
  1> --------[test_4][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_4][3]
  1> --------[test_4][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][3], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_4][1]
  1> --------[test_4][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_4][5]
  1> --------[test_4][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][5], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_4][6]
  1> --------[test_4][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][6], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_4][4]
  1> --------[test_4][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1>
  1> -- index [test_3]
  1> ----shard_id [test_3][4]
  1> --------[test_3][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][4], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_3][0]
  1> --------[test_3][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_3][3]
  1> --------[test_3][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_3][1]
  1> --------[test_3][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_3][5]
  1> --------[test_3][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][5], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_3][6]
  1> --------[test_3][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][6], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_3][2]
  1> --------[test_3][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1>
  1> -- index [test_2]
  1> ----shard_id [test_2][4]
  1> --------[test_2][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_2][0]
  1> --------[test_2][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_2][3]
  1> --------[test_2][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][3], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_2][1]
  1> --------[test_2][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][1], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_2][5]
  1> --------[test_2][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][5], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_2][2]
  1> --------[test_2][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1>
  1> -- index [test_1]
  1> ----shard_id [test_1][0]
  1> --------[test_1][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_1][1]
  1> --------[test_1][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1>
  1> -- index [test_0]
  1> ----shard_id [test_0][2]
  1> --------[test_0][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_0][0]
  1> --------[test_0][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][0], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_0][7]
  1> --------[test_0][7], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][7], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_0][3]
  1> --------[test_0][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_0][1]
  1> --------[test_0][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_0][5]
  1> --------[test_0][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][5], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_0][6]
  1> --------[test_0][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][6], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> ----shard_id [test_0][4]
  1> --------[test_0][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][4], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1>
  1> -- index [test_8]
  1> ----shard_id [test_8][0]
  1> --------[test_8][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_8][1]
  1> --------[test_8][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_8][2]
  1> --------[test_8][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1>
  1> -- index [test_7]
  1> ----shard_id [test_7][4]
  1> --------[test_7][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_7][0]
  1> --------[test_7][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_7][3]
  1> --------[test_7][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_7][1]
  1> --------[test_7][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_7][2]
  1> --------[test_7][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1>
  1> -- index [test_6]
  1> ----shard_id [test_6][0]
  1> --------[test_6][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_6][3]
  1> --------[test_6][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_6][1]
  1> --------[test_6][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_6][2]
  1> --------[test_6][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_6][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1>
  1> -- index [test_5]
  1> ----shard_id [test_5][0]
  1> --------[test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> ----shard_id [test_5][3]
  1> --------[test_5][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_5][1]
  1> --------[test_5][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ----shard_id [test_5][2]
  1> --------[test_5][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1>
  1> routing_nodes:
  1> -----node_id[GQ6yYxmyRT-sfvT0cmuqQQ][V]
  1> --------[test_4][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][7], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_4][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_4][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_3][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_3][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_2][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_2][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_1][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][7], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][5], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_0][6], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_0][4], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_8][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_8][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_7][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_7][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_6][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][3], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][1], node[GQ6yYxmyRT-sfvT0cmuqQQ], [R], s[STARTED]
  1> --------[test_6][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_5][0], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> --------[test_5][2], node[GQ6yYxmyRT-sfvT0cmuqQQ], [P], s[STARTED]
  1> -----node_id[G4AEDzbrRae5BC_UD9zItA][V]
  1> --------[test_4][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_4][7], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_4][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_4][3], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_4][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_4][5], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_4][6], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_4][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][4], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_3][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][5], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_3][6], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_3][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_2][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_2][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_2][3], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_2][1], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_2][5], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_2][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_1][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_0][0], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_0][7], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][5], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_0][6], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_0][4], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_8][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_7][4], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_7][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_7][2], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_6][0], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_6][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_6][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_6][2], node[G4AEDzbrRae5BC_UD9zItA], [R], s[STARTED]
  1> --------[test_5][3], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> --------[test_5][1], node[G4AEDzbrRae5BC_UD9zItA], [P], s[STARTED]
  1> ---- unassigned
  1>
  1> tasks: (0):
  1>

[...]

The text was updated successfully, but these errors were encountered:

dakrone · 2015-01-27T00:21:32Z

A similar test failure:

org.elasticsearch.deleteByQuery.DeleteByQueryTests.testDeleteAllOneIndex

http://build-us-00.elasticsearch.org/job/es_g1gc_master_metal/2579/testReport/junit/org.elasticsearch.deleteByQuery/DeleteByQueryTests/testDeleteAllOneIndex/
http://build-us-00.elasticsearch.org/job/es_core_master_centos/2640/testReport/junit/org.elasticsearch.deleteByQuery/DeleteByQueryTests/testDeleteAllOneIndex/
http://build-us-00.elasticsearch.org/job/es_core_master_regression/1263/testReport/junit/org.elasticsearch.deleteByQuery/DeleteByQueryTests/testDeleteAllOneIndex/

It fails on the:

assertThat(shardInfo.getSuccessful(), greaterThanOrEqualTo(numShards.numPrimaries));

Which I believe relates to the relocation issue Britta mentioned.

See: #9421

See: #9421 Conflicts: src/test/java/org/elasticsearch/deleteByQuery/DeleteByQueryTests.java

brwe · 2015-01-27T08:57:22Z

I think this is unrelated. I actually fixed the DeleteByQueryTests yesterday (c3f1982) and this commit does not seem to be in the build you linked to.

A brief explanation: DeleteByQuery is a write operation. The shard header returned and checked in DeleteByQueryTests is different from the one return for search requests. The reason why DeleteByQuery failed is because I added the check

assertThat(shardInfo.getSuccessful(), greaterThanOrEqualTo(numShards.totalNumShards));

before which was wrong because there was no ensureGreen() so some of the replicas might not have ben initialized yet. I fixed this in c3f1982 by instead checking

assertThat(shardInfo.getSuccessful(), greaterThanOrEqualTo(numShards.numPrimaries));

bleskes · 2015-01-27T10:36:17Z

I wonder if we should just allow reads in the POST_RECOVERY phase. At that point the shards is effectively ready to do everything it needs to do. @brwe this will solve the issue, right?

dakrone · 2015-01-27T16:42:41Z

@brwe okay, does that mean I can unmute the DeleteByQueryTests.testDeleteAllOneIndex?

brwe · 2015-01-27T16:45:32Z

yes

dakrone · 2015-01-27T17:42:04Z

Unmuted the DeleteByQueryTests.testDeleteAllOneIndex test

brwe · 2015-01-28T15:25:51Z

@bleskes I think that would fix it. However, before I push I want to try and write a test that reproduces reliably. Will not do before next week.

bleskes · 2015-02-24T12:48:00Z

@brwe please ping before starting on this. I want to make sure that we capture the original issue which caused us to introduce POST_RECOVERY. I don't recall exactly recall what the problem was (it was refresh related) and I think it was solved by a more recent change to how refresh work (#6545) but it requires careful thought

brwe · 2015-03-20T21:11:20Z

@bleskes ping :)
I finally came back to this and wrote a test that reproduces the failure reliably (#10194) but I did not quite get what you meant by "capture the original issue". Can you elaborate?

s1monw · 2015-04-13T14:39:28Z

@kimchy do you recall why we can't read in that state?

…y not be searchable if still in POST_RECOVERY) see elastic#9421

see elastic#9421

When a client indexes a documents and then calls refresh on the index then the document must be visible after that with search requests. This might not be the case if refresh is a BroadcastOperationAction, see DiscoveryWithServiceDisruptionsTests.testReadOnPostRecoveryShards related to elastic#9421

The restore portion of some snapshot/restore test is failing randomly due to elastic#9421. This change suspends rebalance during snapshot/restore operations until elastic#9421 is fixed. Closes elastic#12855

The restore portion of some snapshot/restore test is failing randomly due to #9421. This change suspends rebalance during snapshot/restore operations until #9421 is fixed. Closes #12855

prerequisite to elastic#9421 see also elastic#12600

Currently, we do not allow reads on shards which are in POST_RECOVERY which unfortunately can cause search failures on shards which just recovered if there no replicas (elastic#9421). The reason why we did not allow reads on shards that are in POST_RECOVERY is that after relocating a shard might miss a refresh if the node that executed the refresh is behind with cluster state processing. If that happens, a user might execute index/refresh/search but still not find the document that was indexed. We changed how refresh works now in elastic#13068 to make sure that shards cannot miss a refresh this way by sending refresh requests the same way that we send write requests. This commit changes IndexShard to allow reads on POST_RECOVERY now. In addition it adds two test: - test for issue elastic#9421 (After relocation shards might temporarily not be searchable if still in POST_RECOVERY) - test for visibility issue with relocation and refresh if reads allowed when shard is in POST_RECOVERY closes elastic#9421

Currently, we do not allow reads on shards which are in POST_RECOVERY which unfortunately can cause search failures on shards which just recovered if there no replicas (#9421). The reason why we did not allow reads on shards that are in POST_RECOVERY is that after relocating a shard might miss a refresh if the node that executed the refresh is behind with cluster state processing. If that happens, a user might execute index/refresh/search but still not find the document that was indexed. We changed how refresh works now in #13068 to make sure that shards cannot miss a refresh this way by sending refresh requests the same way that we send write requests. This commit changes IndexShard to allow reads on POST_RECOVERY now. In addition it adds two test: - test for issue #9421 (After relocation shards might temporarily not be searchable if still in POST_RECOVERY) - test for visibility issue with relocation and refresh if reads allowed when shard is in POST_RECOVERY closes #9421

clintongormley added >bug >test-failure Triaged test failures from CI labels Jan 26, 2015

dakrone added a commit that referenced this issue Jan 27, 2015

[TEST] Mute DeleteByQueryTests.testDeleteAllOneIndex

92b218b

See: #9421

dakrone added a commit that referenced this issue Jan 27, 2015

[TEST] Mute DeleteByQueryTests.testDeleteAllOneIndex

f9f9e7e

See: #9421 Conflicts: src/test/java/org/elasticsearch/deleteByQuery/DeleteByQueryTests.java

brwe self-assigned this Jan 28, 2015

brwe mentioned this issue Mar 20, 2015

[search] make sure search does not fail when shard is in post recovery after relo... #10194

Closed

brwe changed the title ~~[CI Failure] SimpleSortTests.testIssue8226~~ After relocation shards might temporarily not be searchable if still in POST_RECOVERY Apr 10, 2015

clintongormley added blocker v2.0.0-beta1 labels Jul 24, 2015

brwe mentioned this issue Aug 3, 2015

Fix for search failures if shard is in POST_RECOVERY #12600

Closed

clintongormley added v2.0.0 and removed v2.0.0-beta1 labels Aug 10, 2015

brwe added a commit to brwe/elasticsearch that referenced this issue Aug 11, 2015

Test for issue elastic#9421 (After relocation shards might temporaril…

4a66df7

…y not be searchable if still in POST_RECOVERY) see elastic#9421

brwe added a commit to brwe/elasticsearch that referenced this issue Aug 11, 2015

Allow reads on shards that are in POST_RECOVERY

a57089c

see elastic#9421

brwe mentioned this issue Aug 24, 2015

Make refresh a replicated action #13068

Merged

brwe added a commit to brwe/elasticsearch that referenced this issue Sep 1, 2015

Make refresh a replicated action

d81f426

prerequisite to elastic#9421 see also elastic#12600

brwe mentioned this issue Sep 1, 2015

Allow reads on shards that are in POST_RECOVERY #13246

Merged

brwe closed this as completed in #13246 Sep 2, 2015

clintongormley added v2.0.0-beta2 and removed v2.0.0 labels Sep 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After relocation shards might temporarily not be searchable if still in POST_RECOVERY #9421

After relocation shards might temporarily not be searchable if still in POST_RECOVERY #9421

brwe commented Jan 26, 2015

dakrone commented Jan 27, 2015

brwe commented Jan 27, 2015

bleskes commented Jan 27, 2015

dakrone commented Jan 27, 2015

brwe commented Jan 27, 2015

dakrone commented Jan 27, 2015

brwe commented Jan 28, 2015

bleskes commented Feb 24, 2015

brwe commented Mar 20, 2015

s1monw commented Apr 13, 2015

After relocation shards might temporarily not be searchable if still in POST_RECOVERY #9421

After relocation shards might temporarily not be searchable if still in POST_RECOVERY #9421

Comments

brwe commented Jan 26, 2015

State

Course of events:

Sample log

dakrone commented Jan 27, 2015

brwe commented Jan 27, 2015

bleskes commented Jan 27, 2015

dakrone commented Jan 27, 2015

brwe commented Jan 27, 2015

dakrone commented Jan 27, 2015

brwe commented Jan 28, 2015

bleskes commented Feb 24, 2015

brwe commented Mar 20, 2015

s1monw commented Apr 13, 2015