New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After relocation shards might temporarily not be searchable if still in POST_RECOVERY #9421
Comments
A similar test failure:
http://build-us-00.elasticsearch.org/job/es_g1gc_master_metal/2579/testReport/junit/org.elasticsearch.deleteByQuery/DeleteByQueryTests/testDeleteAllOneIndex/ It fails on the: assertThat(shardInfo.getSuccessful(), greaterThanOrEqualTo(numShards.numPrimaries)); Which I believe relates to the relocation issue Britta mentioned. |
See: #9421 Conflicts: src/test/java/org/elasticsearch/deleteByQuery/DeleteByQueryTests.java
I think this is unrelated. I actually fixed the DeleteByQueryTests yesterday (c3f1982) and this commit does not seem to be in the build you linked to. A brief explanation: DeleteByQuery is a write operation. The shard header returned and checked in DeleteByQueryTests is different from the one return for search requests. The reason why DeleteByQuery failed is because I added the check assertThat(shardInfo.getSuccessful(), greaterThanOrEqualTo(numShards.totalNumShards)); before which was wrong because there was no ensureGreen() so some of the replicas might not have ben initialized yet. I fixed this in c3f1982 by instead checking assertThat(shardInfo.getSuccessful(), greaterThanOrEqualTo(numShards.numPrimaries)); |
I wonder if we should just allow reads in the POST_RECOVERY phase. At that point the shards is effectively ready to do everything it needs to do. @brwe this will solve the issue, right? |
@brwe okay, does that mean I can unmute the |
yes |
Unmuted the |
@bleskes I think that would fix it. However, before I push I want to try and write a test that reproduces reliably. Will not do before next week. |
@brwe please ping before starting on this. I want to make sure that we capture the original issue which caused us to introduce POST_RECOVERY. I don't recall exactly recall what the problem was (it was refresh related) and I think it was solved by a more recent change to how refresh work (#6545) but it requires careful thought |
@kimchy do you recall why we can't read in that state? |
…y not be searchable if still in POST_RECOVERY) see elastic#9421
When a client indexes a documents and then calls refresh on the index then the document must be visible after that with search requests. This might not be the case if refresh is a BroadcastOperationAction, see DiscoveryWithServiceDisruptionsTests.testReadOnPostRecoveryShards related to elastic#9421
The restore portion of some snapshot/restore test is failing randomly due to elastic#9421. This change suspends rebalance during snapshot/restore operations until elastic#9421 is fixed. Closes elastic#12855
prerequisite to elastic#9421 see also elastic#12600
Currently, we do not allow reads on shards which are in POST_RECOVERY which unfortunately can cause search failures on shards which just recovered if there no replicas (elastic#9421). The reason why we did not allow reads on shards that are in POST_RECOVERY is that after relocating a shard might miss a refresh if the node that executed the refresh is behind with cluster state processing. If that happens, a user might execute index/refresh/search but still not find the document that was indexed. We changed how refresh works now in elastic#13068 to make sure that shards cannot miss a refresh this way by sending refresh requests the same way that we send write requests. This commit changes IndexShard to allow reads on POST_RECOVERY now. In addition it adds two test: - test for issue elastic#9421 (After relocation shards might temporarily not be searchable if still in POST_RECOVERY) - test for visibility issue with relocation and refresh if reads allowed when shard is in POST_RECOVERY closes elastic#9421
Currently, we do not allow reads on shards which are in POST_RECOVERY which unfortunately can cause search failures on shards which just recovered if there no replicas (#9421). The reason why we did not allow reads on shards that are in POST_RECOVERY is that after relocating a shard might miss a refresh if the node that executed the refresh is behind with cluster state processing. If that happens, a user might execute index/refresh/search but still not find the document that was indexed. We changed how refresh works now in #13068 to make sure that shards cannot miss a refresh this way by sending refresh requests the same way that we send write requests. This commit changes IndexShard to allow reads on POST_RECOVERY now. In addition it adds two test: - test for issue #9421 (After relocation shards might temporarily not be searchable if still in POST_RECOVERY) - test for visibility issue with relocation and refresh if reads allowed when shard is in POST_RECOVERY closes #9421
SimpleSortTests.testIssue8226 for example fails about once a week. Example failure:
http://build-us-00.elasticsearch.org/job/es_g1gc_1x_metal/3129/
I can reproduce it locally (although very rarely) with some additional logging (action.search.type: TRACE).
Here is a brief analysis of what happened. Would be great if someone could take a look and let me know if this makes sense.
Failure:
Here is an example failure in detail, the relevant parts of the logs are below:
State
node_0 is master.
[test_5][0] is relocating from node_1 to node_0.
Cluster state 3673 has the shard as relocating, in cluster state 3674 it is started.
node_0 is the coordinating node for the search request.
In brief, the request fails for shard [test_5][0] because node_0 operates on an older cluster state 3673 when processing the search request, while node_1 is already on 3674.
Course of events:
-> request fails with IndexShardMissingException because node_1 already applied cluster state 3674 and deleted the shard.
-> request fails with IllegalIndexShardStateException because node_0 has not yet processed cluster state 3674 and therefore the shard is in POST_RECOVERY instead of STARTED
No shard failure is logged because IndexShardMissingException and IllegalIndexShardStateException are explicitly excluded from shard failures.
This is a very rare condition and maybe too bad on client side because the information that one shard did not deliver results is there although it is not explicitly listed as shard failure. We can probably make the test pass easily be just waiting for relocations before executing the search request but that seems wrong because any search request can fail this way.
Sample log
The text was updated successfully, but these errors were encountered: