New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test Failure: SearchQueryIT.testIssue3177 - forced merged attempted on closed index #13266
Comments
The test fails because optimize is called and no errors are expected but optimize fails because a shard is relocating. Here is how: First of all, when a shard relocates while optimize is called the call might fail with
To avoid this in the test we create an index, wait for relocations to finish and only then call optimize.
This assumes that after However, because the test does not wait for green status, relocations can still start after
Because of the uneven distribution of shards one shard will be relocated from An ensureGreen() will fix the issue but I do wonder why we start of with such an odd distribution of shards to begin with. I will investigate this. |
In this test we assume that after waitForRelocation() has returned shards are no more relocated and optimize will therefore succeed always. However, because the test does not wait for green status, relocations can still start after waitForRelocation() has returned successfully. see #13266 for a detailed explanation
In this test we assume that after waitForRelocation() has returned shards are no more relocated and optimize will therefore succeed always. However, because the test does not wait for green status, relocations can still start after waitForRelocation() has returned successfully. see #13266 for a detailed explanation
this allocation is a corner case that I tried to special case in the shard allocator here: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java#L647 in the last allocation round we have wo shards left:
ideally we would put both on |
We ( @bleskes and I) discussed further about the fact that we get a failure at all for optimize while relocating. Before I wrote
But actually, optimize should not fail at all just because a shard is relocating. The reason why we get a failure is that the Exceptions returned by the call because a shard is closed are not all handled the same. I made a pr to fix this (#13380). However, while discussing this we found another issue with this: when we call optimize in tests and shards are relocating we might actually miss some because they are relocating and maybe the coordinating node for the request does not have the new address of the shards yet, request is not sent to initializing shards etc. We should fix that as well as it might cause test failures too. In addition other apis like stats does not seem to behave as expected too when shards are relocating. @tlrx might add more to that. Overall I now think the problem is not so much that shards are relocating in tests unexpectedly but more that apis don't behave as expected when shards are relocating? |
I noticed that Indices Stats API, when executed at shard level, throws a ShardNotFoundException if the shard has no routing entry (see code here). Exceptions are then accumulated at |
Whe we call optimize we ignore Exceptions that indicate a closed shard. However, when a shard is closed while an optimize request is in flight it might also trigger an AlreadyClosedException from the IndexWriter when we get the config or ForceMergeFailedEngineException with the EngineClosedException wrapped inside. Because these are not identified as exceptions that indicate a closed shard (TransportActions.isShardNotAvailableException(..)) optimize would sometimes report failures when shards were relocating while optimize was called and sometimes not. This caused weird test failures, see elastic#13266 . Instead, we should let EngineClosedException bubble up and also recognize AlreadyClosedException as an indicator for a closed shard.
Whe we call optimize we ignore Exceptions that indicate a closed shard. However, when a shard is closed while an optimize request is in flight it might also trigger an AlreadyClosedException from the IndexWriter when we get the config or ForceMergeFailedEngineException with the EngineClosedException wrapped inside. Because these are not identified as exceptions that indicate a closed shard (TransportActions.isShardNotAvailableException(..)) optimize would sometimes report failures when shards were relocating while optimize was called and sometimes not. This caused weird test failures, see #13266 . Instead, we should let EngineClosedException bubble up and also recognize AlreadyClosedException as an indicator for a closed shard.
Whe we call optimize we ignore Exceptions that indicate a closed shard. However, when a shard is closed while an optimize request is in flight it might also trigger an AlreadyClosedException from the IndexWriter when we get the config or ForceMergeFailedEngineException with the EngineClosedException wrapped inside. Because these are not identified as exceptions that indicate a closed shard (TransportActions.isShardNotAvailableException(..)) optimize would sometimes report failures when shards were relocating while optimize was called and sometimes not. This caused weird test failures, see #13266 . Instead, we should let EngineClosedException bubble up and also recognize AlreadyClosedException as an indicator for a closed shard.
Test was fixed. I opened #13719 for a general discussion on how different apis deal with relocating shards. |
Build URL: http://build-us-00.elastic.co/job/elasticsearch-20-strong/311/testReport/junit/org.elasticsearch.search.query/SearchQueryIT/testIssue3177/
Cannot reproduce locally
Reproduction command:
Stack trace for failure:
The text was updated successfully, but these errors were encountered: