New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resiliency: Recovering replicas might get stuck in initializing state #6808
Comments
Since Lucene version 4.8 each file has a checksum written as it's footer. We used to calculate the checksums for all files transparently on the filesystem layer (Directory / Store) which is now not necessary anymore. This commit makes use of the new checksums in a backwards compatible way such that files written with the old checksum mechanism are still compared against the corresponding Alder32 checksum while newer files are compared against the Lucene build in CRC32 checksum. Since now every written file is checksummed by default this commit also verifies the checksum for files during recovery and restore if applicable. Closes #5924 This commit also has a fix for #6808 since the added tests in `CorruptedFileTest.java` exposed the issue. Closes #6808
Out of elastic#6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior closes elastic#6816
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior closes #6816
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior closes #6816
Out of elastic#6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary closes elastic#6825
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary closes #6825
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary closes #6825
was this fixed?
|
welp there are some shards stuck initializing on that node again (from newly created indexes)
note the garbled strings
|
hey @OlegYch I personally can't see really evidence that your issue is related to this. Did you really see a shard initialising that was supposed to recover from a shard that is not actually allocated. You also said:
and in this issue the shards that got stuck were replicas in this issue. Can you provide more infos that what you already added? |
oh sorry, didn't understand that this issue was specifically about replica shards as opposed to primary shards, perhaps this is better described in #6816 ? |
is this problem resolved after 1.4.0 ? Mine 1.4.0 still has this problem.
|
@garyelephant I just run into this same issue after resetting the number of replicas using the API. 1.4.0, too. FWIW here's the status of the shards comprising one of the stuck indices: index-2015.01.06 0 r INITIALIZING 10.0.0.162 es2.io.example.com |
@garyelephant @ioc32 this issue is closed. If you're still seeing problems in 1.4, please open a new issue with more information about the problem. |
If a primary fails while a replica starts recovery but has not yet initialized the recovery process the replica will retry until the primary is allocated again on the node. This never happens and the replica gets stuck in INITIALIZING state and will never cleaned up.
The text was updated successfully, but these errors were encountered: