Resiliency: Recovering replicas might get stuck in initializing state #6808

s1monw · 2014-07-10T10:35:45Z

If a primary fails while a replica starts recovery but has not yet initialized the recovery process the replica will retry until the primary is allocated again on the node. This never happens and the replica gets stuck in INITIALIZING state and will never cleaned up.

Since Lucene version 4.8 each file has a checksum written as it's footer. We used to calculate the checksums for all files transparently on the filesystem layer (Directory / Store) which is now not necessary anymore. This commit makes use of the new checksums in a backwards compatible way such that files written with the old checksum mechanism are still compared against the corresponding Alder32 checksum while newer files are compared against the Lucene build in CRC32 checksum. Since now every written file is checksummed by default this commit also verifies the checksum for files during recovery and restore if applicable. Closes #5924 This commit also has a fix for #6808 since the added tests in `CorruptedFileTest.java` exposed the issue. Closes #6808

Out of elastic#6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior closes elastic#6816

Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior closes #6816

Out of elastic#6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary closes elastic#6825

Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method. This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary closes #6825

OlegYch · 2014-08-22T22:46:28Z

was this fixed?
i think i just experienced this after rolling upgrade from 1.3.1 to 1.3.2
i disabled cluster.routing.allocation before upgraded then installed new version on each node and set cluster.routing.allocation=all
then i waited several hours for cluster to become green (with dogslow response times in the meantime)
then i restarted each node one by one again
then waited a bit more and noticed there were 4 primary shards stuck in initializing on one of them and killed it - and the cluster went back up green in no time
there were errors like this in the failed node log:

[2014-08-22 22:01:45,221][DEBUG][action.search.type       ] [thisnode] [myidx][1], node[oDw5wWJ-S6etTr0cGLNbGw], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@57228563] lastShard [true]
org.elasticsearch.transport.SendRequestTransportException: [anothernode][inet[/10.35.62.130:9300]][search/phase/query]
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [anothernode][inet[/10.35.62.130:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:874)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:556)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:206)
        ... 40 more

OlegYch · 2014-08-23T00:49:26Z

welp there are some shards stuck initializing on that node again (from newly created indexes)
the only exceptions in log are

[2014-08-23 00:02:12,622][ERROR][marvel.agent.exporter    ] [thisnode] remote target didn't respond with 200 OK response code [404 Not Found]. content: [:)
^E�errorrIndexMissingException[[.marvel-2014.08.23] missing]�status$^L��]

note the garbled strings
and on other nodes only stuff like

[2014-08-23 00:38:31,588][DEBUG][action.bulk              ] [othernode] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2014-08-23 00:38:31,589][ERROR][marvel.agent.exporter    ] [othernode] create failure (index:[.marvel-2014.08.23] type: [node_stats]): UnavailableShardsException[[.marvel-2014.08.23][0] [2] shardIt, [0] active : Timeout waiting for [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@4f0b6ef7]

clintongormley · 2014-08-24T10:22:32Z

/cc @bleskes @s1monw

s1monw · 2014-08-25T09:19:07Z

hey @OlegYch I personally can't see really evidence that your issue is related to this. Did you really see a shard initialising that was supposed to recover from a shard that is not actually allocated. You also said:

then waited a bit more and noticed there were 4 primary shards stuck in initializing on one of them and killed it - and the cluster went back up green in no time

and in this issue the shards that got stuck were replicas in this issue. Can you provide more infos that what you already added?

OlegYch · 2014-08-25T13:55:10Z

oh sorry, didn't understand that this issue was specifically about replica shards as opposed to primary shards, perhaps this is better described in #6816 ?
i've removed that node from the cluster and stopped elastic, so i can upload files from it if that would help diagnosing

garyelephant · 2015-01-07T13:10:33Z

is this problem resolved after 1.4.0 ? Mine 1.4.0 still has this problem.

curl es_host:9200/_cat/shards 2>1 |grep [UI]N
test-2015.01.07      4 p INITIALIZING                   127.0.0.1 10.71.16.121 
test-2015.01.07      4 r UNASSIGNED                                            
test-2015.01.07      4 r UNASSIGNED

ioc32 · 2015-01-09T19:26:43Z

@garyelephant I just run into this same issue after resetting the number of replicas using the API. 1.4.0, too.

FWIW here's the status of the shards comprising one of the stuck indices:

index-2015.01.06 0 r INITIALIZING 10.0.0.162 es2.io.example.com
index-2015.01.06 0 r UNASSIGNED
index-2015.01.06 1 r UNASSIGNED
index-2015.01.06 1 r UNASSIGNED
index-2015.01.06 2 r UNASSIGNED
index-2015.01.06 2 r UNASSIGNED

clintongormley · 2015-01-13T20:22:47Z

@garyelephant @ioc32 this issue is closed. If you're still seeing problems in 1.4, please open a new issue with more information about the problem.

s1monw added bug labels Jul 10, 2014

s1monw self-assigned this Jul 10, 2014

s1monw closed this as completed in 72e6150 Jul 10, 2014

kimchy added the resiliency label Jul 10, 2014

kimchy mentioned this issue Jul 10, 2014

Improve handling of failed primary replica handling #6816

Closed

kimchy mentioned this issue Jul 11, 2014

Improve handling of failed primary replica handling #6825

Closed

clintongormley changed the title ~~[CLUSTER] Recovering replicas might get stuck in initializing state~~ Resiliency: Recovering replicas might get stuck in initializing state Jul 16, 2014

magnhaug mentioned this issue Sep 1, 2014

Stuck on shard recovery, NPE in _recovery API #6430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resiliency: Recovering replicas might get stuck in initializing state #6808

Resiliency: Recovering replicas might get stuck in initializing state #6808

s1monw commented Jul 10, 2014

OlegYch commented Aug 22, 2014

OlegYch commented Aug 23, 2014

clintongormley commented Aug 24, 2014

s1monw commented Aug 25, 2014

OlegYch commented Aug 25, 2014

garyelephant commented Jan 7, 2015

ioc32 commented Jan 9, 2015

clintongormley commented Jan 13, 2015

Resiliency: Recovering replicas might get stuck in initializing state #6808

Resiliency: Recovering replicas might get stuck in initializing state #6808

Comments

s1monw commented Jul 10, 2014

OlegYch commented Aug 22, 2014

OlegYch commented Aug 23, 2014

clintongormley commented Aug 24, 2014

s1monw commented Aug 25, 2014

OlegYch commented Aug 25, 2014

garyelephant commented Jan 7, 2015

ioc32 commented Jan 9, 2015

clintongormley commented Jan 13, 2015