Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

marking and sending shard failed due to [failed recovery] #14338

Closed
nihilson opened this issue Oct 28, 2015 · 3 comments
Closed

marking and sending shard failed due to [failed recovery] #14338

nihilson opened this issue Oct 28, 2015 · 3 comments
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. feedback_needed

Comments

@nihilson
Copy link

Getting the following exception after startup in a 2 node cluster using version elasticsearch-1.7.2. Can someone help me understand what these messages really mean and how to resolve it ?

[10/28/15 11:31:57:364 EDT] 00000056 org.elasticsearch.indices.cluster                            W [NODE2] [[xtttenantmaster][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [xtttenantmaster][0]: Recovery failed from [NODE1][......][host1][inet[/*.*.*.*:9600]] into [NODE2][....][host2][inet[/*.*.*.*:9600]]
        at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:280)
        at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:70)
        at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:567)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(Thread.java:785)
Caused by: org.elasticsearch.transport.RemoteTransportException: [NODE1][inet[/*.*.*.*:9600]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.transport.NotSerializableTransportException: [org.elasticsearch.indices.recovery.DelayRecoveryException] source node does not have the shard listed in its state as allocated on the node;
@clintongormley clintongormley added discuss :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Oct 29, 2015
@clintongormley
Copy link

@bleskes any ideas here?

@bleskes
Copy link
Contributor

bleskes commented Nov 2, 2015

Recovery of a replica is triggered by the master publishing a cluster state. Sometimes that cluster state is processed by the target node first and only then by the source node. The target node then initiates the recovery by sending a message to the source node. If that message arrives before the source node has processed the cluster state, it will respond with a DelayRecoveryException. This should only tell the target node to retry in a few seconds and is just part of the normal process.

Sadly in your case something goes wrong with the serialization of the DelayRecoveryException , which causes a different NotSerializableTransportException to be sent. The target node then cancels the recovery as it doesn't know what went wrong. Can you check if there is anything in the source node logs that would hint what wasn't serializable? Sadly we don't do a really good job indicating this in 1.7. In 2.0 exception serialization was completely rewritten to help with these issues.

@clintongormley
Copy link

No further feedback. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. feedback_needed
Projects
None yet
Development

No branches or pull requests

3 participants