Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translog recovery can fail due to mappings not present on recovery target #11281

Closed
s1monw opened this issue May 21, 2015 · 0 comments
Closed
Assignees
Labels
blocker >bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. :Search/Mapping Index mappings, including merging and defining field types v2.0.0-beta1

Comments

@s1monw
Copy link
Contributor

s1monw commented May 21, 2015

There is a small window where a type or a field can not be published to the the replica due to a synced mapping update but we are already sending a document of that type to the replica during translog recovery. This is basically the same problem as we have with normal indexing where the first document introducing the type blocks until the mapping update is published but subsequent documents don't introduce the new mapping since the node receiving it already got the update. The window is small but we hit it once in tests today:

http://build-us-00.elastic.co/job/es_core_master_centos/4808/consoleFull

resulting in this:

1> RemoteTransportException[[node_t2][local[658]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[node_t0][local[656]][internal:index/shard/recovery/translog_ops]]; nested: NullPointerException;
  1> Caused by: [test][0] Phase[2] phase2 failed
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:145)
  1>    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:127)
  1>    at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:53)
  1>    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:136)
  1>    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:133)
  1>    at org.elasticsearch.transport.local.LocalTransport$2.doRun(LocalTransport.java:279)
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>    at java.lang.Thread.run(Thread.java:745)
  1> Caused by: RemoteTransportException[[node_t0][local[656]][internal:index/shard/recovery/translog_ops]]; nested: NullPointerException;
  1> Caused by: java.lang.NullPointerException
  1>    at org.elasticsearch.index.mapper.MapperAnalyzer.getWrappedAnalyzer(MapperAnalyzer.java:48)
  1>    at org.apache.lucene.analysis.DelegatingAnalyzerWrapper$DelegatingReuseStrategy.getReusableComponents(DelegatingAnalyzerWrapper.java:74)
  1>    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:139)
  1>    at org.elasticsearch.common.lucene.all.AllField.tokenStream(AllField.java:77)
  1>    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:606)
  1>    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
  1>    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
  1>    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
  1>    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
  1>    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
  1>    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1142)
  1>    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:522)
  1>    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:448)
  1>    at org.elasticsearch.index.shard.TranslogRecoveryPerformer.performRecoveryOperation(TranslogRecoveryPerformer.java:112)
  1>    at org.elasticsearch.index.shard.TranslogRecoveryPerformer.performBatchRecovery(TranslogRecoveryPerformer.java:72)
  1>    at org.elasticsearch.index.shard.IndexShard.performBatchRecovery(IndexShard.java:812)
  1>    at org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:306)
  1>    at org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:297)
  1>    at org.elasticsearch.transport.local.LocalTransport$2.doRun(LocalTransport.java:279)
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  1>    at java.lang.Thread.run(Thread.java:745)

somehow we also need to wait for a clusterstate update here during translog recovery.

@s1monw s1monw added >bug v2.0.0-beta1 blocker :Search/Mapping Index mappings, including merging and defining field types :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels May 21, 2015
bleskes added a commit to bleskes/elasticsearch that referenced this issue May 27, 2015
In rare occasion, the translog replay phase of recovery may require mapping changes on the target shard. This can happen where indexing on the primary introduces new mappings while the recovery is in phase1. If the source node processes the new mapping from the master, allowing the indexing to proceed, before the target node does and the recovery moves to the phase 2 (translog replay) before as well, the translog operations arriving on the target node may miss the mapping changes. Since this is extremely rare, we opt for a simple fix and simply restart the recovery. Note that in the case the file copy phase will likely be very short as the files are already in sync.

Restarting recoveries in such a late phase means we may need to copy segment_N files and/or files that were quickly merged away on the target again. This annoys the write-once protection in our testing infra. To work around it I have introduces a counter in the termpoary file name prefix used by the recovery code.

**** THERE IS STILL AN ONGOING ISSUE ***: Lucene will try to write the same segment_N file (which was cleaned by the recovery code) twice triggering test failures.

 Closes elastic#11281
@bleskes bleskes closed this as completed in ea41ee9 Jun 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. :Search/Mapping Index mappings, including merging and defining field types v2.0.0-beta1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants