SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master #37005

not-napoleon · 2018-12-27T15:46:01Z

Test failed on master:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=fedora/144/

Does NOT reproduce for me on master with:

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=BA465B36283F19D1 \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testSnapshotCanceledOnRemovedShard" \
  -Dtests.security.manager=true \
  -Dtests.locale=hu \
  -Dtests.timezone=America/Merida \
  -Dcompiler.java=11 \
  -Druntime.java=8

Relevant log snippet:

09:15:00   1> [2018-12-27T08:14:13,064][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] removing template [random_index_template]
09:15:00   1> [2018-12-27T08:14:13,092][INFO ][o.e.r.RepositoriesService] [node_sm2] delete repository [test-repo]
09:15:00   1> [2018-12-27T08:14:13,132][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testDataFileFailureDuringRestore] [SharedClusterSnapshotRestoreIT#testDataFileFailureDuringRestore]: cleaned up after test
09:15:00   1> [2018-12-27T08:14:13,132][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testDataFileFailureDuringRestore] after test
09:15:00   1> [2018-12-27T08:14:13,287][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] before test
09:15:00   1> [2018-12-27T08:14:13,287][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] [SharedClusterSnapshotRestoreIT#testSnapshotName]: setting up test
09:15:00   1> [2018-12-27T08:14:13,294][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] adding template [random_index_template] for index patterns [*]
09:15:00   1> [2018-12-27T08:14:13,331][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] [SharedClusterSnapshotRestoreIT#testSnapshotName]: all set up test
09:15:00   1> [2018-12-27T08:14:13,331][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] -->  creating repository
09:15:00   1> [2018-12-27T08:14:13,332][INFO ][o.e.r.RepositoriesService] [node_sm2] put repository [test-repo]
09:15:00   1> [2018-12-27T08:14:13,386][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] [SharedClusterSnapshotRestoreIT#testSnapshotName]: cleaning up after test
09:15:00   1> [2018-12-27T08:14:13,408][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] removing template [random_index_template]
09:15:00   1> [2018-12-27T08:14:13,434][INFO ][o.e.r.RepositoriesService] [node_sm2] delete repository [test-repo]
09:15:00   1> [2018-12-27T08:14:13,468][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] [SharedClusterSnapshotRestoreIT#testSnapshotName]: cleaned up after test
09:15:00   1> [2018-12-27T08:14:13,469][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] after test
09:15:00   1> [2018-12-27T08:14:13,607][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] before test
09:15:00   1> [2018-12-27T08:14:13,607][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] [SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard]: setting up test
09:15:00   1> [2018-12-27T08:14:13,612][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] adding template [random_index_template] for index patterns [*]
09:15:00   1> [2018-12-27T08:14:13,651][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] [SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard]: all set up test
09:15:00   1> [2018-12-27T08:14:13,653][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_sm2] [test-idx] creating index, cause [api], templates [random_index_template], shards [1]/[1], mappings []
09:15:00   1> [2018-12-27T08:14:13,843][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> indexing some data
09:15:00   1> [2018-12-27T08:14:13,846][INFO ][o.e.c.m.MetaDataMappingService] [node_sm2] [test-idx/lDUHpreESTWKTn3etC85pA] create_mapping [_doc]
09:15:00   1> [2018-12-27T08:14:14,083][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> creating repository
09:15:00   1> [2018-12-27T08:14:14,084][INFO ][o.e.s.m.MockRepository   ] [node_sm2] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,084][INFO ][o.e.r.RepositoriesService] [node_sm2] put repository [test-repo]
09:15:00   1> [2018-12-27T08:14:14,095][INFO ][o.e.s.m.MockRepository   ] [node_sm0] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,095][INFO ][o.e.s.m.MockRepository   ] [node_sd4] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,096][INFO ][o.e.s.m.MockRepository   ] [node_sd5] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,096][INFO ][o.e.s.m.MockRepository   ] [node_sd3] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,096][INFO ][o.e.s.m.MockRepository   ] [node_sm1] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,112][INFO ][o.e.s.m.MockRepository   ] [node_sm2] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,120][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> snapshot
09:15:00   1> [2018-12-27T08:14:14,120][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> waiting for block to kick in on node [node_sd4]
09:15:00   1> [2018-12-27T08:14:14,144][INFO ][o.e.s.SnapshotsService   ] [node_sm2] snapshot [test-repo:test-snap/WLdcwqMYRHGiS_RxVVjbuQ] started
09:15:00   1> [2018-12-27T08:14:14,192][INFO ][o.e.s.m.MockRepository   ] [node_sd4] [test-repo] blocking I/O operation for file [__0] at path [[indices][c-oqPeO2SSyVFv3iTpzMZw][0]]
09:15:00   1> [2018-12-27T08:14:14,220][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> removing primary shard that is being snapshotted
09:15:00   1> [2018-12-27T08:14:14,229][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> unblocking blocked node [node_sd4]
09:15:00   1> [2018-12-27T08:14:14,229][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> ensuring snapshot is aborted and the aborted shard was marked as failed
09:15:00   1> [2018-12-27T08:14:14,431][WARN ][o.e.s.SnapshotShardsService] [node_sd4] [[test-idx][0]][test-repo:test-snap/WLdcwqMYRHGiS_RxVVjbuQ] failed to snapshot shard
09:15:00   1> org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1403) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1396) ~[main/:?]
09:15:00   1> 	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_192]
09:15:00   1> 	at org.elasticsearch.core.internal.io.Streams.copy(Streams.java:54) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
09:15:00   1> 	at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:133) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.mockstore.BlobContainerWrapper.writeBlob(BlobContainerWrapper.java:53) ~[test/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.writeBlob(MockRepository.java:352) ~[test/:?]
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1330) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1269) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:852) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:393) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.SnapshotShardsService.access$100(SnapshotShardsService.java:94) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:336) [main/:?]
09:15:00   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [main/:?]
09:15:00   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
09:15:00   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
09:15:00   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
09:15:00   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
09:15:00   1> [2018-12-27T08:14:14,435][WARN ][o.e.r.b.BlobStoreRepository] [node_sm2] [index-0] index blob is not valid x-content [0 bytes]
09:15:00   1> [2018-12-27T08:14:14,436][INFO ][o.e.s.SnapshotsService   ] [node_sm2] snapshot [test-repo:test-snap/WLdcwqMYRHGiS_RxVVjbuQ] completed with state [PARTIAL]
09:15:00   1> [2018-12-27T08:14:14,437][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] [SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard]: cleaning up after test
09:15:00   1> [2018-12-27T08:14:14,444][WARN ][o.e.i.c.IndicesClusterStateService] [node_sd4] [[test-idx][0]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
09:15:00   1> [2018-12-27T08:14:14,453][WARN ][o.e.c.r.a.AllocationService] [node_sm2] failing shard [failed shard, shard [test-idx][0], node[NmI-D7OpTF6LOv51WTCDyw], [P], s[STARTED], a[id=jqVYMdb5SQmT9qtKYKXh6Q], message [master marked shard as active, but shard has not been created, mark shard as failed], failure [Unknown], markAsStale [true]]
09:15:00   1> [2018-12-27T08:14:14,454][INFO ][o.e.c.r.a.AllocationService] [node_sm2] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[test-idx][0]] ...]).
09:15:00   1> [2018-12-27T08:14:14,476][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_sm2] [test-idx/lDUHpreESTWKTn3etC85pA] deleting index
09:15:00   1> [2018-12-27T08:14:14,531][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] removing template [random_index_template]
09:15:00   1> [2018-12-27T08:14:14,554][INFO ][o.e.r.RepositoriesService] [node_sm2] delete repository [test-repo]
09:15:00   1> [2018-12-27T08:14:14,577][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] [SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard]: cleaned up after test
09:15:00   1> [2018-12-27T08:14:14,577][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] after test
09:15:00 ERROR   1.14s J7 | SharedClusterSnapshotRestoreIT.testSnapshotCanceledOnRemovedShard <<< FAILURES!
09:15:00    > Throwable #1: NotSerializableExceptionWrapper[not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes]
09:15:00    > 	at __randomizedtesting.SeedInfo.seed([BA465B36283F19D1:2BF74722441573D]:0)
09:15:00    > 	at org.elasticsearch.common.compress.CompressorFactory.compressor(CompressorFactory.java:56)
09:15:00    > 	at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:70)
09:15:00    > 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:670)
09:15:00    > 	at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:148)
09:15:00    > 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:96)
09:15:00    > 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54)
09:15:00    > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127)
09:15:00    > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208)
09:15:00    > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
09:15:00    > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
09:15:00    > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
09:15:00    > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
09:15:00    > 	at java.lang.Thread.run(Thread.java:748)
09:15:00   1> [2018-12-27T08:14:14,796][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testGetSnapshotsRequest] before test
09:15:00   1> [2018-12-27T08:14:14,797][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testGetSnapshotsRequest]```

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-27T15:46:31Z

Pinging @elastic/es-distributed

not-napoleon · 2018-12-27T15:52:47Z

There was a race condition issue in the same class a while ago, but the presentation looks different than this issue, and I don't think they're related. Including a link for completeness: #18121

original-brownbear · 2018-12-28T07:54:50Z

The failure here comes from the fact that we fail to read the repo info:

ERROR   1.14s J7 | SharedClusterSnapshotRestoreIT.testSnapshotCanceledOnRemovedShard <<< FAILURES!
   > Throwable #1: NotSerializableExceptionWrapper[not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes]
   > 	at __randomizedtesting.SeedInfo.seed([BA465B36283F19D1:2BF74722441573D]:0)
   > 	at org.elasticsearch.common.compress.CompressorFactory.compressor(CompressorFactory.java:56)
   > 	at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:70)
   > 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:670)
   > 	at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:148)
   > 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:96)
   > 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54)
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127)
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

a few lines up the log you can see:

  1> [2018-12-27T08:14:14,435][WARN ][o.e.r.b.BlobStoreRepository] [node_sm2] [index-0] index blob is not valid x-content [0 bytes]

=> apparently we run into an index gen file index-0 that is of 0 length.

I grepped the build logs and could find another such case here https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/3/consoleText in a unrelated test suit.

This test has a shared per suit cluster, the other example has a cluster per test case which imo rules out interference between tests.

Looking at the implementation of org.elasticsearch.common.blobstore.fs.FsBlobContainer#writeBlobAtomic it is not immediately clear how we could end up with an empty blob here. Looking in to that now.

original-brownbear · 2018-12-28T08:14:21Z

The problem seems to come from https://github.com/elastic/elasticsearch/blob/master/server/src/test/java/org/elasticsearch/snapshots/mockstore/MockRepository.java#L382 => the mock repository wrapped used in these tests randomly executes a non-atomic writeBlob when writeBlobAtomic is invoked.
That would allow a 0 length file to be read for index-0 when no data has been fsynced to it yet.

* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests * Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode: * Cloud based Blob stores (S3, GCS and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves * Closes elastic#37005

javanna · 2018-12-28T12:02:53Z

Another failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+release-tests/264/console

ywelsch · 2018-12-28T12:54:01Z

@original-brownbear mute the test for now?

original-brownbear · 2018-12-28T13:10:58Z

@ywelsch sure, will do in a sec.

* relates #37005

original-brownbear · 2018-12-28T13:25:28Z

Muted in master (34d22f3) and 6.x (87d3aa7)

* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests * Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode: * Cloud-based Blob stores (S3, GCS, and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves * Closes #37005

abw333 · 2019-01-22T14:34:35Z

Hi, I am running into a similar issue while creating a snapshot on 6.4:

org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: NotSerializableExceptionWrapper[not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes]
NotSerializableExceptionWrapper[not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes]
	at org.elasticsearch.common.compress.CompressorFactory.compressor(CompressorFactory.java:56)
	at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:69)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:671)
	at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:139)
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.buildResponse(TransportSnapshotsStatusAction.java:207)
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.access$100(TransportSnapshotsStatusAction.java:61)
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction$1.onResponse(TransportSnapshotsStatusAction.java:127)
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction$1.onResponse(TransportSnapshotsStatusAction.java:121)
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.finishHim(TransportNodesAction.java:252)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onOperation(TransportNodesAction.java:229)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:153)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:206)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:198)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1058)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1134)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1124)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1113)
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66)
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:665)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Moreover, I see a log line very similar to the one pointed out by @original-brownbear:

[2019-01-18T18:10:42,829][WARN ][o.e.r.h.HdfsRepository   ] [OyVQR66] [index-0] index blob is not valid x-content [0 bytes]

(Note that it uses HdfsRepository instead of BlobStoreRepository.)

I have a few specific questions:

What exactly is MockRepository? I don't see it in the Elasticsearch package I am using, and I don't see much documentation in the file. Is it only used for testing?
How is MockRepository related to BlobStoreRepository? Why does a change to MockRepository fix an issue that seems to be related to BlobStoreRepository?
How likely is it that the issue that I am seeing is related to this underlying issue? Is it possible that I am seeing the same symptom, but it's another issue? How meaningful is the fact that I see HdfsRepository in my log instead of BlobStoreRepository?
While debugging, I stumbled upon this other thread: SnapshotShardsServiceIT#testRetryPostingSnapshotStatusMessages fails on 6.4 #33199 Is that related to this issue?
Does this issue mean that snapshots are expected to randomly fail in production? If so, is the workaround for clients to add a retry mechanism until we upgrade to an Elasticsearch version that includes the fix (6.7, if I understood correctly)?

Please let me know if there's any other helpful information I can provide, or if there's another question that I should be asking. Thanks in advance for your work in Elasticsearch and for your time.

ywelsch · 2019-01-22T18:36:21Z

@abw333 Your issue is probably fixed in 6.7.0, see #37066. MockRepository is a fake repository implementation that we use in our tests to randomly inject failures. A test failure uncovered a problem with repository types that don't support writing blobs in an atomic way. It turned out that only HDFSRepository was not supporting atomic writes, which we changed in #37066. With that change, we now mandated that all repository implementations support atomic writes, and hence adapted MockRepository (#37011) so that it more accurately reflected the behavior of our improved repository implementations. There is unfortunately no workaround here, in particular retrying will not help. You could try deleting the offending 0-byte file and see if that allows further snapshots.

abw333 · 2019-01-22T20:18:52Z

Thanks, @ywelsch, I will give that a try!

abw333 · 2019-01-29T15:13:23Z

Hi again. I was able to reproduce this failure reliably by attempting to take a snapshot into an HDFS directory that already had an empty file named index-0 in it. I was then able to fix the failure by 1) detecting the failure caused by the race condition 2) deleting the snapshot directory's contents and 3) retrying the snapshot. Thanks again for the suggestion :)

I have a couple follow-up questions:

I am currently taking snapshots of all of my indices with a single request. My indices all have roughly the same size. I am concerned that, a long time into the snapshot process, the snapshot fails, and then I have to restart from the beginning.

Question 1: If my snapshot fails with this race condition, when will it fail? Will it always be near the beginning? Or always near the end? Or does it vary?

If it always fails near the beginning, I don't have anything to worry about. If it sometimes fails near the end, though, I would like to do something about it. I was thinking that, in this case, I could take a snapshot of each of my indices separately. Since they are roughly equally sized, this would lower the cost of a failure induced by the race condition.

Question 2: Does this strategy (i.e. snapshot each index separately, with the error handling described above) sound reasonable? Are there any other alternatives I should consider (e.g. selectively delete some of the contents of the snapshot directory instead of all of them, and then retry the snapshot)?

ywelsch · 2019-01-29T17:26:22Z

@abw333 how often have you seen this happen? This should be extremely rare and only a problem in an extremely flaky or unreliable network or nodes that crash all the time.

Will it always be near the beginning? Or always near the end? Or does it vary?

This is read two times, once at the beginning and end of a snapshot (just before it is overridden), but if the read were to fail at the end, it should already have failed at the beginning.

Question 2: Does this strategy (i.e. snapshot each index separately, with the error handling described above) sound reasonable? Are there any other alternatives I should consider (e.g. selectively delete some of the contents of the snapshot directory instead of all of them, and then retry the snapshot)?

Given that this should very rarely happen, I'm not convinced this would need any special strategy. As I mentioned above, you can just delete the offending 0-byte file and start snapshotting again.

abw333 · 2019-01-30T19:53:35Z

@ywelsch I have a test that starts a snapshot using a CreateSnapshotRequest and then monitors the snapshot status using SnapshotsStatusRequest until the snapshot is complete.

Locally, the test passes reliably, but, on my testing infrastructure, it fails about once every 20 runs. (Which might indicate that the network on my testing infrastructure is not great.)

I have observed the failure (same stack trace) in two places: 1) when starting the snapshot with the CreateSnapshotRequest and 2) when checking on the snapshot status with the SnapshotsStatusRequest.

Although I can't reproduce the race condition locally, I can reproduce the failures by creating the empty file myself either before the CreateSnapshotRequest or before the SnapshotsStatusRequest.

1 is easy to address: I added some logic around the CreateSnapshotRequest to detect the failure caused by the race condition, and to retry (after cleaning up) if necessary. If my understanding is correct, the actual snapshot occurs asynchronously, so the actual CreateSnapshotRequest is fairly quick, and therefore it is not too costly to retry.

2 is a little bit more complex, since it seems that here the race condition is occurring during the actual snapshot (as opposed to during the request that initiates the snapshot). Here, a failure can be more costly since a snapshot is a long-running operation. My worry is that the failure happens towards the end of the snapshot and then we need to rerun the whole thing, which would add a lot of latency to the process.

Is what I'm seeing consistent with your understanding of what's happening here? And do you have any recommendation on how to handle 2? Is the race condition happening during the actual snapshot job? If so, do you expect that the snapshot job would fail right at the beginning?

not-napoleon added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Dec 27, 2018

original-brownbear self-assigned this Dec 27, 2018

original-brownbear mentioned this issue Dec 28, 2018

SNAPSHOT+TESTS: Rem. Mock Atomic Writes Randomness #37011

Merged

javanna changed the title ~~SharedClusterSnapshotRestoreIT failing on Master~~ SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master Dec 28, 2018

original-brownbear added a commit that referenced this issue Dec 28, 2018

TESTS: Mute testSnapshotCanceledOnRemovedShard

34d22f3

* relates #37005

original-brownbear added a commit that referenced this issue Dec 28, 2018

TESTS: Mute testSnapshotCanceledOnRemovedShard

87d3aa7

* relates #37005

original-brownbear closed this as completed in #37011 Jan 7, 2019

ywelsch mentioned this issue Jan 22, 2019

SnapshotShardsServiceIT#testRetryPostingSnapshotStatusMessages fails on 6.4 #33199

Closed

jtibshirani mentioned this issue Jan 25, 2019

SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard can fail with a timeout #37888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master #37005

SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master #37005

not-napoleon commented Dec 27, 2018

elasticmachine commented Dec 27, 2018

not-napoleon commented Dec 27, 2018

original-brownbear commented Dec 28, 2018

original-brownbear commented Dec 28, 2018

javanna commented Dec 28, 2018

ywelsch commented Dec 28, 2018

original-brownbear commented Dec 28, 2018

original-brownbear commented Dec 28, 2018

abw333 commented Jan 22, 2019

ywelsch commented Jan 22, 2019

abw333 commented Jan 22, 2019

abw333 commented Jan 29, 2019

ywelsch commented Jan 29, 2019

abw333 commented Jan 30, 2019

SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master #37005

SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master #37005

Comments

not-napoleon commented Dec 27, 2018

elasticmachine commented Dec 27, 2018

not-napoleon commented Dec 27, 2018

original-brownbear commented Dec 28, 2018

original-brownbear commented Dec 28, 2018

javanna commented Dec 28, 2018

ywelsch commented Dec 28, 2018

original-brownbear commented Dec 28, 2018

original-brownbear commented Dec 28, 2018

abw333 commented Jan 22, 2019

ywelsch commented Jan 22, 2019

abw333 commented Jan 22, 2019

abw333 commented Jan 29, 2019

ywelsch commented Jan 29, 2019

abw333 commented Jan 30, 2019