Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master #37005

Closed
not-napoleon opened this issue Dec 27, 2018 · 14 comments · Fixed by #37011
Closed
Assignees
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@not-napoleon
Copy link
Member

Test failed on master:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=fedora/144/

Does NOT reproduce for me on master with:

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=BA465B36283F19D1 \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testSnapshotCanceledOnRemovedShard" \
  -Dtests.security.manager=true \
  -Dtests.locale=hu \
  -Dtests.timezone=America/Merida \
  -Dcompiler.java=11 \
  -Druntime.java=8

Relevant log snippet:

09:15:00   1> [2018-12-27T08:14:13,064][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] removing template [random_index_template]
09:15:00   1> [2018-12-27T08:14:13,092][INFO ][o.e.r.RepositoriesService] [node_sm2] delete repository [test-repo]
09:15:00   1> [2018-12-27T08:14:13,132][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testDataFileFailureDuringRestore] [SharedClusterSnapshotRestoreIT#testDataFileFailureDuringRestore]: cleaned up after test
09:15:00   1> [2018-12-27T08:14:13,132][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testDataFileFailureDuringRestore] after test
09:15:00   1> [2018-12-27T08:14:13,287][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] before test
09:15:00   1> [2018-12-27T08:14:13,287][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] [SharedClusterSnapshotRestoreIT#testSnapshotName]: setting up test
09:15:00   1> [2018-12-27T08:14:13,294][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] adding template [random_index_template] for index patterns [*]
09:15:00   1> [2018-12-27T08:14:13,331][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] [SharedClusterSnapshotRestoreIT#testSnapshotName]: all set up test
09:15:00   1> [2018-12-27T08:14:13,331][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] -->  creating repository
09:15:00   1> [2018-12-27T08:14:13,332][INFO ][o.e.r.RepositoriesService] [node_sm2] put repository [test-repo]
09:15:00   1> [2018-12-27T08:14:13,386][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] [SharedClusterSnapshotRestoreIT#testSnapshotName]: cleaning up after test
09:15:00   1> [2018-12-27T08:14:13,408][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] removing template [random_index_template]
09:15:00   1> [2018-12-27T08:14:13,434][INFO ][o.e.r.RepositoriesService] [node_sm2] delete repository [test-repo]
09:15:00   1> [2018-12-27T08:14:13,468][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] [SharedClusterSnapshotRestoreIT#testSnapshotName]: cleaned up after test
09:15:00   1> [2018-12-27T08:14:13,469][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotName] after test
09:15:00   1> [2018-12-27T08:14:13,607][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] before test
09:15:00   1> [2018-12-27T08:14:13,607][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] [SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard]: setting up test
09:15:00   1> [2018-12-27T08:14:13,612][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] adding template [random_index_template] for index patterns [*]
09:15:00   1> [2018-12-27T08:14:13,651][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] [SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard]: all set up test
09:15:00   1> [2018-12-27T08:14:13,653][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_sm2] [test-idx] creating index, cause [api], templates [random_index_template], shards [1]/[1], mappings []
09:15:00   1> [2018-12-27T08:14:13,843][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> indexing some data
09:15:00   1> [2018-12-27T08:14:13,846][INFO ][o.e.c.m.MetaDataMappingService] [node_sm2] [test-idx/lDUHpreESTWKTn3etC85pA] create_mapping [_doc]
09:15:00   1> [2018-12-27T08:14:14,083][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> creating repository
09:15:00   1> [2018-12-27T08:14:14,084][INFO ][o.e.s.m.MockRepository   ] [node_sm2] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,084][INFO ][o.e.r.RepositoriesService] [node_sm2] put repository [test-repo]
09:15:00   1> [2018-12-27T08:14:14,095][INFO ][o.e.s.m.MockRepository   ] [node_sm0] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,095][INFO ][o.e.s.m.MockRepository   ] [node_sd4] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,096][INFO ][o.e.s.m.MockRepository   ] [node_sd5] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,096][INFO ][o.e.s.m.MockRepository   ] [node_sd3] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,096][INFO ][o.e.s.m.MockRepository   ] [node_sm1] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,112][INFO ][o.e.s.m.MockRepository   ] [node_sm2] starting mock repository with random prefix SWTuuSLFKZ
09:15:00   1> [2018-12-27T08:14:14,120][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> snapshot
09:15:00   1> [2018-12-27T08:14:14,120][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> waiting for block to kick in on node [node_sd4]
09:15:00   1> [2018-12-27T08:14:14,144][INFO ][o.e.s.SnapshotsService   ] [node_sm2] snapshot [test-repo:test-snap/WLdcwqMYRHGiS_RxVVjbuQ] started
09:15:00   1> [2018-12-27T08:14:14,192][INFO ][o.e.s.m.MockRepository   ] [node_sd4] [test-repo] blocking I/O operation for file [__0] at path [[indices][c-oqPeO2SSyVFv3iTpzMZw][0]]
09:15:00   1> [2018-12-27T08:14:14,220][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> removing primary shard that is being snapshotted
09:15:00   1> [2018-12-27T08:14:14,229][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> unblocking blocked node [node_sd4]
09:15:00   1> [2018-12-27T08:14:14,229][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] --> ensuring snapshot is aborted and the aborted shard was marked as failed
09:15:00   1> [2018-12-27T08:14:14,431][WARN ][o.e.s.SnapshotShardsService] [node_sd4] [[test-idx][0]][test-repo:test-snap/WLdcwqMYRHGiS_RxVVjbuQ] failed to snapshot shard
09:15:00   1> org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1403) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1396) ~[main/:?]
09:15:00   1> 	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_192]
09:15:00   1> 	at org.elasticsearch.core.internal.io.Streams.copy(Streams.java:54) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
09:15:00   1> 	at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:133) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.mockstore.BlobContainerWrapper.writeBlob(BlobContainerWrapper.java:53) ~[test/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.writeBlob(MockRepository.java:352) ~[test/:?]
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1330) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1269) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:852) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:393) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.SnapshotShardsService.access$100(SnapshotShardsService.java:94) ~[main/:?]
09:15:00   1> 	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:336) [main/:?]
09:15:00   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [main/:?]
09:15:00   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
09:15:00   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
09:15:00   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
09:15:00   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
09:15:00   1> [2018-12-27T08:14:14,435][WARN ][o.e.r.b.BlobStoreRepository] [node_sm2] [index-0] index blob is not valid x-content [0 bytes]
09:15:00   1> [2018-12-27T08:14:14,436][INFO ][o.e.s.SnapshotsService   ] [node_sm2] snapshot [test-repo:test-snap/WLdcwqMYRHGiS_RxVVjbuQ] completed with state [PARTIAL]
09:15:00   1> [2018-12-27T08:14:14,437][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] [SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard]: cleaning up after test
09:15:00   1> [2018-12-27T08:14:14,444][WARN ][o.e.i.c.IndicesClusterStateService] [node_sd4] [[test-idx][0]] marking and sending shard failed due to [master marked shard as active, but shard has not been created, mark shard as failed]
09:15:00   1> [2018-12-27T08:14:14,453][WARN ][o.e.c.r.a.AllocationService] [node_sm2] failing shard [failed shard, shard [test-idx][0], node[NmI-D7OpTF6LOv51WTCDyw], [P], s[STARTED], a[id=jqVYMdb5SQmT9qtKYKXh6Q], message [master marked shard as active, but shard has not been created, mark shard as failed], failure [Unknown], markAsStale [true]]
09:15:00   1> [2018-12-27T08:14:14,454][INFO ][o.e.c.r.a.AllocationService] [node_sm2] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[test-idx][0]] ...]).
09:15:00   1> [2018-12-27T08:14:14,476][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_sm2] [test-idx/lDUHpreESTWKTn3etC85pA] deleting index
09:15:00   1> [2018-12-27T08:14:14,531][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm2] removing template [random_index_template]
09:15:00   1> [2018-12-27T08:14:14,554][INFO ][o.e.r.RepositoriesService] [node_sm2] delete repository [test-repo]
09:15:00   1> [2018-12-27T08:14:14,577][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] [SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard]: cleaned up after test
09:15:00   1> [2018-12-27T08:14:14,577][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotCanceledOnRemovedShard] after test
09:15:00 ERROR   1.14s J7 | SharedClusterSnapshotRestoreIT.testSnapshotCanceledOnRemovedShard <<< FAILURES!
09:15:00    > Throwable #1: NotSerializableExceptionWrapper[not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes]
09:15:00    > 	at __randomizedtesting.SeedInfo.seed([BA465B36283F19D1:2BF74722441573D]:0)
09:15:00    > 	at org.elasticsearch.common.compress.CompressorFactory.compressor(CompressorFactory.java:56)
09:15:00    > 	at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:70)
09:15:00    > 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:670)
09:15:00    > 	at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:148)
09:15:00    > 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:96)
09:15:00    > 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54)
09:15:00    > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127)
09:15:00    > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208)
09:15:00    > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
09:15:00    > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
09:15:00    > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
09:15:00    > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
09:15:00    > 	at java.lang.Thread.run(Thread.java:748)
09:15:00   1> [2018-12-27T08:14:14,796][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testGetSnapshotsRequest] before test
09:15:00   1> [2018-12-27T08:14:14,797][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testGetSnapshotsRequest]```
@not-napoleon not-napoleon added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Dec 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear original-brownbear self-assigned this Dec 27, 2018
@not-napoleon
Copy link
Member Author

There was a race condition issue in the same class a while ago, but the presentation looks different than this issue, and I don't think they're related. Including a link for completeness: #18121

@original-brownbear
Copy link
Member

The failure here comes from the fact that we fail to read the repo info:

ERROR   1.14s J7 | SharedClusterSnapshotRestoreIT.testSnapshotCanceledOnRemovedShard <<< FAILURES!
   > Throwable #1: NotSerializableExceptionWrapper[not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes]
   > 	at __randomizedtesting.SeedInfo.seed([BA465B36283F19D1:2BF74722441573D]:0)
   > 	at org.elasticsearch.common.compress.CompressorFactory.compressor(CompressorFactory.java:56)
   > 	at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:70)
   > 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:670)
   > 	at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:148)
   > 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:96)
   > 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54)
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127)
   > 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

a few lines up the log you can see:

  1> [2018-12-27T08:14:14,435][WARN ][o.e.r.b.BlobStoreRepository] [node_sm2] [index-0] index blob is not valid x-content [0 bytes]

=> apparently we run into an index gen file index-0 that is of 0 length.

I grepped the build logs and could find another such case here https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/3/consoleText in a unrelated test suit.

This test has a shared per suit cluster, the other example has a cluster per test case which imo rules out interference between tests.

Looking at the implementation of org.elasticsearch.common.blobstore.fs.FsBlobContainer#writeBlobAtomic it is not immediately clear how we could end up with an empty blob here. Looking in to that now.

@original-brownbear
Copy link
Member

The problem seems to come from https://github.com/elastic/elasticsearch/blob/master/server/src/test/java/org/elasticsearch/snapshots/mockstore/MockRepository.java#L382 => the mock repository wrapped used in these tests randomly executes a non-atomic writeBlob when writeBlobAtomic is invoked.
That would allow a 0 length file to be read for index-0 when no data has been fsynced to it yet.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 28, 2018
* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests
* Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode:
  * Cloud based Blob stores (S3, GCS and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them
  * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves
* Closes elastic#37005
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 28, 2018
* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests
* Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode:
  * Cloud based Blob stores (S3, GCS and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them
  * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves
* Closes elastic#37005
@javanna javanna changed the title SharedClusterSnapshotRestoreIT failing on Master SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master Dec 28, 2018
@javanna
Copy link
Member

javanna commented Dec 28, 2018

@ywelsch
Copy link
Contributor

ywelsch commented Dec 28, 2018

@original-brownbear mute the test for now?

@original-brownbear
Copy link
Member

@ywelsch sure, will do in a sec.

@original-brownbear
Copy link
Member

Muted in master (34d22f3) and 6.x (87d3aa7)

original-brownbear added a commit that referenced this issue Jan 7, 2019
* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests
* Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode:
  * Cloud-based Blob stores (S3, GCS, and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them
  * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves
* Closes #37005
original-brownbear added a commit that referenced this issue Jan 7, 2019
* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests
* Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode:
  * Cloud-based Blob stores (S3, GCS, and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them
  * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves
* Closes #37005
@abw333
Copy link

abw333 commented Jan 22, 2019

Hi, I am running into a similar issue while creating a snapshot on 6.4:

org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: NotSerializableExceptionWrapper[not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes]
NotSerializableExceptionWrapper[not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes]
	at org.elasticsearch.common.compress.CompressorFactory.compressor(CompressorFactory.java:56)
	at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:69)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:671)
	at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:139)
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.buildResponse(TransportSnapshotsStatusAction.java:207)
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction.access$100(TransportSnapshotsStatusAction.java:61)
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction$1.onResponse(TransportSnapshotsStatusAction.java:127)
	at org.elasticsearch.action.admin.cluster.snapshots.status.TransportSnapshotsStatusAction$1.onResponse(TransportSnapshotsStatusAction.java:121)
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.finishHim(TransportNodesAction.java:252)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onOperation(TransportNodesAction.java:229)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:153)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:206)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:198)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1058)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1134)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1124)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1113)
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:260)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:256)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66)
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:665)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Moreover, I see a log line very similar to the one pointed out by @original-brownbear:

[2019-01-18T18:10:42,829][WARN ][o.e.r.h.HdfsRepository   ] [OyVQR66] [index-0] index blob is not valid x-content [0 bytes]

(Note that it uses HdfsRepository instead of BlobStoreRepository.)

I have a few specific questions:

  1. What exactly is MockRepository? I don't see it in the Elasticsearch package I am using, and I don't see much documentation in the file. Is it only used for testing?
  2. How is MockRepository related to BlobStoreRepository? Why does a change to MockRepository fix an issue that seems to be related to BlobStoreRepository?
  3. How likely is it that the issue that I am seeing is related to this underlying issue? Is it possible that I am seeing the same symptom, but it's another issue? How meaningful is the fact that I see HdfsRepository in my log instead of BlobStoreRepository?
  4. While debugging, I stumbled upon this other thread: SnapshotShardsServiceIT#testRetryPostingSnapshotStatusMessages fails on 6.4 #33199 Is that related to this issue?
  5. Does this issue mean that snapshots are expected to randomly fail in production? If so, is the workaround for clients to add a retry mechanism until we upgrade to an Elasticsearch version that includes the fix (6.7, if I understood correctly)?

Please let me know if there's any other helpful information I can provide, or if there's another question that I should be asking. Thanks in advance for your work in Elasticsearch and for your time.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 22, 2019

@abw333 Your issue is probably fixed in 6.7.0, see #37066. MockRepository is a fake repository implementation that we use in our tests to randomly inject failures. A test failure uncovered a problem with repository types that don't support writing blobs in an atomic way. It turned out that only HDFSRepository was not supporting atomic writes, which we changed in #37066. With that change, we now mandated that all repository implementations support atomic writes, and hence adapted MockRepository (#37011) so that it more accurately reflected the behavior of our improved repository implementations. There is unfortunately no workaround here, in particular retrying will not help. You could try deleting the offending 0-byte file and see if that allows further snapshots.

@abw333
Copy link

abw333 commented Jan 22, 2019

Thanks, @ywelsch, I will give that a try!

@abw333
Copy link

abw333 commented Jan 29, 2019

Hi again. I was able to reproduce this failure reliably by attempting to take a snapshot into an HDFS directory that already had an empty file named index-0 in it. I was then able to fix the failure by 1) detecting the failure caused by the race condition 2) deleting the snapshot directory's contents and 3) retrying the snapshot. Thanks again for the suggestion :)

I have a couple follow-up questions:

I am currently taking snapshots of all of my indices with a single request. My indices all have roughly the same size. I am concerned that, a long time into the snapshot process, the snapshot fails, and then I have to restart from the beginning.

Question 1: If my snapshot fails with this race condition, when will it fail? Will it always be near the beginning? Or always near the end? Or does it vary?

If it always fails near the beginning, I don't have anything to worry about. If it sometimes fails near the end, though, I would like to do something about it. I was thinking that, in this case, I could take a snapshot of each of my indices separately. Since they are roughly equally sized, this would lower the cost of a failure induced by the race condition.

Question 2: Does this strategy (i.e. snapshot each index separately, with the error handling described above) sound reasonable? Are there any other alternatives I should consider (e.g. selectively delete some of the contents of the snapshot directory instead of all of them, and then retry the snapshot)?

@ywelsch
Copy link
Contributor

ywelsch commented Jan 29, 2019

@abw333 how often have you seen this happen? This should be extremely rare and only a problem in an extremely flaky or unreliable network or nodes that crash all the time.

Will it always be near the beginning? Or always near the end? Or does it vary?

This is read two times, once at the beginning and end of a snapshot (just before it is overridden), but if the read were to fail at the end, it should already have failed at the beginning.

Question 2: Does this strategy (i.e. snapshot each index separately, with the error handling described above) sound reasonable? Are there any other alternatives I should consider (e.g. selectively delete some of the contents of the snapshot directory instead of all of them, and then retry the snapshot)?

Given that this should very rarely happen, I'm not convinced this would need any special strategy. As I mentioned above, you can just delete the offending 0-byte file and start snapshotting again.

@abw333
Copy link

abw333 commented Jan 30, 2019

@ywelsch I have a test that starts a snapshot using a CreateSnapshotRequest and then monitors the snapshot status using SnapshotsStatusRequest until the snapshot is complete.

Locally, the test passes reliably, but, on my testing infrastructure, it fails about once every 20 runs. (Which might indicate that the network on my testing infrastructure is not great.)

I have observed the failure (same stack trace) in two places: 1) when starting the snapshot with the CreateSnapshotRequest and 2) when checking on the snapshot status with the SnapshotsStatusRequest.

Although I can't reproduce the race condition locally, I can reproduce the failures by creating the empty file myself either before the CreateSnapshotRequest or before the SnapshotsStatusRequest.

1 is easy to address: I added some logic around the CreateSnapshotRequest to detect the failure caused by the race condition, and to retry (after cleaning up) if necessary. If my understanding is correct, the actual snapshot occurs asynchronously, so the actual CreateSnapshotRequest is fairly quick, and therefore it is not too costly to retry.

2 is a little bit more complex, since it seems that here the race condition is occurring during the actual snapshot (as opposed to during the request that initiates the snapshot). Here, a failure can be more costly since a snapshot is a long-running operation. My worry is that the failure happens towards the end of the snapshot and then we need to rerun the whole thing, which would add a lot of latency to the process.

Is what I'm seeing consistent with your understanding of what's happening here? And do you have any recommendation on how to handle 2? Is the race condition happening during the actual snapshot job? If so, do you expect that the snapshot job would fail right at the beginning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants