Partial snapshots when deleted s3 files reappear in bucket #28322

TruthyBoolish · 2018-01-22T01:51:21Z

Elasticsearch version: 5.6.5 and 6.1.2

Plugins installed: [discovery-ec2, repository-s3]

JVM version : 1.8

OS version : Ubuntu 16.04

Description of the problem including expected versus actual behavior:
The s3 repository plugin occasionally produces partial snapshots due to throwing an exception while deleting a file from AWS s3 storage. This is a result of s3 only providing eventual consistency for read-after-delete operations. The full documentation on that is here:
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

but the key section says:
A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, Amazon S3 might list the deleted object.

In more concrete terms, I'm using the s3 repository plugin in circumstances where the deleted snapshot files can still show up in the bucket for seconds/minutes after they've been deleted. When the s3 plugin then tries to delete one of these files for the second time, s3 then correctly shows it doesn't exist, and the s3 plugin considers that enough of an error condition to abort that particular part of the snapshot.

Here is a log of such an error (with some redactions). This log shows the problem happens to multiple indices and shards in the snapshot operation, and some of the failures involve a common file in s3.

[2018-01-21T18:01:25,809][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.17][1]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [pending-index-62] during cleanup
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [pending-index-62] does not exist
	at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
	... 10 more
[2018-01-21T18:02:35,691][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.24][0]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [index-62] during cleanup
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [index-62] does not exist
	at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
	... 10 more
[2018-01-21T18:02:36,148][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.24][1]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [index-62] during cleanup
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [index-62] does not exist
	at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
	... 10 more
[2018-01-21T18:02:39,338][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2016.12.21][0]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [pending-index-63] during cleanup
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [pending-index-63] does not exist
	at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
	... 10 more
[2018-01-21T18:03:40,905][INFO ][o.e.s.SnapshotShardsService] [q8fsCAc] snapshot [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] is done

This log was produced with elasticsearch and repository-s3 version 5.6.5, but I have recreated the problem in 6.1.2 as well. The logic involved has not changed, so this makes sense.
When listBlobs() is called at the start of a snapshot operation,
https://github.com/elastic/elasticsearch/blob/v6.1.2/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L1133
s3 can return the files which have already been deleted.

Later, during the finalize phase, it will now attempt to delete all of the blobs that were listed above. The deleteBlob() method will verify that each blob exists, but the blobs that were previously deleted will not individually show as existent. So deleteBlob() will throw an exception.
https://github.com/elastic/elasticsearch/blob/v6.1.2/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L1238

Steps to reproduce:
This depends on unpredictable s3 consistency behaviors, and so may not be easy to reproduce naturally. I will include a patch against 6.1.2 which will allow you to simulate this scenario by sleeping at the key points in the snapshot operation. This will allow you to add and delete the files from s3, to simulate its eventual consistency.

All that's needed to reproduce is to take a snapshot and have an s3 deleted file reappear in the bucket contents at the right time. It will be a partial snapshot, and the above errors will be logged.

If using the patch which will be provided:

Take a snapshot and watch the logs
When directed in the log message, add a file starting with index- to the s3 path noted in the logs. For instance, index-DUMMY.txt (This simulates a file which was previously deleted by the plugin showing up again in the bucket contents.)
Keep watching the logs, and when the listBlobs() operation is complete, they will direct you to remove the file you uploaded. (This simulates that the file, though it was in the bucket contents, will fail the blobExists() check.)
The snapshot will then fail shortly after.

Suggested fix:
Since this is unlikely to occur for the majority of s3 users, a fix which would not affect the majority of s3 users also seems appropriate. An optional configuration property which lets me filter the listBlobs() result for any blobs that I know I have already deleted occurs to me. Keeping track of that state may present a challenge, so being able to configure more lenient deleteBlob() checks (and thus accepting the risk of the s3 repo being concurrently modified by other processes) would also make sense to me.

Patch to simulate the issue:

diff --git a/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java b/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
index e3ee189..e1398d2 100644
--- a/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
+++ b/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
@@ -167,6 +167,8 @@ import static java.util.Collections.unmodifiableMap;
  */
 public abstract class BlobStoreRepository extends AbstractLifecycleComponent implements Repository {
 
+    private static Map<SnapshotId, Boolean> slept = new HashMap<>();
+
     private BlobContainer snapshotsBlobContainer;
 
     protected final RepositoryMetaData metadata;
@@ -1130,10 +1132,24 @@ public abstract class BlobStoreRepository extends AbstractLifecycleComponent imp
             try {
                 final Map<String, BlobMetaData> blobs;
                 try {
+                    long sleepMinutes = 3;
+                    if(!slept.getOrDefault(snapshotId, false)) {
+                      logger.info("Sleeping for {} minutes. Add file starting with 'index-' to path {} now so listBlobs() will see it.", sleepMinutes, blobContainer.path().buildAsString());
+                      Thread.sleep(sleepMinutes * 60 * 1000);
+		    }
+
                     blobs = blobContainer.listBlobs();
+
+                    if(!slept.getOrDefault(snapshotId, false)) {
+                      logger.info("listBlobs() complete. Sleeping for {} minutes. Delete the file from path {} now so it won't be present when the delete is attempted.", sleepMinutes, blobContainer.path().buildAsString());
+                      Thread.sleep(sleepMinutes * 60 * 1000);
+                      slept.put(snapshotId, true);
+		    }
                 } catch (IOException e) {
                     throw new IndexShardSnapshotFailedException(shardId, "failed to list blobs", e);
-                }
+                } catch (InterruptedException e) {
+		    throw new RuntimeException(e);
+		}
 
                 long generation = findLatestFileNameGeneration(blobs);
                 Tuple<BlobStoreIndexShardSnapshots, Integer> tuple = buildBlobStoreIndexShardSnapshots(blobs);

The text was updated successfully, but these errors were encountered:

tlrx · 2018-05-02T12:21:40Z

Thanks for this very detailed report @TruthyBoolish ! I've seen a similar behavior from time to time too on repositories with a lot of creation/deletion per hour.

I created #30332 that relaxes the constraint when deleting files during the finalization of the snapshot. It should help to reduce these errors.

sinzin91 · 2018-05-03T21:09:23Z

@tlrx Will the fix from #30332 also be applied to ES v5.6.x?

When deleting or creating a snapshot for a given shard, elasticsearch usually starts by listing all the existing snapshotted files in the repository. Then it computes a diff and deletes the snapshotted files that are not needed anymore. During this deletion, an exception is thrown if the file to be deleted does not exist anymore. This behavior is challenging with cloud based repository implementations like S3 where a file that has been deleted can still appear in the bucket for few seconds/minutes (because the deletion can take some time to be fully replicated on S3). If the deleted file appears in the listing of files, then the following deletion will fail with a NoSuchFileException and the snapshot will be partially created/deleted. This pull request makes the deletion of these files a bit less strict, ie not failing if the file we want to delete does not exist anymore. It introduces a new BlobContainer.deleteIgnoringIfNotExists() method that can be used at some specific places where not failing when deleting a file is considered harmless. Closes #28322

stawii · 2019-07-11T14:02:15Z

As far as I can find in release notes this is fixed in 6.3.0 and 6.4.0 version, correct my if I am wrong.
Can/will this be backported to 5.x lines?

dadoonet added the :Plugin Repository S3 label Jan 22, 2018

clintongormley added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed :Plugin Repository S3 labels Feb 14, 2018

colings86 added the >bug label Apr 24, 2018

tlrx self-assigned this May 2, 2018

tlrx mentioned this issue May 2, 2018

Do not fail snapshot when deleting a missing snapshotted file #30332

Merged

tlrx closed this as completed in #30332 May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial snapshots when deleted s3 files reappear in bucket #28322

Partial snapshots when deleted s3 files reappear in bucket #28322

TruthyBoolish commented Jan 22, 2018 •

edited

tlrx commented May 2, 2018

sinzin91 commented May 3, 2018

stawii commented Jul 11, 2019

Partial snapshots when deleted s3 files reappear in bucket #28322

Partial snapshots when deleted s3 files reappear in bucket #28322

Comments

TruthyBoolish commented Jan 22, 2018 • edited

tlrx commented May 2, 2018

sinzin91 commented May 3, 2018

stawii commented Jul 11, 2019

TruthyBoolish commented Jan 22, 2018 •

edited