Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial snapshots when deleted s3 files reappear in bucket #28322

Closed
TruthyBoolish opened this issue Jan 22, 2018 · 3 comments
Closed

Partial snapshots when deleted s3 files reappear in bucket #28322

TruthyBoolish opened this issue Jan 22, 2018 · 3 comments
Assignees
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@TruthyBoolish
Copy link

TruthyBoolish commented Jan 22, 2018

Elasticsearch version: 5.6.5 and 6.1.2

Plugins installed: [discovery-ec2, repository-s3]

JVM version : 1.8

OS version : Ubuntu 16.04

Description of the problem including expected versus actual behavior:
The s3 repository plugin occasionally produces partial snapshots due to throwing an exception while deleting a file from AWS s3 storage. This is a result of s3 only providing eventual consistency for read-after-delete operations. The full documentation on that is here:
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

but the key section says:
A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, Amazon S3 might list the deleted object.

In more concrete terms, I'm using the s3 repository plugin in circumstances where the deleted snapshot files can still show up in the bucket for seconds/minutes after they've been deleted. When the s3 plugin then tries to delete one of these files for the second time, s3 then correctly shows it doesn't exist, and the s3 plugin considers that enough of an error condition to abort that particular part of the snapshot.

Here is a log of such an error (with some redactions). This log shows the problem happens to multiple indices and shards in the snapshot operation, and some of the failures involve a common file in s3.

[2018-01-21T18:01:25,809][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.17][1]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [pending-index-62] during cleanup
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [pending-index-62] does not exist
	at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
	... 10 more
[2018-01-21T18:02:35,691][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.24][0]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [index-62] during cleanup
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [index-62] does not exist
	at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
	... 10 more
[2018-01-21T18:02:36,148][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.24][1]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [index-62] during cleanup
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [index-62] does not exist
	at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
	... 10 more
[2018-01-21T18:02:39,338][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2016.12.21][0]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [pending-index-63] during cleanup
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [pending-index-63] does not exist
	at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
	... 10 more
[2018-01-21T18:03:40,905][INFO ][o.e.s.SnapshotShardsService] [q8fsCAc] snapshot [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] is done

This log was produced with elasticsearch and repository-s3 version 5.6.5, but I have recreated the problem in 6.1.2 as well. The logic involved has not changed, so this makes sense.
When listBlobs() is called at the start of a snapshot operation,
https://github.com/elastic/elasticsearch/blob/v6.1.2/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L1133
s3 can return the files which have already been deleted.

Later, during the finalize phase, it will now attempt to delete all of the blobs that were listed above. The deleteBlob() method will verify that each blob exists, but the blobs that were previously deleted will not individually show as existent. So deleteBlob() will throw an exception.
https://github.com/elastic/elasticsearch/blob/v6.1.2/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L1238

Steps to reproduce:
This depends on unpredictable s3 consistency behaviors, and so may not be easy to reproduce naturally. I will include a patch against 6.1.2 which will allow you to simulate this scenario by sleeping at the key points in the snapshot operation. This will allow you to add and delete the files from s3, to simulate its eventual consistency.

All that's needed to reproduce is to take a snapshot and have an s3 deleted file reappear in the bucket contents at the right time. It will be a partial snapshot, and the above errors will be logged.

If using the patch which will be provided:

  1. Take a snapshot and watch the logs
  2. When directed in the log message, add a file starting with index- to the s3 path noted in the logs. For instance, index-DUMMY.txt (This simulates a file which was previously deleted by the plugin showing up again in the bucket contents.)
  3. Keep watching the logs, and when the listBlobs() operation is complete, they will direct you to remove the file you uploaded. (This simulates that the file, though it was in the bucket contents, will fail the blobExists() check.)
  4. The snapshot will then fail shortly after.

Suggested fix:
Since this is unlikely to occur for the majority of s3 users, a fix which would not affect the majority of s3 users also seems appropriate. An optional configuration property which lets me filter the listBlobs() result for any blobs that I know I have already deleted occurs to me. Keeping track of that state may present a challenge, so being able to configure more lenient deleteBlob() checks (and thus accepting the risk of the s3 repo being concurrently modified by other processes) would also make sense to me.

Patch to simulate the issue:

diff --git a/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java b/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
index e3ee189..e1398d2 100644
--- a/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
+++ b/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
@@ -167,6 +167,8 @@ import static java.util.Collections.unmodifiableMap;
  */
 public abstract class BlobStoreRepository extends AbstractLifecycleComponent implements Repository {
 
+    private static Map<SnapshotId, Boolean> slept = new HashMap<>();
+
     private BlobContainer snapshotsBlobContainer;
 
     protected final RepositoryMetaData metadata;
@@ -1130,10 +1132,24 @@ public abstract class BlobStoreRepository extends AbstractLifecycleComponent imp
             try {
                 final Map<String, BlobMetaData> blobs;
                 try {
+                    long sleepMinutes = 3;
+                    if(!slept.getOrDefault(snapshotId, false)) {
+                      logger.info("Sleeping for {} minutes. Add file starting with 'index-' to path {} now so listBlobs() will see it.", sleepMinutes, blobContainer.path().buildAsString());
+                      Thread.sleep(sleepMinutes * 60 * 1000);
+		    }
+
                     blobs = blobContainer.listBlobs();
+
+                    if(!slept.getOrDefault(snapshotId, false)) {
+                      logger.info("listBlobs() complete. Sleeping for {} minutes. Delete the file from path {} now so it won't be present when the delete is attempted.", sleepMinutes, blobContainer.path().buildAsString());
+                      Thread.sleep(sleepMinutes * 60 * 1000);
+                      slept.put(snapshotId, true);
+		    }
                 } catch (IOException e) {
                     throw new IndexShardSnapshotFailedException(shardId, "failed to list blobs", e);
-                }
+                } catch (InterruptedException e) {
+		    throw new RuntimeException(e);
+		}
 
                 long generation = findLatestFileNameGeneration(blobs);
                 Tuple<BlobStoreIndexShardSnapshots, Integer> tuple = buildBlobStoreIndexShardSnapshots(blobs);

@tlrx
Copy link
Member

tlrx commented May 2, 2018

Thanks for this very detailed report @TruthyBoolish ! I've seen a similar behavior from time to time too on repositories with a lot of creation/deletion per hour.

I created #30332 that relaxes the constraint when deleting files during the finalization of the snapshot. It should help to reduce these errors.

@sinzin91
Copy link

sinzin91 commented May 3, 2018

@tlrx Will the fix from #30332 also be applied to ES v5.6.x?

tlrx added a commit that referenced this issue May 7, 2018
When deleting or creating a snapshot for a given shard, elasticsearch 
usually starts by listing all the existing snapshotted files in the repository. 
Then it computes a diff and deletes the snapshotted files that are not 
needed anymore. During this deletion, an exception is thrown if the file 
to be deleted does not exist anymore.

This behavior is challenging with cloud based repository implementations 
like S3 where a file that has been deleted can still appear in the bucket for 
few seconds/minutes (because the deletion can take some time to be fully 
replicated on S3). If the deleted file appears in the listing of files, then the 
following deletion will fail with a NoSuchFileException and the snapshot 
will be partially created/deleted.

This pull request makes the deletion of these files a bit less strict, ie not 
failing if the file we want to delete does not exist anymore. It introduces a 
new BlobContainer.deleteIgnoringIfNotExists() method that can be used 
at some specific places where not failing when deleting a file is 
considered harmless.

Closes #28322
tlrx added a commit that referenced this issue May 7, 2018
When deleting or creating a snapshot for a given shard, elasticsearch 
usually starts by listing all the existing snapshotted files in the repository. 
Then it computes a diff and deletes the snapshotted files that are not 
needed anymore. During this deletion, an exception is thrown if the file 
to be deleted does not exist anymore.

This behavior is challenging with cloud based repository implementations 
like S3 where a file that has been deleted can still appear in the bucket for 
few seconds/minutes (because the deletion can take some time to be fully 
replicated on S3). If the deleted file appears in the listing of files, then the 
following deletion will fail with a NoSuchFileException and the snapshot 
will be partially created/deleted.

This pull request makes the deletion of these files a bit less strict, ie not 
failing if the file we want to delete does not exist anymore. It introduces a 
new BlobContainer.deleteIgnoringIfNotExists() method that can be used 
at some specific places where not failing when deleting a file is 
considered harmless.

Closes #28322
tlrx added a commit that referenced this issue May 7, 2018
When deleting or creating a snapshot for a given shard, elasticsearch 
usually starts by listing all the existing snapshotted files in the repository. 
Then it computes a diff and deletes the snapshotted files that are not 
needed anymore. During this deletion, an exception is thrown if the file 
to be deleted does not exist anymore.

This behavior is challenging with cloud based repository implementations 
like S3 where a file that has been deleted can still appear in the bucket for 
few seconds/minutes (because the deletion can take some time to be fully 
replicated on S3). If the deleted file appears in the listing of files, then the 
following deletion will fail with a NoSuchFileException and the snapshot 
will be partially created/deleted.

This pull request makes the deletion of these files a bit less strict, ie not 
failing if the file we want to delete does not exist anymore. It introduces a 
new BlobContainer.deleteIgnoringIfNotExists() method that can be used 
at some specific places where not failing when deleting a file is 
considered harmless.

Closes #28322
colings86 pushed a commit that referenced this issue May 8, 2018
When deleting or creating a snapshot for a given shard, elasticsearch 
usually starts by listing all the existing snapshotted files in the repository. 
Then it computes a diff and deletes the snapshotted files that are not 
needed anymore. During this deletion, an exception is thrown if the file 
to be deleted does not exist anymore.

This behavior is challenging with cloud based repository implementations 
like S3 where a file that has been deleted can still appear in the bucket for 
few seconds/minutes (because the deletion can take some time to be fully 
replicated on S3). If the deleted file appears in the listing of files, then the 
following deletion will fail with a NoSuchFileException and the snapshot 
will be partially created/deleted.

This pull request makes the deletion of these files a bit less strict, ie not 
failing if the file we want to delete does not exist anymore. It introduces a 
new BlobContainer.deleteIgnoringIfNotExists() method that can be used 
at some specific places where not failing when deleting a file is 
considered harmless.

Closes #28322
@stawii
Copy link

stawii commented Jul 11, 2019

As far as I can find in release notes this is fixed in 6.3.0 and 6.4.0 version, correct my if I am wrong.
Can/will this be backported to 5.x lines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

7 participants