Description
Reproducible on 5.5.
- Create a node with a file system repository path pointed to a mount for a small volume.
- Create a snapshot repository to this mount path.
- Create a few snapshots.
- Then use dd or other means to generate a large file in that volume that causes it to use all space.
- Use the delete snapshot api to delete the snapshot. The api call will fail complaining that there is no space left on the device.
{"error":{"root_cause":[{"type":"repository_exception","reason":"[backup] failed to delete snapshot [snapshot4/fv_jj1JOQMilPvL4jHjExQ]"}],"type":"repository_exception","reason":"[backup] failed to delete snapshot [snapshot4/fv_jj1JOQMilPvL4jHjExQ]","caused_by":{"type":"i_o_exception","reason":"No space left on device"}},"status":500}
The only indication of a delete snapshot call actually creating files is that we appear to be creating a new "index generation" within the snapshot metadata store.
[2017-09-21T03:46:42,183][DEBUG][o.e.s.SnapshotsService ] [TwOmS-G] deleted snapshot is not running - deleting files
[2017-09-21T03:46:42,184][DEBUG][o.e.c.s.ClusterService ] [TwOmS-G] processing [delete snapshot]: took [3ms] done applying updated cluster_state (version: 7, uuid: VlTYntZNRdmhxQBSB0i8yQ)
[2017-09-21T03:46:42,200][DEBUG][o.e.r.f.FsRepository ] [TwOmS-G] Repository [backup] writing new index generational blob [index-6]
[2017-09-21T03:46:42,202][DEBUG][o.e.c.s.ClusterService ] [TwOmS-G] processing [remove snapshot deletion metadata]: execute
[2017-09-21T03:46:42,202][DEBUG][o.e.c.s.ClusterService ] [TwOmS-G] cluster state updated, version [8], source [remove snapshot deletion metadata]
[2017-09-21T03:46:42,202][DEBUG][o.e.c.s.ClusterService ] [TwOmS-G] publishing cluster state version [8]
[2017-09-21T03:46:42,202][DEBUG][o.e.c.s.ClusterService ] [TwOmS-G] applying cluster state version 8
[2017-09-21T03:46:42,202][DEBUG][o.e.c.s.ClusterService ] [TwOmS-G] set local cluster state to version 8
[2017-09-21T03:46:42,206][WARN ][r.suppressed ] path: /_snapshot/backup/snapshot4, params: {repository=backup, snapshot=snapshot4}
org.elasticsearch.repositories.RepositoryException: [backup] failed to delete snapshot [snapshot4/fv_jj1JOQMilPvL4jHjExQ]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:494) ~[elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$6(SnapshotsService.java:1263) ~[elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.5.0.jar:5.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) ~[?:?]
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[?:?]
at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[?:?]
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211) ~[?:?]
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) ~[?:1.8.0_144]
at java.nio.channels.Channels.writeFully(Channels.java:101) ~[?:1.8.0_144]
at java.nio.channels.Channels.access$000(Channels.java:61) ~[?:1.8.0_144]
at java.nio.channels.Channels$1.write(Channels.java:174) ~[?:1.8.0_144]
at org.elasticsearch.common.io.Streams.copy(Streams.java:80) ~[elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:131) ~[elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeAtomic(BlobStoreRepository.java:953) ~[elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeIndexGen(BlobStoreRepository.java:842) ~[elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:441) ~[elasticsearch-5.5.0.jar:5.5.0]
... 5 more
[2017-09-21T03:46:42,216][DEBUG][o.e.c.s.ClusterService ] [TwOmS-G] processing [remove snapshot deletion metadata]: took [13ms] done applying updated cluster_state (version: 8, uuid: NByDoHeYRl2rCXeiPn2LTA)
This behavior seems counterintuitive. Naturally, when the repository path gets full, admins will attempt to delete older snapshots to so that snapshot will delete the segment files from the repository that are not referenced in any of the snapshots to reclaim space. Because the delete snapshot api actually ends up creating new files, they are not able to quickly recover space using our APIs. Are we somehow creating the new index generation in the snapshot metadata store first (before deleting the older index generation files?), instead of doing it in reverse order?