New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard failing on Master #37005
Comments
Pinging @elastic/es-distributed |
There was a race condition issue in the same class a while ago, but the presentation looks different than this issue, and I don't think they're related. Including a link for completeness: #18121 |
The failure here comes from the fact that we fail to read the repo info:
a few lines up the log you can see:
=> apparently we run into an index gen file I grepped the build logs and could find another such case here https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/3/consoleText in a unrelated test suit. This test has a shared per suit cluster, the other example has a cluster per test case which imo rules out interference between tests. Looking at the implementation of |
The problem seems to come from https://github.com/elastic/elasticsearch/blob/master/server/src/test/java/org/elasticsearch/snapshots/mockstore/MockRepository.java#L382 => the mock repository wrapped used in these tests randomly executes a non-atomic |
* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests * Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode: * Cloud based Blob stores (S3, GCS and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves * Closes elastic#37005
* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests * Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode: * Cloud based Blob stores (S3, GCS and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves * Closes elastic#37005
@original-brownbear mute the test for now? |
@ywelsch sure, will do in a sec. |
* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests * Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode: * Cloud-based Blob stores (S3, GCS, and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves * Closes #37005
* Randomly doing non-atomic writes causes rare 0 byte reads from `index-N` files in tests * Removing this randomness fixes these random failures and is valid because it does not reproduce a real-world failure-mode: * Cloud-based Blob stores (S3, GCS, and Azure) do not have inconsistent partial reads of a blob, either you read a complete blob or nothing on them * For file system based blob stores the atomic move we do (to atomically write a file) by setting `java.nio.file.StandardCopyOption#ATOMIC_MOVE` would throw if the file system does not provide for atomic moves * Closes #37005
Hi, I am running into a similar issue while creating a snapshot on 6.4:
Moreover, I see a log line very similar to the one pointed out by @original-brownbear:
(Note that it uses I have a few specific questions:
Please let me know if there's any other helpful information I can provide, or if there's another question that I should be asking. Thanks in advance for your work in Elasticsearch and for your time. |
@abw333 Your issue is probably fixed in 6.7.0, see #37066. |
Thanks, @ywelsch, I will give that a try! |
Hi again. I was able to reproduce this failure reliably by attempting to take a snapshot into an HDFS directory that already had an empty file named I have a couple follow-up questions: I am currently taking snapshots of all of my indices with a single request. My indices all have roughly the same size. I am concerned that, a long time into the snapshot process, the snapshot fails, and then I have to restart from the beginning. Question 1: If my snapshot fails with this race condition, when will it fail? Will it always be near the beginning? Or always near the end? Or does it vary? If it always fails near the beginning, I don't have anything to worry about. If it sometimes fails near the end, though, I would like to do something about it. I was thinking that, in this case, I could take a snapshot of each of my indices separately. Since they are roughly equally sized, this would lower the cost of a failure induced by the race condition. Question 2: Does this strategy (i.e. snapshot each index separately, with the error handling described above) sound reasonable? Are there any other alternatives I should consider (e.g. selectively delete some of the contents of the snapshot directory instead of all of them, and then retry the snapshot)? |
@abw333 how often have you seen this happen? This should be extremely rare and only a problem in an extremely flaky or unreliable network or nodes that crash all the time.
This is read two times, once at the beginning and end of a snapshot (just before it is overridden), but if the read were to fail at the end, it should already have failed at the beginning.
Given that this should very rarely happen, I'm not convinced this would need any special strategy. As I mentioned above, you can just delete the offending 0-byte file and start snapshotting again. |
@ywelsch I have a test that starts a snapshot using a Locally, the test passes reliably, but, on my testing infrastructure, it fails about once every 20 runs. (Which might indicate that the network on my testing infrastructure is not great.) I have observed the failure (same stack trace) in two places: 1) when starting the snapshot with the Although I can't reproduce the race condition locally, I can reproduce the failures by creating the empty file myself either before the 1 is easy to address: I added some logic around the 2 is a little bit more complex, since it seems that here the race condition is occurring during the actual snapshot (as opposed to during the request that initiates the snapshot). Here, a failure can be more costly since a snapshot is a long-running operation. My worry is that the failure happens towards the end of the snapshot and then we need to rerun the whole thing, which would add a lot of latency to the process. Is what I'm seeing consistent with your understanding of what's happening here? And do you have any recommendation on how to handle 2? Is the race condition happening during the actual snapshot job? If so, do you expect that the snapshot job would fail right at the beginning? |
Test failed on master:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=fedora/144/
Does NOT reproduce for me on master with:
Relevant log snippet:
The text was updated successfully, but these errors were encountered: