-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic NoSuchJobException in sort tests #760
Comments
I've added a Lucene indexer to Jenkins so we can do full text search on build results: |
Hmm, I was sort of hoping the fix to #808 would fix this too, but after looking at it, it looks like it's either a race condition (exists check followed right after by a load) or stale reads from SDB. I tried to aggravate the race locally by sleeping in between but no dice. edit: Now that I think about it, you could just do a single load and catch the NoSuchJobException rather than checking existence separately, but that seems like sweeping the problem under the rug. |
I think this is a stale read after delete. We've been in touch with AWS support about this. They didn't outright reject the idea of SDB not being fully consistent and recommended that we upgrade to DynamoDB. I wonder is why this happens in the same code location every time. |
… meaning that if it is a stale read, the time window should be somewhat flexible and therefore cause different problems in different places. |
I've been trying to brute-force reproduce this by creating and deleting jobs in quick succession. I tried 16k jobs. No dice. We should entertain the possibility of there being a the possibility of the job being deleted concurrently, maybe by the worker. Then again, why does this only happen with AWS? If there is a race between leader and worker, this should manifest itself in other stores as well/ |
Punting for next sprint. |
Given that this is most likely a SDB issue, I like @joelarmstrong's solution. We can fix this particular symptom to curb the problem until we switch to DynamoDB, if/when that happens. |
Catch NoSuchJobException caused by stale SDB read (resolves #760)
…ception/3.3.x Catch NoSuchJobException caused by stale SDB read (resolves #760)
http://jenkins.cgcloud.info/job/toil-pull-requests/1017/testReport/junit/src.toil.test.sort.sortTest/SortTest/testAwsSingle/
The text was updated successfully, but these errors were encountered: