Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic NoSuchJobException in sort tests #760

Closed
hannes-ucsc opened this issue Apr 7, 2016 · 12 comments
Closed

Sporadic NoSuchJobException in sort tests #760

hannes-ucsc opened this issue Apr 7, 2016 · 12 comments
Labels

Comments

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Apr 7, 2016

http://jenkins.cgcloud.info/job/toil-pull-requests/1017/testReport/junit/src.toil.test.sort.sortTest/SortTest/testAwsSingle/

@cket
Copy link
Contributor

cket commented May 2, 2016

@hannes-ucsc hannes-ucsc modified the milestone: 3.2.0 release May 10, 2016
@hannes-ucsc hannes-ucsc modified the milestones: 3.2.0, Sprint 02 May 12, 2016
@hannes-ucsc
Copy link
Member Author

I've added a Lucene indexer to Jenkins so we can do full text search on build results:

http://jenkins.cgcloud.info/search/?q=NoSuchJobException

@joelarmstrong
Copy link
Contributor

joelarmstrong commented Jun 14, 2016

Hmm, I was sort of hoping the fix to #808 would fix this too, but after looking at it, it looks like it's either a race condition (exists check followed right after by a load) or stale reads from SDB. I tried to aggravate the race locally by sleeping in between but no dice.

edit: Now that I think about it, you could just do a single load and catch the NoSuchJobException rather than checking existence separately, but that seems like sweeping the problem under the rug.

@hannes-ucsc
Copy link
Member Author

I think this is a stale read after delete. We've been in touch with AWS support about this. They didn't outright reject the idea of SDB not being fully consistent and recommended that we upgrade to DynamoDB.

I wonder is why this happens in the same code location every time.

@hannes-ucsc
Copy link
Member Author

… meaning that if it is a stale read, the time window should be somewhat flexible and therefore cause different problems in different places.

@hannes-ucsc
Copy link
Member Author

I've been trying to brute-force reproduce this by creating and deleting jobs in quick succession. I tried 16k jobs. No dice. We should entertain the possibility of there being a the possibility of the job being deleted concurrently, maybe by the worker. Then again, why does this only happen with AWS? If there is a race between leader and worker, this should manifest itself in other stores as well/

@hannes-ucsc
Copy link
Member Author

Punting for next sprint.

@hannes-ucsc hannes-ucsc modified the milestones: Sprint 04 (3.2.1), Sprint 03 (3.2.0) Jun 23, 2016
@hannes-ucsc hannes-ucsc modified the milestones: Sprint 05 (3.4.0), Sprint 04 (3.3.0) Jun 29, 2016
@hannes-ucsc hannes-ucsc removed the ready label Jul 2, 2016
@hannes-ucsc hannes-ucsc modified the milestones: Sprint 05 (skipped), Sprint 06 (3.5.0) Jul 5, 2016
@cket
Copy link
Contributor

cket commented Jul 28, 2016

Given that this is most likely a SDB issue, I like @joelarmstrong's solution. We can fix this particular symptom to curb the problem until we switch to DynamoDB, if/when that happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants