Sporadic NoSuchJobException in sort tests #760

hannes-ucsc · 2016-04-07T03:35:13Z

http://jenkins.cgcloud.info/job/toil-pull-requests/1017/testReport/junit/src.toil.test.sort.sortTest/SortTest/testAwsSingle/

hannes-ucsc · 2016-04-27T22:30:31Z

Another one: http://jenkins.cgcloud.info/job/toil-pull-requests/1164/testReport/junit/src.toil.test.sort.sortTest/SortTest/testAwsSingle/

cket · 2016-05-02T21:31:44Z

AWS with mesos: http://jenkins.cgcloud.info/job/toil-pull-requests/1205/testReport/junit/src.toil.test.sort.sortTest/SortTest/testAwsMesos/

hannes-ucsc · 2016-05-27T17:46:48Z

I've added a Lucene indexer to Jenkins so we can do full text search on build results:

http://jenkins.cgcloud.info/search/?q=NoSuchJobException

hannes-ucsc · 2016-06-02T23:20:38Z

http://jenkins.cgcloud.info/job/toil-branches/branch/master/167/testReport/junit/src.toil.test.sort.sortTest/SortTest/testAwsMesos/

hannes-ucsc · 2016-06-14T03:19:13Z

http://jenkins.cgcloud.info/job/toil-pull-requests/1445/testReport/junit/src.toil.test.sort.sortTest/SortTest/testAwsMesos/

joelarmstrong · 2016-06-14T19:04:49Z

Hmm, I was sort of hoping the fix to #808 would fix this too, but after looking at it, it looks like it's either a race condition (exists check followed right after by a load) or stale reads from SDB. I tried to aggravate the race locally by sleeping in between but no dice.

edit: Now that I think about it, you could just do a single load and catch the NoSuchJobException rather than checking existence separately, but that seems like sweeping the problem under the rug.

hannes-ucsc · 2016-06-15T04:39:30Z

I think this is a stale read after delete. We've been in touch with AWS support about this. They didn't outright reject the idea of SDB not being fully consistent and recommended that we upgrade to DynamoDB.

I wonder is why this happens in the same code location every time.

hannes-ucsc · 2016-06-15T04:40:35Z

… meaning that if it is a stale read, the time window should be somewhat flexible and therefore cause different problems in different places.

hannes-ucsc · 2016-06-16T19:29:09Z

http://jenkins.cgcloud.info/job/toil-pull-requests/1478/testReport/junit/src.toil.test.sort.sortTest/SortTest/testAwsSingle/

hannes-ucsc · 2016-06-22T05:37:58Z

I've been trying to brute-force reproduce this by creating and deleting jobs in quick succession. I tried 16k jobs. No dice. We should entertain the possibility of there being a the possibility of the job being deleted concurrently, maybe by the worker. Then again, why does this only happen with AWS? If there is a race between leader and worker, this should manifest itself in other stores as well/

hannes-ucsc · 2016-06-23T01:54:34Z

Punting for next sprint.

cket · 2016-07-28T19:12:59Z

Given that this is most likely a SDB issue, I like @joelarmstrong's solution. We can fix this particular symptom to curb the problem until we switch to DynamoDB, if/when that happens.

Catch NoSuchJobException caused by stale SDB read (resolves #760)

…ception/3.3.x Catch NoSuchJobException caused by stale SDB read (resolves #760)

hannes-ucsc added bug ready labels Apr 7, 2016

joelarmstrong mentioned this issue Apr 25, 2016

Restarting a workflow raises NoSuchJob exceptions #808

Closed

hannes-ucsc mentioned this issue Apr 27, 2016

Fix stale promise job store cache (resolves #817) #823

Merged

hannes-ucsc modified the milestone: 3.2.0 release May 10, 2016

hannes-ucsc assigned cket May 10, 2016

hannes-ucsc modified the milestones: 3.2.0, Sprint 02 May 12, 2016

hannes-ucsc added the URGENT label May 27, 2016

This was referenced May 27, 2016

Retry google bucket creation (resolves #885) #887

Merged

NoSuchJobException in service tests #908

Closed

hannes-ucsc modified the milestones: Sprint 02, Sprint 03 (3.2.0) Jun 14, 2016

This was referenced Jun 16, 2016

Refactor Job._execute (resolves #973) #974

Merged

Fix async writing of files to jobstore in caching (resolves #966) #977

Merged

Add Spark cluster service (resolves #774) #786

Merged

nsjake mentioned this issue Jun 22, 2016

Refactored job store string to locator (resolves #864, resolves #991, resolves #992) #997

Merged

hannes-ucsc removed the URGENT label Jun 23, 2016

hannes-ucsc modified the milestones: Sprint 04 (3.2.1), Sprint 03 (3.2.0) Jun 23, 2016

hannes-ucsc mentioned this issue Jun 29, 2016

Fix overkill functionalization in FileID (resolves #1014) #1024

Merged

hannes-ucsc modified the milestones: Sprint 05 (3.4.0), Sprint 04 (3.3.0) Jun 29, 2016

hannes-ucsc removed the ready label Jul 2, 2016

hannes-ucsc modified the milestones: Sprint 05 (skipped), Sprint 06 (3.5.0) Jul 5, 2016

hannes-ucsc added the in progress label Jul 28, 2016

cket mentioned this issue Jul 29, 2016

Remove workaround for ghost jobs caused by stale reads from SDB #1091

Open

cket pushed a commit that referenced this issue Jul 29, 2016

Catch NoSuchJobException caused by stale SDB read (resolves #760)

40a862f

hannes-ucsc closed this as completed in 4a2bd73 Jul 30, 2016

hannes-ucsc added a commit that referenced this issue Jul 30, 2016

Merge pull request #1092 from BD2KGenomics/issues/760-NoSuchJobException

fedc018

Catch NoSuchJobException caused by stale SDB read (resolves #760)

hannes-ucsc removed the in progress label Jul 30, 2016

hannes-ucsc pushed a commit that referenced this issue Sep 23, 2016

Catch NoSuchJobException caused by stale SDB read (resolves #760)

5bf470b

hannes-ucsc mentioned this issue Sep 23, 2016

Catch NoSuchJobException caused by stale SDB read (resolves #760) #1170

Merged

hannes-ucsc added a commit that referenced this issue Sep 23, 2016

Merge pull request #1170 from BD2KGenomics/issues/760-fix-nosuchjobex…

75c8615

…ception/3.3.x Catch NoSuchJobException caused by stale SDB read (resolves #760)

cricketsloan unassigned cket Jul 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic NoSuchJobException in sort tests #760

Sporadic NoSuchJobException in sort tests #760

hannes-ucsc commented Apr 7, 2016 •

edited

hannes-ucsc commented Apr 27, 2016

cket commented May 2, 2016

hannes-ucsc commented May 27, 2016

hannes-ucsc commented Jun 2, 2016

hannes-ucsc commented Jun 14, 2016

joelarmstrong commented Jun 14, 2016 •

edited

hannes-ucsc commented Jun 15, 2016

hannes-ucsc commented Jun 15, 2016

hannes-ucsc commented Jun 16, 2016

hannes-ucsc commented Jun 22, 2016

hannes-ucsc commented Jun 23, 2016

cket commented Jul 28, 2016

Sporadic NoSuchJobException in sort tests #760

Sporadic NoSuchJobException in sort tests #760

Comments

hannes-ucsc commented Apr 7, 2016 • edited

hannes-ucsc commented Apr 27, 2016

cket commented May 2, 2016

hannes-ucsc commented May 27, 2016

hannes-ucsc commented Jun 2, 2016

hannes-ucsc commented Jun 14, 2016

joelarmstrong commented Jun 14, 2016 • edited

hannes-ucsc commented Jun 15, 2016

hannes-ucsc commented Jun 15, 2016

hannes-ucsc commented Jun 16, 2016

hannes-ucsc commented Jun 22, 2016

hannes-ucsc commented Jun 23, 2016

cket commented Jul 28, 2016

hannes-ucsc commented Apr 7, 2016 •

edited

joelarmstrong commented Jun 14, 2016 •

edited