Convert ADAM pipeline to Toil Spark support (resolves #325) #328

almussel · 2016-06-28T19:15:37Z

No description provided.

almussel · 2016-06-28T19:16:53Z

@fnothaft feel free to take over this PR

fnothaft · 2016-06-28T20:48:51Z

src/toil_scripts/adam_pipeline/adam_preprocessing.py

@@ -36,11 +36,12 @@

 import yaml
 from toil.job import Job


Small nit: can you move import yaml into the earlier block of imports. This way, all of the from toil.* imports will be in a single block.

Currently all of the pipelines I refactored have the imports ordered like this as per PEP8 guidelines. I'm happy to change it, but it'll reduce consistency.

I'm not suggesting changing the order. I'm suggesting to move the line of whitespace from before import yaml to after import yaml so that import yaml is in the previous block of imports and the second block of imports is just imports from toil.*.

My comment still stands. PEP8 wants a blank line between the standard library imports and third-party imports. I agree with you though that it looks nicer your way.

fnothaft · 2016-06-28T20:49:09Z

Other than the one really trivial nit, LGTM!

fnothaft · 2016-06-28T20:49:34Z

@almussel can you fix that nit and squash down?

fnothaft · 2016-06-28T20:51:18Z

Also, can you update the toil min version in setup.py? This is necessary for the failing tests.

almussel · 2016-06-28T20:58:12Z

Will do! Should I go ahead and remove the spark-utils code as well?

fnothaft · 2016-06-28T21:23:35Z

src/toil_scripts/adam_pipeline/adam_preprocessing.py

@@ -33,14 +33,15 @@
 import os
 import textwrap
 from subprocess import check_call, check_output
-
 import yaml


Ah! I didn't realize yaml was a 3rd party import. Yes, you're correct then that there should be a space before it. Sorry for the back and forth here.

fnothaft · 2016-06-28T21:23:48Z

Will do! Should I go ahead and remove the spark-utils code as well?

That'd be great!

fnothaft · 2016-06-28T21:26:48Z

Also, we'll need to change https://github.com/BD2KGenomics/toil-scripts/blob/master/jenkins.sh#L14 to the latest toil version.

almussel · 2016-06-28T22:37:33Z

Not really sure where these failures are coming from. The toil version has been updated.

fnothaft · 2016-06-28T22:39:45Z

@jvivian can you look at these test failures? They are in the RNA-seq tests.

hannes-ucsc · 2016-06-28T22:58:58Z

Welcome to Toil 3.2.1. From what I can tell these are cache-related error. Might want to ask @arkal for advice. One of the errors is due to lack of space so I put a pin into i-915c4804 for @arkal to take a look. /mnt/ephemeral is mounted though.

arkal · 2016-06-28T23:33:16Z

As of Toil 3.2.0 you are no longer allowed to say os.remove(x) on file x if x was obtained from the job store using readGlobalFile. IF you want to delete it from the local dir, use job.fileStore.deleteLocalFile instead. FYI it is also illegal to modify such a file. If you want or modify or delete, mention mutable=True when you readGlobalFile. That is why src.toil_scripts.rnaseq_cgl.test.test_rnaseq_cgl.test_rnaseq_cgl failed.

As for src.toil_scripts.lib.test.test_jobs.test_sample_batcher, the sum of disk asked for by jobs is greater than the amount of disk you have on the system. The singleMachine batch system does not look at disk when it is scheduling jobs, just mem and cpus. This wasn't a problem before 3.2.0 but now that we care about the remaining space on disk, we track that. Issue DataBiosphere/toil#770 is targeting this.

almussel · 2016-06-29T22:03:17Z

Thanks @jvivian!

jvivian · 2016-06-29T22:32:02Z

@arkal — I followed your advice and still got hit with AssertionError: Attempting to delete a non-local file. I'll just be removing the delete so that I can unblock this PR.

arkal · 2016-06-29T22:55:22Z

Welp i'm a dumbass and I feel bad. @jvivian it should be

fsID = job.fileStore.readGlobalFile(tar_id, os.path.join(work_dir, 'sample.tar'))
job.fileStore.deleteLocalFile(fsID)

fnothaft · 2016-06-29T23:21:04Z

Looks like these tests are still failing @jvivian...

jvivian · 2016-06-29T23:40:37Z

Looks like these tests are still failing @jvivian...

Sorry, I know what it is, just busy on another PR. I'll get to it in a second.

fnothaft · 2016-06-29T23:41:13Z

Sorry, I know what it is, just busy on another PR. I'll get to it in a second.

Ah, no rush.

jvivian · 2016-06-29T23:52:36Z

move_files is deprecated with @arkal 's caching changes... Is anyone opposed to me replacing it with a shutil.copy? (and changing the name to copy_files).

EDIT: It was actually a rename in a job function, so I'll skip the move_files refactor

jvivian · 2016-06-30T00:36:48Z

RNA-seq passes now, different test is failing due to cache assertion regarding disk space. I've added in disk='1K' to the test (it spawns 200 jobs) to see if that fixes it.

jvivian · 2016-06-30T01:05:06Z

Working on the disk issue with @arkal . I could set tmpdir in test_map_job to /mnt/ephemeral/..., but it should work given disk is set to 1K. The other possibility is that resource requirements passed via *args doesn't actually work, and that's what map_job does.

arkal · 2016-06-30T01:21:16Z

Yes John, it's as I suspected.... *args does not contain disk, mem or cpus. FunctionWrappingJob pops those from the kwargs before running Job.__init__. To demonstrate, I hacked your test to become

def map_job(job, func, inputs, *args):
    """
    Spawns a tree of jobs to avoid overloading the number of jobs spawned by a single parent.
    This function is appropriate to use when batching samples greater than 1,000.

    :param JobFunctionWrappingJob job: passed automatically by Toil
    :param function func: Function to spawn dynamically, passes one sample as first argument
    :param list inputs: Array of samples to be batched
    :param list args: any arguments to be passed to the function
    """
    print(args)
    raise RuntimeError
    ...

and the stack trace has

Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,343 WARNING: toil.leader: Y/t/jobXmc2TD:  ---TOIL WORKER OUTPUT LOG---
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:The localize() method should only be invoked on a worker.
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:Can't find resource for leader path '/Users/arjun/src/toil-scripts/src/toil_scripts'
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/Users/arjun/src/toil-scripts/src', name='toil_scripts.lib.jobs')
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:The localize() method should only be invoked on a worker.
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:Can't find resource for leader path '/Users/arjun/src/toil-scripts/src/toil_scripts'
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/Users/arjun/src/toil-scripts/src', name='toil_scripts.lib.jobs')
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  Traceback (most recent call last):
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/venv/lib/python2.7/site-packages/toil/worker.py", line 338, in main
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:      job._runner(jobWrapper=jobWrapper, jobStore=jobStore, fileStore=fileStore)
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/venv/lib/python2.7/site-packages/toil/job.py", line 2592, in _runner
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:      returnValues = self._run(jobWrapper, fileStore)
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/venv/lib/python2.7/site-packages/toil/job.py", line 2540, in _run
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:      return self.run(fileStore)
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/venv/lib/python2.7/site-packages/toil/job.py", line 2715, in run
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:      rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/src/toil_scripts/lib/jobs.py", line 16, in map_job
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:      raise RuntimeError
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  RuntimeError
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  Exiting the worker because of a failed jobWrapper on host Arjuns-MacBook-Pro.local
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  ERROR:toil.worker:Exiting the worker because of a failed jobWrapper on host Arjuns-MacBook-Pro.local
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.jobWrapper:Due to failure we are reducing the remaining retry count of job Y/t/jobXmc2TD to 0
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  ('a', 'b', 'c')

see the last line. only ('a', 'b', 'c').
so now you're trying to run x jobs concurrently (where x is the number of available cpus), each asking for 2GB. I think your jobstore is growing with every job, leading to a point where there is just not enough space on the jenkins root volume for x jobs to run.

Hope this helps

hannes-ucsc · 2016-06-30T02:32:40Z

BD2KGenomics/cgcloud#194 caused toil-jenkins-slave to only half the possible ephemeral space (70GB instead of 140GB). That's fixed now and according to @arkal, should fix the build problem caused by lack of disk space.

jvivian · 2016-06-30T02:45:57Z

Sorry for the hold up on this. I'll merge #316 once it's finished CI and then use the function it's introducing to fix the map_job_test.

fnothaft · 2016-06-30T16:18:19Z

I will be testing this patch this AM, so even if tests pass, please hold for me to merge.

jvivian · 2016-06-30T17:09:26Z

I will be testing this patch this AM, so even if tests pass, please hold for me to merge.

Working on my fix now.

jvivian · 2016-06-30T17:15:45Z

@fnothaft — I rebased onto master (no conflicts) for my fix. Ping me if you're fine with me force pushing.

fnothaft · 2016-06-30T17:34:11Z

@jvivian force push is OK here.

jvivian · 2016-07-01T17:50:38Z

I'll rebase the bwa-alignment PR on this once it's merged.

fnothaft · 2016-07-01T23:06:39Z

Note to self: we need some documentation updates to the ADAM uberscript. I'll circle back on that later.

fnothaft · 2016-07-02T01:55:27Z

So, no dice on this right now. Still debugging what's going wrong, but the ADAM pipeline is not running on the cluster properly. It isn't throwing an error, which is weird, but it is definitely not running.

Picking commits from #328

almussel · 2016-07-13T22:50:10Z

@fnothaft @jvivian @hannes-ucsc Are you all okay with me force pushing to this branch again?

fnothaft · 2016-07-13T22:59:54Z

@fnothaft @jvivian @hannes-ucsc Are you all okay with me force pushing to this branch again?

SGTM!

jvivian · 2016-07-13T23:00:34Z

Fine with me as well

On Wed, Jul 13, 2016 at 3:59 PM Frank Austin Nothaft <
notifications@github.com> wrote:

@fnothaft https://github.com/fnothaft @jvivian
https://github.com/jvivian @hannes-ucsc https://github.com/hannes-ucsc
Are you all okay with me force pushing to this branch again?

SGTM!

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#328 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AHOawH3z91uVvlBvG3R95bQMD1hTsM6wks5qVW3rgaJpZM4JAb6F
.

hannes-ucsc · 2016-07-14T01:05:26Z

Are you all okay with me force pushing to this branch again?

yes

fnothaft · 2016-08-06T21:15:40Z

I've moved this work over to #351 for testing. Closing.

almussel added in progress spark/adam needs work labels Jun 28, 2016

almussel force-pushed the issues/325-toil-spark branch from 2ce0163 to a3a266a Compare June 28, 2016 19:19

fnothaft reviewed Jun 28, 2016
View reviewed changes

almussel force-pushed the issues/325-toil-spark branch from a3a266a to c6df6a6 Compare June 28, 2016 21:20

fnothaft reviewed Jun 28, 2016
View reviewed changes

almussel force-pushed the issues/325-toil-spark branch 6 times, most recently from 549a815 to cf5ba5b Compare June 28, 2016 21:55

fnothaft mentioned this pull request Jun 29, 2016

Tune Spark memory settings (resolves #326) #332

Closed

jvivian force-pushed the issues/325-toil-spark branch from 0879a99 to 405f534 Compare June 30, 2016 17:35

fnothaft mentioned this pull request Jul 7, 2016

Picking commits from #328 #340

Merged

hannes-ucsc added a commit that referenced this pull request Jul 8, 2016

Merge pull request #340 from fnothaft/pick-from-328

2cdb4a1

Picking commits from #328

fnothaft mentioned this pull request Jul 13, 2016

Run adam-pipeline in single machine mode (resolves #210, #329, #371, #325, #326) #351

Merged

almussel force-pushed the issues/325-toil-spark branch from 405f534 to 95bc9ed Compare July 13, 2016 23:32

fnothaft force-pushed the issues/325-toil-spark branch from 08cc89b to 9de9f64 Compare August 3, 2016 16:35

Convert ADAM pipeline to using Toil Spark support (resolves #325)

9de9f64

fnothaft closed this Aug 6, 2016

hannes-ucsc removed the in progress label Aug 6, 2016

Convert ADAM pipeline to Toil Spark support (resolves #325) #328

Convert ADAM pipeline to Toil Spark support (resolves #325) #328

Conversation

almussel commented Jun 28, 2016

almussel commented Jun 28, 2016

fnothaft Jun 28, 2016

Choose a reason for hiding this comment

almussel Jun 28, 2016

Choose a reason for hiding this comment

fnothaft Jun 28, 2016

Choose a reason for hiding this comment

almussel Jun 28, 2016

Choose a reason for hiding this comment

fnothaft commented Jun 28, 2016

fnothaft commented Jun 28, 2016

fnothaft commented Jun 28, 2016

almussel commented Jun 28, 2016

fnothaft Jun 28, 2016

Choose a reason for hiding this comment

fnothaft commented Jun 28, 2016

fnothaft commented Jun 28, 2016

almussel commented Jun 28, 2016

fnothaft commented Jun 28, 2016

hannes-ucsc commented Jun 28, 2016

arkal commented Jun 28, 2016 • edited by hannes-ucsc

almussel commented Jun 29, 2016

jvivian commented Jun 29, 2016

arkal commented Jun 29, 2016

fnothaft commented Jun 29, 2016

jvivian commented Jun 29, 2016

fnothaft commented Jun 29, 2016

jvivian commented Jun 29, 2016 • edited

jvivian commented Jun 30, 2016

jvivian commented Jun 30, 2016 • edited

arkal commented Jun 30, 2016 • edited

hannes-ucsc commented Jun 30, 2016

jvivian commented Jun 30, 2016

fnothaft commented Jun 30, 2016

jvivian commented Jun 30, 2016

jvivian commented Jun 30, 2016

fnothaft commented Jun 30, 2016

jvivian commented Jul 1, 2016

fnothaft commented Jul 1, 2016

fnothaft commented Jul 2, 2016

almussel commented Jul 13, 2016

fnothaft commented Jul 13, 2016

jvivian commented Jul 13, 2016

hannes-ucsc commented Jul 14, 2016

fnothaft commented Aug 6, 2016

arkal commented Jun 28, 2016 •

edited by hannes-ucsc

jvivian commented Jun 29, 2016 •

edited

jvivian commented Jun 30, 2016 •

edited

arkal commented Jun 30, 2016 •

edited