Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert ADAM pipeline to Toil Spark support (resolves #325) #328

Closed
wants to merge 1 commit into from

Conversation

almussel
Copy link
Contributor

No description provided.

@almussel
Copy link
Contributor Author

@fnothaft feel free to take over this PR

@@ -36,11 +36,12 @@

import yaml
from toil.job import Job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: can you move import yaml into the earlier block of imports. This way, all of the from toil.* imports will be in a single block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently all of the pipelines I refactored have the imports ordered like this as per PEP8 guidelines. I'm happy to change it, but it'll reduce consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting changing the order. I'm suggesting to move the line of whitespace from before import yaml to after import yaml so that import yaml is in the previous block of imports and the second block of imports is just imports from toil.*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment still stands. PEP8 wants a blank line between the standard library imports and third-party imports. I agree with you though that it looks nicer your way.

@fnothaft
Copy link
Contributor

Other than the one really trivial nit, LGTM!

@fnothaft
Copy link
Contributor

@almussel can you fix that nit and squash down?

@fnothaft
Copy link
Contributor

Also, can you update the toil min version in setup.py? This is necessary for the failing tests.

@almussel
Copy link
Contributor Author

Will do! Should I go ahead and remove the spark-utils code as well?

@@ -33,14 +33,15 @@
import os
import textwrap
from subprocess import check_call, check_output

import yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I didn't realize yaml was a 3rd party import. Yes, you're correct then that there should be a space before it. Sorry for the back and forth here.

@fnothaft
Copy link
Contributor

Will do! Should I go ahead and remove the spark-utils code as well?

That'd be great!

@fnothaft
Copy link
Contributor

Also, we'll need to change https://github.com/BD2KGenomics/toil-scripts/blob/master/jenkins.sh#L14 to the latest toil version.

@almussel almussel force-pushed the issues/325-toil-spark branch 6 times, most recently from 549a815 to cf5ba5b Compare June 28, 2016 21:55
@almussel
Copy link
Contributor Author

Not really sure where these failures are coming from. The toil version has been updated.

@fnothaft
Copy link
Contributor

@jvivian can you look at these test failures? They are in the RNA-seq tests.

@hannes-ucsc
Copy link
Contributor

Welcome to Toil 3.2.1. From what I can tell these are cache-related error. Might want to ask @arkal for advice. One of the errors is due to lack of space so I put a pin into i-915c4804 for @arkal to take a look. /mnt/ephemeral is mounted though.

@arkal
Copy link

arkal commented Jun 28, 2016

As of Toil 3.2.0 you are no longer allowed to say os.remove(x) on file x if x was obtained from the job store using readGlobalFile. IF you want to delete it from the local dir, use job.fileStore.deleteLocalFile instead. FYI it is also illegal to modify such a file. If you want or modify or delete, mention mutable=True when you readGlobalFile. That is why src.toil_scripts.rnaseq_cgl.test.test_rnaseq_cgl.test_rnaseq_cgl failed.

As for src.toil_scripts.lib.test.test_jobs.test_sample_batcher, the sum of disk asked for by jobs is greater than the amount of disk you have on the system. The singleMachine batch system does not look at disk when it is scheduling jobs, just mem and cpus. This wasn't a problem before 3.2.0 but now that we care about the remaining space on disk, we track that. Issue DataBiosphere/toil#770 is targeting this.

@almussel
Copy link
Contributor Author

Thanks @jvivian!

@jvivian
Copy link
Collaborator

jvivian commented Jun 29, 2016

@arkal — I followed your advice and still got hit with AssertionError: Attempting to delete a non-local file. I'll just be removing the delete so that I can unblock this PR.

@arkal
Copy link

arkal commented Jun 29, 2016

Welp i'm a dumbass and I feel bad. @jvivian it should be

fsID = job.fileStore.readGlobalFile(tar_id, os.path.join(work_dir, 'sample.tar'))
job.fileStore.deleteLocalFile(fsID)

@fnothaft
Copy link
Contributor

Looks like these tests are still failing @jvivian...

@jvivian
Copy link
Collaborator

jvivian commented Jun 29, 2016

Looks like these tests are still failing @jvivian...

Sorry, I know what it is, just busy on another PR. I'll get to it in a second.

@fnothaft
Copy link
Contributor

Sorry, I know what it is, just busy on another PR. I'll get to it in a second.

Ah, no rush.

@jvivian
Copy link
Collaborator

jvivian commented Jun 29, 2016

move_files is deprecated with @arkal 's caching changes... Is anyone opposed to me replacing it with a shutil.copy? (and changing the name to copy_files).

EDIT: It was actually a rename in a job function, so I'll skip the move_files refactor

@jvivian
Copy link
Collaborator

jvivian commented Jun 30, 2016

RNA-seq passes now, different test is failing due to cache assertion regarding disk space. I've added in disk='1K' to the test (it spawns 200 jobs) to see if that fixes it.

@jvivian
Copy link
Collaborator

jvivian commented Jun 30, 2016

Working on the disk issue with @arkal . I could set tmpdir in test_map_job to /mnt/ephemeral/..., but it should work given disk is set to 1K. The other possibility is that resource requirements passed via *args doesn't actually work, and that's what map_job does.

@arkal
Copy link

arkal commented Jun 30, 2016

Yes John, it's as I suspected.... *args does not contain disk, mem or cpus. FunctionWrappingJob pops those from the kwargs before running Job.__init__. To demonstrate, I hacked your test to become

def map_job(job, func, inputs, *args):
    """
    Spawns a tree of jobs to avoid overloading the number of jobs spawned by a single parent.
    This function is appropriate to use when batching samples greater than 1,000.

    :param JobFunctionWrappingJob job: passed automatically by Toil
    :param function func: Function to spawn dynamically, passes one sample as first argument
    :param list inputs: Array of samples to be batched
    :param list args: any arguments to be passed to the function
    """
    print(args)
    raise RuntimeError
    ...

and the stack trace has

Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,343 WARNING: toil.leader: Y/t/jobXmc2TD:  ---TOIL WORKER OUTPUT LOG---
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:The localize() method should only be invoked on a worker.
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:Can't find resource for leader path '/Users/arjun/src/toil-scripts/src/toil_scripts'
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/Users/arjun/src/toil-scripts/src', name='toil_scripts.lib.jobs')
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:The localize() method should only be invoked on a worker.
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:Can't find resource for leader path '/Users/arjun/src/toil-scripts/src/toil_scripts'
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/Users/arjun/src/toil-scripts/src', name='toil_scripts.lib.jobs')
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:  Traceback (most recent call last):
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/venv/lib/python2.7/site-packages/toil/worker.py", line 338, in main
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:      job._runner(jobWrapper=jobWrapper, jobStore=jobStore, fileStore=fileStore)
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/venv/lib/python2.7/site-packages/toil/job.py", line 2592, in _runner
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:      returnValues = self._run(jobWrapper, fileStore)
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,344 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/venv/lib/python2.7/site-packages/toil/job.py", line 2540, in _run
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:      return self.run(fileStore)
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/venv/lib/python2.7/site-packages/toil/job.py", line 2715, in run
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:      rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:    File "/Users/arjun/src/toil-scripts/src/toil_scripts/lib/jobs.py", line 16, in map_job
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:      raise RuntimeError
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  RuntimeError
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  Exiting the worker because of a failed jobWrapper on host Arjuns-MacBook-Pro.local
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  ERROR:toil.worker:Exiting the worker because of a failed jobWrapper on host Arjuns-MacBook-Pro.local
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  WARNING:toil.jobWrapper:Due to failure we are reducing the remaining retry count of job Y/t/jobXmc2TD to 0
Arjuns-MacBook-Pro.local: 2016-06-29 18:14:47,345 WARNING: toil.leader: Y/t/jobXmc2TD:  ('a', 'b', 'c')

see the last line. only ('a', 'b', 'c').
so now you're trying to run x jobs concurrently (where x is the number of available cpus), each asking for 2GB. I think your jobstore is growing with every job, leading to a point where there is just not enough space on the jenkins root volume for x jobs to run.

Hope this helps

@hannes-ucsc
Copy link
Contributor

BD2KGenomics/cgcloud#194 caused toil-jenkins-slave to only half the possible ephemeral space (70GB instead of 140GB). That's fixed now and according to @arkal, should fix the build problem caused by lack of disk space.

@jvivian
Copy link
Collaborator

jvivian commented Jun 30, 2016

Sorry for the hold up on this. I'll merge #316 once it's finished CI and then use the function it's introducing to fix the map_job_test.

@fnothaft
Copy link
Contributor

I will be testing this patch this AM, so even if tests pass, please hold for me to merge.

@jvivian
Copy link
Collaborator

jvivian commented Jun 30, 2016

I will be testing this patch this AM, so even if tests pass, please hold for me to merge.

Working on my fix now.

@jvivian
Copy link
Collaborator

jvivian commented Jun 30, 2016

@fnothaft — I rebased onto master (no conflicts) for my fix. Ping me if you're fine with me force pushing.

@fnothaft
Copy link
Contributor

@jvivian force push is OK here.

@jvivian
Copy link
Collaborator

jvivian commented Jul 1, 2016

I'll rebase the bwa-alignment PR on this once it's merged.

@fnothaft
Copy link
Contributor

fnothaft commented Jul 1, 2016

Note to self: we need some documentation updates to the ADAM uberscript. I'll circle back on that later.

@fnothaft
Copy link
Contributor

fnothaft commented Jul 2, 2016

So, no dice on this right now. Still debugging what's going wrong, but the ADAM pipeline is not running on the cluster properly. It isn't throwing an error, which is weird, but it is definitely not running.

@almussel
Copy link
Contributor Author

@fnothaft @jvivian @hannes-ucsc Are you all okay with me force pushing to this branch again?

@fnothaft
Copy link
Contributor

@fnothaft @jvivian @hannes-ucsc Are you all okay with me force pushing to this branch again?

SGTM!

@jvivian
Copy link
Collaborator

jvivian commented Jul 13, 2016

Fine with me as well

On Wed, Jul 13, 2016 at 3:59 PM Frank Austin Nothaft <
notifications@github.com> wrote:

@fnothaft https://github.com/fnothaft @jvivian
https://github.com/jvivian @hannes-ucsc https://github.com/hannes-ucsc
Are you all okay with me force pushing to this branch again?

SGTM!


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#328 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AHOawH3z91uVvlBvG3R95bQMD1hTsM6wks5qVW3rgaJpZM4JAb6F
.

@hannes-ucsc
Copy link
Contributor

Are you all okay with me force pushing to this branch again?

yes

@fnothaft
Copy link
Contributor

fnothaft commented Aug 6, 2016

I've moved this work over to #351 for testing. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants