Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust wdl task runtime attributes (mem, cpu etc) depending on size of input #134

Open
holmeso opened this issue Oct 24, 2019 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@holmeso
Copy link
Contributor

holmeso commented Oct 24, 2019

A lot of our wdl tasks have been failing due to insufficient resources when dealing with data that is not within our usual expected size range.
Its been proposed (thanks Scott) that rather than dialling up all the resources requested by wdl tasks, a more "scientific" approach be used where we check the size of the input before running a task on that input. The resource request for that task can then be set appropriately.

For example, there was a recent failure in the VcfMerge task which takes 2 vcf files and merges them together.
The wdl task that calls this command is setup to use 20GB of memory.
Vcf files from WGS typically have around 5 million records in them. A recent run against some NIH data resulted in vcf files containing over 12 million reads. 20GB of memory was not enough for this dataset and the job ran out of memory.
If there was a means of checking the size of the vcf files before they are merged, the resources for the merge wdl task could be adjusted appropriately.

Describe the solution you'd like

I believe that there is a way within wdl of getting line counts / file size. Would then need to come up with some cutoffs and resource parameters

Describe alternatives you've considered
Bump all wdl task resources when running with large datasets.

Additional context
??

@holmeso holmeso added the enhancement New feature or request label Oct 24, 2019
@delocalizer
Copy link
Contributor

I first discussed this with Ross a couple of months ago when the NIH bams hit, and I like the idea in principle. Bespoke resource requests per-task based on measuring inputs would be ideal, but because the runtime block is parsed before the command you can't do it in-task — you'd have to create a separate custom task that measured input sizes to create the appropriate resources to give to the following task that actually did the work.

I also considered exactly the alternative you suggest above - have a single float-valued "scaling" parameter in the workflow, defaulting to 1, passed to all tasks which would multiply their mem & walltime requests. It'd be less accurate, but simpler.

@delocalizer
Copy link
Contributor

delocalizer commented Oct 24, 2019

I take it back. Perhaps using size() in the runtime block would work, so you wouldn't need a separate task:
https://software.broadinstitute.org/wdl/documentation/spec#float-sizefile-string

Yep, works great...

  runtime {
    memory: "${size(infile, "K") * 1} MiB"
  }

@shimbalama
Copy link

I'm happy to take this on with the caveat that its a learning task and that I have one or two things that I'm working on that will have to be higher priority, that said I should have plenty of time between tests to tinker with this so I hopefully it shouldn't take too long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants