Adjust wdl task runtime attributes (mem, cpu etc) depending on size of input #134

holmeso · 2019-10-24T00:08:09Z

A lot of our wdl tasks have been failing due to insufficient resources when dealing with data that is not within our usual expected size range.
Its been proposed (thanks Scott) that rather than dialling up all the resources requested by wdl tasks, a more "scientific" approach be used where we check the size of the input before running a task on that input. The resource request for that task can then be set appropriately.

For example, there was a recent failure in the VcfMerge task which takes 2 vcf files and merges them together.
The wdl task that calls this command is setup to use 20GB of memory.
Vcf files from WGS typically have around 5 million records in them. A recent run against some NIH data resulted in vcf files containing over 12 million reads. 20GB of memory was not enough for this dataset and the job ran out of memory.
If there was a means of checking the size of the vcf files before they are merged, the resources for the merge wdl task could be adjusted appropriately.

Describe the solution you'd like

I believe that there is a way within wdl of getting line counts / file size. Would then need to come up with some cutoffs and resource parameters

Describe alternatives you've considered
Bump all wdl task resources when running with large datasets.

Additional context
??

delocalizer · 2019-10-24T00:30:44Z

I first discussed this with Ross a couple of months ago when the NIH bams hit, and I like the idea in principle. Bespoke resource requests per-task based on measuring inputs would be ideal, but because the runtime block is parsed before the command you can't do it in-task — you'd have to create a separate custom task that measured input sizes to create the appropriate resources to give to the following task that actually did the work.

I also considered exactly the alternative you suggest above - have a single float-valued "scaling" parameter in the workflow, defaulting to 1, passed to all tasks which would multiply their mem & walltime requests. It'd be less accurate, but simpler.

delocalizer · 2019-10-24T00:46:05Z

I take it back. Perhaps using size() in the runtime block would work, so you wouldn't need a separate task:
https://software.broadinstitute.org/wdl/documentation/spec#float-sizefile-string

Yep, works great...

  runtime {
    memory: "${size(infile, "K") * 1} MiB"
  }

shimbalama · 2019-10-28T23:57:59Z

I'm happy to take this on with the caveat that its a learning task and that I have one or two things that I'm working on that will have to be higher priority, that said I should have plenty of time between tests to tinker with this so I hopefully it shouldn't take too long.

holmeso added the enhancement New feature or request label Oct 24, 2019

delocalizer assigned shimbalama Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust wdl task runtime attributes (mem, cpu etc) depending on size of input #134

Adjust wdl task runtime attributes (mem, cpu etc) depending on size of input #134

holmeso commented Oct 24, 2019

delocalizer commented Oct 24, 2019

delocalizer commented Oct 24, 2019 •

edited

shimbalama commented Oct 28, 2019

Adjust wdl task runtime attributes (mem, cpu etc) depending on size of input #134

Adjust wdl task runtime attributes (mem, cpu etc) depending on size of input #134

Comments

holmeso commented Oct 24, 2019

delocalizer commented Oct 24, 2019

delocalizer commented Oct 24, 2019 • edited

shimbalama commented Oct 28, 2019

delocalizer commented Oct 24, 2019 •

edited