Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop duplicating inputs: Documentation #42

Merged
merged 6 commits into from
Jul 29, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions _documentation_/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ For pipeline specific docs, see the README file in their folder (such as /ld-pru
| running-locally.md | Advice for running these workflows locally |

### Dev Documentation
| file | description |
|--------------- |------------------------------------------------------------------------- |
| cwl-vs-wdl-dev.md | In-depth comparison of the CWL and WDL versions of this pipeline |
| params-vs-config.md | Overview of the unique way the null model wf operates including why it quadruples inputs |
| update-checklist.md | General notes for code maintaners |
| file | description |
|--------------- |------------------------------------------------------------------------- |
| cwl-vs-wdl-dev.md | In-depth comparison of the CWL and WDL versions of this pipeline |
| params-vs-config.md | Overview of the unique way the null model wf operates |
| symlink-workaround.md | Explains how Cromwell file localization can cause need for symlinking |
| update-checklist.md | General notes for code maintaners |
7 changes: 2 additions & 5 deletions _documentation_/for developers/cwl-vs-wdl-dev.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,7 @@ These differences are likely only of interest to maintainers of this repo or tho
**The chromosome file workaround:** A few R scripts require chromosome numbers to be passed in on the command line rather than in the configuration file. In order to do this, chromosome number is written to a file (completely separate from the configuration file) in the task's inline Python section. Upon exiting the Python block, this extra file is read in BASH and then passed to the Rscript call as an argument (as opposed to being in the configuration file). Although the actual call to the R script is identical as the CWL also passes the chr number as an argument, the CWL is able to rely on its inline Javascript to do this, while the WDL must use this workaround to pass the chr number out of the Python scope. As this only involves writing a tiny file that doesn't scale with inputs, it does not have cost implications.

## null-model.wdl
The twice-localized workaround is used for a slightly different reason that it is in other workflows. See [params-vs-wdl](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/_documentation_/for%20developers/params-vs-config.md) for more information.

The second significant difference: The CWL technically has duplicated outputs. The WDL instead returns each file once. On SB, cwl.output.json sets the outputs as the following, where ! indicates a duplicated output, inverse norm transformation is applied, and the output_prefix is set to `test`:
The CWL technically has duplicated outputs. The WDL instead returns each file once. On SB, cwl.output.json sets the outputs as the following, where ! indicates a duplicated output, inverse norm transformation is applied, and the output_prefix is set to `test`:
* configs:
* null_model.config
* ! null_model.config.null_model.params
Expand All @@ -30,12 +28,11 @@ The second significant difference: The CWL technically has duplicated outputs. T
Because everything in null_model_output is already covered by null_model_files, it does not exist as an output in the WDL.

## ld-pruning.wdl
* The twice-localized workaround is used for the same reason as it is in unique_variant_ids; it is taking in an input array of files from more than one input directory.
* WDL does not have an equivalent to ScatterMethod:DotProduct so it instead scatters using zip().
* check_merged_gds uses the chromosome file workaround.

## vcf-to-gds.wdl
* The twice-localized workaround is used in unique_variant_ids due to the fact the R script requires an array of GDS files and uses a config file which can only support an array of GDS files if each GDS file is in the same parent directory. As the GDS files are generated from a scattered task, they are sometimes given unique input directories on some file systems. As such, they first need to be pulled from all of their input directories and placed in the working directory before the config file is generated. This is explained more in-depth as the example case for [#2](https://github.com/DataBiosphere/analysis_pipeline_WDL/issues/2).
* The twice-localized workaround is used in unique_variant_ids due to permission errors on Terra. See [#2](https://github.com/DataBiosphere/analysis_pipeline_WDL/issues/2).
* The WDL will not start the check_gds task if check_gds is false. The CWL will start the check_gds task regardless and generate a config file, and the true/false only applies to calling the script.
* Reasoning: The way GCS billing works, this has the potential of being cheaper. Otherwise we would spend for having a powerful non-preemptible compute designed for an intense task, then only using that compute for making a text file.
* The WDL can correctly handle a mixture of file extensions, not so much by design, but due to the specifics of implementation. The CWL will handle such a mixture incorrectly in check_gds (but can correctly handle a homogenous group, such as all being bcf files).
Expand Down
8 changes: 4 additions & 4 deletions _documentation_/for developers/params-vs-config.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Params vs Configs, or, why Null Model Quadruples Its Inputs
# Params vs Config Files In Null Model

Most pipelines in this repo use a .config file to point requisite inputs to the Rscript. The first task of the null model workflow, null_model_r, works like this. It generates a .config file which is used by null_model.R via the following call:
`Rscript /usr/local/analysis_pipeline/R/null_model.R null_model.config`

But one of the first things null_model.R does is generate a params file based upon the null_model.config file that it just recieved. This params file is not exactly the same as the config file it is passed in and has some formatting differences, but it can effectively act as a copy of the important information in this task's config file. This params file is among the outputs of this task, and it is passed in as an input into the next task, null_model_report.

This is because null_model_report essentially needs two configs.
This is because null_model_report essentially needs *two* configuration files.

First of all null_model_report generates a .config file which only contains the distribution family, the outprefix, and n_categories_boxplot. If it were like most other tasks in this repo, the config file would also contain phenotype_file or other inputs, but it does not. Instead, null_model_report uses the previous tasks' params file for most of its inputs. It is this param file from the first task that is used to generated an Rmd (R markdown) file in the second task.
First of all null_model_report generates a .config file which only contains the distribution family, the outprefix, and n_categories_boxplot. If it were like most other tasks in this repo, the config file would also contain phenotype_file or other inputs, but it does not. Instead, null_model_report uses the previous tasks' params file for most of its inputs. It is this params file from the first task that is used to generated an Rmd (R markdown) file in the second task.

Additionally, the null_model_report task uses a full copy of all files passed into the null_model_r. For instance, if you place 23 chromosome level GDS files into null_model_r, then they will also be passed into null_model_report.

All of this is the case in both the CWL and the WDL.

Because the params file is based upon the first task's config file, and the second task cannot reach into the file system of the first task, this means the first task's config file must use relative paths. Currently, GCS limitations on softlinking means that relative paths only work when using the [Twice Localized Workaround](https://github.com/DataBiosphere/analysis_pipeline_WDL/issues/2) in both tasks of the WDL. This means that the amount of disk space required is four times the size of your inputs -- a 3 GB phenotype_file will be duplicated once in the first task, and once more in the second task, resulting in the overall pipeline requiring 12 GB of space for just that file.
Because the params file is based upon the first task's config file, and the second task cannot reach into the file system of the first task, this means the first task's config file must use relative paths, and symlinks or duplications must be made in both tasks such that the relative paths work out.
39 changes: 39 additions & 0 deletions _documentation_/for developers/symlink-workaround.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## In brief
Some tasks currently have a workaround wherein symlinks are used. This is not a problem in and of itself, but I am documenting why it is necessary as this may help other people WDLize similar programs.

## Reason 1: Inputs go into different folders
_**Important note**: This describes a Cromwell execution on a local filesystem. The details of localizing files and what folders they go into changes on different backends. Some backends may not even need this workaround._

The second task of the ld-prune workflow generates chromosome-level, subset GDS files in a scattered task. The third task, which is not scattered, takes in those files as inputs in order to merge them into a single GDS file.

This situation, wherein a scattered task passes in inputs to a non-scattered task, passes in each instance of the scattered task's outputs into a new folder. Let's say my scattered task runs on 5 GDS files, generating 5 subset GDS files. My subsequent task is passed in those gds files like this:

<img width="623" alt="Screenshot 2021-04-09 at 3 34 00 PM" src="https://user-images.githubusercontent.com/27784612/114250466-a9331f80-9952-11eb-9e09-f114f9d89e4f.png">

That is to say, each GDS file now lives in its own folder within /inputs/.
aofarrel marked this conversation as resolved.
Show resolved Hide resolved

This is problematic with how the R scripts use configuration files. These configuration files expect one line to represent a given pattern for an input file, such as

> gds_file '1KG_phase3_subset_chr .gds'

where the space is filled in with expected chromosome numbers by the script itself at runtime.

We have two options when referring to files like these when making configuration files: Either we pass in the path, or just a filename. If we pass in the full path, the resulting configuration file will be invalid, because every gds file has a different path due to each gds file living in a separate folder. If we pass in a filename, the resulting configuration file will technically be valid, but it will fail because the files strictly speaking do not exist in the working directory, but rather in some subfolder of /inputs/.

However, if we copy or symlink each of those input files into the working directory, we can use the filename method, because now files are actually where the R script expects them.

```
BASH_FILES=(~{sep=" " gdss})
for BASH_FILE in ${BASH_FILES[@]};
do
ln -s ${BASH_FILE}
done
```
Where gdss is the array of input files from the previous scattered task.

## Reason 2: Parameter files passed between different Docker containers
More information: https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/_documentation_/for%20developers/params-vs-config.md

The null model workflow generates a parameters file in its first task, which must use a general path, because the next tasks that same parameters file to generate its own output. If it used absolute paths, the path would point to the first task's inputs directory, which is not available in the second task. This is because, when running on Terra, each task is running inside of a Docker container. (The local version of Cromwell technically does not need Docker, but for Terra it is a hard requirement.) Files that are not explicitly passed in or out of that container cannot be accessed by other tasks. In other words, in the WDL context, each task has its own file system. Additionally, the input directory's name is not consistent across tasks even if they are based upon the same Docker image, nor can it be predicted before runtime.

Therefore, we must use relative paths in the params file, and we additionally must symlink files from the input directory to the working directory for these relative paths to function with the R scripts.
8 changes: 1 addition & 7 deletions _documentation_/for users/cwl-vs-wdl-user.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,11 @@ The original CWLs allocate memory and CPUs on a workflow level, while the WDLs d

In line with workflow inputs: The original CWL does not have an option for disk space, but the WDL does, as it is a soft requirement for running on Terra. If not defined by the user, it will fall back on an estimate. Should that estimate prove inappropriate, the user will need to override it on a task level.

##### *The twice-localized workaround*
Due to a workaround involving file inputs, some WDL tasks require up to twice as much disk space as their CWL equivalent tasks would require. Exactly which files must be duplicated depends on the workflow. Be aware that sometimes the size of the files that need to be duplicated will scale with the size of your inputs. Cost-related implications for users are discussed for each workflow below; algorithmic explainations beyond what the typical user needs to know can be found in [this document](https://github.com/DataBiosphere/analysis_pipeline_WDL/tree/main/_documentation_/for%20developers/cwl-vs-wdl-dev.md).

## ld-pruning.wdl
* The twice-localized workaround duplicates a series of intermediate files (specifically in the merge_gds task). **The size of these intermediate files scale with the size of your input GDS files.** However, they have already been pruned by a previous task, so you can assume they will be smaller than you input files.
* The CWL appears to contain a bug where `exclude_PCA_cor` being set to false is not respected ([#14](https://github.com/DataBiosphere/analysis_pipeline_WDL/issues/14)). The WDL avoids this.
* The output of merge_gds across the CWL and WDL do not md5 to the same value, but should be functionally equivalent. See [#22](https://github.com/DataBiosphere/analysis_pipeline_WDL/issues/22) for more information.

## null-model.wdl
The majority of input files are affected by the twice-localized workaround in both tasks. As such, we recommend that, in each task, you allocate about double the amount of disk space as the size of your inputs.

The CWL includes a function which is designed to have an output phenotype file inherit metadata from the required input phenotype file. It is specific to the Seven Bridges platform and therefore has not been included in the WDL. We have tested the workflow extensively and have not found a situation where the phenotype output file from the WDL varies from what the phenotype output from the CWL is; ie, in spite of this deletion the two outputs md5 across workflows.

Be aware that the original CWL gives significantly different output for null_model_file depending on whether the Seven Bridges backend selected at workspace creation time is AWS (default) or Google [(#3 on CWL)](https://github.com/UW-GAC/analysis_pipeline_cwl/issues/3). This naturally carries over when comparing the output of the original CWL on an AWS backend to the output of this WDL on a non-AWS backend. Rather than relying on an md5sum, we define "significantly different output" to mean the outputs in question do not pass when the R function `all.equal()` is run at default values. By these criteria:
Expand All @@ -30,4 +24,4 @@ Be aware that the original CWL gives significantly different output for null_mod


## vcf-to-gds.wdl
* The twice-localized workaround duplicates a series of intermediate files (specifically in the unique_variant_ids task). **The size of these intermediate files scale with the size of your input VCFs.** However, they are GDS files, which are more heavily compressed than VCFs, so you can assume they will be smaller than your input files.
The twice-localized workaround duplicates a series of intermediate files (specifically in the unique_variant_ids task). **The size of these intermediate files scale with the size of your input VCFs.** However, they are GDS files, which are more heavily compressed than VCFs, so you can assume they will be smaller than your input files.
1 change: 0 additions & 1 deletion null-model/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
* Running this on Google Cloud Compute will give different outputs compared to it being run on AWS. More information here: https://github.com/DataBiosphere/analysis_pipeline_WDL/issues/31
* Be aware that the optional checker workflow for Null Model varies considerably from other checker workflows in this repo. If you intend on using it, please be sure to read the [checker workflow documentation](https://github.com/DataBiosphere/analysis_pipeline_WDL/null-model/checker/README.md).
* A total of nine sample JSONs are present in this repo. They are based upon example configuration files in [the original repo's testdata folder](https://github.com/UW-GAC/analysis_pipeline/tree/master/testdata). Note that [fast_scoreSE](https://github.com/UW-GAC/analysis_pipeline/blob/master/testdata/null_model_fast_scoreSE.config) refers to another pipeline and the various `*_reportonly.config` files are redundant; these configs have therefore not been included.
* When run on Google Cloud Compute, each task of this pipeline will request double the disk size of your inputs + 1 GB. Manually overriding this to a smalller number is not recommended.

## Table Of Contents
<!---toc start-->
Expand Down
2 changes: 1 addition & 1 deletion vcf-to-gds/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Conversion to GDS (vcf-to-gds-wf.wdl)
This workflow converts vcf.gz/vcf/vcf.bgz/bcf, one per chromosome, to GDS files. It then provides each variant across the files with unique variant IDs for later compatiability with PLINK, and optionally checks the resulting GDS files against their original inputs for consistency. This represents [the first "chunk" of the original pipeline](https://github.com/UW-GAC/analysis_pipeline#conversion-to-gds) minus the currently-not-recommend optional merge. Merging is still planned to be supported, but only after linkage disequilbrium (also in-progress) is calculated.
This workflow converts vcf.gz/vcf/vcf.bgz/bcf, one per chromosome, to GDS files. It then provides each variant across the files with unique variant IDs for later compatiability with PLINK, and optionally checks the resulting GDS files against their original inputs for consistency. This represents [the first "chunk" of the original pipeline](https://github.com/UW-GAC/analysis_pipeline#conversion-to-gds) minus the currently-not-recommend optional merge. For merging GDS files, see the merge_gds task in this repo's LD Pruning workflow.

Please be aware that the third step, check_gds is skipped by default, as it is incredibly slow on non-downsampled data. Running this step on modern TOPMed data can take literal days.

Expand Down