Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Twice-Localized Workaround #2

Open
aofarrel opened this issue Apr 27, 2021 · 6 comments
Open

The Twice-Localized Workaround #2

aofarrel opened this issue Apr 27, 2021 · 6 comments
Labels
Cromwell limitation? May be a limitation of how Cromwell works CWL-WDL divergence GCS limitation? May be a limitation of how GCS works wontfix This will not be worked on

Comments

@aofarrel
Copy link
Collaborator

aofarrel commented Apr 27, 2021

In brief

Some tasks currently have a workaround wherein some input files are copied over twice. This results in these tasks requiring up to twice as much disk space as they would otherwise. Currently, disk size estimate calculations for these tasks account for this, so it is unlikely this will cause users to get an error. But this still results in slightly increased costs, so it's worth making note of.

Example

The first task of the vcf-to-gds workflow generates GDS files in a scattered task. The second task, which is not scattered, takes in those files as inputs to give them unique variant IDs.

This situation, wherein a scattered task passes in inputs to a non-scattered task, passes in each instance of the scattered task's outputs into a new folder. Let's say my scattered task runs on 5 vcf files, generating five gds files. My second task is passed in those gds files like this:

Screenshot 2021-04-09 at 3 34 00 PM

That is to say, each gds file now lives in its own folder within /inputs/.

This is problematic with how the R scripts use configuration files. These configuration files expect one line to represent a given pattern for an input file, such as

gds_file '1KG_phase3_subset_chr .gds'

where the space is filled in with expected chromosome numbers by the script itself at runtime.

We have two options when referring to files like these when making configuration files: Either we pass in the path, or just a filename. If we pass in the full path, the resulting configuration file will be invalid, because every gds file has a different path due to each gds file living in a separate folder. If we pass in a filename, the resulting configuration file will technically be valid, but it will fail because the files strictly speaking do not exist in the working directory, but rather in some subfolder of /inputs/.

However, if we copy or symlink each of those input files into the working directory, we can use the filename method, because now files are actually where the R script expects them.

BASH_FILES=(~{sep=" " gdss})
for BASH_FILE in ${BASH_FILES[@]};
do
	ln -s ${BASH_FILE}
done

Where gdss is the array of input files from the previous scattered task.

However, this approach is non-functional on Terra -- a permission denied error is thrown. There are three root causes:

  1. Terra does not give root permissions when executing workflows
  2. Cromwell tends to give localized input files rw-r--r-- permissions
  3. The Rscript in question uses openfn() with readonly=False

chmod or mv are also not allowed on Terra in this context, so we need to duplicate the files to create a copy for which we have write permissions.

BASH_FILES=(~{sep=" " gdss})
for BASH_FILE in ${BASH_FILES[@]};
do
	cp ${BASH_FILE} .
done

Other workflows are able to use symlinks as they only open the inputs as read-only.

@aofarrel aofarrel added Cromwell limitation? May be a limitation of how Cromwell works GCS limitation? May be a limitation of how GCS works labels Apr 27, 2021
@aofarrel aofarrel changed the title unique_variant_id contains a workaround which copies input GDS files over twice [a] unique_variant_id contains a workaround which copies input GDS files over twice May 1, 2021
@aofarrel aofarrel changed the title [a] unique_variant_id contains a workaround which copies input GDS files over twice some tasks contain a workaround which copies input GDS files over twice May 25, 2021
@aofarrel aofarrel changed the title some tasks contain a workaround which copies input GDS files over twice The Double Input Workaround Jun 1, 2021
@aofarrel aofarrel added CWL-WDL divergence wontfix This will not be worked on labels Jun 1, 2021
@aofarrel aofarrel added this to Won't Fix/Unfixable in Issue Triage Jun 10, 2021
@aofarrel aofarrel changed the title The Double Input Workaround The Twice-Localized Workaround Jul 9, 2021
@aofarrel
Copy link
Collaborator Author

aofarrel commented Jul 13, 2021

The LD Pruning workflow's merge_gds task used to this workaround for the exact same reason as vcf-to-gds: There's an input array of files from a previous scattered task, and the config file will only support their addition if each file in that array is in the same directory. However, it turns out its Rscripts are opening the files as readonly mode, so symlinks will suffice.

@aofarrel
Copy link
Collaborator Author

aofarrel commented Jul 26, 2021

All this time I was assuming that the issues I had with softlinks was unavoidable due to vcf-to-gds not working with them and due to all these people saying "softlinks don't exist in Google Cloud Storage," but I've since updated this issue and older comments to explain that the issue is actually down to permissions.

@aofarrel
Copy link
Collaborator Author

local

in:

  ls -lha .

out:

total 28K
drwxr-xrwx 9 topmed topmed 288 Jul 26 14:15 .
drwxr-xrwx 5 topmed topmed 160 Jul 26 14:15 ..
-rw-r--r-- 1 topmed topmed 6.3K Jul 26 14:15 script
-rw-r--r-- 1 topmed topmed 461 Jul 26 14:15 script.background
-rw-r--r-- 1 topmed topmed 499 Jul 26 14:15 script.submit
-rw-r--r-- 1 topmed topmed 12 Jul 26 14:15 stderr
-rw-r--r-- 1 topmed topmed 12 Jul 26 14:15 stderr.background
-rw-r--r-- 1 topmed topmed 0 Jul 26 14:15 stdout
-rw-r--r-- 1 topmed topmed 6 Jul 26 14:15 stdout.background

in:

  ls -lha ../inputs

out:

total 0
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 -1179415630
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 -478351052
drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 .
drwxr-xrwx 5 topmed topmed 160 Jul 26 14:15 ..
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 1318600307
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 2019664885
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 222713526

in:

  ls -lha ../inputs/*

out:

../inputs/-1179415630:
total 72K
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 .
drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 ..
-rw-r--r-- 2 topmed topmed 71K Jul 26 14:15 1KG_phase3_subset_chr1.gds

../inputs/-478351052:
total 72K
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 .
drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 ..
-rw-r--r-- 2 topmed topmed 72K Jul 26 14:14 1KG_phase3_subset_chr3.gds

../inputs/1318600307:
total 72K
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 .
drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 ..
-rw-r--r-- 2 topmed topmed 72K Jul 26 14:15 1KG_phase3_subset_chr2.gds

../inputs/2019664885:
total 76K
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 .
drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 ..
-rw-r--r-- 2 topmed topmed 73K Jul 26 14:15 1KG_phase3_subset_chr20.gds

../inputs/222713526:
total 60K
drwxr-xrwx 3 topmed topmed 96 Jul 26 14:15 .
drwxr-xrwx 7 topmed topmed 224 Jul 26 14:15 ..
-rw-r--r-- 2 topmed topmed 58K Jul 26 14:14 1KG_phase3_subset_chrX.gds

after the creation of symbolic links, the execution directory has:

total 32K
drwxr-xrwx 14 topmed topmed 448 Jul 26 14:22 .
drwxr-xrwx 5 topmed topmed 160 Jul 26 14:22 ..
lrwxr-xr-x 1 topmed topmed 123 Jul 26 14:22 1KG_phase3_subset_chr1.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/872935859/1KG_phase3_subset_chr1.gds
lrwxr-xr-x 1 topmed topmed 124 Jul 26 14:22 1KG_phase3_subset_chr2.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-924015500/1KG_phase3_subset_chr2.gds
lrwxr-xr-x 1 topmed topmed 125 Jul 26 14:22 1KG_phase3_subset_chr20.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-222950922/1KG_phase3_subset_chr20.gds
lrwxr-xr-x 1 topmed topmed 124 Jul 26 14:22 1KG_phase3_subset_chr3.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/1574000437/1KG_phase3_subset_chr3.gds
lrwxr-xr-x 1 topmed topmed 125 Jul 26 14:22 1KG_phase3_subset_chrX.gds -> /bark-bark/vcftogds/46c1a6d5-73da-4a34-befd-6cee3cf19b05/call-unique_variant_id/inputs/-2019902281/1KG_phase3_subset_chrX.gds
-rw-r--r-- 1 topmed topmed 6.3K Jul 26 14:22 script
-rw-r--r-- 1 topmed topmed 461 Jul 26 14:22 script.background
-rw-r--r-- 1 topmed topmed 499 Jul 26 14:22 script.submit
-rw-r--r-- 1 topmed topmed 1.7K Jul 26 14:22 stderr
-rw-r--r-- 1 topmed topmed 12 Jul 26 14:22 stderr.background
-rw-r--r-- 1 topmed topmed 1.9K Jul 26 14:22 stdout
-rw-r--r-- 1 topmed topmed 6 Jul 26 14:22 stdout.background

@aofarrel
Copy link
Collaborator Author

On google, ls -lha ../* finds:

ls -lha ../bin ../boot ../cromwell_root ../dev ../etc ../google ../home ../lib ../lib64 ../media ../mnt ../opt ../proc ../root ../run ../sbin ../srv ../sys ../tmp ../usr ../var

Within ../cromwell_root:

drwxrwxrwx 5 root   root   4.0K Jul 26 14:53 .
drwxr-xr-x 1 root   root   4.0K Jul 26 14:53 ..
-rw-r--r-- 1 root   root   2.0K Jul 26 14:53 gcs_delocalization.sh
-rw-r--r-- 1 root   root   1.7K Jul 26 14:53 gcs_localization.sh
-rw-r--r-- 1 root   root    14K Jul 26 14:53 gcs_transfer.sh
drwxrwxrwx 2 root   root    16K Jul 26 14:49 lost+found
-rw-r--r-- 1 root   root   4.7K Jul 26 14:53 script
-rw-r--r-- 1 topmed topmed  245 Jul 26 14:53 stderr
-rw-r--r-- 1 topmed topmed  677 Jul 26 14:53 stdout
drwxrwxrwx 2 topmed topmed 4.0K Jul 26 14:53 tmp.78134170
drwxr-xr-x 3 root   root   4.0K Jul 26 14:53 topmed_workflow_testing

@aofarrel
Copy link
Collaborator Author

The Terra errors for this commit reference ln even though I'm not using ln in the task.

ln: failed to access '/cromwell_root/*.gds': No such file or directory

However, outside of the task but within the script folder...

# hardlink or symlink all the files into the glob directory
( ln -L /cromwell_root/*.gds /cromwell_root/glob-5650d15b9bd471dc83ac35b7daef1c7b 2> /dev/null ) || ( ln /cromwell_root/*.gds /cromwell_root/glob-5650d15b9bd471dc83ac35b7daef1c7b )

@aofarrel
Copy link
Collaborator Author

Going through old branches and found terra-permissions-workaround from about ten months ago. This is essentially all it changed; it was just touching vcf-to-gds. I highly doubt the information in it is all true since that workaround isn't implemented with what's on main, but it may be worth recording...

Screen Shot 2022-07-12 at 11 56 56 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cromwell limitation? May be a limitation of how Cromwell works CWL-WDL divergence GCS limitation? May be a limitation of how GCS works wontfix This will not be worked on
Projects
Issue Triage
Won't Fix/Unfixable/Waiting for C...
Development

No branches or pull requests

1 participant