-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Twice-Localized Workaround #2
Comments
The LD Pruning workflow's merge_gds task used to this workaround for the exact same reason as vcf-to-gds: There's an input array of files from a previous scattered task, and the config file will only support their addition if each file in that array is in the same directory. However, it turns out its Rscripts are opening the files as readonly mode, so symlinks will suffice. |
All this time I was assuming that the issues I had with softlinks was unavoidable due to vcf-to-gds not working with them and due to all these people saying "softlinks don't exist in Google Cloud Storage," but I've since updated this issue and older comments to explain that the issue is actually down to permissions. |
localin:
out:
in:
out:
in:
out:
after the creation of symbolic links, the execution directory has:
|
On google,
Within ../cromwell_root:
|
The Terra errors for this commit reference
However, outside of the task but within the script folder...
|
Going through old branches and found terra-permissions-workaround from about ten months ago. This is essentially all it changed; it was just touching vcf-to-gds. I highly doubt the information in it is all true since that workaround isn't implemented with what's on main, but it may be worth recording... |
In brief
Some tasks currently have a workaround wherein some input files are copied over twice. This results in these tasks requiring up to twice as much disk space as they would otherwise. Currently, disk size estimate calculations for these tasks account for this, so it is unlikely this will cause users to get an error. But this still results in slightly increased costs, so it's worth making note of.
Example
The first task of the vcf-to-gds workflow generates GDS files in a scattered task. The second task, which is not scattered, takes in those files as inputs to give them unique variant IDs.
This situation, wherein a scattered task passes in inputs to a non-scattered task, passes in each instance of the scattered task's outputs into a new folder. Let's say my scattered task runs on 5 vcf files, generating five gds files. My second task is passed in those gds files like this:
That is to say, each gds file now lives in its own folder within /inputs/.
This is problematic with how the R scripts use configuration files. These configuration files expect one line to represent a given pattern for an input file, such as
where the space is filled in with expected chromosome numbers by the script itself at runtime.
We have two options when referring to files like these when making configuration files: Either we pass in the path, or just a filename. If we pass in the full path, the resulting configuration file will be invalid, because every gds file has a different path due to each gds file living in a separate folder. If we pass in a filename, the resulting configuration file will technically be valid, but it will fail because the files strictly speaking do not exist in the working directory, but rather in some subfolder of /inputs/.
However, if we copy or symlink each of those input files into the working directory, we can use the filename method, because now files are actually where the R script expects them.
Where gdss is the array of input files from the previous scattered task.
However, this approach is non-functional on Terra -- a permission denied error is thrown. There are three root causes:
rw-r--r--
permissionschmod
ormv
are also not allowed on Terra in this context, so we need to duplicate the files to create a copy for which we have write permissions.Other workflows are able to use symlinks as they only open the inputs as read-only.
The text was updated successfully, but these errors were encountered: