Module/gatk rnaseq/1.0 #184

Jwong684 · 2021-05-07T22:29:40Z

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for New Module

Required

If applicable

I added more granular output subdirectories.
I added rules to the reference_files workflow to generate any new reference files.
I added subdirectories with large intermediate files to the list of scratch_subdirectories in the default.yaml configuration file.
I updated the list of available wildcards for the input files in the default.yaml configuration file.

Checklist for Updated Module

To be completed.

Kdreval

Thanks Jasper! I left some comments. I have another general question: there is {pair_status} wildcard, but no reference to normal samples in the module. RNASeq would be unpaired, so do we need this wildcard to be included?

Kdreval · 2021-05-08T07:17:25Z

demo/config.yaml

+    gatk_rnaseq:
+        inputs:
+            sample_bam: "data/{sample_id}.bam"
+            sample_bai: "data/{sample_id}.bam.bai"


This needs new line added at the end

It still shows no new line here, maybe I am looking at outdated file?

Sorry, I must have missed it. It's fixed

Kdreval · 2021-05-08T07:17:48Z

modules/gatk_rnaseq/1.0/config/default.yaml

+                window: 35 # window size between SNPs in cluster
+                cluster_size: 3 # at least 3 SNPs in cluster
+                # hard filtering (filters OUT) based on metrics: 
+                    # FS (FisherStrand): Phred-scale probability that there is a strand bias from a Fisher's test. (default FS > 30.0)


Thanks for detailed documentation!

Kdreval · 2021-05-08T07:20:04Z

modules/gatk_rnaseq/1.0/config/default.yaml

+            bcftools: "{MODSDIR}/envs/bcftools-1.10.2.yaml"
+
+        threads:
+            gatk_splitntrim: 12


It is possible to combine these keys and reuse them across rules. In other words, if several rules need the same number of threads, they can refer to the same key in config. Same can be applied to resources as well. I think this reduces the number of keys to specify/adjust if needed, and reduces complexity of the config. What do you think?

It's possible, but wouldn't it also cause confusion if there's ever a need to change the numbers? Also, would there be a unique name that could be applied to a subset of the rules ("thread_12" would be non-descriptive and wouldn't indicate which rules would use this parameter)?

You are right, let's keep the more detailed and informative names

Kdreval · 2021-05-08T07:25:14Z

modules/gatk_rnaseq/1.0/envs/bcftools-1.10.2.yaml

@@ -0,0 +1,36 @@
+name: test-bcftools


This (and another environment file) should be rather in lcr-modules/envs/ folder, and symlinked in the module. This will make it easier to reuse same environments across modules, therefore reducing the need to build multiple environments. There is already an environment for bcftools with the same version, so you can just as well symlink lcr-modules/envs/bcftools/bcftools-1.10.2.yaml to this module

Kdreval · 2021-05-08T07:28:33Z

modules/gatk_rnaseq/1.0/envs/gatk_rnaseq.yaml

@@ -0,0 +1,352 @@
+name: bioinformatics


Is this environment just GATK or there is something else besides it that is required for this module to run? There is a GATK environment lcr-modules/envs/gatk/gatk-4.1.8.1.yaml, maybe it can be reused here? I see this one contains tools like, for example, bedtools, vcf2maf, tmux, and a lot of perl dependencies, but are they used in the module?

Hm that is a good question. I pulled it from a working environment. I'll test to see if it'll work with the gatk module

Yes it worked with gatk module. I will just use the symlinked version for both gatk and bcftools

Kdreval · 2021-05-08T07:33:07Z