Possible to run docker in cavatica on openPBTA read data? #341

hbeale · 2019-12-16T23:18:37Z

@jharenza (Jackie Taroni suggested I ping you as the person most likely to have expertise in this area.)

We have a docker that we would like to be applied to the aligned read data to produce data that our analysis in OpenPBTA would consume. Do you have a process for making this request? I've summarized our proposed analysis below.

We would like to perform QC analysis on the aligned read data. We count the number of Mapped Exonic Non-Duplicate (MEND) reads. We will be performing outlier analysis (#229) and meta-analysis of the outlier results, and we wish to know which (if not all) are high quality enough to generalize from. We have defined the relationship of MEND counts to sensitivity and specificity of outlier calling in the manuscript linked below.

The QC analysis takes as input a hg38-aligned bam file and generates as output several small text files, as well as a duplicate-marked bam file (which can be discarded). The process takes approximately 2 hours on a sample containing 70 million reads when processed on a computer with 64GB of memory and 12 VCPUs.

The value of the analysis is maximized if we also have access to the STAR output log titled "Log.final.out" and fastqc output (e.g. R1_fastqc.zip and R2_fastqc.zip).

The QC process is dockerized and the code is available at https://github.com/UCSC-Treehouse/mend_qc.

The MEND approach is described more fully here. https://www.biorxiv.org/content/10.1101/716829v1.

jharenza · 2019-12-17T00:27:16Z

@hbeale - thanks for reaching out! We can do one of two things:

we can either create a CAVATICA project with all aligned reads for you, add some CAVATICA compute credits, and your team can run the docker/workflow in CAVATICA
we can put it in our queue to run (may be later this week, and many people are leaving next week for the holiday)

With either of these solutions, we would release the output files in a new data release and when you'd work on the PR, you could pull those files from the data release. If the data release is not immediate (may not be until the new year, as I am going to release #326 tomorrow), we could get you the files ahead of time/if you run it, you'd have them immediately. Which do you prefer?

hbeale · 2019-12-17T16:43:26Z

Thanks, @jharenza! Option two sounds more reproducible; let's go with that if possible. What else do you need from me?

hbeale · 2019-12-17T16:43:50Z

(And your time line sounds fine with me).

jharenza · 2019-12-17T23:15:25Z

Ok, @hbeale - we have a group meeting tomorrow and will discuss and get back to you with some plans.

hbeale · 2019-12-17T23:38:18Z

Thanks! The value of the analysis is maximized if we also have access to the STAR output log titled "Log.final.out" and fastqc output (e.g. R1_fastqc.zip and R2_fastqc.zip). I've amended the first comment to reflect this. Can you discuss including these in your data release as well? Thank you.

jharenza · 2019-12-18T20:18:25Z

@hbeale - @zhangb1 was able to create a workflow for this today, so I think we can queue that up later this week. Re: the STAR output, that should be no problem - we can zip and release those. Re: the fastqc output, we actually run RNASeqQC. I am attaching a sample output file here so you can check whether what you need is in these files or whether you need the FASTQC program run? Thanks!

96a41796-c1b6-447f-9f88-b2e7e52005b1.Aligned.out.sorted.bam.metrics.txt

hbeale · 2019-12-19T18:05:21Z

thanks @zhangb1!
Regarding FastQC and RNASeqQC, I can do without either. I usually get total reads from the FastQC output, but I can also get it from the STAR "Log.final.out". Thank you!

jharenza · 2019-12-19T18:10:50Z

ok great!

jharenza · 2020-01-02T21:29:25Z

@hbeale we have completed the MEND QC run, and will plan to release this data + STAR Log.final.out files with #v13. Can give you an updated timeline for release in the next week.

jharenza · 2020-01-06T21:14:33Z

Hi @hbeale - below are the outputs from Mend QC - which did you want in the release?

readDist.txt: The output of RSeqQC read_distribution.py (~1kb)
bam_umend_qc.tsv: uniqMappedNonDupeReadCount, estExonicUniqMappedNonDupeReadCount and PASS/FAIL
bam_umend_qc.json: Same as bam_umend_qc.tsv but in json format
sortedByCoord.md.bam: BAM with duplicates marked sorted by coordinate
sortedByCoord.md.bam.bai: Index for sortedByCoord.md.bam

cc: @migbro

Thanks!

hbeale · 2020-01-06T21:16:29Z

Great! Please release readDist.txt: The output of RSeqQC read_distribution.py (~1kb) and bam_umend_qc.tsv: uniqMappedNonDupeReadCount, estExonicUniqMappedNonDupeReadCount and PASS/FAIL

…

On Mon, Jan 6, 2020 at 1:14 PM Jo Lynne ***@***.***> wrote: Hi @hbeale <https://github.com/hbeale> - below are the outputs from Mend QC - which did you want in the release? readDist.txt: The output of RSeqQC read_distribution.py (~1kb) bam_umend_qc.tsv: uniqMappedNonDupeReadCount, estExonicUniqMappedNonDupeReadCount and PASS/FAIL bam_umend_qc.json: Same as bam_umend_qc.tsv but in json format sortedByCoord.md.bam: BAM with duplicates marked sorted by coordinate sortedByCoord.md.bam.bai: Index for sortedByCoord.md.bam cc: @migbro <https://github.com/migbro> Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#341?email_source=notifications&email_token=AAANLA5BMIJPIKZEJZXWCBTQ4ONLVA5CNFSM4J3TKFZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIGZYNI#issuecomment-571317301>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAANLA4H5JYLBAQYGC2K5YLQ4ONLVANCNFSM4J3TKFZA> .

jaclyn-taroni · 2020-01-18T11:57:19Z

@hbeale The MendQC files and STAR logs were included with the v13 release (#444), which is now available via the download script in the master branch. I am going to close this issue. If there are any issues or questions around those files, we can reopen or file a new data issue. Thanks!

hbeale added the data label Dec 16, 2019

jharenza self-assigned this Dec 17, 2019

jharenza mentioned this issue Dec 23, 2019

Planned data release: V13 #373

Closed

7 tasks

jaclyn-taroni closed this as completed Jan 18, 2020

hbeale mentioned this issue Feb 20, 2020

Proposed Analysis: Flag RNA-seq samples with substantially different read-type composition #550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to run docker in cavatica on openPBTA read data? #341

Possible to run docker in cavatica on openPBTA read data? #341

hbeale commented Dec 16, 2019 •

edited

jharenza commented Dec 17, 2019

hbeale commented Dec 17, 2019

hbeale commented Dec 17, 2019

jharenza commented Dec 17, 2019

hbeale commented Dec 17, 2019

jharenza commented Dec 18, 2019

hbeale commented Dec 19, 2019

jharenza commented Dec 19, 2019

jharenza commented Jan 2, 2020

jharenza commented Jan 6, 2020

hbeale commented Jan 6, 2020 via email

jaclyn-taroni commented Jan 18, 2020

Possible to run docker in cavatica on openPBTA read data? #341

Possible to run docker in cavatica on openPBTA read data? #341

Comments

hbeale commented Dec 16, 2019 • edited

jharenza commented Dec 17, 2019

hbeale commented Dec 17, 2019

hbeale commented Dec 17, 2019

jharenza commented Dec 17, 2019

hbeale commented Dec 17, 2019

jharenza commented Dec 18, 2019

hbeale commented Dec 19, 2019

jharenza commented Dec 19, 2019

jharenza commented Jan 2, 2020

jharenza commented Jan 6, 2020

hbeale commented Jan 6, 2020 via email

jaclyn-taroni commented Jan 18, 2020

hbeale commented Dec 16, 2019 •

edited