Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Possible to run docker in cavatica on openPBTA read data? #341

Closed
hbeale opened this issue Dec 16, 2019 · 12 comments
Closed

Possible to run docker in cavatica on openPBTA read data? #341

hbeale opened this issue Dec 16, 2019 · 12 comments
Assignees
Labels

Comments

@hbeale
Copy link
Contributor

hbeale commented Dec 16, 2019

@jharenza (Jackie Taroni suggested I ping you as the person most likely to have expertise in this area.)

We have a docker that we would like to be applied to the aligned read data to produce data that our analysis in OpenPBTA would consume. Do you have a process for making this request? I've summarized our proposed analysis below.

We would like to perform QC analysis on the aligned read data. We count the number of Mapped Exonic Non-Duplicate (MEND) reads. We will be performing outlier analysis (#229) and meta-analysis of the outlier results, and we wish to know which (if not all) are high quality enough to generalize from. We have defined the relationship of MEND counts to sensitivity and specificity of outlier calling in the manuscript linked below.

The QC analysis takes as input a hg38-aligned bam file and generates as output several small text files, as well as a duplicate-marked bam file (which can be discarded). The process takes approximately 2 hours on a sample containing 70 million reads when processed on a computer with 64GB of memory and 12 VCPUs.

The value of the analysis is maximized if we also have access to the STAR output log titled "Log.final.out" and fastqc output (e.g. R1_fastqc.zip and R2_fastqc.zip).

The QC process is dockerized and the code is available at https://github.com/UCSC-Treehouse/mend_qc.

The MEND approach is described more fully here. https://www.biorxiv.org/content/10.1101/716829v1.

@hbeale hbeale added the data label Dec 16, 2019
@jharenza
Copy link
Collaborator

@hbeale - thanks for reaching out! We can do one of two things:

  1. we can either create a CAVATICA project with all aligned reads for you, add some CAVATICA compute credits, and your team can run the docker/workflow in CAVATICA
  2. we can put it in our queue to run (may be later this week, and many people are leaving next week for the holiday)

With either of these solutions, we would release the output files in a new data release and when you'd work on the PR, you could pull those files from the data release. If the data release is not immediate (may not be until the new year, as I am going to release #326 tomorrow), we could get you the files ahead of time/if you run it, you'd have them immediately. Which do you prefer?

@jharenza jharenza self-assigned this Dec 17, 2019
@hbeale
Copy link
Contributor Author

hbeale commented Dec 17, 2019

Thanks, @jharenza! Option two sounds more reproducible; let's go with that if possible. What else do you need from me?

@hbeale
Copy link
Contributor Author

hbeale commented Dec 17, 2019

(And your time line sounds fine with me).

@jharenza
Copy link
Collaborator

Ok, @hbeale - we have a group meeting tomorrow and will discuss and get back to you with some plans.

@hbeale
Copy link
Contributor Author

hbeale commented Dec 17, 2019

Thanks! The value of the analysis is maximized if we also have access to the STAR output log titled "Log.final.out" and fastqc output (e.g. R1_fastqc.zip and R2_fastqc.zip). I've amended the first comment to reflect this. Can you discuss including these in your data release as well? Thank you.

@jharenza
Copy link
Collaborator

@hbeale - @zhangb1 was able to create a workflow for this today, so I think we can queue that up later this week. Re: the STAR output, that should be no problem - we can zip and release those. Re: the fastqc output, we actually run RNASeqQC. I am attaching a sample output file here so you can check whether what you need is in these files or whether you need the FASTQC program run? Thanks!

96a41796-c1b6-447f-9f88-b2e7e52005b1.Aligned.out.sorted.bam.metrics.txt

@hbeale
Copy link
Contributor Author

hbeale commented Dec 19, 2019

thanks @zhangb1!
Regarding FastQC and RNASeqQC, I can do without either. I usually get total reads from the FastQC output, but I can also get it from the STAR "Log.final.out". Thank you!

@jharenza
Copy link
Collaborator

ok great!

@jharenza jharenza mentioned this issue Dec 23, 2019
7 tasks
@jharenza
Copy link
Collaborator

jharenza commented Jan 2, 2020

@hbeale we have completed the MEND QC run, and will plan to release this data + STAR Log.final.out files with #v13. Can give you an updated timeline for release in the next week.

@jharenza
Copy link
Collaborator

jharenza commented Jan 6, 2020

Hi @hbeale - below are the outputs from Mend QC - which did you want in the release?

readDist.txt: The output of RSeqQC read_distribution.py (~1kb)
bam_umend_qc.tsv: uniqMappedNonDupeReadCount, estExonicUniqMappedNonDupeReadCount and PASS/FAIL
bam_umend_qc.json: Same as bam_umend_qc.tsv but in json format
sortedByCoord.md.bam: BAM with duplicates marked sorted by coordinate
sortedByCoord.md.bam.bai: Index for sortedByCoord.md.bam

cc: @migbro

Thanks!

@hbeale
Copy link
Contributor Author

hbeale commented Jan 6, 2020 via email

@jaclyn-taroni
Copy link
Member

@hbeale The MendQC files and STAR logs were included with the v13 release (#444), which is now available via the download script in the master branch. I am going to close this issue. If there are any issues or questions around those files, we can reopen or file a new data issue. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants