IdentifyVariants #182

songtaogui · 2018-11-28T07:12:36Z

Hi,
I would like to do a "merge call" using IdentifyVariants as suggested in #111. And I have learned that IdentifyVariants takes Breakend assemblies as required inputs ( which should be merged before ) and Coordinate-sorted BAM file as optional inputs. So my questions are:

What does the "Coordinate-sorted BAM file" do during IdentifyVariants
Is it OK to use SV-reads-bam rather than raw-all-reads-bam
Should I merge bam files of all samples into one ( just like the assemblies ) ?

Thank you very much!
Songtao Gui

The text was updated successfully, but these errors were encountered:

…rt of GRIDSS is run

d-cameron · 2018-12-04T12:52:59Z

Ok, it's a bit more complicated that my initial comment. I've added an example/gridss_separate.sh scripts so you can a) see what each step does, and b) don't have to run the full GRIDSS pipeline when you don't need to. The problem with just running independent assemblies is that the assembly.bam records the per-sample support according to input ordinal, not input name. This means if you merge all the assembly bams together the variant calling will consider every assembly to come from the first sample.
To get around this, you'll need to batch your samples and use dummy placeholders.
Eg: batch 1 would look like

java -Xmx31G $JVM_ARGS gridss.AssembleBreakends \
		INPUT=input1.bam INPUT_LABEL=sample1 \
		INPUT=input2.bam INPUT_LABEL=sample2 \
		INPUT=empty.bam INPUT_LABEL=sample3 \
		INPUT=empty.bam INPUT_LABEL=sample4 \
		INPUT=empty.bam INPUT_LABEL=sample5 \
		INPUT=empty.bam INPUT_LABEL=sample6 \
		OUTPUT=batch1.bam

your second batch would look like:

java -Xmx31G $JVM_ARGS gridss.AssembleBreakends \
		INPUT=empty.bam INPUT_LABEL=sample1 \
		INPUT=empty.bam INPUT_LABEL=sample2 \
		INPUT=input3.bam INPUT_LABEL=sample3 \
		INPUT=input4.bam INPUT_LABEL=sample4 \
		INPUT=empty.bam INPUT_LABEL=sample5 \
		INPUT=empty.bam INPUT_LABEL=sample6 \
		OUTPUT=batch1.bam

and so on.

It is important that the input file and label ordering matches for every batch!

By using empty.bam (a BAM file containing zero reads) as the input file for each input not included in the batch, we don't incorporate them into the assembly, but we keep all the assembly bam files consistent with each other.

The other issue is that assembly contig names will be reused across each batch. We can solve this issue by prepending the batch name to every read name within each assembly bam. Do this before invoking gridss.SoftClipToSplitReads on the assembly bam so you don't have to worry about split reads.

So the pipeline would look something like:

preprocess all sample
run batched assembly on the input files with a empty placeholder bam for inputs not in the batch. I haven't tested GRIDSS past 1000x aggregate coverage so don't make you batches too big.
rename the assembly contig reads by including a batch identifier
samtools merge the assemblies
extract metrics & run gridss.SoftClipToSplitReads on the merged assembly
perform the variant calling steps on the full input
-- you may need to disable async I/O and reduce the buffer sizes to keep memory usage under control since you'll be reading from hundreds of samples in parallel.

zlye · 2019-10-24T15:38:40Z

Hi - Is there a way I could do this if I've already run the entire pipeline on each of my 200+ samples? Thanks

d-cameron · 2019-10-27T23:54:16Z

If you've already run the entire pipeline on 200 samples, then you'll have 200 VCFs. Are you wanting to go back to make a single VCF with 200 samples in it? What is your use case?

zlye · 2019-10-28T14:26:51Z

Hi, I would like to call CNVs on a population of 200 individuals. Ideally this dataset would enable me to make statements about frequencies of various structural variants. (I am working with plant data which is sometimes tricky because there are lots of duplications.) I ran the pipeline individually because I thought I had too many samples for the multi-sample option. I saw this post #182 <#182> and then understood that batch method you describe might be more appropriate. However, given that I already have per-sample VCFs is there a way you'd recommend to merge the individuals? (unfortunately I don't have the intermediate files). Or would you recommend re-running with the batch method? Also - how long would I expect a single batch take - if I run 40 samples with an avg of 16x coverage? Thanks, Zoe

…

On Sun, Oct 27, 2019 at 7:54 PM Daniel Cameron ***@***.***> wrote: If you've already run the entire pipeline on 200 samples, then you'll have 200 VCFs. Are you wanting to go back to make a single VCF with 200 samples in it? What is your use case? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_PapenfussLab_gridss_issues_182-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAKFUYUYBABOPDL34NINDEMLQQYS2VA5CNFSM4GG37KY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECLLPCI-23issuecomment-2D546748297&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=163oVAfqYW8rLwmeRfPScQ&m=W5xdsC50Qmn5fVFdSAsUPXWO9W_nTaYqnRYCxNxQEb4&s=ha4TcAK2ObtFV3QV_2FvDcINMTgjTybu5YbBjPkNOCw&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKFUYUZ7HJVZWJ7PNOHTM43QQYS2VANCNFSM4GG37KYQ&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=163oVAfqYW8rLwmeRfPScQ&m=W5xdsC50Qmn5fVFdSAsUPXWO9W_nTaYqnRYCxNxQEb4&s=BewGcouYQUAht2532jTAAeVHojMPy8zZWAU-1dXoG4o&e=> .

-- Zoe Lye P.h.D. candidate 12 Waverly Place New York University New York, NY 10003

d-cameron added a commit that referenced this issue Dec 4, 2018

#179 #182 added FAQ to readme

b238c03

d-cameron added a commit that referenced this issue Dec 4, 2018

#182 #180 Added gridss_separate.sh example script showing how each pa…

75ccf90

…rt of GRIDSS is run

d-cameron closed this as completed Dec 4, 2018

YuanwenGuo mentioned this issue Oct 15, 2019

java.lang.NumberFormatException #261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IdentifyVariants #182

IdentifyVariants #182

songtaogui commented Nov 28, 2018

d-cameron commented Dec 4, 2018

zlye commented Oct 24, 2019

d-cameron commented Oct 27, 2019

zlye commented Oct 28, 2019 via email

IdentifyVariants #182

IdentifyVariants #182

Comments

songtaogui commented Nov 28, 2018

d-cameron commented Dec 4, 2018

zlye commented Oct 24, 2019

d-cameron commented Oct 27, 2019

zlye commented Oct 28, 2019 via email