Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IdentifyVariants #182

Closed
songtaogui opened this issue Nov 28, 2018 · 4 comments
Closed

IdentifyVariants #182

songtaogui opened this issue Nov 28, 2018 · 4 comments

Comments

@songtaogui
Copy link

Hi,
I would like to do a "merge call" using IdentifyVariants as suggested in #111. And I have learned that IdentifyVariants takes Breakend assemblies as required inputs ( which should be merged before ) and Coordinate-sorted BAM file as optional inputs. So my questions are:

  1. What does the "Coordinate-sorted BAM file" do during IdentifyVariants
  2. Is it OK to use SV-reads-bam rather than raw-all-reads-bam
  3. Should I merge bam files of all samples into one ( just like the assemblies ) ?

Thank you very much!
Songtao Gui

d-cameron added a commit that referenced this issue Dec 4, 2018
d-cameron added a commit that referenced this issue Dec 4, 2018
@d-cameron
Copy link
Member

Ok, it's a bit more complicated that my initial comment. I've added an example/gridss_separate.sh scripts so you can a) see what each step does, and b) don't have to run the full GRIDSS pipeline when you don't need to. The problem with just running independent assemblies is that the assembly.bam records the per-sample support according to input ordinal, not input name. This means if you merge all the assembly bams together the variant calling will consider every assembly to come from the first sample.
To get around this, you'll need to batch your samples and use dummy placeholders.
Eg: batch 1 would look like

java -Xmx31G $JVM_ARGS gridss.AssembleBreakends \
		INPUT=input1.bam INPUT_LABEL=sample1 \
		INPUT=input2.bam INPUT_LABEL=sample2 \
		INPUT=empty.bam INPUT_LABEL=sample3 \
		INPUT=empty.bam INPUT_LABEL=sample4 \
		INPUT=empty.bam INPUT_LABEL=sample5 \
		INPUT=empty.bam INPUT_LABEL=sample6 \
		OUTPUT=batch1.bam

your second batch would look like:

java -Xmx31G $JVM_ARGS gridss.AssembleBreakends \
		INPUT=empty.bam INPUT_LABEL=sample1 \
		INPUT=empty.bam INPUT_LABEL=sample2 \
		INPUT=input3.bam INPUT_LABEL=sample3 \
		INPUT=input4.bam INPUT_LABEL=sample4 \
		INPUT=empty.bam INPUT_LABEL=sample5 \
		INPUT=empty.bam INPUT_LABEL=sample6 \
		OUTPUT=batch1.bam

and so on.

It is important that the input file and label ordering matches for every batch!

By using empty.bam (a BAM file containing zero reads) as the input file for each input not included in the batch, we don't incorporate them into the assembly, but we keep all the assembly bam files consistent with each other.

The other issue is that assembly contig names will be reused across each batch. We can solve this issue by prepending the batch name to every read name within each assembly bam. Do this before invoking gridss.SoftClipToSplitReads on the assembly bam so you don't have to worry about split reads.

So the pipeline would look something like:

  • preprocess all sample
  • run batched assembly on the input files with a empty placeholder bam for inputs not in the batch. I haven't tested GRIDSS past 1000x aggregate coverage so don't make you batches too big.
  • rename the assembly contig reads by including a batch identifier
  • samtools merge the assemblies
  • extract metrics & run gridss.SoftClipToSplitReads on the merged assembly
  • perform the variant calling steps on the full input
    -- you may need to disable async I/O and reduce the buffer sizes to keep memory usage under control since you'll be reading from hundreds of samples in parallel.

@zlye
Copy link

zlye commented Oct 24, 2019

Hi - Is there a way I could do this if I've already run the entire pipeline on each of my 200+ samples? Thanks

@d-cameron
Copy link
Member

If you've already run the entire pipeline on 200 samples, then you'll have 200 VCFs. Are you wanting to go back to make a single VCF with 200 samples in it? What is your use case?

@zlye
Copy link

zlye commented Oct 28, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants