Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add genotyping script #234

Open
xxlpaladin opened this issue Aug 14, 2019 · 15 comments
Open

Add genotyping script #234

xxlpaladin opened this issue Aug 14, 2019 · 15 comments

Comments

@xxlpaladin
Copy link

Hello, thanks very much for your previous answers. And I have another small question recently. Why gridss can not generate the genotypes of SVs (i.e., 0/1,1/1), and some other SV callers can give the genotypes, such as manta and lumpy. So I wonder if there are any scripts can help to give the genotypes of SVs (generated by gridss), and I know SVTyper (https://github.com/hall-lab/svtyper/) can give the genotypes of SVs (the ungenotyped SVs and bam file should be provided).
Thank you,
Best wishes.

@d-cameron
Copy link
Member

Why gridss can not generate the genotypes of SVs (i.e., 0/1,1/1)

This a philosophical difference between the approach taken by GRIDSS and both other callers. GRIDSS itself is a breakpoint ( /single breakend) caller. Critically, it makes no assumptions about the ploidy of the sample. Genotyping a SV requires that a) the ploidy is known, and b) genotyping the SV uder the ploidy assumption is actually meaningful. Except for a subset of simple events, the SV itself results in aneuploidy. When there is aneuploidy, genotyping under a diploid assumption does not make sense, hence why GRIDSS chooses not to do it.

That said, GRIDSS reports the most accurate variant allele fraction of any of the short read variant callers. You can genotype SVs by using the VF, REF, and REFPAIR fields.

VAF = VF / (VF + REF + REFPAIR) for variants larger than the library fragment size (I conservatively use a 1000bp threshold)

VAF = VF / (VF + REF) for variants smaller that the max library fragment size distribution (because a RP spanning the variant will be concordantly aligned if the ins/del/dup is short enough).

@d-cameron
Copy link
Member

Would a small R script that genotyped under diploid assumptions be a generally useful utility?

@d-cameron d-cameron removed the wontfix label Aug 15, 2019
@xxlpaladin
Copy link
Author

Why gridss can not generate the genotypes of SVs (i.e., 0/1,1/1)

This a philosophical difference between the approach taken by GRIDSS and both other callers. GRIDSS itself is a breakpoint ( /single breakend) caller. Critically, it makes no assumptions about the ploidy of the sample. Genotyping a SV requires that a) the ploidy is known, and b) genotyping the SV uder the ploidy assumption is actually meaningful. Except for a subset of simple events, the SV itself results in aneuploidy. When there is aneuploidy, genotyping under a diploid assumption does not make sense, hence why GRIDSS chooses not to do it.
That said, GRIDSS reports the most accurate variant allele fraction of any of the short read variant callers. You can genotype SVs by using the VF, REF, and REFPAIR fields.
VAF = VF / (VF + REF + REFPAIR) for variants larger than the library fragment size (I conservatively use a 1000bp threshold)
VAF = VF / (VF + REF) for variants smaller that the max library fragment size distribution (because a RP spanning the variant will be concordantly aligned if the ins/del/dup is short enough).

Thanks, I think I have got it. But do you have some suggestions about the threshold value of VAF to judge the genotype of SVs? For example, when VAF < 0.5, genotype is 0/1; when VAF > 0.5, genotype is 1/1. And can you give some explanations about how to confirm the threshold value of VAF?

@xxlpaladin
Copy link
Author

Would a small R script that genotyped under diploid assumptions be a generally useful utility?

I think the script is very useful and necessary, because we can compare the genotypes of gridss SVs with the genotypes of the same SVs generated by other SV callers (e.g., manta, lumpy). And we can also use the genotypes for other further analyses.

@d-cameron
Copy link
Member

How would you like the genotype of of non-diploid regions to be defined?

@d-cameron
Copy link
Member

But do you have some suggestions about the threshold value of VAF to judge the genotype of SVs?
There's no reason why you can't use a similar genotype model that is used by SNV callers.

For example:
https://software.broadinstitute.org/gatk/documentation/article?id=11079
or
https://software.broadinstitute.org/gatk/media/docs/Samtools.pdf

For example, when VAF < 0.5, genotype is 0/1; when VAF > 0.5, genotype is 1/1.

A cut-off at VAF=0.5 is definitely incorrect as that's right in the middle of the heterzygous distribution (at least for indel events).

And can you give some explanations about how to confirm the threshold value of VAF?

The threshold to use will come straight out of your model. You should check that it make sense emperically and if not, you'll need to tweak your error model.

@d-cameron d-cameron changed the title Question about genotyping of SV. Add genotyping script Aug 15, 2019
@d-cameron d-cameron reopened this Aug 15, 2019
@xxlpaladin
Copy link
Author

How would you like the genotype of of non-diploid regions to be defined?

I think we can ignore the non-diploid regions when we genotype the SVs, we can only genotype the SVs in diploid regions. If necessary, we can genotype the SVs in non-diploid regions independently (e.g., the SVs in mitochondrion).

@xxlpaladin
Copy link
Author

But do you have some suggestions about the threshold value of VAF to judge the genotype of SVs?
There's no reason why you can't use a similar genotype model that is used by SNV callers.

For example:
https://software.broadinstitute.org/gatk/documentation/article?id=11079
or
https://software.broadinstitute.org/gatk/media/docs/Samtools.pdf

For example, when VAF < 0.5, genotype is 0/1; when VAF > 0.5, genotype is 1/1.

A cut-off at VAF=0.5 is definitely incorrect as that's right in the middle of the heterzygous distribution (at least for indel events).

Thanks for your details about this, I entirely agree with you about using the genotype model of SNV to SVs, and I also think about this yesterday. The VAF=0.5 is just an example to express what I mean, it is definitely incorrect.

And can you give some explanations about how to confirm the threshold value of VAF?

The threshold to use will come straight out of your model. You should check that it make sense emperically and if not, you'll need to tweak your error model.

I generally understand but writing script to get the genotypes is a little difficult for me. If possible, can you provide a common script about genotyping SVs? Thank you.

@DarioS
Copy link

DarioS commented Aug 19, 2019

What about a purity-adjusted VAF instead of a genotype for your use case? I guess that it could simply be calculated as VAF / purity with a ceiling of 1.

@xxlpaladin
Copy link
Author

What about a purity-adjusted VAF instead of a genotype for your use case? I guess that it could simply be calculated as VAF / purity with a ceiling of 1.

Thanks for your good suggestions, I have tried to run the latest version 'purple-2.33.jar' several times. But I think the purple pipeline is not suitable for my analysis. Because it can only be used to human, and all my samples are sheep, so do you have other suggestions?

@d-cameron
Copy link
Member

Coming back to this again: the key determination is the #fragments supporting ref, and the #fragments supporting the variant. For simple deletions we can port SNV logic over but it becomes a bit more complicated for other events.

Take a duplication-like event (-, + orientation on the same chromosome). If it's a simple duplication, the expected VAF is 1/3 for a diploid sample (ref chromatid + ref&var chromatid).

Options for implementing are:

  • forcing genotyping off 'simplest' event, and get it wrong for more complex events (e.g. translocation to the same chromosome look like a simple DEL or DUP but have different expected VAFs)
  • defer genotyping to only be supported after derivative chromosome reconstruction (e.g. as done by LINX)
  • Expose this complexity back to the user via command-line arguments?
  • Report alternative solutions in custom VCF fields with the 'simplest' in the main field.

@GeoMicroSoares
Copy link

Hi there, has this not been followed up? Would be beyond useful to have something like this available...

@d-cameron
Copy link
Member

Unfortunately, this has not progressed in GRIDSS itself. Thus far, our efforts have been focused on complex event reconstruction (e.g. breakage-fusion-bridge) and derivate chromsome reconstruction. Since this also incorporates CNA data, it is done by LINX (https://www.biorxiv.org/content/10.1101/781013v1), which is designed for human tumour data.

@d-cameron
Copy link
Member

I think we can ignore the non-diploid regions when we genotype the SVs, we can only genotype the SVs in diploid regions. If necessary, we can genotype the SVs in non-diploid regions independently (e.g., the SVs in mitochondrion).

A breakpoint causes a change in copy number at the break junction so the approach of "only genotype the SVs in diploid regions" doesn't work since at least one side of the break junction is not going to be diploid. The only exception to this is if there is a compensatory breakpoint at the same position (e.g. inversion).

To me, this request is similar to the ones leading to simple-event-annotation.R. I'm trying to avoid making the same mistake I did with that: creating simple implementation that was frequently incorrect and increasing the support burden (simple-event-annotation.R is no longer included in GRIDSS releases) due to people using it in situations it was not designed for. Properly genotyping structural variants is really an exercise in derivate chromosome reconstruction - a much more difficult problem than just fitting a model.

I know it's not the answer you're looking for, but GRIDSS does include sufficient annotation at each breakpoint that a simple genotyping model can be applied for the simple cases (the most trivial being REF=0 -> homozygous).

@d-cameron
Copy link
Member

Why gridss can not generate the genotypes of SVs (i.e., 0/1,1/1), and some other SV callers can give the genotypes, such as manta and lumpy.

If there's no CN change, it's probably not a deletion so a caller shouldn't call and genotype it as such. VCFv4.4 will include the SVCLAIM field (samtools/hts-specs#517) which will assist in clarifying what claim a SV caller is making. For a deletion, SVCLAIM=BP is a breakpoint claim, SVCLAIM=CN is a copy number loss claim, and SVCLAIM=CNBP is an actual deletion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants