Add genotyping script #234

xxlpaladin · 2019-08-14T07:39:41Z

Hello, thanks very much for your previous answers. And I have another small question recently. Why gridss can not generate the genotypes of SVs (i.e., 0/1,1/1), and some other SV callers can give the genotypes, such as manta and lumpy. So I wonder if there are any scripts can help to give the genotypes of SVs (generated by gridss), and I know SVTyper (https://github.com/hall-lab/svtyper/) can give the genotypes of SVs (the ungenotyped SVs and bam file should be provided).
Thank you,
Best wishes.

d-cameron · 2019-08-15T02:21:47Z

Why gridss can not generate the genotypes of SVs (i.e., 0/1,1/1)

This a philosophical difference between the approach taken by GRIDSS and both other callers. GRIDSS itself is a breakpoint ( /single breakend) caller. Critically, it makes no assumptions about the ploidy of the sample. Genotyping a SV requires that a) the ploidy is known, and b) genotyping the SV uder the ploidy assumption is actually meaningful. Except for a subset of simple events, the SV itself results in aneuploidy. When there is aneuploidy, genotyping under a diploid assumption does not make sense, hence why GRIDSS chooses not to do it.

That said, GRIDSS reports the most accurate variant allele fraction of any of the short read variant callers. You can genotype SVs by using the VF, REF, and REFPAIR fields.

VAF = VF / (VF + REF + REFPAIR) for variants larger than the library fragment size (I conservatively use a 1000bp threshold)

VAF = VF / (VF + REF) for variants smaller that the max library fragment size distribution (because a RP spanning the variant will be concordantly aligned if the ins/del/dup is short enough).

d-cameron · 2019-08-15T02:22:43Z

Would a small R script that genotyped under diploid assumptions be a generally useful utility?

xxlpaladin · 2019-08-15T08:36:40Z

Why gridss can not generate the genotypes of SVs (i.e., 0/1,1/1)

This a philosophical difference between the approach taken by GRIDSS and both other callers. GRIDSS itself is a breakpoint ( /single breakend) caller. Critically, it makes no assumptions about the ploidy of the sample. Genotyping a SV requires that a) the ploidy is known, and b) genotyping the SV uder the ploidy assumption is actually meaningful. Except for a subset of simple events, the SV itself results in aneuploidy. When there is aneuploidy, genotyping under a diploid assumption does not make sense, hence why GRIDSS chooses not to do it.
That said, GRIDSS reports the most accurate variant allele fraction of any of the short read variant callers. You can genotype SVs by using the VF, REF, and REFPAIR fields.
VAF = VF / (VF + REF + REFPAIR) for variants larger than the library fragment size (I conservatively use a 1000bp threshold)
VAF = VF / (VF + REF) for variants smaller that the max library fragment size distribution (because a RP spanning the variant will be concordantly aligned if the ins/del/dup is short enough).

Thanks, I think I have got it. But do you have some suggestions about the threshold value of VAF to judge the genotype of SVs? For example, when VAF < 0.5, genotype is 0/1; when VAF > 0.5, genotype is 1/1. And can you give some explanations about how to confirm the threshold value of VAF?

xxlpaladin · 2019-08-15T08:54:57Z

Would a small R script that genotyped under diploid assumptions be a generally useful utility?

I think the script is very useful and necessary, because we can compare the genotypes of gridss SVs with the genotypes of the same SVs generated by other SV callers (e.g., manta, lumpy). And we can also use the genotypes for other further analyses.

d-cameron · 2019-08-15T14:11:36Z

How would you like the genotype of of non-diploid regions to be defined?

d-cameron · 2019-08-15T14:19:29Z

But do you have some suggestions about the threshold value of VAF to judge the genotype of SVs?
There's no reason why you can't use a similar genotype model that is used by SNV callers.

For example:
https://software.broadinstitute.org/gatk/documentation/article?id=11079
or
https://software.broadinstitute.org/gatk/media/docs/Samtools.pdf

For example, when VAF < 0.5, genotype is 0/1; when VAF > 0.5, genotype is 1/1.

A cut-off at VAF=0.5 is definitely incorrect as that's right in the middle of the heterzygous distribution (at least for indel events).

And can you give some explanations about how to confirm the threshold value of VAF?

The threshold to use will come straight out of your model. You should check that it make sense emperically and if not, you'll need to tweak your error model.

xxlpaladin · 2019-08-16T01:32:02Z

How would you like the genotype of of non-diploid regions to be defined?

I think we can ignore the non-diploid regions when we genotype the SVs, we can only genotype the SVs in diploid regions. If necessary, we can genotype the SVs in non-diploid regions independently (e.g., the SVs in mitochondrion).

xxlpaladin · 2019-08-16T01:55:29Z

But do you have some suggestions about the threshold value of VAF to judge the genotype of SVs?
There's no reason why you can't use a similar genotype model that is used by SNV callers.

For example:
https://software.broadinstitute.org/gatk/documentation/article?id=11079
or
https://software.broadinstitute.org/gatk/media/docs/Samtools.pdf

For example, when VAF < 0.5, genotype is 0/1; when VAF > 0.5, genotype is 1/1.

A cut-off at VAF=0.5 is definitely incorrect as that's right in the middle of the heterzygous distribution (at least for indel events).

Thanks for your details about this, I entirely agree with you about using the genotype model of SNV to SVs, and I also think about this yesterday. The VAF=0.5 is just an example to express what I mean, it is definitely incorrect.

And can you give some explanations about how to confirm the threshold value of VAF?

The threshold to use will come straight out of your model. You should check that it make sense emperically and if not, you'll need to tweak your error model.

I generally understand but writing script to get the genotypes is a little difficult for me. If possible, can you provide a common script about genotyping SVs? Thank you.

DarioS · 2019-08-19T07:00:09Z

What about a purity-adjusted VAF instead of a genotype for your use case? I guess that it could simply be calculated as VAF / purity with a ceiling of 1.

xxlpaladin · 2019-08-20T03:23:17Z

What about a purity-adjusted VAF instead of a genotype for your use case? I guess that it could simply be calculated as VAF / purity with a ceiling of 1.

Thanks for your good suggestions, I have tried to run the latest version 'purple-2.33.jar' several times. But I think the purple pipeline is not suitable for my analysis. Because it can only be used to human, and all my samples are sheep, so do you have other suggestions?

d-cameron · 2019-10-31T07:13:36Z

Coming back to this again: the key determination is the #fragments supporting ref, and the #fragments supporting the variant. For simple deletions we can port SNV logic over but it becomes a bit more complicated for other events.

Take a duplication-like event (-, + orientation on the same chromosome). If it's a simple duplication, the expected VAF is 1/3 for a diploid sample (ref chromatid + ref&var chromatid).

Options for implementing are:

forcing genotyping off 'simplest' event, and get it wrong for more complex events (e.g. translocation to the same chromosome look like a simple DEL or DUP but have different expected VAFs)
defer genotyping to only be supported after derivative chromosome reconstruction (e.g. as done by LINX)
Expose this complexity back to the user via command-line arguments?
Report alternative solutions in custom VCF fields with the 'simplest' in the main field.

GeoMicroSoares · 2020-07-16T13:59:05Z

Hi there, has this not been followed up? Would be beyond useful to have something like this available...

d-cameron · 2020-07-17T04:23:01Z

Unfortunately, this has not progressed in GRIDSS itself. Thus far, our efforts have been focused on complex event reconstruction (e.g. breakage-fusion-bridge) and derivate chromsome reconstruction. Since this also incorporates CNA data, it is done by LINX (https://www.biorxiv.org/content/10.1101/781013v1), which is designed for human tumour data.

d-cameron · 2020-07-17T04:39:42Z

I think we can ignore the non-diploid regions when we genotype the SVs, we can only genotype the SVs in diploid regions. If necessary, we can genotype the SVs in non-diploid regions independently (e.g., the SVs in mitochondrion).

A breakpoint causes a change in copy number at the break junction so the approach of "only genotype the SVs in diploid regions" doesn't work since at least one side of the break junction is not going to be diploid. The only exception to this is if there is a compensatory breakpoint at the same position (e.g. inversion).

To me, this request is similar to the ones leading to simple-event-annotation.R. I'm trying to avoid making the same mistake I did with that: creating simple implementation that was frequently incorrect and increasing the support burden (simple-event-annotation.R is no longer included in GRIDSS releases) due to people using it in situations it was not designed for. Properly genotyping structural variants is really an exercise in derivate chromosome reconstruction - a much more difficult problem than just fitting a model.

I know it's not the answer you're looking for, but GRIDSS does include sufficient annotation at each breakpoint that a simple genotyping model can be applied for the simple cases (the most trivial being REF=0 -> homozygous).

d-cameron · 2020-07-17T04:44:42Z

Why gridss can not generate the genotypes of SVs (i.e., 0/1,1/1), and some other SV callers can give the genotypes, such as manta and lumpy.

If there's no CN change, it's probably not a deletion so a caller shouldn't call and genotype it as such. VCFv4.4 will include the SVCLAIM field (samtools/hts-specs#517) which will assist in clarifying what claim a SV caller is making. For a deletion, SVCLAIM=BP is a breakpoint claim, SVCLAIM=CN is a copy number loss claim, and SVCLAIM=CNBP is an actual deletion.

d-cameron closed this as completed Aug 15, 2019

d-cameron added help wanted wontfix labels Aug 15, 2019

d-cameron removed the wontfix label Aug 15, 2019

d-cameron added enhancement and removed help wanted labels Aug 15, 2019

d-cameron changed the title ~~Question about genotyping of SV.~~ Add genotyping script Aug 15, 2019

d-cameron reopened this Aug 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add genotyping script #234

Add genotyping script #234

xxlpaladin commented Aug 14, 2019

d-cameron commented Aug 15, 2019

d-cameron commented Aug 15, 2019

xxlpaladin commented Aug 15, 2019

xxlpaladin commented Aug 15, 2019

d-cameron commented Aug 15, 2019

d-cameron commented Aug 15, 2019

xxlpaladin commented Aug 16, 2019

xxlpaladin commented Aug 16, 2019

DarioS commented Aug 19, 2019

xxlpaladin commented Aug 20, 2019

d-cameron commented Oct 31, 2019

GeoMicroSoares commented Jul 16, 2020

d-cameron commented Jul 17, 2020

d-cameron commented Jul 17, 2020

d-cameron commented Jul 17, 2020

Add genotyping script #234

Add genotyping script #234

Comments

xxlpaladin commented Aug 14, 2019

d-cameron commented Aug 15, 2019

d-cameron commented Aug 15, 2019

xxlpaladin commented Aug 15, 2019

xxlpaladin commented Aug 15, 2019

d-cameron commented Aug 15, 2019

d-cameron commented Aug 15, 2019

xxlpaladin commented Aug 16, 2019

xxlpaladin commented Aug 16, 2019

DarioS commented Aug 19, 2019

xxlpaladin commented Aug 20, 2019

d-cameron commented Oct 31, 2019

GeoMicroSoares commented Jul 16, 2020

d-cameron commented Jul 17, 2020

d-cameron commented Jul 17, 2020

d-cameron commented Jul 17, 2020