Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect SNP representation when it overlaps a deletion #272

Open
ymcki opened this issue Feb 23, 2024 · 3 comments
Open

Incorrect SNP representation when it overlaps a deletion #272

ymcki opened this issue Feb 23, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@ymcki
Copy link

ymcki commented Feb 23, 2024

When I was trying to run vcfeval with the same vcf generated by Clair3 using the following command, I find that I am getting
3518 out of 5440881 FPs/FNs

./rtg vcfeval -t ~/genome/hs38.sdf -b sample.vcf.gz -c sample.vcf.gz --ref-overlap -o sample

Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure

2.000            5440881        5440881       3518       3518     0.9994       0.9994     0.9994
 None            5440881        5440881       3518       3518     0.9994       0.9994     0.9994

Upon looking at the FPs/FNs, I noticed that quite many of them are caused by incorrect SNP calls when a deletion overlaps an SNP.
For example:
chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_*3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268
chr1 4620715 . T A 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 1/1:7:58:39,18:0.3103:.

By loading haplotagged bam to IGV, I can see that the deletion is on one haplotype and the SNP is on the other haplotype.. So the correct call should be 1|0 for the deletion and 1|2 for the SNP where 2 is for the missing allele * based on VCF spec.
https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele

So the corrected calls should be:
chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268
chr1 4620715 . T A,
7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 2|1:7:58:39,18:0.3103:.

and vcfeval doesn't flag them as FPs/FNs.

By assuming the hetero deletion calls are correct, I fixed the SNP calls with the following awk commands,
awk '!/^#/&&length($4)>length($5){split($10,a,":");d=length($4)-length($5);if(a[1]=="0|1"||a[1]=="1|0"||a[1]=="0/1"){for(i=1;i<=d;++i){printf "%s\t%d\t%s\n",$1,$2+i,a[1]}}}' sample.vcf > sample.del
awk 'BEGIN{b["0|1"]="1|2";b["1|0"]="2|1";b["0/1"]="1/2";while(getline<"sample.del"){a[sprintf("%s:%d",$1,$2)]=b[$3]}}/^#/{print}!/^#/&&length($4)!=length($5){print}!/^#/&&length($4)==length($5){p=sprintf("%s:%d",$1,$2);if(a[p]!=""){$5=sprintf("%s,*",$5);$10=sprintf("%s%s",a[p],substr($10,4))}OFS="\t";print}' sample.vcf > sample_fixed.vcf

The number of FPs/FNs is reduced to 942 which improves precision/sensitivity from 99.9354% to 99.9827%

In the remaining 942, I noticed that 295 of them are caused by overlapping homo deletion and homo SNP calls. The rest are more complex calls.

It would be great if Clair3 can fix these problems in the future release. It will help greatly in benchmarking and also reducing the false homo SNP calls that are often ignored by clinicians. Thank you very much for your time.

@aquaskyline aquaskyline added the enhancement New feature or request label Feb 23, 2024
@aquaskyline
Copy link
Member

aquaskyline commented Feb 23, 2024

A fix will be in the next version (v1.0.6~~v1.0.7TBD)

To double confirm the correct calls should be:
chr1 4620711 . AAAAT A 6.41 PASS 0/1
chr1 4620715 . T A,* 7.2 PASS 1/2

Correct?

@ymcki
Copy link
Author

ymcki commented Feb 23, 2024

A fix will be in the next version (v1.0.6)

To double confirm the correct calls should be: chr1 4620711 . AAAAT A 6.41 PASS 0/1 chr1 4620715 . T A,* 7.2 PASS 1/2

Correct?

Yes, this should make vcfeval happy.

@aquaskyline
Copy link
Member

Many thanks for the detailed analysis and report.

@aquaskyline aquaskyline changed the title Incorrect SNP call when a deletion overlaps an SNP Incorrect SNP representation when it overlaps a deletion Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants