You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I was trying to run vcfeval with the same vcf generated by Clair3 using the following command, I find that I am getting
3518 out of 5440881 FPs/FNs
Upon looking at the FPs/FNs, I noticed that quite many of them are caused by incorrect SNP calls when a deletion overlaps an SNP.
For example:
chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_*3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268
chr1 4620715 . T A 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 1/1:7:58:39,18:0.3103:.
So the corrected calls should be:
chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268
chr1 4620715 . T A, 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 2|1:7:58:39,18:0.3103:.
and vcfeval doesn't flag them as FPs/FNs.
By assuming the hetero deletion calls are correct, I fixed the SNP calls with the following awk commands,
awk '!/^#/&&length($4)>length($5){split($10,a,":");d=length($4)-length($5);if(a[1]=="0|1"||a[1]=="1|0"||a[1]=="0/1"){for(i=1;i<=d;++i){printf "%s\t%d\t%s\n",$1,$2+i,a[1]}}}' sample.vcf > sample.del
awk 'BEGIN{b["0|1"]="1|2";b["1|0"]="2|1";b["0/1"]="1/2";while(getline<"sample.del"){a[sprintf("%s:%d",$1,$2)]=b[$3]}}/^#/{print}!/^#/&&length($4)!=length($5){print}!/^#/&&length($4)==length($5){p=sprintf("%s:%d",$1,$2);if(a[p]!=""){$5=sprintf("%s,*",$5);$10=sprintf("%s%s",a[p],substr($10,4))}OFS="\t";print}' sample.vcf > sample_fixed.vcf
The number of FPs/FNs is reduced to 942 which improves precision/sensitivity from 99.9354% to 99.9827%
In the remaining 942, I noticed that 295 of them are caused by overlapping homo deletion and homo SNP calls. The rest are more complex calls.
It would be great if Clair3 can fix these problems in the future release. It will help greatly in benchmarking and also reducing the false homo SNP calls that are often ignored by clinicians. Thank you very much for your time.
The text was updated successfully, but these errors were encountered:
When I was trying to run vcfeval with the same vcf generated by Clair3 using the following command, I find that I am getting
3518 out of 5440881 FPs/FNs
./rtg vcfeval -t ~/genome/hs38.sdf -b sample.vcf.gz -c sample.vcf.gz --ref-overlap -o sample
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
Upon looking at the FPs/FNs, I noticed that quite many of them are caused by incorrect SNP calls when a deletion overlaps an SNP.
For example:
chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_*3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268
chr1 4620715 . T A 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 1/1:7:58:39,18:0.3103:.
By loading haplotagged bam to IGV, I can see that the deletion is on one haplotype and the SNP is on the other haplotype.. So the correct call should be 1|0 for the deletion and 1|2 for the SNP where 2 is for the missing allele * based on VCF spec.
https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele
So the corrected calls should be:
chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268
chr1 4620715 . T A, 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 2|1:7:58:39,18:0.3103:.
and vcfeval doesn't flag them as FPs/FNs.
By assuming the hetero deletion calls are correct, I fixed the SNP calls with the following awk commands,
awk '!/^#/&&length($4)>length($5){split($10,a,":");d=length($4)-length($5);if(a[1]=="0|1"||a[1]=="1|0"||a[1]=="0/1"){for(i=1;i<=d;++i){printf "%s\t%d\t%s\n",$1,$2+i,a[1]}}}' sample.vcf > sample.del
awk 'BEGIN{b["0|1"]="1|2";b["1|0"]="2|1";b["0/1"]="1/2";while(getline<"sample.del"){a[sprintf("%s:%d",$1,$2)]=b[$3]}}/^#/{print}!/^#/&&length($4)!=length($5){print}!/^#/&&length($4)==length($5){p=sprintf("%s:%d",$1,$2);if(a[p]!=""){$5=sprintf("%s,*",$5);$10=sprintf("%s%s",a[p],substr($10,4))}OFS="\t";print}' sample.vcf > sample_fixed.vcf
The number of FPs/FNs is reduced to 942 which improves precision/sensitivity from 99.9354% to 99.9827%
In the remaining 942, I noticed that 295 of them are caused by overlapping homo deletion and homo SNP calls. The rest are more complex calls.
It would be great if Clair3 can fix these problems in the future release. It will help greatly in benchmarking and also reducing the false homo SNP calls that are often ignored by clinicians. Thank you very much for your time.
The text was updated successfully, but these errors were encountered: