Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing somatic variant calling and execution speed #22

Open
sloth-eat-pudding opened this issue Mar 30, 2024 · 5 comments
Open

Enhancing somatic variant calling and execution speed #22

sloth-eat-pudding opened this issue Mar 30, 2024 · 5 comments

Comments

@sloth-eat-pudding
Copy link

Hello,

I am working with HCC1395 data, analyzing tumor samples at 75x coverage and normal samples at 45x coverage. I utilized Clair3 to process the normal.bam file, generating a normal.vcf. This file was then employed for phasing and haplotagging the tumor.bam, followed by using a somatic mutation caller. The results showed a notable decrease in false positives.

phase and haplotag Precision Recall F1-score TP FP FN
ClairS germline.vcf 67.12% 77.64% 72.00% 30626 15001 8821
Clair3 normal.vcf 72.50% 77.46% 74.90% 30556 11593 8891

In an instance where false positives were converted to true negatives, it was observed that the mutations in the normal sample were heterozygous, whereas in the tumor sample, they were homozygous. This suggests a loss of heterozygosity (LOH) event, making the strategy of phasing and tagging most reads into the same haplotype seem correct. Have you considered this method?

image

Moreover, I noted in literature that the primary reason for choosing Longphase for phasing is its speed. We still have a speed advantage in haplotagging. ClairS employs parallel acceleration at the chromosome level and we can introduce a feature to specify a range. Could this reduce the training costs for you? I also conducted a haplotag test, and the results do not seem to show any significant differences.

haplotag Precision Recall F1-score TP FP FN
whatshap v1.7 67.12% 77.64% 72.00% 30626 15001 8821
longphase v1.3 67.27% 77.62% 72.07% 30617 14897 8830
@sloth-eat-pudding sloth-eat-pudding changed the title Enhancing Mutation Variant calling and Execution Speed Enhancing somatic variant calling and execution speed Mar 31, 2024
@aquaskyline
Copy link
Member

Hi longphase team. Thanks for asking. We spoke in more detail via email. ClairS is ready to make use of additional HP taggings in addition to the current HP1 and HP2. Basically, there is no limit to the number of HP categories ClairS can take. For parallelization, supporting range processing sounds good, ClairS will most likely use it in a per chromosome fashion.

@sloth-eat-pudding
Copy link
Author

We have released version 1.7.
Haplotag now includes the --region feature.

The complete list of haplotag parameters

Usage:  haplotag [OPTION] ... READSFILE
      --help                          display this help and exit.

require arguments:
      -s, --snp-file=NAME             input SNP vcf file.
      -b, --bam-file=NAME             input bam file.
      -r, --reference=NAME            reference fasta.
optional arguments:
      --tagSupplementary              tag supplementary alignment. default:false
      --sv-file=NAME                  input phased SV vcf file.
      --mod-file=NAME                 input a modified VCF file (produced by longphase modcall and processed by longphase phase).
      -q, --qualityThreshold=Num      not tag alignment if the mapping quality less than threshold. default:1
      -p, --percentageThreshold=Num   the alignment will be tagged according to the haplotype corresponding to most alleles.
                                      if the alignment has no obvious corresponding haplotype, it will not be tagged. default:0.6
      -t, --threads=Num               number of thread. default:1
      -o, --out-prefix=NAME           prefix of phasing result. default:result
      --region=REGION                 tagging include only reads/variants overlapping those regions. default:(all regions)
      --log                           an additional log file records the result of each read. default:false

@zhengzhenxian
Copy link
Collaborator

@sloth-eat-pudding
Glad to have the new release of LongPhase, our team will test it and get back to you.

ZX

@sloth-eat-pudding
Copy link
Author

In a #18 (comment) issue, it was mentioned that a "longer phaseset and an improved haplotagging ratio" were needed. Therefore, I attempted to incorporate indels for phasing and haplotagging.

Explanation of data sources

  • snp.vcf : uses the confident heterozygous germline variants from ClairS.
  • indle.vcf : generated by Clair3 calling.
    normal & tumor: used only if the chromosome, position, and genotype are identical.
    normal all : uses all indels from the normal.
snp.vcf indel.vcf phase haplotag Precision Recall F1-score TP FP FN
ClairS - longphase v1.3 whatshap v1.7 67.12% 77.64% 72.00% 30626 15001 8821
ClairS - longphase v1.3 longphase v1.3 67.27% 77.62% 72.07% 30617 14897 8830
ClairS - longphase v1.7 whatshap v1.7 67.27% 77.62% 72.07% 30619 14899 8828
ClairS indel (normal & tumor) longphase v1.7-indel whatshap v1.7 67.44% 77.57% 72.15% 30599 14770 8848
ClairS indel (normal & tumor) longphase v1.7-indel longphase v1.3(no tag indel) 67.48% 77.57% 72.18% 30601 14745 8846
ClairS indel (normal & tumor) longphase v1.7-indel longphase v1.7(tag indel) 67.75% 77.52% 72.31% 30578 14553 8869
ClairS indel (normal all) longphase v1.7-indel longphase v1.7(tag indel) 68.80% 77.44% 72.87% 30548 13853 8899

Would you be interested in trying to incorporate indels as well?

@aquaskyline
Copy link
Member

Yes, doing that in the next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants