Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sfm --haplotypecaller introduces duplicate lines in BAM #51

Open
ntm opened this issue Jun 30, 2021 · 2 comments
Open

sfm --haplotypecaller introduces duplicate lines in BAM #51

ntm opened this issue Jun 30, 2021 · 2 comments

Comments

@ntm
Copy link

ntm commented Jun 30, 2021

Hi,

I am testing elPrep, mostly for the variant-calling step: hoping to use it as a drop-in replacement for GATK4. I am using a small test BAM, paired-end reads, some mates map to different chromosomes. I noticed that in sfm mode, every read whose mate maps to a different chrom gets duplicated in the output.bam. This doesn't affect the GVCFs produced by elPrep, but the resulting BAM is not spec-compliant, and I suspect this may affect other tools working on the elPrep-produced BAMs. Details below.

Run elprep in sfm mode:
elprep sfm testin.bam testout_sfm.bam --haplotypecaller test_sfm.g.vcf.gz --reference /data/HumanGenome/hs38DH.elfasta
Run elprep in filter mode:
elprep filter testin.bam testout_filter.bam --haplotypecaller test_filter.g.vcf.gz --reference /data/HumanGenome/hs38DH.elfasta
Compare small BAMs sfm vs filter:
/home/nthierry/Software/BamUtil/bamUtil_1.0.13/bamUtil/bin/bam diff --in1 testout_sfm.bam --in2 testout_filter.bam
-> testout_sfm.bam contains duplicate lines for each read whose mate maps to a different chromosome.

Question: does this impact the elprep GVCFs?
zdiff test_sfm.g.vcf.gz test_filter.g.vcf.gz
-> only diff is the header ##elPrepCommandLine , no consequence for the elprep variant-caller.

The bug doesn't occur if I don't call variants:
elprep filter testin.bam ttt_filter.bam
elprep sfm testin.bam ttt_sfm.bam
/home/nthierry/Software/BamUtil/bamUtil_1.0.13/bamUtil/bin/bam diff --in1 ttt_sfm.bam --in2 ttt_filter.bam
-> no difference

The SAM spec says:
[QNAME] In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
[FLAG] For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’
-> the elPrep-produced BAMs with duplicate lines doesn't seem compliant (AFAICT?)

Regards,
Nicolas

@caherzee
Copy link
Contributor

Hi Nicolas,

It is a bug I am aware of, and have already fixed it internally, but I have not released the fix yet.

You are correct that the bug only occurs when using the haplotype caller and that VCF output is not affected. It is a bug that occurs because the final bam is incorrectly merged from the intermediate split files.

I expect to release a new version of elPrep soon.

Thanks for your patience.

Kind regards,
Charlotte

@matthdsm
Copy link
Contributor

Hi Charlotte,

Was this fixed in the latest release?

Thanks
M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants