Skip to content

Add support for VCFv4.4 / VCFv4.5 Number= fields #1874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

daviesrob
Copy link
Member

Adds BCF_VL_* types for Number= P, LA, LG, LR and M, and updates bcf_hdr_register_hrec() so that it sets them.

Note that this changes handling of unrecognised Number= values. Previously htslib would attempt to treat these as numbers, and as it did not check for sscanf() failure they were stored as type BCF_VL_FIXED with count 0xfffff. This commit makes it print a warning and store the type as BCF_VL_VAR instead.

TODO:

  • Add tests
  • Check more header types in bcf_hdr_check_sanity()
  • Make bcf_remove_allele_set() work with local alleles

@jkbonfield
Copy link
Contributor

I dealt with the 2nd of your TODO action items. The first (tests) turns out to be irrelevant currently as there are no checks in the code either, so all test files pass regardless. It needs upgrading to add proper record validation, but that will come with a CPU cost so it needs more thought as to how to do this. Best done in a following PR I suspect.

My header validation permits Number=. as I see this used in the wild in numerous places. Indeed Bcftools own tests also fail on this all over the place, commonly having AD defined as Number=. for example.

Bcftools still has some warnings with the extra checks:

test/setGT.5.vcf has ##FORMAT=<ID=AD,Number=A,Type=Integer,Description="Alleles"> (should be "R") and test/view64bit.5.vcf has ##INFO=<ID=END,Number=A,Type=Integer,Description="dummy"> (should be "1").

We probably need @pd3 to look at these and comment whether these are widely used alternatives that fly against the specification. Technically a tool could look at AD,Number=A and derive the REF depth hasn't been reported, but the spec doesn't formally permit such nuance. Equally so it's possible that with multiple alleles END may have multiple values. It's not permitted, but perhaps that's an oversight in the spec and people have been doing this for years?

We can omit such things from validation if they are widely accepted spec breakages.

Note this doesn't produce errrors and the function has void return. It won't spam either as each error is reported at maximum once (not even once per file, but once per use of htslib).

@jkbonfield
Copy link
Contributor

Note I would like to be more aggressive on the newer 4.4 and 4.5 types, like LPL needing Number=LG, but ironically in Gnomad (and probably others) these are the cases where we use Number=. to avoid problems with parsers not supporting 4.5.

We could add a version header check perhaps so we only validate on specific versions. I'll look into seeing how easy that is to do. It'll make things a bit messier though.

Part of me thinks maybe just removing the BCF_VL_VAR loop-hole and causing more warnings to appear may just be the right thing. The data is incorrect according to the spec and the consequence is nothing more than a one line warning. Is it not correct that people are told their data is wrongly formatted, as long as we still continue with best-efforts as we've always done? However for the time being and in the interests of keeping the peace it's being lenient as people have got away with this in the past.

daviesrob and others added 3 commits April 17, 2025 15:39
Adds BCF_VL_* types for Number= P, LA, LG, LR and M

Upgrades bcf_hdr_register_hrec() so that it sets the appropriate
value when it finds one of the new types in a FORMAT header line.

Only set number type BCF_VL_FIXED if the value can actually be
parsed as an integer.  Unknown types will now cause a warning
to be printed and will be treated as BCF_VL_VAR.

Adds doxygen style comments to describe the meaning of the various
BCF_VL_* values.
Note I haven't added the myraid of base modification tags as there are
so many of them.
Previously we were gracious in accepting Number=. as it turns in in
several places in the wild.  However for VCF 4.4 and 4.5 we're more
strict as these aren't common place yet and we can complain louder
about errors.

The small caveat here is using 4.4 and 4.5 defined tags in an earlier
VCF version.  This happens when for example people wish to use local
alleles.  They've been used for many years, but only finally landed in
4.5.  Hence we see VCF 4.3 using these tags and Number=. as the
correct type code didn't exist in that version.

Also added phase-set tests which were accidentally missed on the
previous commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants