Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At least some of our VCFs are invalid according to the VCF spec. #265

Open
grendeloz opened this issue Jun 14, 2021 · 7 comments
Open

At least some of our VCFs are invalid according to the VCF spec. #265

grendeloz opened this issue Jun 14, 2021 · 7 comments

Comments

@grendeloz
Copy link
Contributor

The bug

golang VCF libraries will not read out VCFs.

According to a strict reading of the last few version of the VCF spec (https://samtools.github.io/hts-specs/VCFv4.2.pdf, https://samtools.github.io/hts-specs/VCFv4.3.pdf) the INFO field has a prescribed format which we are departing from in a number of ways:

  1. Our Source element is not enclosed in double quotes.
  2. Our Version element is not enclosed in double quotes.
  3. We have added a non-standard element - FileDate

To Reproduce

I am using VCF fbe3b136-dc8b-4c8d-bde3-a6390c91b521.vcf from the COLO-829 analysis analysis_fbe3b136-dc8b-4c8d-bde3-a6390c91b521 for testing my code. The following two lines demonstrate the problems shown above - the first line has all 3 problems and the second line has problems 1 and 2:

##INFO=<ID=GERM,Number=2,Type=Integer,Description="Counts of donor occurs this mutation, total recorded donor number",Source=/mnt/lustre/reference/genomeinfo/qannotate/icgc_germline_qsnp_PUBLIC.vcf,FileDate=null>
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership",Source=/mnt/lustre/reference/dbsnp/141/00-All.vcf,Version=141>

If I cut out the first 1000 lines form this VCF and rectify all 3 problems, then the VCF will parse.

Expected behavior

The golang library appears to be applying the VCF spec strictly and not allowing for the addition of user-defined fields in INFO lines however the spec does not explicitly allow for user-defined fields in INFO lines so I think we should stop using them.

I'm guessing qannotate may be adding these but wherever it is, I'd like to have the quoting fixed. And for the FileDate, we could make Source a composite field that also contained the file date, for example:

##INFO=<ID=GERM,Number=2,Type=Integer,Description="Counts of donor occurs this mutation, total recorded donor number",Source=/mnt/lustre/reference/genomeinfo/qannotate/icgc_germline_qsnp_PUBLIC.vcf;2021-06-14">

If we go down the composite field path, I would suggest that we use semi-colon as the separator because comma is the default separator between subfields within an INFO field.

@ChristinaXu2017
Copy link
Contributor

ChristinaXu2017 commented Jun 14, 2021

It seems the FileDate wasn't a user-defined field, it should be part of "Source". Because it is missing double quotes, so accidentally becomes an extra field. No doubt we should put double quotes to all string values which allows space etc.

the comma inside a quoted string may be ok. But seme colon also good between file name an date inside Source.

@holmeso
Copy link
Contributor

holmeso commented Jun 14, 2021

What golang command are you using to test the validity of the vcf?

@holmeso
Copy link
Contributor

holmeso commented Jun 14, 2021

So it looks like there was a change on the 7th of June 2018 that removed the Source, FileDate and Version attributes from the INFO field header.

@holmeso
Copy link
Contributor

holmeso commented Jun 14, 2021

Running vcftools' vcf-validator against the vcf file in question throws a number of errors (this vcf harks back to the days where we would ampersand delimit values from samples run on different callers) but doesn't complain about the header.

If there is a way to be more thorough in our vcf validation (perhaps by using your golang command) then we should incorporate that into our tests.

@grendeloz
Copy link
Contributor Author

Sorry, not a golang command - I'm using the github.com/brentp/vcfgo package (like a class but not OO) in my own code and it was borking. It's been working fine on the toy VCFs I use for testing but it didn't like the full Colo-829 VCF I pulled down for testing at proper scale.

@grendeloz
Copy link
Contributor Author

Ollie - any idea on why we removed the Source, FileDate and Version? The FileDate is obviously non-standard but the other 2 are fine as long as they are double quoted. And after a more careful reading of the the spec, I don't think my proposed compound Source field would have worked anyway - it seems pretty clear that a simple value like dbsnp is what is expected in that field.

@holmeso
Copy link
Contributor

holmeso commented Jun 14, 2021

Ollie - any idea on why we removed the Source, FileDate and Version? The FileDate is obviously non-standard but the other 2 are fine as long as they are double quoted. And after a more careful reading of the the spec, I don't think my proposed compound Source field would have worked anyway - it seems pretty clear that a simple value like dbsnp is what is expected in that field.

I can't remember JP. It could just have been a mistake (I had merged a long running qsnp branch that commented out the Source/Version/FileDate attributes). If you would like the Source and Version to make it back in then please say the word. We would need to identify which INFO fields this should be applied to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants