Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 36 million developers.Sign up
This small release contains only small fixes and the next improvements:
If the header line in a VCF file contains several samples with the same name, it is now flagged as an error, as recently clarified in the VCF specification.
Warnings are now logged if there are unused parameters in the command used to run any of the tools. Thanks @srbcheema1 for the contributions!
A new tool has been added to the suite! This one checks that the REF column in a VCF matches the sequence contained in a FASTA file, and reports any mismatches in a summary or plain text file, in a similar fashion to the VCF validator reporting. A new report type that only outputs the valid lines is also included in this tool.
We have also added support for Windows, making the suite compatible with the 3 major operating systems. Please be aware that you will need to decompress your files before validating them on Windows due to a known issue.
You can find the binaries for all versions, ready for direct download, attached to these notes.
MacOS users can now run the validation suite in their favorite OS, without needing Docker or admin permissions. Just copy the executable in the link into your machine and run it in exactly the same way as in Linux. Please let us know if you find any compatibility issues by creating a bug report.
The validator can also read files compressed in multiple formats without the need of a pipe. You can find instructions in the updated README file.
Thanks to @srbcheema1 for these contributions!
The validator can now check fields specific of the gVCF extension. This includes <*> alternate alleles and how they relate to the END INFO field and sample genotypes.
Following some user reports (#101, #102) of incorrect counts being expected for FORMAT fields with Number=G, we confirmed with the specification that their cardinality depends on the ploidy of each sample genotype and not on the ALT column. The issue should be solved now, but if you find any problems please open a new ticket!
This version also introduces some usability improvements. The biggest is a summary report in addition to the existing text and database outputs. This is human-readable and lists each type of error detected, the number of times it occurred, and the first line where it was observed.
--version option now reports which version of the validator are you running. Please note that in vcf-validator 0.4 or previous this option was used to note which version of the specification the input file should match.
And finally, the validator now warns the user if the input is compressed, instead of reporting a confusing list of errors.
You can download the Linux binaries using the links, and also visit this page if you are interested in the full list of changes.
It has been a really productive summer thanks to @Anishka0107, the Google Summer of Code student who has improved the support for structural variants in the validator and the debugulator
She has added new metadata validations to ensure that INFO and FORMAT fields match the header definition, and that said header matches the VCF specification itself. These validations apply not only to short variants but also to structural variation tags, which hadn't been fully supported until now!
She also expanded the checks (added to last version) that guarantee no duplicate values in the ID and FORMAT columns in a single line, to also include the FILTER and INFO columns. The debugulator can now automatically fix these duplicates, as well as the values assigned to some INFO tags (see #78 for more details).
The last phase of GSoC was more focused on the purely technical aspects of the project: cleaning up the code, improving the documentation and slightly simplifying the grammar that detects syntax errors.
Please download the Linux binaries using the links below, and visit this page if you are interested in the full list of changes.
This version simplifies the integration of the validation tool in automated pipelines, detecting the version of the VCF file before running the validation. This also prevents errors from being raised due to involuntary mismatches between the command line argument and the file.
New checks have been also included, to guarantee that no duplicate values are present in the ID and FORMAT columns in a single line. These checks are only applicable to version 4.3 of the specification!
The binaries can be downloaded using the links below.
The VCF specification allows not to list the GT field in the FORMAT column, but if present it must the first field. This release solves an issue that was making the validator raise a misleading error if GT was not present.
This maintenance release solves a couple of issues reported for version 0.4.1:
- Only a single value was considered valid as CIGAR field in the INFO column, when it should be a list as long as the number of alternate alleles. Thanks @sambrightman for your pull request!
- Errors due to the lack of newline characters and the end of the file were not properly reported.
This maintenance release solves memory issues reported for version 0.4.
New dependencies were added to make possible to detect more complex errors, but the amount of memory consumed grew indefinitely. This has been solved and memory usage now remains constant at less than 10 MB of RAM.
The new executables, compatible with any Linux version, can be downloaded using the links below.
In addition to the removal of duplicate variants introduced in the previous release, errors in the INFO and samples columns can be fixed now by removing the faulty field from the column. For instance, if an INFO value looks like
AN=123;AF=not_a_frequency;DP=345, the fix would transform it into
Other improvements included in this version are:
- Support for genomic ploidy different from 2
- Ensuring all the variants that don't require fixing are written after running the vcf-debugulator
- Simplified build process using a Docker image (recommended for developers only)
You can download the executables using the links below.