-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch VT -> BCFtools for normalization - Structural Variant normalization #1014
Comments
I raised an issue for |
This has been fixed in bcftools, I think we can replace vt with bcftools in vcf_preprocess now: http://samtools.github.io/bcftools/howtos/install.html Should be able to look at file history or commented out bits to add it back |
Working in branch feature/issue_1014_bcftools_normalize TODO:
|
I should also write some tests for all of the different SV norm/unique fixes done and to just make sure they stay fixed and to they make it through an import pipeline |
Need to modify ModifiedImportedVariant for bcftools At the moment the modified imported variant is attached to "preprocess VCF" which doesn't have tool set. Will move it to "normalize" then we can determine whether it was done against VT or BCFTools going forward |
@davmlaw Will this be turned off in the next deploy, if we're disabling BCF tools? And what is prod currently doing? |
@EmmaTudini - there are 2 separate things here:
VT was the first tool to introduce normalization but it does not support symbolic variants at all, has a number of open issues, and hasn't made a release since 2018 BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them. So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue) For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release |
@davmlaw for my understanding, can you show me in relation to something like I see in the liftover pipeline there's a step And previously a similar step was being performed by VT? So does pyhgvs or BioCommons resolve to a variant coordinate, then bcf tools resolves that variant coordinate to its more normalised consistent format (and in many cases would leave it alone if it's already in its best representation)? |
BCFTools is doing what VT did - normalize VCF files (left aligned) HGVS standard requires normalization (3' - ie "stranded downstream") which is done in its own libraries (pyHGVS and Biocommons) |
I don't think we need to fix this as no prod environment would have ingested symbolic variants yet, but I found one in vg test that got moved after normalization:
|
Discussed in Shariant meeting - to test bcftools normalization difference vs existing ones with VT we should dump out variants from DB, run through bcftools and look for changes. This has been split off issue for testing: SACGF/variantgrid_private#3658 Other than the above, testing for this involves:
|
In Shariant testing, we found this: Will be able to work around it by always writing an END for symbolic variants |
I switched from using END to SVLEN on 02/21/2024 in #991 I tested whether it was the same in VEP I also used test files in bcftools normalization which were generated with END from before this change (it took a long time from me raising the issue with bcftools for it to make it into VG) Just need to put END back |
Rework to add END to This method has grown to be really unwieldy by tacking stuff on - it takes 2 different types of tuples. All callers to this use VariantCoordinate instead of tuples EXCEPT create liftover pipelines (which passes tuples_have_id_field=True) We should change it to just take VariantCoordinate, and maybe a parallel list of IDs (or maybe add an ID to VariantCoordinate?) But - need to change all the weird different kinds of tuples in the liftover code to even try this:
|
OK, have done refactoring to go full in on VariantCoordinate and get rid of weird mixed tuples Working in brach Need to test a whole bunch of stuff (see commit) |
ok, fixed going foward Still need to find/fix historical ones. It may be a real pain to fix these, may be better to just delete them manually as it has only been on test/dev systems. Something like:
Will probably need something extra to handle imported alleles etc |
@davmlaw Have commented on this issue https://github.com/SACGF/variantgrid_private/issues/3645#issuecomment-2208109928 - think the issues I've found might be related to this change |
@davmlaw In case you haven't seen, I've raised another comment on https://github.com/SACGF/variantgrid_private/issues/3645 |
I found a bug due to my incorrect calculation of Variant.END (needed by bcftools) - done in SACGF/variantgrid_private#3676 During/after that have been testing bcftools normalization pretty thoroughly and seems to all work fine |
Re-opening this issue, to help with my testing |
VT doesn't handle symbolic alts, perhaps we should switch to using bcftools
I raised this as a problem and they only fixed it for DELs
samtools/bcftools#1919
Find out the version affected, we should probably also raise an issue for DUPs
The text was updated successfully, but these errors were encountered: