Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch VT -> BCFtools for normalization - Structural Variant normalization #1014

Open
davmlaw opened this issue Mar 28, 2024 · 21 comments
Open

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Mar 28, 2024

VT doesn't handle symbolic alts, perhaps we should switch to using bcftools

I raised this as a problem and they only fixed it for DELs

samtools/bcftools#1919

Find out the version affected, we should probably also raise an issue for DUPs

@davmlaw
Copy link
Contributor Author

davmlaw commented Apr 2, 2024

I raised an issue for <DUP> but in the meantime - perhaps we should just convert symbolic variants into explicit before normalization?

@davmlaw
Copy link
Contributor Author

davmlaw commented May 9, 2024

BCFTools added support for del/dup recently

It seems like both tools don't look for SVLEN when handling symbolic variants:

BCFTools

VT

So I have diisabled uniq for now

@davmlaw
Copy link
Contributor Author

davmlaw commented May 23, 2024

This has been fixed in bcftools, I think we can replace vt with bcftools in vcf_preprocess now:

http://samtools.github.io/bcftools/howtos/install.html

Should be able to look at file history or commented out bits to add it back

@davmlaw
Copy link
Contributor Author

davmlaw commented May 24, 2024

Working in branch feature/issue_1014_bcftools_normalize

TODO:

  • --check-ref=w dies due to contigs not being mentioned - need to handle it eg test data "17:7676584 G>AA"
  • Add check for bcftools version that normalizes uniques etc properly (deploy_check command?)
  • Modify install instructions etc, no longer need vt
  • Need to read in the norm/multiallic changes etc (VT used to modify INFO)
REF_MISMATCH	NC_000017.10	7676584	G	C

davmlaw added a commit that referenced this issue May 24, 2024
davmlaw added a commit that referenced this issue May 27, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented May 28, 2024

I should also write some tests for all of the different SV norm/unique fixes done and to just make sure they stay fixed and to they make it through an import pipeline

@davmlaw
Copy link
Contributor Author

davmlaw commented May 29, 2024

Need to modify ModifiedImportedVariant for bcftools

At the moment the modified imported variant is attached to "preprocess VCF" which doesn't have tool set. Will move it to "normalize" then we can determine whether it was done against VT or BCFTools going forward

@EmmaTudini
Copy link
Contributor

@davmlaw Will this be turned off in the next deploy, if we're disabling BCF tools? And what is prod currently doing?

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 17, 2024

@EmmaTudini - there are 2 separate things here:

  • VCF normalization - to make sure variants in inconsistent input representation end up stored as 1 consistent format in the database (left align, smallest representation etc) - prod currently uses "VT" for this
  • Liftover - conversion between genome builds. Prod currently uses ClinGen allele registry (and just converting MT as it's the same contig on both builds)

VT was the first tool to introduce normalization but it does not support symbolic variants at all, has a number of open issues, and hasn't made a release since 2018

BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them.

So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue)

For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release

@TheMadBug
Copy link
Member

@davmlaw for my understanding, can you show me in relation to something like
https://test.shariant.org.au/classification/hgvs_resolution_tool?genome_build=GRCh38&hgvs=NM_033071.3%28SYNE1%29%3Ac.21711G%3EA
where BCFTools comes into it? As opposed to what pyhgvs or BioCommons does?

I see in the liftover pipeline there's a step normalize that uses bcf tools, and a step vcf_clean_and_filter that also uses it.

And previously a similar step was being performed by VT?

So does pyhgvs or BioCommons resolve to a variant coordinate, then bcf tools resolves that variant coordinate to its more normalised consistent format (and in many cases would leave it alone if it's already in its best representation)?

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 20, 2024

BCFTools is doing what VT did - normalize VCF files (left aligned)

HGVS standard requires normalization (3' - ie "stranded downstream") which is done in its own libraries (pyHGVS and Biocommons)

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 21, 2024

I don't think we need to fix this as no prod environment would have ingested symbolic variants yet, but I found one in vg test that got moved after normalization:

NC_000021.9:g.43095483_43104346del

4966530 - existing one
5117588 - g.HGVS is NC_000021.9:g.5001137_5010000del records being "normalized by bcftools NC_000021.9|43095482|A|<DEL>"

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 23, 2024

Discussed in Shariant meeting - to test bcftools normalization difference vs existing ones with VT we should dump out variants from DB, run through bcftools and look for changes. This has been split off issue for testing: SACGF/variantgrid_private#3658

Other than the above, testing for this involves:

  • Make sure that normalization still occurs as it used to
  • Structural variants (dels/dups) are now normalized

@davmlaw davmlaw changed the title Structural Variant normalization BCFtools normalization - Structural Variant normalization Jun 23, 2024
@davmlaw davmlaw changed the title BCFtools normalization - Structural Variant normalization Switch VT -> BCFtools for normalization - Structural Variant normalization Jun 24, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Jul 1, 2024

In Shariant testing, we found this:

samtools/bcftools#2216

Will be able to work around it by always writing an END for symbolic variants

@davmlaw
Copy link
Contributor Author

davmlaw commented Jul 1, 2024

I switched from using END to SVLEN on 02/21/2024 in #991

I tested whether it was the same in VEP

I also used test files in bcftools normalization which were generated with END from before this change (it took a long time from me raising the issue with bcftools for it to make it into VG)

Just need to put END back

davmlaw added a commit that referenced this issue Jul 1, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Jul 1, 2024

Rework to add END to write_vcf_from_tuples

This method has grown to be really unwieldy by tacking stuff on - it takes 2 different types of tuples. All callers to this use VariantCoordinate instead of tuples EXCEPT create liftover pipelines (which passes tuples_have_id_field=True)

We should change it to just take VariantCoordinate, and maybe a parallel list of IDs (or maybe add an ID to VariantCoordinate?)

But - need to change all the weird different kinds of tuples in the liftover code to even try this:

Allele.get_liftover_tuple
snpdb.liftover._get_build_liftover_tuples

@davmlaw
Copy link
Contributor Author

davmlaw commented Jul 2, 2024

OK, have done refactoring to go full in on VariantCoordinate and get rid of weird mixed tuples

Working in brach feature/issue_1014_refactor_variant_tuple_add_end

Need to test a whole bunch of stuff (see commit)

@davmlaw
Copy link
Contributor Author

davmlaw commented Jul 3, 2024

ok, fixed going foward

Still need to find/fix historical ones. It may be a real pain to fix these, may be better to just delete them manually as it has only been on test/dev systems. Something like:

from snpdb.models import Variant
from annotation.models import AnnotationRangeLock

bad_norm = Variant.objects.filter(modifiedimportedvariant__isnull=False, locus__position=1, svlen__isnull=False)
for v in bad_norm:
    AnnotationRangeLock.release_variant(v)
    print(v.delete())

Will probably need something extra to handle imported alleles etc

@EmmaTudini
Copy link
Contributor

@davmlaw Have commented on this issue https://github.com/SACGF/variantgrid_private/issues/3645#issuecomment-2208109928 - think the issues I've found might be related to this change

@EmmaTudini
Copy link
Contributor

@davmlaw In case you haven't seen, I've raised another comment on https://github.com/SACGF/variantgrid_private/issues/3645

davmlaw added a commit that referenced this issue Jul 4, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 9, 2024

I found a bug due to my incorrect calculation of Variant.END (needed by bcftools) - done in SACGF/variantgrid_private#3676

During/after that have been testing bcftools normalization pretty thoroughly and seems to all work fine

@davmlaw davmlaw closed this as completed Aug 18, 2024
@EmmaTudini
Copy link
Contributor

Re-opening this issue, to help with my testing

@EmmaTudini EmmaTudini reopened this Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants