Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch VT -> BCFtools for normalization - Structural Variant normalization #1014

Open
davmlaw opened this issue Mar 28, 2024 · 12 comments
Open

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Mar 28, 2024

VT doesn't handle symbolic alts, perhaps we should switch to using bcftools

I raised this as a problem and they only fixed it for DELs

samtools/bcftools#1919

Find out the version affected, we should probably also raise an issue for DUPs

@davmlaw
Copy link
Contributor Author

davmlaw commented Apr 2, 2024

I raised an issue for <DUP> but in the meantime - perhaps we should just convert symbolic variants into explicit before normalization?

@davmlaw
Copy link
Contributor Author

davmlaw commented May 9, 2024

BCFTools added support for del/dup recently

It seems like both tools don't look for SVLEN when handling symbolic variants:

BCFTools

VT

So I have diisabled uniq for now

@davmlaw
Copy link
Contributor Author

davmlaw commented May 23, 2024

This has been fixed in bcftools, I think we can replace vt with bcftools in vcf_preprocess now:

http://samtools.github.io/bcftools/howtos/install.html

Should be able to look at file history or commented out bits to add it back

@davmlaw
Copy link
Contributor Author

davmlaw commented May 24, 2024

Working in branch feature/issue_1014_bcftools_normalize

TODO:

  • --check-ref=w dies due to contigs not being mentioned - need to handle it eg test data "17:7676584 G>AA"
  • Add check for bcftools version that normalizes uniques etc properly (deploy_check command?)
  • Modify install instructions etc, no longer need vt
  • Need to read in the norm/multiallic changes etc (VT used to modify INFO)
REF_MISMATCH	NC_000017.10	7676584	G	C

davmlaw added a commit that referenced this issue May 24, 2024
davmlaw added a commit that referenced this issue May 27, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented May 28, 2024

I should also write some tests for all of the different SV norm/unique fixes done and to just make sure they stay fixed and to they make it through an import pipeline

@davmlaw
Copy link
Contributor Author

davmlaw commented May 29, 2024

Need to modify ModifiedImportedVariant for bcftools

At the moment the modified imported variant is attached to "preprocess VCF" which doesn't have tool set. Will move it to "normalize" then we can determine whether it was done against VT or BCFTools going forward

@EmmaTudini
Copy link
Contributor

@davmlaw Will this be turned off in the next deploy, if we're disabling BCF tools? And what is prod currently doing?

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 17, 2024

@EmmaTudini - there are 2 separate things here:

  • VCF normalization - to make sure variants in inconsistent input representation end up stored as 1 consistent format in the database (left align, smallest representation etc) - prod currently uses "VT" for this
  • Liftover - conversion between genome builds. Prod currently uses ClinGen allele registry (and just converting MT as it's the same contig on both builds)

VT was the first tool to introduce normalization but it does not support symbolic variants at all, has a number of open issues, and hasn't made a release since 2018

BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them.

So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue)

For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release

@TheMadBug
Copy link
Member

@davmlaw for my understanding, can you show me in relation to something like
https://test.shariant.org.au/classification/hgvs_resolution_tool?genome_build=GRCh38&hgvs=NM_033071.3%28SYNE1%29%3Ac.21711G%3EA
where BCFTools comes into it? As opposed to what pyhgvs or BioCommons does?

I see in the liftover pipeline there's a step normalize that uses bcf tools, and a step vcf_clean_and_filter that also uses it.

And previously a similar step was being performed by VT?

So does pyhgvs or BioCommons resolve to a variant coordinate, then bcf tools resolves that variant coordinate to its more normalised consistent format (and in many cases would leave it alone if it's already in its best representation)?

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 20, 2024

BCFTools is doing what VT did - normalize VCF files (left aligned)

HGVS standard requires normalization (3' - ie "stranded downstream") which is done in its own libraries (pyHGVS and Biocommons)

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 21, 2024

I don't think we need to fix this as no prod environment would have ingested symbolic variants yet, but I found one in vg test that got moved after normalization:

NC_000021.9:g.43095483_43104346del

4966530 - existing one
5117588 - g.HGVS is NC_000021.9:g.5001137_5010000del records being "normalized by bcftools NC_000021.9|43095482|A|<DEL>"

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 23, 2024

Discussed in Shariant meeting - to test bcftools normalization difference vs existing ones with VT we should dump out variants from DB, run through bcftools and look for changes. This has been split off issue for testing: SACGF/variantgrid_private#3658

Other than the above, testing for this involves:

  • Make sure that normalization still occurs as it used to
  • Structural variants (dels/dups) are now normalized

@davmlaw davmlaw changed the title Structural Variant normalization BCFtools normalization - Structural Variant normalization Jun 23, 2024
@davmlaw davmlaw changed the title BCFtools normalization - Structural Variant normalization Switch VT -> BCFtools for normalization - Structural Variant normalization Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants