Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structural Variant normalization and Representation #818

Closed
davmlaw opened this issue May 9, 2023 · 6 comments
Closed

Structural Variant normalization and Representation #818

davmlaw opened this issue May 9, 2023 · 6 comments
Labels
CNV enhancement New feature or request
Milestone

Comments

@davmlaw
Copy link
Contributor

davmlaw commented May 9, 2023

Split out of #54 Structural Variant - copy number variation

It seems to be possible to write the same underlying genetic mutation in different ways using VCF files:

Say you have the same contig/position/ref base but the ALT is different:

alt=<del>FORMAT={GT: 0/1} # 1 copy of deletion
alt=<CNV> FORMAT={FC: 0.5} # 0.5 fold change is 1 copy deleted

The downside of is that there is no representation in HGVS. Maybe it's possible via ISCN?

I dont' think you can represent it as SPDI unless you just use pure numbers for the last 2

Would it be a good idea to convert the <CNV> to a <del> if FC < 1 and a <DUP> if FC > 1?

The do NOT annotate the same, ie the DEL is impact=HIGH wile the CNV is impact=MODIFIER

chr13   32890596        .       A       <CNV>   1       PASS    CN=2;EVENT=CNV_47;SVTYPE=BND;VARTYPE=copy_number_variation;END=32972909;ANT=BRCA2       FC      0.5
chr13   32890596        .       A       <DEL>   1       PASS    CN=2;EVENT=CNV_47;SVTYPE=DEL;VARTYPE=copy_number_variation;END=32972909;ANT=BRCA2       FC      0.5

Advantages of leaving it are:

  • Multi-sample entries from the same VCF will be against the same variant - ie you can see that some patients had loss, some had gain at the same site.

Advantages of normalizing it are:

  • CNVs from different VCFs (represented in diff ways) are linked together in the database
  • If you convert from <CNV> to <DEL/DUP> then you can represent as HGVS

I guess it comes down to whether you think a deletion or dup at the same site are the same - I would say not?

TODO: We should check how <CNV> vs <DEL> vs <DUP> are handled as VEP annotation - does it matter?

Does it make sense to call it a DUP if there are say 4 copies now? Is that just homozygous for a DUP?

gnomAD structural variants include CNV (as well as INS/DUP etc)

1	46000	nssv15780791	T	<CNV>	.	.	DBVARID=nssv15780791;SVTYPE=CNV;END=111250;EXPERIMENT=1;SAMPLESET=1;REGIONID=nsv4514092;AC=977;AFR_AC=161;AMR_AC=84;EAS_AC=5;EUR_AC=716;OTH_AC=11;AF=0.09479;AFR_AF=0.035439;AMR_AF=0.114286;EAS_AF=0.004139;EUR_AF=0.192163;OTH_AF=0.115789;AN=10307;AFR_AN=4543;AMR_AN=735;EAS_AN=1208;EUR_AN=3726;OTH_AN=95
@davmlaw davmlaw added the enhancement New feature or request label May 9, 2023
@davmlaw
Copy link
Contributor Author

davmlaw commented May 10, 2023

VT (abandoned?) and bcftools don't normalize symbolic variants - raised issue samtools/bcftools#1919

I think this means we may need to convert symbolic variants to explicit ref/alt then normalize them and then convert back.

An issue would be how to make sure we re-load a dup as a dup rather than an ins (keep track of old allele in INFO?)

@davmlaw davmlaw changed the title CNV normalization CNV normalization and Representation May 16, 2023
@davmlaw
Copy link
Contributor Author

davmlaw commented May 16, 2023

END

Should end be max(len(ref), len(alt))) or abs(len(ref) - len(alt))? If you are looking at overlaps - the max seems better?
Make sure our calc from ref/alt is same as END=XXX or length=XXX

HISTORICAL

If > 1kb then describe as a CNV - should we go through historical data and convert variants to CNVs?

Can find long ones via Sequence.objects.order_by("-length").first().variant_set.all().get().pk

NORMALIZATION

We should convert CNV to DUP
Need to also convert indels to raw strings (but what max size??) so they are properly normalized by VT or bcftools (eg NC_000003.11:g.128204049_128206714del)

—-------

NC_000003.11:g.128204049_128206714del

As it is > 1kb - it should be represented in the DB as
But when we create it, we need to normalize - which means writing it out as a normal variant, and then running it through VT etc

We need to do this when creating a variant -

if length >= settings.VARIANT_STRUCTURAL_SIZE:
ref = ref[0]
if hgvs_name.mutation_type in ('del', 'dup', 'ins'):
alt = f"<{hgvs_name.mutation_type.upper()}>"
else:
raise ValueError(f"Don't know how to handle symbolic variant of type {hgvs_name.mutation_type}")

We’ll then run this VCF through VT (do anyway?) so it will get normalized

Normalize CNVs

May have to explicitly write out the ref/alt - run through VT
Then load in, convert to CNV inside the VCF loader?

When we are calling write_vcf_from_tuples - I think we actually want to write the proper ref-alt as

write_vcf_from_tuples

—---------------

Search for “chrom, position, ref”

get_results_from_variant_tuples

NC_000003.11:g.128204049_128206714del - can’t create as error needs end

Variant = 3:128204042 > T

In [11]: name.get_vcf_coords()
Out[11]: ('NC_000003.11', 128204048, 128206714)

Example large CNV normalization

NC_000003.11:g.128204049_128206714del' resolved to 3:128204042 TCT...GCC>T

ClinGen allele registry agrees also normalizes to same start

"hgvs":["NC_000003.11:g.128204049_128206714del","CM000665.1:g.128204049_128206714del"],
"chromosome":"3",
"coordinates":[{"end":128206708,
"start":128204042,

SPDI

NC_000003.12:128481050:GG:TT
NC_000003.12:128481050:1:TT

NC_000003.11:128204048:2666: see REST API call

{
  "data": {
    "hgvs": "NC_000003.11:g.128204049_128206714del"
  }
}

However HGVS call doesn't appear to normalize: it has inserted sequence "CTCCCG" but same prefix on deleted sequence

@GracePendlebury
Copy link

@EmmaTudini
Copy link
Contributor

EmmaTudini commented Mar 19, 2024

@davmlaw Just found this slide deck- https://drive.google.com/file/d/1w4uIdBJX6duSDni8XlOc84tF4h1CbMa_/view
Seems as though clingen allele registry is now accepting CNVs? Might be something to look into

@davmlaw
Copy link
Contributor Author

davmlaw commented Mar 19, 2024

We use CAR for unique IDs, unfortunately they say:

Unlike smaller variants supported by the core Registry functionality, CNVs are not canonicalized

We'd also have to use their own invented format as well, strangely it supports this:

GRCh37 (chr7:132348499-132359000)x1

But not this:

NC_000007.13:g.132348499_132359000del

Gives: VariationTooLong

@davmlaw davmlaw added this to the SA Path VG 4 milestone May 9, 2024
@davmlaw davmlaw removed this from the SA Path VG 4 milestone May 9, 2024
@davmlaw davmlaw changed the title CNV normalization and Representation Structural Variant normalization and Representation May 10, 2024
@davmlaw davmlaw added this to the SA Path VG 4 milestone May 21, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented May 23, 2024

Converting CNV to DUP (which we need to do for TSO500) is handled in issue https://github.com/SACGF/variantgrid_sapath/issues/304

The other variant normalization etc is now handled in #1014

@davmlaw davmlaw closed this as completed May 23, 2024
TheMadBug added a commit that referenced this issue Jun 18, 2024
* Panal App VUS/GUS relationship download

* Panal App VUS/GUS relationship download

* Providing methods to find more useful relationships between term and gene symbol for report

* issue #2647 - liftover refactor

* issue #2647 - start of bcftools liftover

* Issue #2647 - BCFTools liftover

* Issue #2647 - Remove vestigial NCBI remap traces

* Issue #2647 - enable in vg test

* Stop warning that doesn't apply here

* Issue #2647 - be able to create variant for long ones...

* This occasionally failed - execution continued after asking to change window

* issue #1054 - make bcftools liftover part of standard ANNOTATION settings

* issue #1054 - misc liftover issues

* Issue #1052 - analysis template version

* Make VEP have version in path

* Shariant test config - enable BCFTools liftover

* Upload initially shows error wrongly in pipeline race condition

* issue #818 - don't uniq on preprocess anymore

* issue #1056 - vcf_clean_and_filter convert contig headers

* issue #1057 - VEP deployment change to explicit version

* vg test config

* issue #980 - karyomapping use symbol instead of gene

* Renamed counts to Counts(germline only) on overlaps page (#1046)

* Fix issue where alleleOriginToggle was being called with undefined

* WIP fixing conda dependencies

* Update of conda environment (works again in conda)

* Add new evidence key for somatic testing

* Show better quick clinical significance values (for somatic and germline)

* Fix recently introduced bug in suggested terms for an allele

* Allow link for condition matching to appear on a classification even when not in edit mode

* issue #1059 - all variants node

* issue #1059 - all variants node

* Adjust node counts a little

* issue #1043 - export VCF

* Tidy up server status page removing redundant data (now go to Overall Status)

* Fix the evidence key values

* Get c.HGVS showing properly on variant details page again

* issue #758 - configure quick links via settings

* More comments in conda file

* Move condition text match from classification.html to JSON to keep things in sync
More requirements on when you will be linked to condition resolution

* Style fixes for new liftover

* Put liftover date on form

* Move overall data to the splash server status page

* Add Clinical Trails gov to quick links

* Add liftover to settings menu if variants menu disabled

* Update shariant prod settings to use VEP v108

* Make liftover page side menu change based on settings

* Better wording for changing clinical contexts within a discordance

* Tighten up the handling of embedded card or modal (used for triage but affected other modals by mistake)

* Better safety around condition text relationships

* For view metrics use heading of "Users" not "User"

* Add overall stats to liftover pages

* Slight formatting on liftover note

* Rework the internals of the view user activity reports

* More styling on liftover runs page

* More styling on liftover pages

* Tiny column alignment issue on liftover

* linting

* Fix exclusion of blank searches in metrics

* Subtle change of header wording for activity report

* Improve the formatting of missing IDs when making a batch for ClinVar

* Proper int to string handling for missing IDs

* issue #758 - quick links via settings

* Reworking of Lab Differences using new zippable functionality for ExportRow

* Put the allele URL into the lab compare (I think it was there before)

* Enable SEARCH_HGVS_GENE_SYMBOL_USE_MANE for shariant test

* Also try enabling SEARCH_HGVS_GENE_SYMBOL

* Fix bug where condition URL was calculated before seeing if we had condition text

* Enable extra variant annotations for testing

* Attempt at parsing OncogenicityClassification for pulling in ClinVar records

* issue #1193 - analysis audit log

* issue #1193 - analysis audit log

* issue #1193 - analysis audit log

* issue #1193 - analysis audit log

* issue #1193 - analysis audit log

* variantgrid_private#1193 - Audit log for analysis

* Get audit log name right in requirements

* issue #1060  - variant details page, create allele for all variants not just shorty clingen ones

* Sometimes can't reload analysis #1061

* issue #1193 - disable audit log for template copies etc

* issue #1193 - Audit log for analysis - click to expand and see JSON

* issue #1193 - Don't audit node cloning (was causing tests to fail)

* temp fix for #1053 - will work out data fix later

* Better categorising of ClinVar's Somatic/Germline records

* issue #1053 - cohort genotype versions - make common cohort version match cohort.version

* linting - format whitespace

* linting - remove unused imports

* Unused file

* linting - f string w/o interpolation

* linting - use generator

* Increase the ClinVar parser version to purge old cache

* Add django-audit-log to conda environment

* ClinVar REcords More work on parsing condition

* issue #1053 - missing import, make sure we retrieve only 1 cgc per cohort

* issue #877 - Export column - c_hgvs_compat

* Add MeSH as a non-local ontology set

* Ability to render multiple conditions against a single ClinVar record

* Don't abbreviate ref/alt in VCF

* Fix bug with filtering out non-human clinvar records

* Count homo sapiens, homo-sapiens, homosapiens the same as human for ClinVar records

* A script to close manually raised flags

* Raise classification change flag for both Classification and Somatic Clinical Significance

* Filter based on creator of flag

* Added prefixed clinvar batch export CB_ and clinvar export CE_ search functionality (#1066)

* issue #3604 - COSMIC search

* Update changelog

* Tidy up code for close flags, allow it to re-open flags

* Bug fixes for the condition checking code

* panalapp work

* PanelApp Compare: Fixes to stop infinite recurssion, correct method signatures.

* #3591 panel app export

---------

Co-authored-by: TheMadBug <jimmy.andrews@gmail.com>
Co-authored-by: Dave Lawrence <davmlaw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CNV enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants