**Table of contents**<a id='toc0_'></a>    
- [Normalizer Performance Analysis](#toc1_)    
  - [Import relevant packages](#toc1_1_)    
  - [Dictionaries to map variants to categories and record category counts](#toc1_2_)    
  - [CIViC](#toc1_3_)    
  - [MOA](#toc1_4_)    
  - [ClinVar](#toc1_5_)    
  - [Computing Coverage](#toc1_6_)    
  - [Generating Table](#toc1_7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Normalizer Performance Analysis](#toc0_)

This notebook contains an analysis of the normalizer performance on the CIViC, MOA, and Clinvar data

## <a id='toc1_1_'></a>[Import relevant packages](#toc0_)

In [1]:
import os
import json
import re
import sys
from enum import IntEnum

import numpy as np
import pandas as pd
import plotly.graph_objects as go

module_path = os.path.abspath(os.path.join(".."))
if module_path not in sys.path:
    sys.path.append(module_path)



In [2]:
# Import NOT_SUPPORTED_VARIANT_CATEGORY_VALUES from utils.py and remove TRANSCRIPT_VAR
from utils import NotSupportedVariantCategory, NOT_SUPPORTED_VARIANT_CATEGORY_VALUES  # noqa: E402
NOT_SUPPORTED_VARIANT_CATEGORY_VALUES = NOT_SUPPORTED_VARIANT_CATEGORY_VALUES[:-1]
NOT_SUPPORTED_VARIANT_CATEGORY_VALUES

['Sequence',
 'Genotype/Haplotype',
 'Fusion',
 'Rearrangement',
 'Epigenetic Modification',
 'Copy Number',
 'Expression',
 'Gene Function',
 'Region-Defined',
 'Genome Feature',
 'Other']

## <a id='toc1_2_'></a>[Dictionaries to map variants to categories and record category counts](#toc0_)

Bin variants to categories.

For variants with multiple associated types:  If the 2+ types have a subset relationship (eg frameshift; frameshift truncation), they are assigned to categories consistent with the superset type (frameshift).  If the types are disjoint (eg: Transcript Variant; Loss of Function Variant), they are assigned with the category most closely associated with the assayed data (Transcript Variant).  This assignment is done in the CIVIC_CATEGORY_BINS dictionary.

In [3]:
CIVIC_CATEGORY_BINS = {
    "Delins": NotSupportedVariantCategory.SEQUENCE,
    "Direct Tandem Duplication": NotSupportedVariantCategory.SEQUENCE,
    "Disruptive Inframe Deletion": NotSupportedVariantCategory.SEQUENCE,
    "Disruptive Inframe Insertion": NotSupportedVariantCategory.SEQUENCE,
    "Coding Sequence Variant": NotSupportedVariantCategory.SEQUENCE,
    "Conservative Inframe Deletion": NotSupportedVariantCategory.SEQUENCE,
    "Copy Number Variants": NotSupportedVariantCategory.COPY_NUMBER,
    "Frameshift": NotSupportedVariantCategory.SEQUENCE,
    "Frameshift Truncation": NotSupportedVariantCategory.SEQUENCE,
    "Frameshift Variant": NotSupportedVariantCategory.SEQUENCE,
    "Frameshift Variant;Minus 1 Frameshift Variant": NotSupportedVariantCategory.SEQUENCE,
    "Inframe Deletion": NotSupportedVariantCategory.SEQUENCE,
    "Inframe Indel": NotSupportedVariantCategory.SEQUENCE,
    "Inframe Insertion": NotSupportedVariantCategory.SEQUENCE,
    "Intron Variant": NotSupportedVariantCategory.REGION_DEFINED,
    "Minus 1 Frameshift Variant": NotSupportedVariantCategory.SEQUENCE,
    "Minus 2 Frameshift Variant": NotSupportedVariantCategory.SEQUENCE,
    "Missense Variant": NotSupportedVariantCategory.SEQUENCE,
    "Non Conservative Missense Variant": NotSupportedVariantCategory.SEQUENCE,
    "Plus 1 Frameshift Variant": NotSupportedVariantCategory.SEQUENCE,
    "Region-Defined Variant": NotSupportedVariantCategory.REGION_DEFINED,
    "Regulatory Region Variant": NotSupportedVariantCategory.REGION_DEFINED,
    "Sequence Variants": NotSupportedVariantCategory.SEQUENCE,
    "Splice Acceptor Variant": NotSupportedVariantCategory.REGION_DEFINED,
    "Splice Donor Region Variant": NotSupportedVariantCategory.REGION_DEFINED,
    "Splice Donor Variant": NotSupportedVariantCategory.REGION_DEFINED,
    "Splicing Variant": NotSupportedVariantCategory.OTHER,
    "Start Lost": NotSupportedVariantCategory.SEQUENCE,
    "Stop Gained": NotSupportedVariantCategory.SEQUENCE,
    "Stop Lost": NotSupportedVariantCategory.SEQUENCE,
    "Synonymous Variant": NotSupportedVariantCategory.SEQUENCE,
    "Transcript Amplification": NotSupportedVariantCategory.COPY_NUMBER,
    "Transcript Fusion": NotSupportedVariantCategory.FUSION,
    "3 Prime UTR Variant": NotSupportedVariantCategory.REGION_DEFINED,
    "Amino Acid Deletion;Inframe Deletion": NotSupportedVariantCategory.SEQUENCE,
    "Frameshift Truncation;Minus 2 Frameshift Variant": NotSupportedVariantCategory.SEQUENCE,
    "Frameshift Truncation;Plus 2 Frameshift Variant": NotSupportedVariantCategory.SEQUENCE,
    "Frameshift Variant;Delins": NotSupportedVariantCategory.SEQUENCE,
    "Inframe Insertion;Delins": NotSupportedVariantCategory.SEQUENCE,
    "Inframe Insertion;Inframe Deletion;Delins": NotSupportedVariantCategory.SEQUENCE,
    "Inframe Variant;Inframe Insertion;Inframe Deletion;Delins ": NotSupportedVariantCategory.SEQUENCE,
    "Minus 1 Frameshift Variant;Frameshift Truncation": NotSupportedVariantCategory.SEQUENCE,
    "Plus 1 Frameshift Variant;Frameshift Elongation": NotSupportedVariantCategory.SEQUENCE,
    "Plus 1 Frameshift Variant;Frameshift Truncation": NotSupportedVariantCategory.SEQUENCE,
    "Missense Variant;Gain Of Function Variant": NotSupportedVariantCategory.SEQUENCE,
    "Missense Variant;Loss Of Function Variant": NotSupportedVariantCategory.SEQUENCE,
    "Missense Variant;Loss Of Heterozygosity": NotSupportedVariantCategory.SEQUENCE,
    "Missense Variant;Polymorphic Sequence Variant": NotSupportedVariantCategory.SEQUENCE,
    "Missense Variant;Snp": NotSupportedVariantCategory.SEQUENCE,
    "Missense Variant;Transcript Fusion": NotSupportedVariantCategory.SEQUENCE,
    "Stop Gained;Loss Of Function Variant": NotSupportedVariantCategory.SEQUENCE,
    "Stop Lost;Inframe Deletion": NotSupportedVariantCategory.SEQUENCE,
}

CLINVAR_CATEGORY_BINS = {
    "Complex": NotSupportedVariantCategory.OTHER,
    "CompoundHeterozygote": NotSupportedVariantCategory.GENOTYPE_AND_HAPLOTYPE,
    "Deletion": NotSupportedVariantCategory.SEQUENCE,
    "Diplotype": NotSupportedVariantCategory.GENOTYPE_AND_HAPLOTYPE,
    "Distinct chromosomes": NotSupportedVariantCategory.REARRANGEMENT,
    "Duplication": NotSupportedVariantCategory.SEQUENCE,
    "Haplotype": NotSupportedVariantCategory.SEQUENCE,
    "Haplotype, single variant": NotSupportedVariantCategory.SEQUENCE,
    "Indel": NotSupportedVariantCategory.SEQUENCE,
    "Insertion": NotSupportedVariantCategory.SEQUENCE,
    "Inversion": NotSupportedVariantCategory.SEQUENCE,
    "Microsatellite": NotSupportedVariantCategory.SEQUENCE,
    "Phase unknown": NotSupportedVariantCategory.OTHER,
    "Tandem duplication": NotSupportedVariantCategory.SEQUENCE,
    "Translocation": NotSupportedVariantCategory.REARRANGEMENT,
    "Variation": NotSupportedVariantCategory.OTHER,
    "copy number gain": NotSupportedVariantCategory.COPY_NUMBER,
    "copy number loss": NotSupportedVariantCategory.COPY_NUMBER,
    "fusion": NotSupportedVariantCategory.FUSION,
    "protein only": NotSupportedVariantCategory.SEQUENCE,
    "single nucleotide variant": NotSupportedVariantCategory.SEQUENCE,
}

These variables flag the fields in the dictionary item values below.  In category_counts, each entry is a list of integer values, representing, in order, the number of tokens normalized of that variant, the number ostensibly supported but unable to be normalized, the number of tokens that are not supported, and the total number of tokens.

In [4]:
class Fields(IntEnum):
    """Create IntEnum for count fields in the category_counts dict."""

    NORMALIZED_COUNT = 0
    UNABLE_TO_NORMALIZE_COUNT = 1
    UNSUPPORTED_COUNT = 2
    TOTAL_COUNT = 3
    PERCENT_NORMALIZED = 4

In [5]:
category_counts = {v: [0, 0, 0, 0, 0.0] for v in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES}

## <a id='toc1_3_'></a>[CIViC](#toc0_)



In order to score the normalizer's performance on the CIViC data, some cleaning is required.

First we need to read in the data that was ostensibly supported, get rid of variants with multiple type labels, and assign variant types to as  many of the entries as possible that have a "Not provided" value for civic_variant_types.

Read in .csv of normalized variants in CIVIC

In [6]:
civic_normalized_df = pd.read_csv(
    "../civic/variation_analysis/able_to_normalize_queries.tsv", sep="\t"
)
civic_normalized_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.Mikw3IoUZ58l_zejQQOT0D0inT2Cvxpr,normalize
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.GkISlkjkoX6ts9HHLAzsjDvbCU0d6KyH,normalize


Prune columns and add new column to flag as normalized.

In [7]:
pruned_civic_normalized_df = civic_normalized_df[
    ["variant_id", "query", "query_type", "civic_variant_types"]
]
pruned_civic_normalized_df.insert(4, "normalization_status", "normalized")
pruned_civic_normalized_df.head()

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,Stop Lost,normalized
1,1988,NC_000003.11:g.10191649A>T,genomic,Stop Lost,normalized
2,2488,3-10191647-T-G,genomic,Stop Lost,normalized
3,1986,NC_000003.11:g.10191648G>T,genomic,Stop Lost,normalized
4,1987,NC_000003.11:g.10191649A>G,genomic,Stop Lost,normalized


Repeat process with the variants that were unable to be normalized.

In [8]:
civic_not_normalized_df = pd.read_csv(
    "../civic/variation_analysis/unable_to_normalize_queries.tsv", sep="\t"
)
civic_not_normalized_df.shape

(83, 8)

In [9]:
pruned_civic_not_normalized_df = civic_not_normalized_df[
    ["variant_id", "query", "query_type", "civic_variant_types"]
]
pruned_civic_not_normalized_df.insert(4, "normalization_status", "not_normalized")
pruned_civic_not_normalized_df.head()

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status
0,748,MLH1 *757L,protein,Stop Lost,not_normalized
1,3718,AR A748V,protein,Not provided,not_normalized
2,3725,AR A765T,protein,Not provided,not_normalized
3,4485,ERBB2 A775_G776ins YVMA,protein,Not provided,not_normalized
4,248,TERT C228T,protein,Regulatory Region Variant,not_normalized


Merge these dfs

In [10]:
frames = [pruned_civic_normalized_df, pruned_civic_not_normalized_df]
civic_supported_df = pd.concat(frames)
civic_supported_df.shape

(2098, 5)

Making all queries in all caps to make it easier to account of untyped variants later on.

In [11]:
civic_supported_df["query"] = civic_supported_df["query"].apply(str.upper)

Checking variant types.  The single largest types is "Not provided".  

In [12]:
civic_supported_df["civic_variant_types"].value_counts(dropna=False)

civic_variant_types
Not provided                                                  898
Missense Variant                                              877
Stop Gained                                                    60
Transcript Amplification                                       36
Frameshift Truncation                                          35
Inframe Deletion                                               34
Inframe Insertion                                              17
Frameshift Truncation;Minus 1 Frameshift Variant               15
Synonymous Variant                                             13
Splice Donor Variant                                           13
Splice Acceptor Variant                                         9
Frameshift Variant                                              7
Minus 1 Frameshift Variant;Frameshift Truncation                6
Missense Variant;Gain Of Function Variant                       6
Stop Lost                                               

Most of these look like amino acid substitutions.
Defining a regex to detect these variants and assign "Missense Variant" type to these variants.

If a variant does not have an assigned variant type in civic, if it is a protein query, and the query matches a regex pattern associated with variant substitutions (such as "PTEN A126D"), then I am re-classifying them as a "Missense Variant" instead.

In [13]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+\s+[A-Z]+\d+[A-Z|*]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

Doing so reduced the 816 untyped variants down to 86.
Checking the remaining weird variants.

In [14]:
untyped_variants = civic_supported_df[
    civic_supported_df["civic_variant_types"] == "Not provided"
]
print(len(untyped_variants))
untyped_variants.head(20)

86


Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status,variant flag
11,3342,KRAS A11_G12INSGA,protein,Not provided,normalized,False
68,4484,ERBB2 A775_G776INSIVMA,protein,Not provided,normalized,False
69,4723,ERBB2 A775_G776INSTVMA,protein,Not provided,normalized,False
70,4724,ERBB2 A775_G776INSV,protein,Not provided,normalized,False
71,4725,ERBB2 A775_G776INSVVMA,protein,Not provided,normalized,False
72,2658,ERBB2 A775_G776INSYVMA,protein,Not provided,normalized,False
73,4483,ERBB2 A775_G776INSYVMA,protein,Not provided,normalized,False
80,3751,ARHGAP35 A865_L870DELINSV,protein,Not provided,normalized,False
88,2655,MYB AMPLIFICATION,protein,Not provided,normalized,False
116,1261,MDM2 AMPLIFICATION,protein,Not provided,normalized,False


Reassigning variants marked as {gene} Amplification as Transcript Amplification Variants

In [15]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+\s+AMPLIFICATION", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Transcript Amplification",
    civic_supported_df["civic_variant_types"],
)

Reassigning amino acid insertions, delins, and deletions as "Missense Variant", including a couple of variants that have a random space before or after the sequence operation like "INS"

In [16]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+\s+[A-Z]+\d+_+[A-Z]+\d+INS+[A-Z]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

In [17]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+\s+[A-Z]+\d+_+[A-Z]+\d+INS+\s+[A-Z]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

In [18]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+\s+[A-Z]+\d+-+\d+\s+INS+[A-Z]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

In [19]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+\s+[A-Z]+\d+_+[A-Z]+\d+DELINS+[A-Z]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

In [20]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+\s+[A-Z]+\d+_+[A-Z]+\d+DEL", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

And assigning missense types to a handful of remaining variants that are non-standard names for genomic and protein sequence variants

In [21]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+\s+P\.+[A-Z]+\d+[A-Z]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

In [22]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"\S+[A-Z]+\-+[A-Z]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "genomic")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

In [23]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"NC_\d+\.+\d+:[A-Z]+\.+\d+[A-Z]+>+[A-Z]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "genomic")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

In [24]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"NC_\d+\.+\d+:[A-Z]+\.+\d+_+\d+INS+[A-Z]", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "genomic")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

In [25]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"NF1 P.W426*", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Missense Variant",
    civic_supported_df["civic_variant_types"],
)

This variant is a unique (to this db) nonstandard nomenclature for just some variant in a particular domain, so it is a region-defined variant.

In [26]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"DICER1 RNASE IIIB MUTATION", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Region-Defined Variant",
    civic_supported_df["civic_variant_types"],
)

This variant similarly indicates some variant in a non-coding region, so it is also a region-defined variant.

In [27]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"HNRNPH1 NON-CODING MUTATIONS", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Region-Defined Variant",
    civic_supported_df["civic_variant_types"],
)

Finally this last variant is a fusion variant.

In [28]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(
    lambda x: bool(re.match(r"NRG1 NRG1 FUSIONS", x))
)
civic_supported_df["civic_variant_types"] = np.where(
    (civic_supported_df["query_type"] == "protein")
    & (civic_supported_df["civic_variant_types"] == "Not provided")
    & (civic_supported_df["variant flag"]),
    "Fusion",
    civic_supported_df["civic_variant_types"],
)

Verifying that we have categorized all unaccounted for variants.

In [29]:
untyped_variants = civic_supported_df[
    civic_supported_df["civic_variant_types"] == "Not provided"
]
print(len(untyped_variants))
untyped_variants.head(20)

0


Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status,variant flag


Add category column to CIViC df.

In [30]:
civic_supported_df["category"] = civic_supported_df["civic_variant_types"].map(
    CIVIC_CATEGORY_BINS
)
civic_supported_df.tail()

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status,variant flag,category
78,1630,FLT3 Y591_V592INSVDFREYE,protein,Missense Variant,not_normalized,False,Sequence
79,3724,AR Y763C,protein,Missense Variant,not_normalized,False,Sequence
80,5189,BRAF DELL485_P490INSY,protein,Missense Variant,not_normalized,False,Sequence
81,4777,HNRNPH1 NON-CODING MUTATIONS,protein,Region-Defined Variant,not_normalized,False,Region-Defined
82,151,BCL2 REG_E@[IGH]::BCL2,protein,Transcript Regulatory Region Fusion,not_normalized,False,


Split df by normalized/not_normalized flag

In [31]:
civic_normalized_df_cats = civic_supported_df[
    civic_supported_df["normalization_status"] == "normalized"
]
civic_normalized_df_cats

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status,variant flag,category
0,2489,NC_000003.11:G.10191648_10191649INSC,genomic,Stop Lost,normalized,False,Sequence
1,1988,NC_000003.11:G.10191649A>T,genomic,Stop Lost,normalized,False,Sequence
2,2488,3-10191647-T-G,genomic,Stop Lost,normalized,False,Sequence
3,1986,NC_000003.11:G.10191648G>T,genomic,Stop Lost,normalized,False,Sequence
4,1987,NC_000003.11:G.10191649A>G,genomic,Stop Lost,normalized,False,Sequence
...,...,...,...,...,...,...,...
2010,877,NC_000020.11:G.58903752C>T,genomic,Synonymous Variant,normalized,False,Sequence
2011,731,NC_000003.11:G.37056036G>A,genomic,Splice Donor Variant,normalized,False,Region-Defined
2012,3045,VHL P.F76DEL,protein,Missense Variant,normalized,False,Sequence
2013,3310,HDAC9 P.L33R,protein,Missense Variant,normalized,False,Sequence


In [32]:
civic_not_normalized_df_cats = civic_supported_df[
    civic_supported_df["normalization_status"] == "not_normalized"
]
civic_not_normalized_df_cats

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status,variant flag,category
0,748,MLH1 *757L,protein,Stop Lost,not_normalized,False,Sequence
1,3718,AR A748V,protein,Missense Variant,not_normalized,False,Sequence
2,3725,AR A765T,protein,Missense Variant,not_normalized,False,Sequence
3,4485,ERBB2 A775_G776INS YVMA,protein,Missense Variant,not_normalized,False,Sequence
4,248,TERT C228T,protein,Regulatory Region Variant,not_normalized,False,Region-Defined
...,...,...,...,...,...,...,...
78,1630,FLT3 Y591_V592INSVDFREYE,protein,Missense Variant,not_normalized,False,Sequence
79,3724,AR Y763C,protein,Missense Variant,not_normalized,False,Sequence
80,5189,BRAF DELL485_P490INSY,protein,Missense Variant,not_normalized,False,Sequence
81,4777,HNRNPH1 NON-CODING MUTATIONS,protein,Region-Defined Variant,not_normalized,False,Region-Defined


For each df, Get CIViC Variant counts by category and add to counts dictionary

In [33]:
civic_normalized_category_counts = json.loads(
    civic_normalized_df_cats["category"].value_counts().to_json()
)
civic_normalized_category_counts

{'Sequence': 1902, 'Copy Number': 63, 'Region-Defined': 28, 'Other': 1}

In [34]:
def add_json_counts(var_category_counts: dict, support_status: Fields) -> None:
    """given a JSON of variant categories and counts and whether that dataframe represents normalized, not_normalized, or not_supported variants, adds the counts of variants to dictionary of counts

    :param var_category_counts: counts of variants in clinvar with variant type information in JSON format.
    :param support_status: an int flag to indicate if the variants in the dataframe are normalized (0), unable to be normalized (1), or unsupported (2) by the normalizer
    """
    for category, count in var_category_counts.items():
        category_counts[category][support_status] += count
        category_counts[category][Fields.TOTAL_COUNT] += count

In [35]:
add_json_counts(civic_normalized_category_counts, Fields.NORMALIZED_COUNT)
category_counts

{'Sequence': [1902, 0, 0, 1902, 0.0],
 'Genotype/Haplotype': [0, 0, 0, 0, 0.0],
 'Fusion': [0, 0, 0, 0, 0.0],
 'Rearrangement': [0, 0, 0, 0, 0.0],
 'Epigenetic Modification': [0, 0, 0, 0, 0.0],
 'Copy Number': [63, 0, 0, 63, 0.0],
 'Expression': [0, 0, 0, 0, 0.0],
 'Gene Function': [0, 0, 0, 0, 0.0],
 'Region-Defined': [28, 0, 0, 28, 0.0],
 'Genome Feature': [0, 0, 0, 0, 0.0],
 'Other': [1, 0, 0, 1, 0.0]}

In [36]:
civic_not_normalized_category_counts = json.loads(
    civic_not_normalized_df_cats["category"].value_counts().to_json()
)
civic_not_normalized_category_counts

{'Sequence': 78, 'Region-Defined': 3}

In [37]:
add_json_counts(civic_not_normalized_category_counts, Fields.UNABLE_TO_NORMALIZE_COUNT)
category_counts

{'Sequence': [1902, 78, 0, 1980, 0.0],
 'Genotype/Haplotype': [0, 0, 0, 0, 0.0],
 'Fusion': [0, 0, 0, 0, 0.0],
 'Rearrangement': [0, 0, 0, 0, 0.0],
 'Epigenetic Modification': [0, 0, 0, 0, 0.0],
 'Copy Number': [63, 0, 0, 63, 0.0],
 'Expression': [0, 0, 0, 0, 0.0],
 'Gene Function': [0, 0, 0, 0, 0.0],
 'Region-Defined': [28, 3, 0, 31, 0.0],
 'Genome Feature': [0, 0, 0, 0, 0.0],
 'Other': [1, 0, 0, 1, 0.0]}

Read in the csv for unsupported variants.  This data was already mapped to categories in civic_variant_analysis.  Therefore, we only need to import the data and perform the count on the category column.

In [38]:
not_supported_variants = pd.read_csv(
    "../civic/variation_analysis/not_supported_variants.tsv", sep="\t"
)
print(not_supported_variants.shape)
not_supported_variants.head()

(1747, 6)


Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted
0,4170,VHL,,Not provided,Transcript,False
1,4214,VHL,,Not provided,Transcript,False
2,4216,VHL,,Not provided,Transcript,False
3,4278,VHL,,Not provided,Transcript,False
4,4232,BRCA1,,Not provided,Transcript,False


Checking Counts.

In [39]:
not_supported_variants["category"].value_counts()

category
Transcript                 362
Fusion                     313
Expression                 294
Region-Defined             255
Sequence                   133
Rearrangement              122
Gene Function              111
Other                       79
Copy Number                 32
Genotype/Haplotype          22
Epigenetic Modification     14
Genome Feature              10
Name: count, dtype: int64

There is one small discrepancies here - the variants labelled as "Transcript Variants" here should be binned under "Sequence Variants"

In [40]:
not_supported_variants["category"].replace(
    "Transcript", "Sequence", inplace=True
)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  not_supported_variants["category"].replace(


In [41]:
not_supported_variants_category_counts = json.loads(
    not_supported_variants["category"].value_counts().to_json()
)
not_supported_variants_category_counts

{'Sequence': 495,
 'Fusion': 313,
 'Expression': 294,
 'Region-Defined': 255,
 'Rearrangement': 122,
 'Gene Function': 111,
 'Other': 79,
 'Copy Number': 32,
 'Genotype/Haplotype': 22,
 'Epigenetic Modification': 14,
 'Genome Feature': 10}

In [42]:
add_json_counts(not_supported_variants_category_counts, Fields.UNSUPPORTED_COUNT)
category_counts

{'Sequence': [1902, 78, 495, 2475, 0.0],
 'Genotype/Haplotype': [0, 0, 22, 22, 0.0],
 'Fusion': [0, 0, 313, 313, 0.0],
 'Rearrangement': [0, 0, 122, 122, 0.0],
 'Epigenetic Modification': [0, 0, 14, 14, 0.0],
 'Copy Number': [63, 0, 32, 95, 0.0],
 'Expression': [0, 0, 294, 294, 0.0],
 'Gene Function': [0, 0, 111, 111, 0.0],
 'Region-Defined': [28, 3, 255, 286, 0.0],
 'Genome Feature': [0, 0, 10, 10, 0.0],
 'Other': [1, 0, 79, 80, 0.0]}

## <a id='toc1_4_'></a>[MOA](#toc0_)

Read MOA .csv file for Normalized variants

In [43]:
moa_normalized_df = pd.read_csv(
    "../moa/feature_analysis/able_to_normalize_queries.tsv", sep="\t"
)
print(moa_normalized_df.shape)
moa_normalized_df.head()

(196, 5)


Unnamed: 0,variant_id,query,moa_feature_type,category,vrs_id
0,66,ABL1 p.T315I,somatic_variant,Sequence,ga4gh:VA.D6NzpWXKqBnbcZZrXNSXj4tMUwROKbsQ
1,68,ABL1 p.T315A,somatic_variant,Sequence,ga4gh:VA.37YVc2HpRgXOq3HtsjcL1eiyLhDXLmYy
2,70,ABL1 p.F317L,somatic_variant,Sequence,ga4gh:VA.ZJZc_8PkTSu-twmaJvj6yQXvPJHElPZc
3,71,ABL1 p.F317V,somatic_variant,Sequence,ga4gh:VA.SnGz3wUT2JaIid12PoI6OHc4t7LgHVj1
4,72,ABL1 p.F317I,somatic_variant,Sequence,ga4gh:VA.wDDVWfpuxnuYkLj5_0OrnaBvrJAXYcJA


Get variant counts by category, update variant counts df 

In [44]:
moa_normalized_category_counts = json.loads(
    moa_normalized_df["category"].value_counts().to_json()
)
moa_normalized_category_counts

{'Sequence': 159, 'Copy Number': 31, 'Gene Function': 6}

In [45]:
add_json_counts(moa_normalized_category_counts, Fields.NORMALIZED_COUNT)
category_counts

{'Sequence': [2061, 78, 495, 2634, 0.0],
 'Genotype/Haplotype': [0, 0, 22, 22, 0.0],
 'Fusion': [0, 0, 313, 313, 0.0],
 'Rearrangement': [0, 0, 122, 122, 0.0],
 'Epigenetic Modification': [0, 0, 14, 14, 0.0],
 'Copy Number': [94, 0, 32, 126, 0.0],
 'Expression': [0, 0, 294, 294, 0.0],
 'Gene Function': [6, 0, 111, 117, 0.0],
 'Region-Defined': [28, 3, 255, 286, 0.0],
 'Genome Feature': [0, 0, 10, 10, 0.0],
 'Other': [1, 0, 79, 80, 0.0]}

Repeat same process for variants that were supported but failed to normalize.

In [46]:
moa_not_normalized_df = pd.read_csv(
    "../moa/feature_analysis/unable_to_normalize_queries.tsv", sep="\t"
)
print(moa_not_normalized_df.shape)
moa_not_normalized_df.head()

(0, 7)


Unnamed: 0,variant_id,query,moa_feature_type,category,exception_raised,message,warnings


In [47]:
moa_not_normalized_category_counts = json.loads(
    moa_not_normalized_df["category"].value_counts().to_json()
)
moa_not_normalized_category_counts

{}

In [48]:
add_json_counts(moa_not_normalized_category_counts, Fields.UNABLE_TO_NORMALIZE_COUNT)
category_counts

{'Sequence': [2061, 78, 495, 2634, 0.0],
 'Genotype/Haplotype': [0, 0, 22, 22, 0.0],
 'Fusion': [0, 0, 313, 313, 0.0],
 'Rearrangement': [0, 0, 122, 122, 0.0],
 'Epigenetic Modification': [0, 0, 14, 14, 0.0],
 'Copy Number': [94, 0, 32, 126, 0.0],
 'Expression': [0, 0, 294, 294, 0.0],
 'Gene Function': [6, 0, 111, 117, 0.0],
 'Region-Defined': [28, 3, 255, 286, 0.0],
 'Genome Feature': [0, 0, 10, 10, 0.0],
 'Other': [1, 0, 79, 80, 0.0]}

Repeat same process for variants that are unsupported.

In [49]:
moa_not_supported_df = pd.read_csv(
    "../moa/feature_analysis/not_supported_variants.tsv", sep="\t"
)
print(moa_not_supported_df.shape)
print(moa_not_supported_df.head())
moa_not_supported_df["category"].value_counts(dropna=False)

(256, 4)
   variant_id                        query moa_feature_type       category
0           1             BCR--ABL1 Fusion    rearrangement  Rearrangement
1           9                   ALK Fusion    rearrangement  Rearrangement
2          12                          ALK    rearrangement  Rearrangement
3          15            ALK Translocation    rearrangement  Rearrangement
4          18  BRD4 t(15;19) Translocation    rearrangement  Rearrangement


category
Sequence          127
Region-Defined     40
Rearrangement      35
Copy Number        23
Other              12
Expression         11
Gene Function       8
Name: count, dtype: int64

In [50]:
moa_not_supported_category_counts = json.loads(
    moa_not_supported_df["category"].value_counts().to_json()
)
moa_not_supported_category_counts

{'Sequence': 127,
 'Region-Defined': 40,
 'Rearrangement': 35,
 'Copy Number': 23,
 'Other': 12,
 'Expression': 11,
 'Gene Function': 8}

In [51]:
add_json_counts(moa_not_supported_category_counts, Fields.UNSUPPORTED_COUNT)
category_counts

{'Sequence': [2061, 78, 622, 2761, 0.0],
 'Genotype/Haplotype': [0, 0, 22, 22, 0.0],
 'Fusion': [0, 0, 313, 313, 0.0],
 'Rearrangement': [0, 0, 157, 157, 0.0],
 'Epigenetic Modification': [0, 0, 14, 14, 0.0],
 'Copy Number': [94, 0, 55, 149, 0.0],
 'Expression': [0, 0, 305, 305, 0.0],
 'Gene Function': [6, 0, 119, 125, 0.0],
 'Region-Defined': [28, 3, 295, 326, 0.0],
 'Genome Feature': [0, 0, 10, 10, 0.0],
 'Other': [1, 0, 91, 92, 0.0]}

## <a id='toc1_5_'></a>[ClinVar](#toc0_)

Read in the three clinvar csv files.

In [52]:
clinvar_normalized_df = pd.read_csv(
    "../clinvar/variation_analysis_output/variation_type_count_supported_df.csv"
)
print(clinvar_normalized_df.shape)
clinvar_normalized_df.head(20)

(10, 4)


Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count
0,0,single nucleotide variant,,3401625
1,1,Deletion,,160232
2,2,Duplication,,73474
3,3,Microsatellite,,36095
4,4,copy number gain,,21484
5,5,copy number loss,,20367
6,6,Indel,,17038
7,7,Insertion,,13019
8,8,Inversion,,1401
9,9,Variation,,379


In [53]:
clinvar_not_normalized_df = pd.read_csv(
    "../clinvar/variation_analysis_output/variation_type_count_supported_not_normalized_df.csv"
)
print(clinvar_not_normalized_df.shape)
clinvar_not_normalized_df.head(10)

(7, 4)


Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count
0,0,Deletion,,711
1,1,copy number loss,,538
2,2,copy number gain,,438
3,3,Duplication,,357
4,4,single nucleotide variant,,71
5,5,Indel,,2
6,6,Insertion,,1


In [54]:
clinvar_not_supported_df = pd.read_csv(
    "../clinvar/variation_analysis_output/variation_type_count_not_supported_df.csv"
)
print(clinvar_not_supported_df.shape)
clinvar_not_supported_df.head(20)

(59, 4)


Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count
0,0,copy number gain,Variant not described on GRCh37/GRCh38 assembly,2807
1,1,copy number loss,Variant not described on GRCh37/GRCh38 assembly,1706
2,2,Haplotype,haplotype and genotype variations are not supp...,617
3,3,Diplotype,haplotype and genotype variations are not supp...,596
4,4,Deletion,No viable variation members identified.,593
5,5,Microsatellite,repeat expressions are not supported.,456
6,6,Deletion,sequence for accession not supported by vrs-py...,336
7,7,CompoundHeterozygote,haplotype and genotype variations are not supp...,297
8,8,single nucleotide variant,No viable variation members identified.,290
9,9,Translocation,No viable variation members identified.,282


Add column and map variant types to categories.

In [55]:
clinvar_normalized_df["category"] = clinvar_normalized_df["in.variation_type"].map(
    CLINVAR_CATEGORY_BINS
)
clinvar_normalized_df.head(20)

Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count,category
0,0,single nucleotide variant,,3401625,Sequence
1,1,Deletion,,160232,Sequence
2,2,Duplication,,73474,Sequence
3,3,Microsatellite,,36095,Sequence
4,4,copy number gain,,21484,Copy Number
5,5,copy number loss,,20367,Copy Number
6,6,Indel,,17038,Sequence
7,7,Insertion,,13019,Sequence
8,8,Inversion,,1401,Sequence
9,9,Variation,,379,Other


In [56]:
clinvar_not_normalized_df["category"] = clinvar_not_normalized_df[
    "in.variation_type"
].map(CLINVAR_CATEGORY_BINS)
clinvar_not_normalized_df.head(20)

Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count,category
0,0,Deletion,,711,Sequence
1,1,copy number loss,,538,Copy Number
2,2,copy number gain,,438,Copy Number
3,3,Duplication,,357,Sequence
4,4,single nucleotide variant,,71,Sequence
5,5,Indel,,2,Sequence
6,6,Insertion,,1,Sequence


In [57]:
clinvar_not_supported_df["category"] = clinvar_not_supported_df[
    "in.variation_type"
].map(CLINVAR_CATEGORY_BINS)
clinvar_not_supported_df.head(20)

Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count,category
0,0,copy number gain,Variant not described on GRCh37/GRCh38 assembly,2807,Copy Number
1,1,copy number loss,Variant not described on GRCh37/GRCh38 assembly,1706,Copy Number
2,2,Haplotype,haplotype and genotype variations are not supp...,617,Sequence
3,3,Diplotype,haplotype and genotype variations are not supp...,596,Genotype/Haplotype
4,4,Deletion,No viable variation members identified.,593,Sequence
5,5,Microsatellite,repeat expressions are not supported.,456,Sequence
6,6,Deletion,sequence for accession not supported by vrs-py...,336,Sequence
7,7,CompoundHeterozygote,haplotype and genotype variations are not supp...,297,Genotype/Haplotype
8,8,single nucleotide variant,No viable variation members identified.,290,Sequence
9,9,Translocation,No viable variation members identified.,282,Rearrangement


Due to the structure of the data and the way that the original analysis developed, some but not all CNVs per the in.variation_type were annotated in the in.issue column as "Copy number change (cn loss|del and cn gain|dup)", "Absolute copy count", or "Min/max copy count range not supported".  However, some of the Copy number Gain/Loss variants did not get binned as CNVs per the in.vrs_xform_plan.policy.  Therefore, we need to mark those variants in the union of the following two sets as being in the category of Copy Number Variants:

Variants with in.variant_type ==
1. copy number loss
2. copy number gain

Variants with in.vrs_xform_plan.policy == 
1. Copy number change (cn loss|del and cn gain|dup)
2. Absolute copy count
3. Min/max copy count range not supported

Above we already caught the first set of variants. Now we must go back through each df one more time and map the variants we missed per in.vrs_xform_plan.policy values to the category of Copy Number Variants.

In [58]:
cnv_per_policy = [
    "Copy number change (cn loss|del and cn gain|dup)",
    "Absolute copy count",
    "Min/max copy count range not supported",
    "Copy number change (cn loss|del and cn gain|dup)",
]

In [59]:
clinvar_normalized_df.loc[
    clinvar_normalized_df["in.issue"].isin(cnv_per_policy), "category"
] = "Copy Number Variants"

In [60]:
clinvar_normalized_df

Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count,category
0,0,single nucleotide variant,,3401625,Sequence
1,1,Deletion,,160232,Sequence
2,2,Duplication,,73474,Sequence
3,3,Microsatellite,,36095,Sequence
4,4,copy number gain,,21484,Copy Number
5,5,copy number loss,,20367,Copy Number
6,6,Indel,,17038,Sequence
7,7,Insertion,,13019,Sequence
8,8,Inversion,,1401,Sequence
9,9,Variation,,379,Other


In [61]:
clinvar_not_normalized_df

Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count,category
0,0,Deletion,,711,Sequence
1,1,copy number loss,,538,Copy Number
2,2,copy number gain,,438,Copy Number
3,3,Duplication,,357,Sequence
4,4,single nucleotide variant,,71,Sequence
5,5,Indel,,2,Sequence
6,6,Insertion,,1,Sequence


In [62]:
clinvar_not_normalized_df.loc[
    clinvar_not_normalized_df["in.issue"].isin(cnv_per_policy),
    "category",
] = "Copy Number Variants"


clinvar_not_normalized_df

Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count,category
0,0,Deletion,,711,Sequence
1,1,copy number loss,,538,Copy Number
2,2,copy number gain,,438,Copy Number
3,3,Duplication,,357,Sequence
4,4,single nucleotide variant,,71,Sequence
5,5,Indel,,2,Sequence
6,6,Insertion,,1,Sequence


In [63]:
clinvar_not_supported_df

Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count,category
0,0,copy number gain,Variant not described on GRCh37/GRCh38 assembly,2807,Copy Number
1,1,copy number loss,Variant not described on GRCh37/GRCh38 assembly,1706,Copy Number
2,2,Haplotype,haplotype and genotype variations are not supp...,617,Sequence
3,3,Diplotype,haplotype and genotype variations are not supp...,596,Genotype/Haplotype
4,4,Deletion,No viable variation members identified.,593,Sequence
5,5,Microsatellite,repeat expressions are not supported.,456,Sequence
6,6,Deletion,sequence for accession not supported by vrs-py...,336,Sequence
7,7,CompoundHeterozygote,haplotype and genotype variations are not supp...,297,Genotype/Haplotype
8,8,single nucleotide variant,No viable variation members identified.,290,Sequence
9,9,Translocation,No viable variation members identified.,282,Rearrangement


In [64]:
clinvar_not_supported_df.loc[
    clinvar_not_supported_df["in.issue"].isin(cnv_per_policy),
    "category",
] = "Copy Number Variants"

clinvar_not_supported_df

Unnamed: 0.1,Unnamed: 0,in.variation_type,in.issue,count,category
0,0,copy number gain,Variant not described on GRCh37/GRCh38 assembly,2807,Copy Number
1,1,copy number loss,Variant not described on GRCh37/GRCh38 assembly,1706,Copy Number
2,2,Haplotype,haplotype and genotype variations are not supp...,617,Sequence
3,3,Diplotype,haplotype and genotype variations are not supp...,596,Genotype/Haplotype
4,4,Deletion,No viable variation members identified.,593,Sequence
5,5,Microsatellite,repeat expressions are not supported.,456,Sequence
6,6,Deletion,sequence for accession not supported by vrs-py...,336,Sequence
7,7,CompoundHeterozygote,haplotype and genotype variations are not supp...,297,Genotype/Haplotype
8,8,single nucleotide variant,No viable variation members identified.,290,Sequence
9,9,Translocation,No viable variation members identified.,282,Rearrangement


Get counts from the three dfs.

In [65]:
category_counts

{'Sequence': [2061, 78, 622, 2761, 0.0],
 'Genotype/Haplotype': [0, 0, 22, 22, 0.0],
 'Fusion': [0, 0, 313, 313, 0.0],
 'Rearrangement': [0, 0, 157, 157, 0.0],
 'Epigenetic Modification': [0, 0, 14, 14, 0.0],
 'Copy Number': [94, 0, 55, 149, 0.0],
 'Expression': [0, 0, 305, 305, 0.0],
 'Gene Function': [6, 0, 119, 125, 0.0],
 'Region-Defined': [28, 3, 295, 326, 0.0],
 'Genome Feature': [0, 0, 10, 10, 0.0],
 'Other': [1, 0, 91, 92, 0.0]}

In [66]:
def sum_clinvar_counts(dataframe: pd.DataFrame, support_status: int) -> None:
    """given a dataframe and whether that dataframe represents normalized, not_normalized, or not_supported variants, adds the counts of variants to dictionary of counts

    :param dataframe: counts of variants in clinvar with variant type information in dataframe format.
    :param support_status: an int flag to indicate if the variants in the dataframe are normalized (0), unable to be normalized (1), or unsupported (2) by the normalizer
    """
    for i in category_counts.keys():
        subdf = dataframe[dataframe["category"] == i]
        if len(subdf):
            category = i
            count = subdf["count"].sum()
            print(category, count)
            category_counts[category][support_status] += count
            category_counts[category][Fields.TOTAL_COUNT] += count

In [67]:
sum_clinvar_counts(clinvar_normalized_df, Fields.NORMALIZED_COUNT)

category_counts

Sequence 3702884
Copy Number 41851
Other 379


{'Sequence': [np.int64(3704945), 78, 622, np.int64(3705645), 0.0],
 'Genotype/Haplotype': [0, 0, 22, 22, 0.0],
 'Fusion': [0, 0, 313, 313, 0.0],
 'Rearrangement': [0, 0, 157, 157, 0.0],
 'Epigenetic Modification': [0, 0, 14, 14, 0.0],
 'Copy Number': [np.int64(41945), 0, 55, np.int64(42000), 0.0],
 'Expression': [0, 0, 305, 305, 0.0],
 'Gene Function': [6, 0, 119, 125, 0.0],
 'Region-Defined': [28, 3, 295, 326, 0.0],
 'Genome Feature': [0, 0, 10, 10, 0.0],
 'Other': [np.int64(380), 0, 91, np.int64(471), 0.0]}

In [68]:
sum_clinvar_counts(clinvar_not_normalized_df, Fields.UNABLE_TO_NORMALIZE_COUNT)

category_counts

Sequence 1142
Copy Number 976


{'Sequence': [np.int64(3704945), np.int64(1220), 622, np.int64(3706787), 0.0],
 'Genotype/Haplotype': [0, 0, 22, 22, 0.0],
 'Fusion': [0, 0, 313, 313, 0.0],
 'Rearrangement': [0, 0, 157, 157, 0.0],
 'Epigenetic Modification': [0, 0, 14, 14, 0.0],
 'Copy Number': [np.int64(41945), np.int64(976), 55, np.int64(42976), 0.0],
 'Expression': [0, 0, 305, 305, 0.0],
 'Gene Function': [6, 0, 119, 125, 0.0],
 'Region-Defined': [28, 3, 295, 326, 0.0],
 'Genome Feature': [0, 0, 10, 10, 0.0],
 'Other': [np.int64(380), 0, 91, np.int64(471), 0.0]}

In [69]:
sum_clinvar_counts(clinvar_not_supported_df, Fields.UNSUPPORTED_COUNT)

category_counts

Sequence 4170
Genotype/Haplotype 893
Fusion 5
Rearrangement 284
Copy Number 4612
Other 271


{'Sequence': [np.int64(3704945),
  np.int64(1220),
  np.int64(4792),
  np.int64(3710957),
  0.0],
 'Genotype/Haplotype': [0, 0, np.int64(915), np.int64(915), 0.0],
 'Fusion': [0, 0, np.int64(318), np.int64(318), 0.0],
 'Rearrangement': [0, 0, np.int64(441), np.int64(441), 0.0],
 'Epigenetic Modification': [0, 0, 14, 14, 0.0],
 'Copy Number': [np.int64(41945),
  np.int64(976),
  np.int64(4667),
  np.int64(47588),
  0.0],
 'Expression': [0, 0, 305, 305, 0.0],
 'Gene Function': [6, 0, 119, 125, 0.0],
 'Region-Defined': [28, 3, 295, 326, 0.0],
 'Genome Feature': [0, 0, 10, 10, 0.0],
 'Other': [np.int64(380), 0, np.int64(362), np.int64(742), 0.0]}

## <a id='toc1_6_'></a>[Computing Coverage](#toc0_)

For the purposes of making the table, computing the percent of all variants normalized in each category.

In [70]:
for i in category_counts.keys():
    normalized = category_counts[i][Fields.NORMALIZED_COUNT]
    total = category_counts[i][Fields.TOTAL_COUNT]
    percent_covered = normalized / total
    category_counts[i][Fields.PERCENT_NORMALIZED] = "%.4f" % percent_covered

category_counts

{'Sequence': [np.int64(3704945),
  np.int64(1220),
  np.int64(4792),
  np.int64(3710957),
  '0.9984'],
 'Genotype/Haplotype': [0, 0, np.int64(915), np.int64(915), '0.0000'],
 'Fusion': [0, 0, np.int64(318), np.int64(318), '0.0000'],
 'Rearrangement': [0, 0, np.int64(441), np.int64(441), '0.0000'],
 'Epigenetic Modification': [0, 0, 14, 14, '0.0000'],
 'Copy Number': [np.int64(41945),
  np.int64(976),
  np.int64(4667),
  np.int64(47588),
  '0.8814'],
 'Expression': [0, 0, 305, 305, '0.0000'],
 'Gene Function': [6, 0, 119, 125, '0.0480'],
 'Region-Defined': [28, 3, 295, 326, '0.0859'],
 'Genome Feature': [0, 0, 10, 10, '0.0000'],
 'Other': [np.int64(380), 0, np.int64(362), np.int64(742), '0.5121']}

Computing total counts and coverage across all variant categories.

In [71]:
totals = [0, 0, 0, 0, 0.0]

for i in category_counts.items():
    for f in Fields:
        if f != Fields.PERCENT_NORMALIZED:
            totals[f] += i[1][f]

totals[Fields.PERCENT_NORMALIZED] = "%.4f" % (
    totals[Fields.NORMALIZED_COUNT] / totals[Fields.TOTAL_COUNT]
)

totals

[np.int64(3747304),
 np.int64(2199),
 np.int64(12238),
 np.int64(3761741),
 '0.9962']

## <a id='toc1_7_'></a>[Generating Table](#toc0_)

Generating a table in plotly to show variant counts and normalization percentage by category, as well as the types of data fields associated with different variant categories.

In [72]:
core_field = "\u2b24"
optional_field = "<b>◯</b>"

colorwhite = "rgb(255, 255, 255)"
blueshade1 = "rgb(230, 240, 250)"
blueshade2 = "rgb(207, 226, 243)"
blueshade3 = "rgb(159, 197, 232)"
blueshade4 = "rgb(111, 168, 220)"
blueshade5 = "rgb(61, 133, 198)"
blueshade5point5 = "rgb(49, 116, 187)"
blueshade6 = "rgb(35, 100, 177)"
blueshade7 = "rgb(11, 83, 148)"

colors = [
    blueshade5point5,
    colorwhite,
    blueshade1,
    colorwhite,
    colorwhite,
    blueshade5,
    colorwhite,
    blueshade2,
    blueshade3,
    colorwhite,
    blueshade4,
]

data = {
    "variant_category": [
        v.replace(" Variants", "") for v in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES
    ],
    "counts": [
        f"{category_counts[v][Fields.TOTAL_COUNT]:,}"
        for v in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES
    ],
    "percent_normalized": [
        "%.2f" % round(float(category_counts[v][Fields.PERCENT_NORMALIZED]) * 100, 2)
        + "%"
        for v in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES
    ],
    "delta_sequence": [
        core_field,
        core_field,
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        optional_field,
    ],
    "delta_location": [
        optional_field,
        optional_field,
        core_field,
        core_field,
        "",
        "",
        "",
        "",
        "",
        "",
        "",
    ],
    "delta_frame": [
        optional_field,
        optional_field,
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        optional_field,
    ],
    "delta_quantity": [
        optional_field,
        optional_field,
        "",
        "",
        core_field,
        core_field,
        optional_field,
        "",
        "",
        optional_field,
        optional_field,
    ],
    "delta_function": [
        optional_field,
        optional_field,
        "",
        "",
        optional_field,
        optional_field,
        core_field,
        core_field,
        "",
        optional_field,
        optional_field,
    ],
    "region_specificity": [
        optional_field,
        optional_field,
        optional_field,
        optional_field,
        optional_field,
        optional_field,
        optional_field,
        optional_field,
        core_field,
        "",
        optional_field,
    ],
    "shading": colors,
}
df = pd.DataFrame(data)

fig = go.Figure(
    data=[
        go.Table(
            columnwidth=[90, 53, 65, 53, 50, 50, 50, 50, 50, 50],
            header=dict(
                values=[
                    "<b>Variant Category</b>",
                    "<b>Count</b>",
                    "<b>% Normalized</b>",
                    "<b>Δ Sequence</b>",
                    "<b>Δ Location</b>",
                    "<b>Δ Frame</b>",
                    "<b>Δ Quantity</b>",
                    "<b>Δ Function</b>",
                    "<b>Region Specificity</b>",
                ],
                line_color="black",
                fill_color="white",
                align="center",
                font=dict(color="black", size=18),
            ),
            cells=dict(
                values=[
                    df.variant_category,
                    df.counts,
                    df.percent_normalized,
                    df.delta_sequence,
                    df.delta_location,
                    df.delta_frame,
                    df.delta_quantity,
                    df.delta_function,
                    df.region_specificity,
                ],
                line_color=["black"],
                fill_color=[df.shading],
                align="right",
                font=dict(color="black", size=18),
                height=30,
            ),
        )
    ]
)

fig.add_annotation(
    dict(
        text="  \u2b24  Core information fields<br><br>  <b>◯</b>  Optional information fields  ",
        align="left",
        showarrow=False,
        xref="paper",
        xanchor="right",
        yref="paper",
        x=0.98,
        y=0.02,
        yanchor="bottom",
        bordercolor="black",
        borderwidth=1,
    )
)

fig.update_layout(
    height=615,
    width=1400,
    font=dict(size=18, color="Black"),
    title="<b>Counts, Normalizer Performance, and Data Types of Variants by Category</b>",
    margin=go.layout.Margin(
        l=2,  # left margin
        r=2,  # right margin
        b=0,  # bottom margin
        t=52,  # top margin
    ),
)
fig.show()

Exporting the table as a .png file.

In [73]:
fig.write_image("../merged_performance_analysis_table.png", "png")

In [74]:
fig.write_image("../merged_performance_analysis_table.svg", "svg")