# <a id='toc1_'></a>[Normalizer Performance Analysis](#toc0_)

This notebook contains an analysis of the normalizer performance on the CIViC, MOA, and Clinvar data

**Table of contents**<a id='toc0_'></a>    
- [Normalizer Performance Analysis](#toc1_)    
  - [Import relevant packages](#toc1_1_)    
  - [Dictionaries to map variants to categories and record category counts](#toc1_2_)    
  - [CIViC](#toc1_3_)    
  - [MOA](#toc1_4_)    
  - [ClinVar](#toc1_5_)    
  - [Computing Coverage](#toc1_6_)    
  - [Generating Table](#toc1_7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Import relevant packages](#toc0_)

In [None]:
import pandas as pd
import numpy as np
import json
import re
import plotly.graph_objects as go
from enum import Enum
from enum import IntEnum

## <a id='toc1_2_'></a>[Dictionaries to map variants to categories and record category counts](#toc0_)

Bin variants to categories.

For variants with multiple associated types:  If the 2+ types have a subset relationship (eg frameshift; frameshift truncation), they are assigned to categories consistent with the superset type (frameshift).  If the types are disjoint (eg: Transcript Variant; Loss of Function Variant), they are assigned with the category most closely associated with the assayed data (Transcript Variant).  This assignment is done in the civic_category_bins dictionary.

In [None]:
civic_category_bins = {
    "Delins":"Sequence Variants",
    "Direct Tandem Duplication":"Sequence Variants",
    "Disruptive Inframe Deletion":"Sequence Variants",
    "Disruptive Inframe Insertion":"Sequence Variants",
    "Coding Sequence Variant":"Sequence Variants",
    "Conservative Inframe Deletion":"Sequence Variants",
    "Copy Number Variants":"Copy Number Variants",
    "Frameshift":"Sequence Variants",
    "Frameshift Truncation":"Sequence Variants",
    "Frameshift Variant":"Sequence Variants",
    "Frameshift Variant;Minus 1 Frameshift Variant":"Sequence Variants",
    "Inframe Deletion":"Sequence Variants",
    "Inframe Indel":"Sequence Variants",
    "Inframe Insertion":"Sequence Variants",
    "Intron Variant":"Region-Defined Variants",
    "Minus 1 Frameshift Variant":"Sequence Variants",
    "Minus 2 Frameshift Variant":"Sequence Variants",
    "Missense Variant":"Sequence Variants",
    "Non Conservative Missense Variant":"Sequence Variants",
    "Plus 1 Frameshift Variant":"Sequence Variants",
    "Region-Defined Variant":"Region-Defined Variants",
    "Regulatory Region Variant":"Region-Defined Variants",
    "Sequence Variants":"Sequence Variants",
    "Splice Acceptor Variant":"Region-Defined Variants",
    "Splice Donor Region Variant":"Region-Defined Variants",
    "Splice Donor Variant":"Region-Defined Variants",
    "Splicing Variant":"Other Variants",
    "Start Lost":"Sequence Variants",
    "Stop Gained":"Sequence Variants",
    "Stop Lost":"Sequence Variants",
    "Synonymous Variant":"Sequence Variants",
    "Transcript Amplification":"Copy Number Variants",
    "Transcript Fusion":"Fusion Variants",
    "3 Prime UTR Variant":"Region-Defined Variants",
    "Amino Acid Deletion;Inframe Deletion":"Sequence Variants",
    "Frameshift Truncation;Minus 2 Frameshift Variant":"Sequence Variants",
    "Frameshift Truncation;Plus 2 Frameshift Variant":"Sequence Variants",
    "Frameshift Variant;Delins":"Sequence Variants",
    "Inframe Insertion;Delins":"Sequence Variants",
    "Inframe Insertion;Inframe Deletion;Delins":"Sequence Variants",
    "Inframe Variant;Inframe Insertion;Inframe Deletion;Delins ":"Sequence Variants",
    "Minus 1 Frameshift Variant;Frameshift Truncation":"Sequence Variants",
    "Plus 1 Frameshift Variant;Frameshift Elongation":"Sequence Variants",
    "Plus 1 Frameshift Variant;Frameshift Truncation":"Sequence Variants",
    "Missense Variant;Gain Of Function Variant":"Sequence Variants", 
    "Missense Variant;Loss Of Function Variant":"Sequence Variants", 
    "Missense Variant;Loss Of Heterozygosity":"Sequence Variants", 
    "Missense Variant;Polymorphic Sequence Variant":"Sequence Variants", 
    "Missense Variant;Snp":"Sequence Variants", 
    "Missense Variant;Transcript Fusion":"Sequence Variants",
    "Stop Gained;Loss Of Function Variant":"Sequence Variants",
    "Stop Lost;Inframe Deletion":"Sequence Variants"
}



moa_category_bins = {
    "Copy Number Variants": "Copy Number Variants",
    "Expression Variants": "Expression Variants",
    "Other Variants": "Other Variants",
    "Rearrangement Variants": "Rearrangement Variants",
    "Sequence Variants": "Sequence Variants"
}



clinvar_category_bins = {
    "Complex":"Other Variants",
    "CompoundHeterozygote":"Genotype Variants",
    "Deletion":"Sequence Variants",
    "Diplotype":"Genotype Variants",
    "Distinct chromosomes":"Rearrangement Variants",
    "Duplication":"Sequence Variants",
    "Haplotype":"Sequence Variants",
    "Haplotype, single variant":"Sequence Variants",
    "Indel":"Sequence Variants",
    "Insertion":"Sequence Variants",
    "Inversion":"Sequence Variants",
    "Microsatellite":"Sequence Variants",
    "Phase unknown":"Other Variants",
    "Tandem duplication":"Sequence Variants",
    "Translocation":"Rearrangement Variants",
    "Variation":"Other Variants",
    "copy number gain":"Copy Number Variants",
    "copy number loss":"Copy Number Variants",
    "fusion":"Fusion Variants",
    "protein only":"Sequence Variants",
    "single nucleotide variant":"Sequence Variants"
}

These variables flag the fields in the dictionary item values below.  In category_counts, each entry is a list of integer values, representing, in order, the number of tokens normalized of that variant, the number ostensibly supported but unable to be normalized, the number of tokens that are not supported, and the total number of tokens.

In [None]:
class Fields(IntEnum):
    """Create IntEnum for count fields in the category_counts dict."""
    NORMALIZED_COUNT = 0
    UNABLE_TO_NORMALIZE_COUNT = 1
    UNSUPPORTED_COUNT = 2
    TOTAL_COUNT = 3
    PERCENT_NORMALIZED = 4

In [None]:
category_counts = {
    "Copy Number Variants":[0,0,0,0,0.0],
    "Epigenetic Modification":[0,0,0,0,0.0],
    "Expression Variants":[0,0,0,0,0.0],
    "Fusion Variants":[0,0,0,0,0.0],
    "Gene Function Variants":[0,0,0,0,0.0],
    "Genotype Variants":[0,0,0,0,0.0],
    "Other Variants":[0,0,0,0,0.0],
    "Rearrangement Variants":[0,0,0,0,0.0],
    "Region-Defined Variants":[0,0,0,0,0.0],
    "Sequence Variants":[0,0,0,0,0.0]
}

## <a id='toc1_3_'></a>[CIViC](#toc0_)



In order to score the normalizer's performance on the CIViC data, some cleaning is required.

First we need to read in the data that was ostensibly supported, get rid of variants with multiple type labels, and assign variant types to as  many of the entries as possible that have a "Not provided" value for civic_variant_types.

Read in .csv of normalized variants in CIVIC

In [None]:
civic_normalized_df = pd.read_csv("../civic/variation_analysis/able_to_normalize_queries.csv",sep = "\t")
civic_normalized_df.head()

Prune columns and add new column to flag as normalized.

In [None]:
pruned_civic_normalized_df = civic_normalized_df[["variant_id","query","query_type","civic_variant_types"]]
pruned_civic_normalized_df.insert(4,"normalization_status","normalized")
pruned_civic_normalized_df.head()

Repeat process with the variants that were unable to be normalized.

In [None]:
civic_not_normalized_df = pd.read_csv("../civic/variation_analysis/unable_to_normalize_queries.csv",sep = "\t")
civic_not_normalized_df.shape

In [None]:
pruned_civic_not_normalized_df = civic_not_normalized_df[["variant_id","query","query_type","civic_variant_types"]]
pruned_civic_not_normalized_df.insert(4,"normalization_status","not_normalized")
pruned_civic_not_normalized_df.head()

Merge these dfs

In [None]:
frames = [pruned_civic_normalized_df, pruned_civic_not_normalized_df]
civic_supported_df = pd.concat(frames)
civic_supported_df.shape

Making all queries in all caps to make it easier to account of untyped variants later on.

In [None]:
civic_supported_df["query"] = civic_supported_df["query"].apply(str.upper)

Checking variant types.  The single largest types is "Not provided".  
Most of these look like amino acid substitutions.
Defining a regex to detect these variants and assign "Missense Variant" type to these variants.

In [None]:
civic_supported_df["civic_variant_types"].value_counts(dropna=False)

If a variant does not have an assigned variant type in civic, if it is a protein query and the query matches a regex pattern associated with variant substitutions (such as "PTEN A126D"), then I am re-classifying them as a "Missense Variant" instead.

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+[A-Z|*]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

Doing so reduced the 816 untyped variants down to 70.
Checking the remaining weird variants.

In [None]:
untyped_variants = civic_supported_df[civic_supported_df["civic_variant_types"] == "Not provided"]
untyped_variants.head(20)

Reassigning variants marked as {gene} Amplification as Transcript Amplification Variants

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+AMPLIFICATION", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Transcript Amplification", civic_supported_df["civic_variant_types"])


Reassigning amino acid insertions, delins, and deletions as "Missense Variant", including a couple of variants that have a random space before or after the sequence operation like "INS"

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+_+[A-Z]+\d+INS+[A-Z]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])


In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+_+[A-Z]+\d+INS+\s+[A-Z]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+-+\d+\s+INS+[A-Z]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+_+[A-Z]+\d+DELINS+[A-Z]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])


In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+_+[A-Z]+\d+DEL", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])


And assigning missense types to a handful of remaining variants that are non-standard names for genomic and protein sequence variants

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+P\.+[A-Z]+\d+[A-Z]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])


In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+[A-Z]+\-+[A-Z]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "genomic") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("NC_\d+\.+\d+:[A-Z]+\.+\d+[A-Z]+>+[A-Z]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "genomic") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("NC_\d+\.+\d+:[A-Z]+\.+\d+_+\d+INS+[A-Z]", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "genomic") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

This last variant is a unique (to this db) nonstandard nomenclature for just some variant in a particular domain, so it is a region-defined variant.

In [None]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("DICER1 RNASE IIIB MUTATION", x)))
civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Region-Defined Variant", civic_supported_df["civic_variant_types"])

Add category column to CIViC df.

In [None]:
civic_supported_df["category"] = civic_supported_df["civic_variant_types"].map(civic_category_bins)
civic_supported_df.tail()

Split df by normalized/not_normalized flag

In [None]:
civic_normalized_df_cats = civic_supported_df[civic_supported_df["normalization_status"] == "normalized"]
civic_normalized_df_cats

In [None]:
civic_not_normalized_df_cats = civic_supported_df[civic_supported_df["normalization_status"] == "not_normalized"]
civic_not_normalized_df_cats

For each df, Get CIViC Variant counts by category and add to counts dictionary

In [None]:
civic_normalized_category_counts = json.loads(civic_normalized_df_cats["category"].value_counts().to_json())
civic_normalized_category_counts

In [None]:
def add_json_counts(var_category_counts, support_status) -> None:
    """given a JSON of variant categories and counts and whether that dataframe represents normalized, not_normalized, or not_supported variants, adds the counts of variants to dictionary of counts
    
    :param var_category_counts: counts of variants in clinvar with variant type information in JSON format.
    :param support_status: an int flag to indicate if the variants in the dataframe are normalized (0), unable to be normalized (1), or unsupported (2) by the normalizer
    """
    for category, count in var_category_counts.items():
        category_counts[category][support_status] += count
        category_counts[category][Fields.total_count] += count

In [None]:
add_json_counts(civic_normalized_category_counts, Fields.NORMALIZED_COUNT)
category_counts

In [None]:
civic_not_normalized_category_counts = json.loads(civic_not_normalized_df_cats["category"].value_counts().to_json())
civic_not_normalized_category_counts

In [None]:
add_json_counts(civic_not_normalized_category_counts, Fields.UNABLE_TO_NORMALIZE_COUNT)
category_counts

Read in the csv for unsupported variants.  This data was already mapped to categories in civic_variant_analysis.  Therefore, we only need to import the data and perform the count on the category column.

In [None]:
not_supported_variants = pd.read_csv("../civic/variation_analysis/not_supported_variants.csv",sep = "\t")
print(not_supported_variants.shape)
not_supported_variants.head()

Checking Counts.

In [None]:
not_supported_variants["category"].value_counts()

There are two small discrepancies here. First, there is a hyphen missing from "Region-Defined Variants" which will cause a key error.  Second, the variants labelled as "Transcript Variants" here should be binned under "Sequence Variants".  Fixing that now.

In [None]:
not_supported_variants["category"].replace("Region Defined Variants", "Region-Defined Variants", inplace=True)
not_supported_variants["category"].replace("Transcript Variants", "Sequence Variants", inplace=True)

In [None]:
not_supported_variants_category_counts = json.loads(not_supported_variants["category"].value_counts().to_json())
not_supported_variants_category_counts

In [None]:
add_json_counts(not_supported_variants_category_counts, Fields.UNSUPPORTED_COUNT)
category_counts

## <a id='toc1_4_'></a>[MOA](#toc0_)

Read MOA .csv file for Normalized variants

In [None]:
moa_normalized_df = pd.read_csv("../moa/feature_analysis/able_to_normalize_queries.csv",sep = "\t")
print(moa_normalized_df.shape)
moa_normalized_df.head()

Get variant counts by category, update variant counts df 

In [None]:
moa_normalized_category_counts = json.loads(moa_normalized_df["category"].value_counts().to_json())
moa_normalized_category_counts

In [None]:
add_json_counts(moa_normalized_category_counts, Fields.NORMALIZED_COUNT)
category_counts

Repeat same process for variants that were supported but failed to normalize.

In [None]:
moa_not_normalized_df = pd.read_csv("../moa/feature_analysis/unable_to_normalize_queries.csv",sep = "\t")
print(moa_not_normalized_df.shape)
moa_not_normalized_df.head()

In [None]:
moa_not_normalized_category_counts = json.loads(moa_not_normalized_df["category"].value_counts().to_json())
moa_not_normalized_category_counts

In [None]:
add_json_counts(moa_not_normalized_category_counts, Fields.UNABLE_TO_NORMALIZE_COUNT)
category_counts

Repeat same process for variants that are unsupported.

In [None]:
moa_not_supported_df = pd.read_csv("../moa/feature_analysis/not_supported_variants.csv",sep = "\t")
print(moa_not_supported_df.shape)
print(moa_not_supported_df.head())
moa_not_supported_df["category"].value_counts(dropna=False)

In [None]:
moa_not_supported_category_counts = json.loads(moa_not_supported_df["category"].value_counts().to_json())
moa_not_supported_category_counts

In [None]:
add_json_counts(moa_not_supported_category_counts, Fields.UNSUPPORTED_COUNT)
category_counts

## <a id='toc1_5_'></a>[ClinVar](#toc0_)

Read in the three clinvar csv files.

In [None]:
clinvar_normalized_df = pd.read_csv("../clinvar/clinvar_variation_analysis_output/variation_type_count_supported_df.csv")
print(clinvar_normalized_df.shape)
clinvar_normalized_df.head(20)

In [None]:
clinvar_not_normalized_df = pd.read_csv("../clinvar/clinvar_variation_analysis_output/variation_type_count_supported_not_normalized_df.csv")
print(clinvar_not_normalized_df.shape)
clinvar_not_normalized_df.head(10)

In [None]:
clinvar_not_supported_df = pd.read_csv("../clinvar/clinvar_variation_analysis_output/variation_type_count_not_supported_df.csv")
print(clinvar_not_supported_df.shape)
clinvar_not_supported_df.head(20)

Add column and map variant types to categories.

In [None]:
clinvar_normalized_df["category"] = clinvar_normalized_df["in.variation_type"].map(clinvar_category_bins)
clinvar_normalized_df.head(20)

In [None]:
clinvar_not_normalized_df["category"] = clinvar_not_normalized_df["in.variation_type"].map(clinvar_category_bins)
clinvar_not_normalized_df.head(20)

In [None]:
clinvar_not_supported_df["category"] = clinvar_not_supported_df["in.variation_type"].map(clinvar_category_bins)
clinvar_not_supported_df.head(20)

Due to the structure of the data and the way that the original analysis developed, some but not all CNVs per the in.variation_type were annotated in the in.vrs_xform_plan.policy column as "Copy number change (cn loss|del and cn gain|dup)", "Absolute copy count", or "Min/max copy count range not supported".  However, some of the Copy number Gain/Loss variants did not get binned as CNVs per the in.vrs_xform_plan.policy.  Therefore, we need to mark those variants in the union of the following two sets as being in the category of Copy Number Variants:

Variants with in.variant_type ==
1. copy number loss
2. copy number gain

Variants with in.vrs_xform_plan.policy == 
1. Copy number change (cn loss|del and cn gain|dup)
2. Absolute copy count
3. Min/max copy count range not supported

Above we already caught the first set of variants. Now we must go back through each df one more time and map the variants we missed per in.vrs_xform_plan.policy values to the category of Copy Number Variants.

In [None]:
cnv_per_policy = ["Copy number change (cn loss|del and cn gain|dup)","Absolute copy count","Min/max copy count range not supported","Copy number change (cn loss|del and cn gain|dup)"]

In [None]:
clinvar_normalized_df.loc[
    clinvar_normalized_df["in.vrs_xform_plan.policy"].isin(cnv_per_policy),
      "category"
      ] = "Copy Number Variants"


In [None]:
clinvar_normalized_df

In [None]:
clinvar_not_normalized_df

In [None]:
clinvar_not_normalized_df.loc[
    clinvar_not_normalized_df["in.vrs_xform_plan.policy"].isin(cnv_per_policy),
      "category"
      ] = "Copy Number Variants"


clinvar_not_normalized_df

In [None]:
clinvar_not_supported_df

In [None]:
clinvar_not_supported_df.loc[
    clinvar_not_supported_df["in.vrs_xform_plan.policy"].isin(cnv_per_policy),
      "category"
      ] = "Copy Number Variants"

clinvar_not_supported_df

Get counts from the three dfs.

In [None]:
category_counts

In [None]:
def sum_clinvar_counts(dataframe: pd.DataFrame, support_status: int) -> None:
    """given a dataframe and whether that dataframe represents normalized, not_normalized, or not_supported variants, adds the counts of variants to dictionary of counts
    
    :param dataframe: counts of variants in clinvar with variant type information in dataframe format.
    :param support_status: an int flag to indicate if the variants in the dataframe are normalized (0), unable to be normalized (1), or unsupported (2) by the normalizer
    """
    for i in category_counts.keys():
        subdf = dataframe[dataframe["category"] == i]
        if len(subdf):
            category = i
            count = subdf["count"].sum()
            print(category, count)
            category_counts[category][support_status] += count
            category_counts[category][Fields.TOTAL_COUNT] += count


In [None]:
sum_clinvar_counts(clinvar_normalized_df,Fields.NORMALIZED_COUNT)

category_counts

In [None]:
sum_clinvar_counts(clinvar_not_normalized_df,Fields.UNABLE_TO_NORMALIZE_COUNT)

category_counts

In [None]:
sum_clinvar_counts(clinvar_not_supported_df,Fields.UNSUPPORTED_COUNT)

category_counts

## <a id='toc1_6_'></a>[Computing Coverage](#toc0_)

For the purposes of making the table, computing the percent of all variants normalized in each category.

In [None]:
for i in category_counts.keys():
    normalized = category_counts[i][Fields.NORMALIZED_COUNT]
    total = category_counts[i][Fields.TOTAL_COUNT]
    percent_covered = normalized/total
    category_counts[i][Fields.percent_normalized] = "%.4f" % percent_covered

category_counts
    

## <a id='toc1_7_'></a>[Generating Table](#toc0_)

Generating a table in plotly to show variant counts and normalization percentage by category, as well as the types of data fields associated with different variant categories.

In [None]:
class VariantCategory(str, Enum):
    """Create enum for the kind of variants that are in the combined analysis."""
    SEQUENCE_VARS = "Sequence Variants"
    GENOTYPES = "Genotype Variants"
    FUSION = "Fusion Variants"
    REARRANGEMENTS = "Rearrangement Variants"
    EPIGENETIC_MODIFICATION = "Epigenetic Modification"
    COPY_NUMBER = "Copy Number Variants"
    EXPRESSION = "Expression Variants"
    GENE_FUNC = "Gene Function Variants"
    REGION_DEFINED_VAR = "Region-Defined Variants"
    OTHER = "Other Variants"

VARIANT_CATEGORY_VALUES = VariantCategory.__members__.values()
    

In [None]:
base_colors = ['rgb(49, 130, 189)','rgb(239, 243, 255)', 'rgb(189, 215, 231)', 'rgb(107, 174, 214)',
           'white']
core_field = "\u2B24"
optional_field = "<b>◯</b>"

colors = ['rgb(49, 130, 189)','white', 'white', 'white',
           'white', 'rgb(49, 130, 189)', 'white', 'white', 'rgb(189, 215, 231)','rgb(107, 174, 214)']
data = {'variant_category' : VARIANT_CATEGORY_VALUES,
        'counts' : [f'{category_counts[v.value][Fields.TOTAL_COUNT]:,}' for v in VARIANT_CATEGORY_VALUES],
        'percent_normalized' : ["%.2f" %round(float(category_counts[v.value][Fields.PERCENT_NORMALIZED])*100,2)+"%" for v in VARIANT_CATEGORY_VALUES],
        'delta_sequence' : [core_field, core_field, "", "", "", "", "", "", "", optional_field],
        'delta_location' : [optional_field, optional_field, core_field, core_field, "", "", "", "", "", ""],
        'delta_frame' : [optional_field, optional_field, "", "", "", "", "", "", "", optional_field],
        'delta_quantity' : [optional_field, optional_field, "", "", core_field, core_field, optional_field, "", "", optional_field],
        'delta_function' : [optional_field, optional_field, "", "", optional_field, optional_field, core_field, core_field, "", optional_field],
        'region_specificity' : [optional_field, optional_field, optional_field, optional_field, optional_field, optional_field, optional_field, optional_field, core_field, optional_field],
        'shading' : colors
         }
df = pd.DataFrame(data)

fig = go.Figure(data=[go.Table(
  columnwidth = [90,53,65,53,50,50,50,50,50,50],
  header=dict(
    values=["<b>Variant Category</b>", "<b>Count</b>", "<b>% Normalized</b>", "<b>Δ Sequence</b>", "<b>Δ Location</b>", "<b>Δ Frame</b>", "<b>Δ Quantity</b>", "<b>Δ Function</b>", "<b>Region Specificity</b>"],
    line_color='black', fill_color='white',
    align='center', font=dict(color='black', size=18)
  ),
  cells=dict(
    values=[df.variant_category, df.counts, df.percent_normalized, df.delta_sequence, df.delta_location, df.delta_frame, df.delta_quantity, df.delta_function, df.region_specificity],
    line_color=["black"], fill_color= [df.shading],
    align='right', font=dict(color='black', size=18), height=30
  ))

])

fig.add_annotation(
            dict(
                text='  \u2B24  Core information fields<br><br>  <b>◯</b>  Optional information fields  ',
                align='left',
                showarrow=False,
                xref='paper',
                xanchor = 'right',
                yref='paper',
                x=0.98,
                y=0.02,
                yanchor = 'bottom',
                bordercolor='black',
                borderwidth=1
            ))

fig.update_layout(
    height=585, 
    width=1400,
    font=dict(
        size=18,
        color="Black"
        ),
    title = "<b>Counts, Normalizer Performance, and Data Types of Variants by Category</b>",
        margin=go.layout.Margin(
        l=2, #left margin
        r=2, #right margin
        b=0, #bottom margin
        t=52  #top margin
    ))
fig.show()

Exporting the table as a .png file.

In [None]:
fig.write_image("../merged_performance_analysis_table.png",'png')