# <a id='toc1_'></a>[CIViC Evidence Analysis](#toc0_)
The civic_evidence_analysis notebook contains an analysis on CIViC evidence data

**Table of contents**<a id='toc0_'></a>    
- [CIViC Evidence Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Create output directory](#toc1_1_2_)    
    - [Use latest cache that has been pushed to the repo](#toc1_1_3_)    
  - [Total Variants in CIViC](#toc1_2_)    
  - [Total Evidence items in CIViC](#toc1_3_)    
  - [Total Molecular Profiles in CIViC](#toc1_4_)    
- [Create analysis functions / global variables](#toc2_)    
  - [Summary dicts](#toc2_1_)    
  - [Define Analysis Functions](#toc2_2_)    
- [Analysis of Normalized Queries](#toc3_)    
  - [List of Normalized Variants ID's](#toc3_1_)    
  - [Variant analysis](#toc3_2_)    
  - [Transform df for evidence analysis](#toc3_3_)    
  - [Evidence analysis](#toc3_4_)    
  - [Impact](#toc3_5_)    
    - [Import molecular profile id](#toc3_5_1_)    
    - [Import molecular profile scores](#toc3_5_2_)    
- [Analysis of Unable to Normalize Queries](#toc4_)    
  - [List of Unable to Normalize Variant ID's](#toc4_1_)    
  - [Variant analysis](#toc4_2_)    
  - [Transform df for evidence analysis](#toc4_3_)    
  - [Evidence analysis](#toc4_4_)    
  - [Impact](#toc4_5_)    
    - [Import molecular profile id](#toc4_5_1_)    
    - [Import molecular profile scores](#toc4_5_2_)    
- [Analysis of Not Supported Variants](#toc5_)    
    - [List of Not Supported Variant ID's](#toc5_1_1_)    
  - [Variant Analysis](#toc5_2_)    
    - [Not Supported Variant Analysis by Subcategory](#toc5_2_1_)    
  - [Transform df for evidence analysis](#toc5_3_)    
  - [Evidence analysis](#toc5_4_)    
    - [Not Supported Variant Evidence Analysis by Subcategory](#toc5_4_1_)    
  - [Impact](#toc5_5_)    
    - [Via Evidence Level](#toc5_5_1_)    
      - [Analysis with only Accepted Variants](#toc5_5_1_1_)    
        - [Calculating evidence score via level](#toc5_5_1_1_1_)    
        - [Summary Table](#toc5_5_1_1_2_)    
        - [Calculating evidence score via level](#toc5_5_1_1_3_)    
      - [Analysis with Accepted and Submitted Variants](#toc5_5_1_2_)    
        - [Calculating evidence score via level](#toc5_5_1_2_1_)    
        - [Summary Table](#toc5_5_1_2_2_)    
    - [Via Molecular Profile Score- this was not used](#toc5_5_2_)    
      - [Import molecular profile id](#toc5_5_2_1_)    
      - [Import molecular profile scores](#toc5_5_2_2_)    
      - [Impact by Subcategory](#toc5_5_2_3_)    
- [Summary](#toc6_)    
  - [Variant Analysis](#toc6_1_)    
    - [Building Summary Table 1 & 2](#toc6_1_1_)    
    - [Summary Table 1](#toc6_1_2_)    
    - [Summary Table 2](#toc6_1_3_)    
    - [Building Summary Tables 3 - 5](#toc6_1_4_)    
    - [Summary Table 3](#toc6_1_5_)    
    - [Summary Table 4](#toc6_1_6_)    
    - [Summary Table 5](#toc6_1_7_)    
  - [Evidence Analysis](#toc6_2_)    
    - [Building Summary Tables 6 & 7](#toc6_2_1_)    
    - [Summary Table 6](#toc6_2_2_)    
    - [Summmary Table 7](#toc6_2_3_)    
    - [Building Summary Tables 8 - 10](#toc6_2_4_)    
    - [Summary Table 8](#toc6_2_5_)    
    - [Summary Table 9](#toc6_2_6_)    
    - [Summary Table 10](#toc6_2_7_)    
  - [Impact](#toc6_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [1]:
import os
import sys
from pathlib import Path
from enum import Enum
import re
import numpy as np

import pandas as pd
import plotly.express as px
from civicpy import civic as civicpy

module_path = os.path.abspath(os.path.join("../.."))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils import load_civicpy_cache, NOT_SUPPORTED_VARIANT_CATEGORY_VALUES  # noqa: E402

### <a id='toc1_1_2_'></a>[Create output directory](#toc0_)

In [2]:
path = Path("output")
path.mkdir(exist_ok=True)

### <a id='toc1_1_3_'></a>[Use latest cache that has been pushed to the repo](#toc0_)

In [3]:
load_civicpy_cache()

Using cache-20250717.pkl for civicpy cache


## <a id='toc1_2_'></a>[Total Variants in CIViC](#toc0_)

In [4]:
civic_variant_ids = civicpy.get_all_variants(include_status=["accepted", "submitted"])
total_number_variants = len(civic_variant_ids)
f"Total Number of variants in CIViC: {total_number_variants}"

'Total Number of variants in CIViC: 3845'

## <a id='toc1_3_'></a>[Total Evidence items in CIViC](#toc0_)

Rejected evidence items are excluded

In [5]:
civic_evidence_items = civicpy.get_all_evidence(
    include_status=["accepted", "submitted"]
)

In [6]:
total_ac_sub_evidence = len(civic_evidence_items)
f"Total Number of accepted and submitted evidence items in CIViC: {total_ac_sub_evidence}"

'Total Number of accepted and submitted evidence items in CIViC: 10850'

## <a id='toc1_4_'></a>[Total Molecular Profiles in CIViC](#toc0_)

In [7]:
civic_molprofs = civicpy.get_all_molecular_profiles(
    include_status=["accepted", "submitted"]
)

# <a id='toc2_'></a>[Create analysis functions / global variables](#toc0_)

In [8]:
class VariantNormType(str, Enum):
    """Variation Normalization types"""

    NORMALIZED = "Normalized"
    UNABLE_TO_NORMALIZE = "Unable to Normalize"
    NOT_SUPPORTED = "Not Supported"


VARIANT_NORM_TYPE_VALUES = [v.value for v in VariantNormType.__members__.values()]

## <a id='toc2_1_'></a>[Summary dicts](#toc0_)

These dictionaries will be mutated and used at the end of the analysis

In [9]:
variant_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of CIViC Variants per Category": [],
    "Fraction of all CIViC Variants": [],
    "Percent of all CIViC Variants": [],
    "Fraction of Accepted Variants": [],
    "Percent of Accepted Variants": [],
    "Fraction of Submitted Variants": [],
    "Percent of Submitted Variants": [],
}
variant_analysis_summary

{'Variant Category': ['Normalized', 'Unable to Normalize', 'Not Supported'],
 'Count of CIViC Variants per Category': [],
 'Fraction of all CIViC Variants': [],
 'Percent of all CIViC Variants': [],
 'Fraction of Accepted Variants': [],
 'Percent of Accepted Variants': [],
 'Fraction of Submitted Variants': [],
 'Percent of Submitted Variants': []}

In [10]:
evidence_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of CIViC Evidence Items per Category": [],
    "Fraction of all CIViC Evidence Items": [],
    "Percent of all CIViC Evidence Items": [],
    "Fraction of Accepted Evidence Items": [],
    "Percent of Accepted Evidence Items": [],
    "Fraction of Submitted Evidence Items": [],
    "Percent of Submitted Evidence Items": [],
}
evidence_analysis_summary

{'Variant Category': ['Normalized', 'Unable to Normalize', 'Not Supported'],
 'Count of CIViC Evidence Items per Category': [],
 'Fraction of all CIViC Evidence Items': [],
 'Percent of all CIViC Evidence Items': [],
 'Fraction of Accepted Evidence Items': [],
 'Percent of Accepted Evidence Items': [],
 'Fraction of Submitted Evidence Items': [],
 'Percent of Submitted Evidence Items': []}

## <a id='toc2_2_'></a>[Define Analysis Functions](#toc0_)

In [11]:
def variant_analysis(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do variant analysis (counts, percentages)

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of variants that are in `df`
    :return: Transformed dataframe with variant ID duplicates dropped
    """
    # Drop duplicate rows
    df = df.drop_duplicates(subset=["variant_id"])
    variant_ids = list(df["variant_id"])

    # Count
    num_variants = len(variant_ids)
    fraction_variants = f"{num_variants} / {total_number_variants}"
    print(
        f"\nNumber of {variant_norm_type.value} Variants in CIViC: {fraction_variants}"
    )

    # Percent
    percentage_variants = f"{num_variants / total_number_variants * 100:.2f}%"
    print(
        f"Percent of {variant_norm_type.value} Variants in CIViC: {percentage_variants}"
    )

    # Get accepted counts
    num_accepted_variants = df.variant_accepted.sum()
    fraction_accepted_variants = f"{num_accepted_variants} / {num_variants}"
    print(
        f"\nNumber of accepted {variant_norm_type.value} Variants: {fraction_accepted_variants}"
    )

    # Get accepted Percent
    percentage_accepted_variants = f"{num_accepted_variants / num_variants * 100:.2f}%"
    print(
        f"Percent of accepted {variant_norm_type.value} Variants: {percentage_accepted_variants}"
    )

    # Get submitted counts
    num_submitted_variants = len(df) - num_accepted_variants
    fraction_submitted_variants = f"{num_submitted_variants} / {num_variants}"
    print(
        f"\nNumber of submitted {variant_norm_type.value} Variants: {fraction_submitted_variants}"
    )

    # Get submitted Percent
    percentage_submitted_variants = (
        f"{num_submitted_variants / num_variants * 100:.2f}%"
    )
    print(
        f"Percent of submitted {variant_norm_type.value} Variants: {percentage_submitted_variants}"
    )

    variant_analysis_summary["Count of CIViC Variants per Category"].append(
        num_variants
    )
    variant_analysis_summary["Fraction of all CIViC Variants"].append(fraction_variants)
    variant_analysis_summary["Percent of all CIViC Variants"].append(
        percentage_variants
    )
    variant_analysis_summary["Fraction of Accepted Variants"].append(
        fraction_accepted_variants
    )
    variant_analysis_summary["Percent of Accepted Variants"].append(
        percentage_accepted_variants
    )
    variant_analysis_summary["Fraction of Submitted Variants"].append(
        fraction_submitted_variants
    )
    variant_analysis_summary["Percent of Submitted Variants"].append(
        percentage_submitted_variants
    )

    return df

In [12]:
def transform_df_evidence_ids(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include evidence ID information

    :param df: Dataframe of variants
    :return: Transformed dataframe with evidence ID information
    """
    tmp_df = df.copy(deep=True)

    _variants_evidence_ids = []
    variant_ids = list(tmp_df["variant_id"])

    for v_id in variant_ids:
        _variant_evidence_ids = []

        for variant in civic_variant_ids:
            if int(v_id) == variant.id:
                for mp in variant.molecular_profiles:
                    for e in mp.evidence_items:
                        if e.id not in _variant_evidence_ids:
                            _variant_evidence_ids.append(e.id)

        _variants_evidence_ids.append(_variant_evidence_ids or "")

    tmp_df["evidence_ids"] = _variants_evidence_ids

    # Explode and rename evidence ids field
    tmp_df = tmp_df.explode(column="evidence_ids")
    tmp_df = tmp_df.rename(columns={"evidence_ids": "evidence_id"})

    return tmp_df

In [13]:
def transform_df_evidence(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include evidence status, rating, and level

    :param df: Dataframe of variants
    :return: Transformed dataframe with evidence status, rating, and level information.
    """
    variants_evidence_ids = list(df["evidence_id"])

    # Add evidence status, rating, and level information
    _variants_evidence_statuses = []
    _variants_evidence_ratings = []
    _variants_evidence_levels = []

    for eid in variants_evidence_ids:
        _variant_evidence_statuses = []
        _variant_evidence_ratings = []
        _variant_evidence_levels = []

        for evidence in civic_evidence_items:
            if eid and (int(eid) == evidence.id):
                if evidence.status not in _variant_evidence_statuses:
                    _variant_evidence_statuses.append(evidence.status)

                if evidence.rating not in _variant_evidence_ratings:
                    _variant_evidence_ratings.append(evidence.rating)

                if evidence.evidence_level not in _variant_evidence_levels:
                    _variant_evidence_levels.append(evidence.evidence_level)

        _variants_evidence_statuses.append(_variant_evidence_statuses or "")
        _variants_evidence_ratings.append(_variant_evidence_ratings or "")
        _variants_evidence_levels.append(_variant_evidence_levels or "")

    df["evidence_status"] = _variants_evidence_statuses
    df["evidence_status"] = df["evidence_status"].str.join(", ")
    df["evidence_rating"] = _variants_evidence_ratings
    df["evidence_level"] = _variants_evidence_levels

    return df

In [14]:
def evidence_analysis(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do evidence analysis (counts, percentages)

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of variants that are in `df`
    :return: Transformed dataframe with evidence ID duplicates dropped

    this is for Not Supported Variant analysis since it has sub categories and
    evidence item duplicates should be dropped within the sub categories,
    not across all Not Supported Variant evidence items
    """
    # Count
    num_variant_unique_evidence_items = len(set(df.evidence_id))
    fraction_evidence_items = (
        f"{num_variant_unique_evidence_items} / {total_ac_sub_evidence}"
    )
    print(
        f"Number of {variant_norm_type.value} Variant Evidence items in CIViC: {fraction_evidence_items}"
    )

    # Percent
    percentage_evidence_items = (
        f"{num_variant_unique_evidence_items / total_ac_sub_evidence * 100:.2f}%"
    )
    print(
        f"Percent of {variant_norm_type.value} Variant Evidence items in CIViC: {percentage_evidence_items}"
    )

    # Add evidence accepted column
    df["evidence_accepted"] = df.evidence_status.map(
        {"accepted": True, "submitted": False}
    )

    # Drop evidence id duplicates- this creates a new temporary df so that later duplicates can be
    # dropped by evidence id and category
    df1 = df.drop_duplicates(subset=["evidence_id"])

    # Get accepted counts
    num_accepted_evidences_variants = df1.evidence_accepted.sum()
    fraction_accepted_evidences_variants = (
        f"{num_accepted_evidences_variants} / {num_variant_unique_evidence_items}"
    )
    print(
        f"\nNumber of accepted {variant_norm_type.value} Variant Evidence items: {fraction_accepted_evidences_variants}"
    )

    # Get accepted Percent
    percentage_accepted_evidences_variants = f"{num_accepted_evidences_variants / num_variant_unique_evidence_items * 100:.2f}%"
    print(
        f"Percent of accepted {variant_norm_type.value} Variant Evidence items: {percentage_accepted_evidences_variants}"
    )

    # Get submitted counts
    number_submitted_evidences_variants = len(df1) - num_accepted_evidences_variants
    fraction_submitted_evidences_variants = (
        f"{number_submitted_evidences_variants} / {num_variant_unique_evidence_items}"
    )
    print(
        f"\nNumber of submitted {variant_norm_type.value} Variant Evidence items: {fraction_submitted_evidences_variants}"
    )

    # Get submitted Percent
    percentage_submitted_evidences_variants = f"{number_submitted_evidences_variants / num_variant_unique_evidence_items * 100:.2f}%"
    print(
        f"Percent of submitted {variant_norm_type.value} Variant Evidence items: {percentage_submitted_evidences_variants}"
    )

    evidence_analysis_summary["Count of CIViC Evidence Items per Category"].append(
        num_variant_unique_evidence_items
    )
    evidence_analysis_summary["Fraction of all CIViC Evidence Items"].append(
        fraction_evidence_items
    )
    evidence_analysis_summary["Percent of all CIViC Evidence Items"].append(
        percentage_evidence_items
    )
    evidence_analysis_summary["Fraction of Accepted Evidence Items"].append(
        fraction_accepted_evidences_variants
    )
    evidence_analysis_summary["Percent of Accepted Evidence Items"].append(
        percentage_accepted_evidences_variants
    )
    evidence_analysis_summary["Fraction of Submitted Evidence Items"].append(
        fraction_submitted_evidences_variants
    )
    evidence_analysis_summary["Percent of Submitted Evidence Items"].append(
        percentage_submitted_evidences_variants
    )
    if variant_norm_type == VariantNormType.NOT_SUPPORTED:
        return df
    else:
        return df1

In [15]:
def transform_df_mp_id(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile ID information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile ID information
    """
    tmp_df = df.copy(deep=True)

    variants_molprof_ids = []
    variant_ids = list(tmp_df["variant_id"])

    for v_id in variant_ids:
        variant_molprof_ids = []

        for variant in civic_variant_ids:
            if int(v_id) == variant.id:
                for mp in variant.molecular_profiles:
                    if mp.id not in variant_molprof_ids:
                        variant_molprof_ids.append(mp.id)

        variants_molprof_ids.append(variant_molprof_ids or "")

    tmp_df["molecular_profile_id"] = variants_molprof_ids
    return tmp_df

In [16]:
def transform_df_mp_score(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile score information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile score information
    """
    variants_molprof_scores = []
    normalized_variant_molprof_ids = list(df["molecular_profile_id"])

    for mp_ids in normalized_variant_molprof_ids:
        variant_molprof_scores = []
        for mp_id in mp_ids:
            for molprof in civic_molprofs:
                if int(mp_id) == molprof.id:
                    variant_molprof_scores.append(molprof.molecular_profile_score)

        variants_molprof_scores.append(variant_molprof_scores or "")

    df["molecular_profile_score"] = variants_molprof_scores
    return df

In [17]:
def transform_df_mp_score_sum(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile score sum information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile score sum information
    """
    df["molecular_profile_score_sum"] = df["molecular_profile_score"].apply(
        lambda x: sum(x)
    )
    return df

# <a id='toc3_'></a>[Analysis of Normalized Queries](#toc0_)

## <a id='toc3_1_'></a>[List of Normalized Variants ID's](#toc0_)

In [18]:
normalized_queries_df = pd.read_csv(
    "../variation_analysis/able_to_normalize_queries.tsv", sep="\t"
)
normalized_queries_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.Mikw3IoUZ58l_zejQQOT0D0inT2Cvxpr,normalize
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.GkISlkjkoX6ts9HHLAzsjDvbCU0d6KyH,normalize


## <a id='toc3_2_'></a>[Variant analysis](#toc0_)

In [19]:
normalized_queries_df = variant_analysis(
    normalized_queries_df, VariantNormType.NORMALIZED
)
normalized_queries_df.head()


Number of Normalized Variants in CIViC: 2015 / 3845
Percent of Normalized Variants in CIViC: 52.41%

Number of accepted Normalized Variants: 976 / 2015
Percent of accepted Normalized Variants: 48.44%

Number of submitted Normalized Variants: 1039 / 2015
Percent of submitted Normalized Variants: 51.56%


Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.Mikw3IoUZ58l_zejQQOT0D0inT2Cvxpr,normalize
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.GkISlkjkoX6ts9HHLAzsjDvbCU0d6KyH,normalize


In [20]:
variant_analysis_summary

{'Variant Category': ['Normalized', 'Unable to Normalize', 'Not Supported'],
 'Count of CIViC Variants per Category': [2015],
 'Fraction of all CIViC Variants': ['2015 / 3845'],
 'Percent of all CIViC Variants': ['52.41%'],
 'Fraction of Accepted Variants': ['976 / 2015'],
 'Percent of Accepted Variants': ['48.44%'],
 'Fraction of Submitted Variants': ['1039 / 2015'],
 'Percent of Submitted Variants': ['51.56%']}

## <a id='toc3_3_'></a>[Transform df for evidence analysis](#toc0_)

In [21]:
normalized_queries_add_evidence_df = transform_df_evidence_ids(normalized_queries_df)
normalized_queries_add_evidence_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,evidence_id
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,9347
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,6724
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize,5336
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,10779
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,6723


In [22]:
normalized_queries_add_evidence_df = transform_df_evidence(
    normalized_queries_add_evidence_df
)
normalized_queries_add_evidence_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,evidence_id,evidence_status,evidence_rating,evidence_level
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,9347,submitted,[3],[C]
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,6724,accepted,[2],[C]
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize,5336,accepted,[2],[C]
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,10779,submitted,[3],[C]
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,6723,accepted,[2],[C]


## <a id='toc3_4_'></a>[Evidence analysis](#toc0_)

In [23]:
normalized_queries_add_evidence_df = evidence_analysis(
    normalized_queries_add_evidence_df, VariantNormType.NORMALIZED
)
normalized_queries_add_evidence_df.head()

Number of Normalized Variant Evidence items in CIViC: 6457 / 10850
Percent of Normalized Variant Evidence items in CIViC: 59.51%

Number of accepted Normalized Variant Evidence items: 2415 / 6457
Percent of accepted Normalized Variant Evidence items: 37.40%

Number of submitted Normalized Variant Evidence items: 4042 / 6457
Percent of submitted Normalized Variant Evidence items: 62.60%


Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,9347,submitted,[3],[C],False
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,6724,accepted,[2],[C],True
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize,5336,accepted,[2],[C],True
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,10779,submitted,[3],[C],False
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,6723,accepted,[2],[C],True


## <a id='toc3_5_'></a>[Impact](#toc0_)
Via molecular profile score

### <a id='toc3_5_1_'></a>[Import molecular profile id](#toc0_)

In [24]:
normalized_queries_add_molprof_df = transform_df_mp_id(normalized_queries_df)
normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,[2362]
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize,[1864]
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,[2361]
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.Mikw3IoUZ58l_zejQQOT0D0inT2Cvxpr,normalize,[1862]
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.GkISlkjkoX6ts9HHLAzsjDvbCU0d6KyH,normalize,[1863]


### <a id='toc3_5_2_'></a>[Import molecular profile scores](#toc0_)

In [25]:
normalized_queries_add_molprof_df = transform_df_mp_score(
    normalized_queries_add_molprof_df
)
normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id,molecular_profile_score
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,[2362],[5.0]
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize,[1864],[5.0]
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,[2361],[5.0]
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.Mikw3IoUZ58l_zejQQOT0D0inT2Cvxpr,normalize,[1862],[10.0]
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.GkISlkjkoX6ts9HHLAzsjDvbCU0d6KyH,normalize,[1863],[5.0]


In [26]:
normalized_queries_add_molprof_df = transform_df_mp_score_sum(
    normalized_queries_add_molprof_df
)
normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.bq-oeQxlHsivQjLeBx2iIDHE6byLoIYf,normalize,[2362],[5.0],5.0
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.F28e9gdIz4RKTwb8Vch32ewM9byNWd7s,normalize,[1864],[5.0],5.0
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.locY4ll_kFLsvWR3-6n4zSCbY2WeBC4H,normalize,[2361],[5.0],5.0
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.Mikw3IoUZ58l_zejQQOT0D0inT2Cvxpr,normalize,[1862],[10.0],10.0
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.GkISlkjkoX6ts9HHLAzsjDvbCU0d6KyH,normalize,[1863],[5.0],5.0


# <a id='toc4_'></a>[Analysis of Unable to Normalize Queries](#toc0_)

## <a id='toc4_1_'></a>[List of Unable to Normalize Variant ID's](#toc0_)

In [27]:
not_normalized_queries_df = pd.read_csv(
    "../variation_analysis/unable_to_normalize_queries.tsv", sep="\t"
)
not_normalized_queries_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L']
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V']
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T']
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t..."
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T']


## <a id='toc4_2_'></a>[Variant analysis](#toc0_)

In [28]:
not_normalized_queries_df = variant_analysis(
    not_normalized_queries_df, VariantNormType.UNABLE_TO_NORMALIZE
)
not_normalized_queries_df.head()


Number of Unable to Normalize Variants in CIViC: 83 / 3845
Percent of Unable to Normalize Variants in CIViC: 2.16%

Number of accepted Unable to Normalize Variants: 14 / 83
Percent of accepted Unable to Normalize Variants: 16.87%

Number of submitted Unable to Normalize Variants: 69 / 83
Percent of submitted Unable to Normalize Variants: 83.13%


Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L']
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V']
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T']
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t..."
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T']


## <a id='toc4_3_'></a>[Transform df for evidence analysis](#toc0_)

In [29]:
not_normalized_quer_add_evidence_df = transform_df_evidence_ids(
    not_normalized_queries_df
)
not_normalized_quer_add_evidence_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,evidence_id
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],1812
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],10128
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],10135
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11494
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11323


In [30]:
not_normalized_quer_add_evidence_df = transform_df_evidence(
    not_normalized_quer_add_evidence_df
)
not_normalized_quer_add_evidence_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,evidence_id,evidence_status,evidence_rating,evidence_level
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],1812,accepted,[1],[C]
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],10128,submitted,[3],[D]
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],10135,submitted,[3],[D]
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11494,submitted,[4],[D]
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11323,submitted,[3],[B]


## <a id='toc4_4_'></a>[Evidence analysis](#toc0_)

In [31]:
not_normalized_quer_add_evidence_df = evidence_analysis(
    not_normalized_quer_add_evidence_df, VariantNormType.UNABLE_TO_NORMALIZE
)
not_normalized_quer_add_evidence_df.head()

Number of Unable to Normalize Variant Evidence items in CIViC: 128 / 10850
Percent of Unable to Normalize Variant Evidence items in CIViC: 1.18%

Number of accepted Unable to Normalize Variant Evidence items: 20 / 128
Percent of accepted Unable to Normalize Variant Evidence items: 15.62%

Number of submitted Unable to Normalize Variant Evidence items: 108 / 128
Percent of submitted Unable to Normalize Variant Evidence items: 84.38%


Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],1812,accepted,[1],[C],True
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],10128,submitted,[3],[D],False
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],10135,submitted,[3],[D],False
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11494,submitted,[4],[D],False
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11323,submitted,[3],[B],False


## <a id='toc4_5_'></a>[Impact](#toc0_)
Via molecular profile score

### <a id='toc4_5_1_'></a>[Import molecular profile id](#toc0_)

In [32]:
not_normalized_queries_add_molprof_df = transform_df_mp_id(not_normalized_queries_df)
not_normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,molecular_profile_id
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],[729]
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],[3586]
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],[3593]
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...","[4463, 4472]"
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T'],[244]


### <a id='toc4_5_2_'></a>[Import molecular profile scores](#toc0_)

In [33]:
not_normalized_queries_add_molprof_df = transform_df_mp_score(
    not_normalized_queries_add_molprof_df
)
not_normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,molecular_profile_id,molecular_profile_score
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],[729],[2.5]
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],[3586],[0.0]
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],[3593],[0.0]
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...","[4463, 4472]","[0.0, 0.0]"
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T'],[244],[40.0]


In [34]:
not_normalized_queries_add_molprof_df = transform_df_mp_score_sum(
    not_normalized_queries_add_molprof_df
)
not_normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],[729],[2.5],2.5
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],[3586],[0.0],0.0
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],[3593],[0.0],0.0
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...","[4463, 4472]","[0.0, 0.0]",0.0
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T'],[244],[40.0],40.0


# <a id='toc5_'></a>[Analysis of Not Supported Variants](#toc0_)

### <a id='toc5_1_1_'></a>[List of Not Supported Variant ID's](#toc0_)

In [35]:
not_supported_queries_df = pd.read_csv(
    "../variation_analysis/not_supported_variants.tsv", sep="\t"
)
not_supported_queries_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted
0,4170,VHL,,Not provided,Transcript,False
1,4214,VHL,,Not provided,Transcript,False
2,4216,VHL,,Not provided,Transcript,False
3,4278,VHL,,Not provided,Transcript,False
4,4232,BRCA1,,Not provided,Transcript,False


## <a id='toc5_2_'></a>[Variant Analysis](#toc0_)

In [36]:
not_supported_queries_df = variant_analysis(
    not_supported_queries_df, VariantNormType.NOT_SUPPORTED
)
not_supported_queries_df.head()


Number of Not Supported Variants in CIViC: 1747 / 3845
Percent of Not Supported Variants in CIViC: 45.44%

Number of accepted Not Supported Variants: 814 / 1747
Percent of accepted Not Supported Variants: 46.59%

Number of submitted Not Supported Variants: 933 / 1747
Percent of submitted Not Supported Variants: 53.41%


Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted
0,4170,VHL,,Not provided,Transcript,False
1,4214,VHL,,Not provided,Transcript,False
2,4216,VHL,,Not provided,Transcript,False
3,4278,VHL,,Not provided,Transcript,False
4,4232,BRCA1,,Not provided,Transcript,False


In [37]:
not_supported_queries_df["variant_accepted"].value_counts()

variant_accepted
False    933
True     814
Name: count, dtype: int64

### <a id='toc5_2_1_'></a>[Not Supported Variant Analysis by Subcategory](#toc0_)

In [38]:
not_supported_variant_analysis_summary = {
    "Category": NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    "Count of CIViC Variants per Category": [],
    "Fraction of Not Supported Variants": [],
    "Percent of Not Supported Variants": [],
    "Fraction of all CIViC Variants": [],
    "Percent of all CIViC Variants": [],
    "Fraction of Accepted Variants": [],
    "Percent of Accepted Variants": [],
    "Fraction of Submitted Variants": [],
    "Percent of Submitted Variants": [],
}

In [39]:
not_supported_variant_categories_summary_data = dict()
total_number_unique_not_supported_variants = len(
    set(not_supported_queries_df.variant_id)
)

for (
    category
) in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:  # These are not supported categories
    not_supported_variant_categories_summary_data[category] = {}
    category_df = not_supported_queries_df[
        not_supported_queries_df.category == category
    ]

    # Count
    number_unique_not_supported_category_variants = len(set(category_df.variant_id))
    not_supported_variant_categories_summary_data[category][
        "number_unique_not_supported_category_variants"
    ] = number_unique_not_supported_category_variants

    # Fraction
    fraction_not_supported_category_variant_of_civic = (
        f"{number_unique_not_supported_category_variants} / {total_number_variants}"
    )
    not_supported_variant_categories_summary_data[category][
        "fraction_not_supported_category_variant_of_civic"
    ] = fraction_not_supported_category_variant_of_civic

    # Percent
    percent_not_supported_category_variant_of_civic = f"{number_unique_not_supported_category_variants / total_number_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percent_not_supported_category_variant_of_civic"
    ] = percent_not_supported_category_variant_of_civic

    # Not supported fraction
    fraction_not_supported_category_variant_of_total_not_supported = f"{number_unique_not_supported_category_variants} / {total_number_unique_not_supported_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_not_supported_category_variant_of_total_not_supported"
    ] = fraction_not_supported_category_variant_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_variant_of_total_not_supported = f"{number_unique_not_supported_category_variants / total_number_unique_not_supported_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percent_not_supported_category_variant_of_total_not_supported"
    ] = percent_not_supported_category_variant_of_total_not_supported

    # Accepted fraction
    number_accepted_not_supported_category_variants = category_df.variant_accepted.sum()
    fraction_accepted_not_supported_category_variants = f"{number_accepted_not_supported_category_variants} / {number_unique_not_supported_category_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_accepted_not_supported_category_variants"
    ] = fraction_accepted_not_supported_category_variants

    # Accepted percent
    percentage_accepted_not_supported_category_variants = f"{number_accepted_not_supported_category_variants / number_unique_not_supported_category_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percentage_accepted_not_supported_category_variants"
    ] = percentage_accepted_not_supported_category_variants

    # Submitted fraction
    number_submitted_not_supported_category_variants = (
        len(category_df) - number_accepted_not_supported_category_variants
    )
    fraction_submitted_not_supported_category_variants = f" {number_submitted_not_supported_category_variants} / {number_unique_not_supported_category_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_submitted_not_supported_category_variants"
    ] = fraction_submitted_not_supported_category_variants

    # Submitted percent
    percentage_submitted_not_supported_category_variants = f"{number_submitted_not_supported_category_variants / number_unique_not_supported_category_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percentage_submitted_not_supported_category_variants"
    ] = percentage_submitted_not_supported_category_variants

    not_supported_variant_analysis_summary[
        "Count of CIViC Variants per Category"
    ].append(number_unique_not_supported_category_variants)
    not_supported_variant_analysis_summary["Fraction of all CIViC Variants"].append(
        fraction_not_supported_category_variant_of_civic
    )
    not_supported_variant_analysis_summary["Percent of all CIViC Variants"].append(
        percent_not_supported_category_variant_of_civic
    )
    not_supported_variant_analysis_summary["Fraction of Not Supported Variants"].append(
        fraction_not_supported_category_variant_of_total_not_supported
    )
    not_supported_variant_analysis_summary["Percent of Not Supported Variants"].append(
        percent_not_supported_category_variant_of_total_not_supported
    )
    not_supported_variant_analysis_summary["Fraction of Accepted Variants"].append(
        fraction_accepted_not_supported_category_variants
    )
    not_supported_variant_analysis_summary["Percent of Accepted Variants"].append(
        percentage_accepted_not_supported_category_variants
    )
    not_supported_variant_analysis_summary["Fraction of Submitted Variants"].append(
        fraction_submitted_not_supported_category_variants
    )
    not_supported_variant_analysis_summary["Percent of Submitted Variants"].append(
        percentage_submitted_not_supported_category_variants
    )

## <a id='toc5_3_'></a>[Transform df for evidence analysis](#toc0_)

In [40]:
not_supported_variants_add_evidence_df = transform_df_evidence_ids(
    not_supported_queries_df
)
not_supported_variants_add_evidence_df

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id
0,4170,VHL,,Not provided,Transcript,False,10647
1,4214,VHL,,Not provided,Transcript,False,10752
2,4216,VHL,,Not provided,Transcript,False,10754
3,4278,VHL,,Not provided,Transcript,False,10958
4,4232,BRCA1,,Not provided,Transcript,False,7164
...,...,...,...,...,...,...,...
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9613
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9618
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9619
1745,3508,CD274,v242,Not provided,Sequence,False,9695


There are some variants without evidence items. These variants were excluded from the impact analysis since they cannot contribute a variant impact score.

In [41]:
not_supported_variants_add_evidence_df.loc[
    not_supported_variants_add_evidence_df["evidence_id"] == ""
]

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id
47,5163,,Alterations,Not provided,Region-Defined,False,
530,4537,,Fusion,Transcript Fusion,Fusion,False,


In [42]:
not_supported_variants_add_evidence_df = transform_df_evidence(
    not_supported_variants_add_evidence_df
)
not_supported_variants_add_evidence_df

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level
0,4170,VHL,,Not provided,Transcript,False,10647,submitted,[2],[C]
1,4214,VHL,,Not provided,Transcript,False,10752,submitted,[3],[C]
2,4216,VHL,,Not provided,Transcript,False,10754,submitted,[3],[C]
3,4278,VHL,,Not provided,Transcript,False,10958,submitted,[3],[C]
4,4232,BRCA1,,Not provided,Transcript,False,7164,submitted,[3],[C]
...,...,...,...,...,...,...,...,...,...,...
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9613,submitted,[4],[B]
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9618,submitted,[4],[B]
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9619,submitted,[4],[B]
1745,3508,CD274,v242,Not provided,Sequence,False,9695,submitted,[4],[E]


## <a id='toc5_4_'></a>[Evidence analysis](#toc0_)

In [43]:
not_supported_variants_add_evidence_df = evidence_analysis(
    not_supported_variants_add_evidence_df, VariantNormType.NOT_SUPPORTED
)
not_supported_variants_add_evidence_df

Number of Not Supported Variant Evidence items in CIViC: 4926 / 10850
Percent of Not Supported Variant Evidence items in CIViC: 45.40%

Number of accepted Not Supported Variant Evidence items: 2558 / 4926
Percent of accepted Not Supported Variant Evidence items: 51.93%

Number of submitted Not Supported Variant Evidence items: 2368 / 4926
Percent of submitted Not Supported Variant Evidence items: 48.07%


Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted
0,4170,VHL,,Not provided,Transcript,False,10647,submitted,[2],[C],False
1,4214,VHL,,Not provided,Transcript,False,10752,submitted,[3],[C],False
2,4216,VHL,,Not provided,Transcript,False,10754,submitted,[3],[C],False
3,4278,VHL,,Not provided,Transcript,False,10958,submitted,[3],[C],False
4,4232,BRCA1,,Not provided,Transcript,False,7164,submitted,[3],[C],False
...,...,...,...,...,...,...,...,...,...,...,...
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9613,submitted,[4],[B],False
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9618,submitted,[4],[B],False
1744,3478,ESR2,underexpression beta-1,Not provided,Other,False,9619,submitted,[4],[B],False
1745,3508,CD274,v242,Not provided,Sequence,False,9695,submitted,[4],[E],False


### <a id='toc5_4_1_'></a>[Not Supported Variant Evidence Analysis by Subcategory](#toc0_)

 List all the possible variant categories, have to use non unique file since evidence items are used more than once across groups


In [44]:
not_supported_variant_categories = (
    not_supported_variants_add_evidence_df.category.unique()
)
[v for v in not_supported_variant_categories]

['Transcript',
 'Genotype/Haplotype',
 'Sequence',
 'Rearrangement',
 'Region-Defined',
 'Other',
 'Copy Number',
 'Fusion',
 'Gene Function',
 'Expression',
 'Genome Feature',
 'Epigenetic Modification']

Evidence items may be used across multiple variants

In [45]:
duplicate = not_supported_variants_add_evidence_df[
    not_supported_variants_add_evidence_df.duplicated("evidence_id", keep=False)
]
duplicate

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted
5,5005,HLA-A,*02:01P,Not provided,Genotype/Haplotype,False,12138,submitted,[4],[A],False
6,5006,HLA-A,*02:02P,Not provided,Genotype/Haplotype,False,12138,submitted,[4],[A],False
7,5007,HLA-A,*02:03P,Not provided,Genotype/Haplotype,False,12138,submitted,[4],[A],False
8,5008,HLA-A,*02:06P,Not provided,Genotype/Haplotype,False,12138,submitted,[4],[A],False
40,1296,CTNNB1,Activating Mutation,Gain Of Function Variant;Transcript Variant,Gene Function,True,12023,submitted,[2],[A],False
...,...,...,...,...,...,...,...,...,...,...,...
1624,4466,TERT,,Not provided,Transcript,False,11278,submitted,[2],[C],False
1662,5027,,e10::e18,Transcript Fusion,Fusion,False,11169,submitted,[1],[C],False
1683,5015,MAP2K4,loss-of-function Mutation,Not provided,Gene Function,False,12141,submitted,[3],[D],False
1689,4463,TSC1,mutation,Not provided,Region-Defined,False,11269,submitted,[4],[A],False


In [46]:
not_supported_variant_evidence_summary = {
    "Category": NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    "Count of CIViC Evidence Items per Category": [],
    "Fraction of all CIViC Evidence Items": [],
    "Percent of all CIViC Evidence Items": [],
    "Fraction of Not Supported Variant Evidence Items": [],
    "Percent of Not Supported Variant Evidence Items": [],
    "Fraction of Accepted Evidence Items": [],
    "Percent of Accepted Evidence Items": [],
    "Fraction of Submitted Evidence Items": [],
    "Percent of Submitted Evidence Items": [],
}

In [47]:
not_supported_variant_categories_evidence_summary_data = dict()
total_number_not_supported_variant_unique_evidence_items = len(
    set(not_supported_variants_add_evidence_df.evidence_id)
)

for category in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:
    not_supported_variant_categories_evidence_summary_data[category] = {}
    evidence_category_df = not_supported_variants_add_evidence_df[
        not_supported_variants_add_evidence_df.category == category
    ]
 
    # Count
    number_unique_not_supported_category_evidence = len(
        set(evidence_category_df.evidence_id)
    )
    not_supported_variant_categories_evidence_summary_data[category][
        "number_unique_not_supported_category_evidence"
    ] = number_unique_not_supported_category_evidence

    # Fraction
    fraction_not_supported_category_variant_evidence_of_civic = (
        f"{number_unique_not_supported_category_evidence} / {total_ac_sub_evidence}"
    )
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_not_supported_category_variant_evidence_of_civic"
    ] = fraction_not_supported_category_variant_evidence_of_civic

    # Percent
    percent_not_supported_category_variant_evidence_of_civic = f"{number_unique_not_supported_category_evidence / total_ac_sub_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percent_not_supported_category_variant_evidence_of_civic"
    ] = percent_not_supported_category_variant_evidence_of_civic

    # Not supported fraction
    fraction_not_supported_category_variant_evidence_of_total_not_supported = f"{number_unique_not_supported_category_evidence} / {total_number_not_supported_variant_unique_evidence_items}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_not_supported_category_variant_evidence_of_total_not_supported"
    ] = fraction_not_supported_category_variant_evidence_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_variant_evidence_of_total_not_supported = f"{number_unique_not_supported_category_evidence / total_number_not_supported_variant_unique_evidence_items * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percent_not_supported_category_variant_evidence_of_total_not_supported"
    ] = percent_not_supported_category_variant_evidence_of_total_not_supported

    # Accepted fraction
    number_accepted_not_supported_category_variant_evidence = (
        evidence_category_df.evidence_accepted.sum()
    )
    fraction_accepted_evidence_not_supported_category_variants = f"{number_accepted_not_supported_category_variant_evidence} / {number_unique_not_supported_category_evidence}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_accepted_evidence_not_supported_category_variants"
    ] = fraction_accepted_evidence_not_supported_category_variants

    # Accepted percent
    percentage_accepted_evidence_not_supported_category_variants = f"{number_accepted_not_supported_category_variant_evidence / number_unique_not_supported_category_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percentage_accepted_evidence_not_supported_category_variants"
    ] = percentage_accepted_evidence_not_supported_category_variants

    # Submitted fraction
    number_submitted_not_supported_category_variant_evidence = (
        number_unique_not_supported_category_evidence
        - evidence_category_df.evidence_accepted.sum()
    )
    fraction_submitted_evidence_not_supported_category_variants = f"{number_submitted_not_supported_category_variant_evidence} / {number_unique_not_supported_category_evidence}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_submitted_evidence_not_supported_category_variants"
    ] = fraction_submitted_evidence_not_supported_category_variants

    # Submitted percent
    percentage_submitted_evidence_not_supported_category_variants = f"{number_submitted_not_supported_category_variant_evidence / number_unique_not_supported_category_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percentage_submitted_evidence_not_supported_category_variants"
    ] = percentage_submitted_evidence_not_supported_category_variants

    not_supported_variant_evidence_summary[
        "Count of CIViC Evidence Items per Category"
    ].append(number_unique_not_supported_category_evidence)
    not_supported_variant_evidence_summary[
        "Fraction of all CIViC Evidence Items"
    ].append(fraction_not_supported_category_variant_evidence_of_civic)
    not_supported_variant_evidence_summary[
        "Percent of all CIViC Evidence Items"
    ].append(percent_not_supported_category_variant_evidence_of_civic)
    not_supported_variant_evidence_summary[
        "Fraction of Not Supported Variant Evidence Items"
    ].append(fraction_not_supported_category_variant_evidence_of_total_not_supported)
    not_supported_variant_evidence_summary[
        "Percent of Not Supported Variant Evidence Items"
    ].append(percent_not_supported_category_variant_evidence_of_total_not_supported)
    not_supported_variant_evidence_summary[
        "Fraction of Accepted Evidence Items"
    ].append(fraction_accepted_evidence_not_supported_category_variants)
    not_supported_variant_evidence_summary["Percent of Accepted Evidence Items"].append(
        percentage_accepted_evidence_not_supported_category_variants
    )
    not_supported_variant_evidence_summary[
        "Fraction of Submitted Evidence Items"
    ].append(fraction_submitted_evidence_not_supported_category_variants)
    not_supported_variant_evidence_summary[
        "Percent of Submitted Evidence Items"
    ].append(percentage_submitted_evidence_not_supported_category_variants)

## <a id='toc5_5_'></a>[Impact](#toc0_)

### <a id='toc5_5_1_'></a>[Via Evidence Level](#toc0_)

#### <a id='toc5_5_1_1_'></a>[Analysis with only Accepted Variants](#toc0_)

accepted variant = a variant with at least one 'accepted' evidence item

In [48]:
ns_var_w_evid_df = not_supported_variants_add_evidence_df.copy()
ns_var_w_evid_df["evidence_id"] = ns_var_w_evid_df["evidence_id"].apply(
    lambda x: np.nan if isinstance(x, str) and x.strip() == "" else x
)
ns_var_w_evid_df = ns_var_w_evid_df[ns_var_w_evid_df["evidence_id"].notna()]

There are no variants without an evidence id that have an evidence status

In [49]:
ns_var_w_evid_df[
    ns_var_w_evid_df["evidence_accepted"].isna()
]

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted


Selecting only variants with at least one accepted evidence item (Accepted Variants)

In [50]:
ns_var_w_acc_evid_df = ns_var_w_evid_df[ns_var_w_evid_df["evidence_accepted"]].copy()

In [51]:
ns_var_w_acc_evid_df["evidence_accepted"].value_counts()

evidence_accepted
True    2615
Name: count, dtype: int64

##### <a id='toc5_5_1_1_1_'></a>[Calculating evidence score via level](#toc0_)

Each variant receives an evidence score by adding up the numerical value of levels of the evidence items associated with the variant

In [52]:
def calculate_impact_score(df: pd.DataFrame) -> pd.DataFrame:
    """Converts the alphabetical evidence level to a numerical score and adds the score of each evidence item per variant

    :param df: Dataframe of variants with respective evidence items
    :return: Transformed dataframe with evidence score
    """
    EVIDENCE_LEVEL_TO_IMPACT = {"A": 10, "B": 5, "C": 3, "D": 1, "E": 0.5}
    df["evidence_level"] = df["evidence_level"].apply(lambda x: x[0])
    df["evidence_score"] = ""
    df["evidence_score"] = df["evidence_level"].map(EVIDENCE_LEVEL_TO_IMPACT)

    df.sort_values(by=["variant_id"])
    df1 = df.groupby("variant_id").aggregate(
        {
            "gene_name": "first",
            "variant_name": "first",
            "category": "first",
            "evidence_id": "count",
            "evidence_score": "sum",
        }
    )
    df1 = df1.rename(
        columns={
            "evidence_id": "#_evidence_items",
            "evidence_score": "evidence_score_sum",
        }
    )

    return df1

In [53]:
not_supported_variants_w_acc_evid_df = calculate_impact_score(ns_var_w_acc_evid_df)
not_supported_variants_w_acc_evid_df

Unnamed: 0_level_0,gene_name,variant_name,category,#_evidence_items,evidence_score_sum
variant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,Fusion,Fusion,328,783.0
5,,Fusion,Fusion,47,91.0
17,BRAF,V600,Sequence,25,146.0
19,CCND1,Expression,Expression,2,10.0
20,CCND1,Overexpression,Expression,8,36.0
...,...,...,...,...,...
5162,EPOR,rearrangements,Rearrangement,1,5.0
5167,,Fusion,Fusion,1,3.0
5168,MECOM,rearrangement,Rearrangement,3,15.0
5171,ALK,Exon 2-18 Deletion,Rearrangement,2,6.0


##### <a id='toc5_5_1_1_2_'></a>[Summary Table](#toc0_)

In [54]:
def summarize_impact(df: pd.DataFrame) -> pd.DataFrame:
    """Calculates the number of variants, evidence items, and impact score per category

    :param df: Dataframe of variants
    :return: Transformed dataframe with the number of variants, evidence items, and impact score per category
    """
    df1 = df.reset_index()

    df1 = df1.groupby("category").aggregate(
        {"variant_id": "count", "#_evidence_items": "sum", "evidence_score_sum": "sum"}
    )
    df1 = df1.rename(
        columns={"evidence_score_sum": "impact", "variant_id": "number_of_variants"}
    )
    df1["average_impact_per_variant"] = (
        df1["impact"] / df1["number_of_variants"]
    ).round(2)
    df1 = df1.sort_values(by=["impact"], ascending=False)

    return df1

In [55]:
not_supported_accepted_variant_categories_df = summarize_impact(
    not_supported_variants_w_acc_evid_df
)
not_supported_accepted_variant_categories_df

Unnamed: 0_level_0,number_of_variants,#_evidence_items,impact,average_impact_per_variant
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fusion,203,1028,3475.5,17.12
Region-Defined,105,459,2197.0,20.92
Expression,181,345,1249.0,6.9
Rearrangement,52,238,1114.0,21.42
Sequence,70,193,844.5,12.06
Gene Function,59,171,613.5,10.4
Other,37,42,198.0,5.35
Transcript,56,58,168.0,3.0
Copy Number,19,35,113.0,5.95
Genotype/Haplotype,14,19,96.0,6.86


##### <a id='toc5_5_1_1_3_'></a>[Calculating evidence score via level](#toc0_)

In [56]:
not_supported_accepted_variant_categories_df.sum().round(2)

number_of_variants              814.00
#_evidence_items               2615.00
impact                        10175.50
average_impact_per_variant      122.09
dtype: float64

#### <a id='toc5_5_1_2_'></a>[Analysis with Accepted and Submitted Variants](#toc0_)

submitted variant = a variant with only 'submitted' evidence items

In [57]:
ns_var_w_acc_sub_evid_df = not_supported_variants_add_evidence_df.copy()
ns_var_w_acc_sub_evid_df["evidence_id"] = ns_var_w_acc_sub_evid_df["evidence_id"].apply(
    lambda x: np.nan if isinstance(x, str) and x.strip() == "" else x
)
ns_var_w_acc_sub_evid_df = ns_var_w_acc_sub_evid_df[ns_var_w_acc_sub_evid_df["evidence_id"].notna()] #.notna removes those with no evidence items

##### <a id='toc5_5_1_2_1_'></a>[Calculating evidence score via level](#toc0_)

In [58]:
not_supported_variants_w_acc_sub_evid_df = calculate_impact_score(
    ns_var_w_acc_sub_evid_df
)
not_supported_variants_w_acc_sub_evid_df

Unnamed: 0_level_0,gene_name,variant_name,category,#_evidence_items,evidence_score_sum
variant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,Fusion,Fusion,462,1117.0
5,,Fusion,Fusion,95,167.0
17,BRAF,V600,Sequence,28,166.0
19,CCND1,Expression,Expression,2,10.0
20,CCND1,Overexpression,Expression,10,40.0
...,...,...,...,...,...
5178,CD44,CD44v10,Other,1,5.0
5179,,Fusion,Fusion,1,5.0
5180,BAX,mutation,Region-Defined,1,1.0
5187,,Fusion,Fusion,3,15.0


##### <a id='toc5_5_1_2_2_'></a>[Summary Table](#toc0_)

In [59]:
not_supported_accepted_submitted_variant_categories_df = summarize_impact(
    not_supported_variants_w_acc_sub_evid_df
)
not_supported_accepted_submitted_variant_categories_df

Unnamed: 0_level_0,number_of_variants,#_evidence_items,impact,average_impact_per_variant
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fusion,312,1625,5552.0,17.79
Region-Defined,254,828,3398.5,13.38
Rearrangement,122,601,2373.5,19.45
Expression,294,626,2117.0,7.2
Gene Function,111,404,1362.5,12.27
Transcript,362,435,1305.0,3.6
Sequence,133,307,1241.0,9.33
Other,79,128,506.5,6.41
Genotype/Haplotype,22,42,232.0,10.55
Copy Number,32,77,225.0,7.03


The impact of submitted variants only

In [60]:
(
    not_supported_accepted_submitted_variant_categories_df["impact"]
    - not_supported_accepted_variant_categories_df["impact"]
).sort_values(ascending=False)

category
Fusion                     2076.5
Rearrangement              1259.5
Region-Defined             1201.5
Transcript                 1137.0
Expression                  868.0
Gene Function               749.0
Sequence                    396.5
Other                       308.5
Genotype/Haplotype          136.0
Copy Number                 112.0
Genome Feature               95.0
Epigenetic Modification      10.0
Name: impact, dtype: float64

In [61]:
not_supported_accepted_submitted_variant_categories_df.to_csv(
    "output/civic_both_evidence_cat_impact_df.csv", index=True
)
not_supported_accepted_variant_categories_df.to_csv(
    "output/civic_accepted_evidence_only_impact_df.csv",
    index=True,
)

### <a id='toc5_5_2_'></a>[Via Molecular Profile Score- this was not used](#toc0_)
 Since MOA evidence items are only scored by level, we used impact score via evidence level for CIViC variants to remain consistent

#### <a id='toc5_5_2_1_'></a>[Import molecular profile id](#toc0_)

In [62]:
not_supported_variants_add_molprof_df = transform_df_mp_id(not_supported_queries_df)
not_supported_variants_add_molprof_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id
0,4170,VHL,,Not provided,Transcript,False,[4038]
1,4214,VHL,,Not provided,Transcript,False,[4082]
2,4216,VHL,,Not provided,Transcript,False,[4084]
3,4278,VHL,,Not provided,Transcript,False,[4146]
4,4232,BRCA1,,Not provided,Transcript,False,[4100]


#### <a id='toc5_5_2_2_'></a>[Import molecular profile scores](#toc0_)

In [63]:
not_supported_variants_add_molprof_df = transform_df_mp_score(
    not_supported_variants_add_molprof_df
)
not_supported_variants_add_molprof_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id,molecular_profile_score
0,4170,VHL,,Not provided,Transcript,False,[4038],[0.0]
1,4214,VHL,,Not provided,Transcript,False,[4082],[0.0]
2,4216,VHL,,Not provided,Transcript,False,[4084],[0.0]
3,4278,VHL,,Not provided,Transcript,False,[4146],[0.0]
4,4232,BRCA1,,Not provided,Transcript,False,[4100],[0.0]


In [64]:
not_supported_variants_add_molprof_df = transform_df_mp_score_sum(
    not_supported_variants_add_molprof_df
)
not_supported_variants_add_molprof_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
0,4170,VHL,,Not provided,Transcript,False,[4038],[0.0],0.0
1,4214,VHL,,Not provided,Transcript,False,[4082],[0.0],0.0
2,4216,VHL,,Not provided,Transcript,False,[4084],[0.0],0.0
3,4278,VHL,,Not provided,Transcript,False,[4146],[0.0],0.0
4,4232,BRCA1,,Not provided,Transcript,False,[4100],[0.0],0.0


In [65]:
not_supported_variants_add_molprof_df[
    (not_supported_variants_add_molprof_df["molecular_profile_score_sum"] == 0.0)
    & (not_supported_variants_add_molprof_df["variant_accepted"])
]

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
42,2657,ERBB2,Activating Mutation,Gene Variant;Gain Of Function Variant,Gene Function,True,"[2526, 5070]","[0.0, 0.0]",0.0
51,5157,,Amplification,Not provided,Region-Defined,True,[5437],[0.0],0.0
129,4585,MTAP,Deletion,Not provided,Gene Function,True,[4644],[0.0],0.0
145,3744,VHL,,Not provided,Transcript,True,[3612],[0.0],0.0
158,1516,EGFR,EGFRVIII,Not provided,Gene Function,True,"[1424, 4245, 4345, 4346]","[0.0, 0.0, 0.0, 0.0]",0.0
...,...,...,...,...,...,...,...,...,...
1668,5144,,e24::e4,Transcript Fusion,Fusion,True,[5422],[0.0],0.0
1675,5150,,e7::e35,Not provided,Fusion,True,[5429],[0.0],0.0
1726,5168,MECOM,rearrangement,Not provided,Rearrangement,True,[5448],[0.0],0.0
1727,5162,EPOR,rearrangements,Not provided,Rearrangement,True,[5442],[0.0],0.0


In [66]:
not_supported_variants_add_molprof_df["molecular_profile_score_sum"].max()

np.float64(1065.0)

In [67]:
not_supported_variants_add_molprof_df[
    (not_supported_variants_add_molprof_df["molecular_profile_score_sum"] != 0.0)
]

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
11,2930,VHL,,Not provided,Transcript,True,[2799],[7.5],7.5
13,785,CHEK2,1100DELC,Frameshift Truncation,Sequence,True,[766],[15.0],15.0
16,823,EPCAM,3' Exon Deletion,Disruptive Inframe Deletion,Rearrangement,True,[801],[20.0],20.0
17,433,HIF1A,3' UTR Polymorphism,3 Prime UTR Variant;Snp,Region-Defined,True,[429],[10.0],10.0
20,2367,VHL,3p26.3-25.3 11Mb del,Not provided,Rearrangement,True,[2240],[7.5],7.5
...,...,...,...,...,...,...,...,...,...
1722,272,CDKN2A,p16 Expression,,Expression,True,[268],[180.0],180.0
1728,3313,CDKN1A,rs1059234,Not provided,Other,True,[3181],[15.0],15.0
1731,256,KIT,rs17084733,3 Prime UTR Variant,Other,True,[252],[15.0],15.0
1732,2671,CDKN1A,rs1801270,Not provided,Other,True,[2540],[15.0],15.0


#### <a id='toc5_5_2_3_'></a>[Impact by Subcategory](#toc0_)

In [68]:
not_supported_impact_summary = {
    "Category": NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    "CIVIC Total Sum Impact Score": [],
    "Average Impact Score per Variant": [],
    "Average Impact Score per Evidence Item": [],
    "Total Number Evidence Items": [
        v["number_unique_not_supported_category_evidence"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "% Accepted Evidence Items": [
        v["percentage_accepted_evidence_not_supported_category_variants"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "Total Number Variants": [
        v["number_unique_not_supported_category_variants"]
        for v in not_supported_variant_categories_summary_data.values()
    ],
}

In [69]:
not_supported_variant_categories_impact_data = dict()
for category in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:
    not_supported_variant_categories_impact_data[category] = {}
    impact_category_df = not_supported_variants_add_molprof_df[
        not_supported_variants_add_molprof_df.category == category
    ]

    total_sum_not_supported_category_impact = impact_category_df[
        "molecular_profile_score_sum"
    ].sum()
    not_supported_variant_categories_impact_data[category][
        "total_sum_not_supported_category_impact"
    ] = total_sum_not_supported_category_impact

    avg_impact_score_variant = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_variants
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_variant"
    ] = avg_impact_score_variant

    avg_impact_score_evidence = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_evidence
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_evidence"
    ] = avg_impact_score_evidence

    not_supported_impact_summary["CIVIC Total Sum Impact Score"].append(
        total_sum_not_supported_category_impact
    )
    not_supported_impact_summary["Average Impact Score per Variant"].append(
        avg_impact_score_variant
    )
    not_supported_impact_summary["Average Impact Score per Evidence Item"].append(
        avg_impact_score_evidence
    )

    print(f"{category}: {total_sum_not_supported_category_impact}")

Sequence: 2601.75
Genotype/Haplotype: 312.5
Fusion: 8112.75
Rearrangement: 3336.0
Epigenetic Modification: 285.5
Copy Number: 210.0
Expression: 3628.0
Gene Function: 1878.75
Region-Defined: 6565.0
Genome Feature: 0.0
Other: 576.0
Transcript: 349.0


In [70]:
not_supported_variant_impact_df = pd.DataFrame(not_supported_impact_summary)

In [71]:
not_supported_variant_impact_df = not_supported_variant_impact_df.round(2)
not_supported_variant_impact_df

Unnamed: 0,Category,CIVIC Total Sum Impact Score,Average Impact Score per Variant,Average Impact Score per Evidence Item,Total Number Evidence Items,% Accepted Evidence Items,Total Number Variants
0,Sequence,2601.75,7.19,5.98,300,64.33%,133
1,Genotype/Haplotype,312.5,0.86,0.72,39,48.72%,22
2,Fusion,8112.75,22.41,18.65,1590,64.65%,313
3,Rearrangement,3336.0,9.22,7.67,593,40.13%,122
4,Epigenetic Modification,285.5,0.79,0.66,23,95.65%,14
5,Copy Number,210.0,0.58,0.48,77,45.45%,32
6,Expression,3628.0,10.02,8.34,623,55.38%,294
7,Gene Function,1878.75,5.19,4.32,386,44.30%,111
8,Region-Defined,6565.0,18.14,15.09,782,58.70%,255
9,Genome Feature,0.0,0.0,0.0,25,20.00%,10


In [72]:
not_supported_variant_impact_df.to_csv(
    "output/not_supported_variant_impact_df.csv", index=False
)

# <a id='toc6_'></a>[Summary](#toc0_)

## <a id='toc6_1_'></a>[Variant Analysis](#toc0_)

### <a id='toc6_1_1_'></a>[Building Summary Table 1 & 2](#toc0_)

In [73]:
all_variant_df = pd.DataFrame(variant_analysis_summary)

In [74]:
def combine_frac_perc(df: pd.DataFrame, denominator: str) -> pd.DataFrame:
    """Put fraction and percent string into one string

    :param df: Dataframe of variant statistics
    :param denominator: string representing what the denominator of the fraction is
    :return: Transformed dataframe with fraction and percent string as one string
    """
    for d in denominator:
        perc_key = f"Percent of {d}"
        frac_key = f"Fraction of {d}"
        df[perc_key] = df[frac_key].astype(str) + "  (" + df[perc_key] + ")"
        df = df.drop([frac_key], axis=1)
    return df

In [75]:
all_variant_df = combine_frac_perc(
    all_variant_df, ["all CIViC Variants", "Accepted Variants", "Submitted Variants"]
)
all_variant_df

Unnamed: 0,Variant Category,Count of CIViC Variants per Category,Percent of all CIViC Variants,Percent of Accepted Variants,Percent of Submitted Variants
0,Normalized,2015,2015 / 3845 (52.41%),976 / 2015 (48.44%),1039 / 2015 (51.56%)
1,Unable to Normalize,83,83 / 3845 (2.16%),14 / 83 (16.87%),69 / 83 (83.13%)
2,Not Supported,1747,1747 / 3845 (45.44%),814 / 1747 (46.59%),933 / 1747 (53.41%)


In [76]:
all_variant_percent_status_df = all_variant_df.drop(
    [
        "Percent of all CIViC Variants",
        "Count of CIViC Variants per Category",
    ],
    axis=1,
)

for_merge_all_variant_percent_of_civic_df = all_variant_df.drop(
    [
        "Percent of Accepted Variants",
        "Percent of Submitted Variants",
    ],
    axis=1,
)

all_variant_percent_of_civic_df = for_merge_all_variant_percent_of_civic_df.drop(
    ["Count of CIViC Variants per Category"], axis=1
)

In [77]:
for_merge_all_variant_percent_of_civic_df.to_csv(
    "output/for_merge_all_variant_percent_of_civic_df.csv",
    index=False,
)

### <a id='toc6_1_2_'></a>[Summary Table 1](#toc0_)

The table below shows the 3 categories that CIViC variants were divided into after normalization and what percent they make up of all variants in CIViC data.

<ins>Numerator:</ins> # of CIViC variants based on normalization status
<br><ins>Denominator:</ins> # of all CIViC variants

In [78]:
all_variant_percent_of_civic_df = all_variant_percent_of_civic_df.set_index(
    "Variant Category"
)
all_variant_percent_of_civic_df

Unnamed: 0_level_0,Percent of all CIViC Variants
Variant Category,Unnamed: 1_level_1
Normalized,2015 / 3845 (52.41%)
Unable to Normalize,83 / 3845 (2.16%)
Not Supported,1747 / 3845 (45.44%)


In [79]:
civic_summary_table_1 = all_variant_percent_of_civic_df

### <a id='toc6_1_3_'></a>[Summary Table 2](#toc0_)

The table below shows the 3 categories that CIViC variants were divided into after normalization and what percent of the variants in each category are accepted (have at least one evidence item that is accepted) or not.

<ins>Numerator:</ins> # of CIViC variants based on normalization and acceptance status
<br><ins>Denominator:</ins> # of CIViC variants based on normalization status

In [80]:
all_variant_percent_status_df = all_variant_percent_status_df.set_index(
    "Variant Category"
)
all_variant_percent_status_df

Unnamed: 0_level_0,Percent of Accepted Variants,Percent of Submitted Variants
Variant Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Normalized,976 / 2015 (48.44%),1039 / 2015 (51.56%)
Unable to Normalize,14 / 83 (16.87%),69 / 83 (83.13%)
Not Supported,814 / 1747 (46.59%),933 / 1747 (53.41%)


In [81]:
civic_summary_table_2 = all_variant_percent_status_df

### <a id='toc6_1_4_'></a>[Building Summary Tables 3 - 5](#toc0_)

In [82]:
not_supported_variant_df = pd.DataFrame(not_supported_variant_analysis_summary)

In [83]:
not_supported_variant_df = combine_frac_perc(
    not_supported_variant_df,
    [
        "Not Supported Variants",
        "all CIViC Variants",
        "Accepted Variants",
        "Submitted Variants",
    ],
)
not_supported_variant_df

Unnamed: 0,Category,Count of CIViC Variants per Category,Percent of Not Supported Variants,Percent of all CIViC Variants,Percent of Accepted Variants,Percent of Submitted Variants
0,Sequence,133,133 / 1747 (7.61%),133 / 3845 (3.46%),70 / 133 (52.63%),63 / 133 (47.37%)
1,Genotype/Haplotype,22,22 / 1747 (1.26%),22 / 3845 (0.57%),14 / 22 (63.64%),8 / 22 (36.36%)
2,Fusion,313,313 / 1747 (17.92%),313 / 3845 (8.14%),203 / 313 (64.86%),110 / 313 (35.14%)
3,Rearrangement,122,122 / 1747 (6.98%),122 / 3845 (3.17%),52 / 122 (42.62%),70 / 122 (57.38%)
4,Epigenetic Modification,14,14 / 1747 (0.80%),14 / 3845 (0.36%),14 / 14 (100.00%),0 / 14 (0.00%)
5,Copy Number,32,32 / 1747 (1.83%),32 / 3845 (0.83%),19 / 32 (59.38%),13 / 32 (40.62%)
6,Expression,294,294 / 1747 (16.83%),294 / 3845 (7.65%),181 / 294 (61.56%),113 / 294 (38.44%)
7,Gene Function,111,111 / 1747 (6.35%),111 / 3845 (2.89%),59 / 111 (53.15%),52 / 111 (46.85%)
8,Region-Defined,255,255 / 1747 (14.60%),255 / 3845 (6.63%),105 / 255 (41.18%),150 / 255 (58.82%)
9,Genome Feature,10,10 / 1747 (0.57%),10 / 3845 (0.26%),4 / 10 (40.00%),6 / 10 (60.00%)


In [84]:
for_merge_not_supported_variant_percent_of_civic_df = not_supported_variant_df.drop(
    [
        "Percent of Not Supported Variants",
        "Percent of Accepted Variants",
        "Percent of Submitted Variants",
    ],
    axis=1,
)

not_supported_variant_percent_of_civic_df = (
    for_merge_not_supported_variant_percent_of_civic_df.drop(
        ["Count of CIViC Variants per Category"], axis=1
    )
)

not_supported_variant_percent_of_not_supported_df = not_supported_variant_df[
    ["Category", "Percent of Not Supported Variants"]
].copy()

not_supported_variant_percent_evidence_df = not_supported_variant_df.drop(
    [
        "Percent of all CIViC Variants",
        "Percent of Not Supported Variants",
        "Count of CIViC Variants per Category",
    ],
    axis=1,
)

In [85]:
for_merge_not_supported_variant_percent_of_civic_df.to_csv(
    "output/for_merge_not_supported_variant_percent_of_civic_df.csv",
    index=False,
)

### <a id='toc6_1_5_'></a>[Summary Table 3](#toc0_)

The table below shows the categories that the Not Supported variants were broken into and what percent of all CIViC variants they make up. These percentages will not add up to 100% because Not Supported variants are only a subset of all CIViC variants.

<ins>Numerator:</ins> # of CIViC variants that are Not Supported in a given Subcategory
<br><ins>Denominator:</ins> # of all CIViC variants

In [86]:
not_supported_variant_percent_of_civic_df = (
    not_supported_variant_percent_of_civic_df.set_index("Category")
)
not_supported_variant_percent_of_civic_df

Unnamed: 0_level_0,Percent of all CIViC Variants
Category,Unnamed: 1_level_1
Sequence,133 / 3845 (3.46%)
Genotype/Haplotype,22 / 3845 (0.57%)
Fusion,313 / 3845 (8.14%)
Rearrangement,122 / 3845 (3.17%)
Epigenetic Modification,14 / 3845 (0.36%)
Copy Number,32 / 3845 (0.83%)
Expression,294 / 3845 (7.65%)
Gene Function,111 / 3845 (2.89%)
Region-Defined,255 / 3845 (6.63%)
Genome Feature,10 / 3845 (0.26%)


In [87]:
civic_summary_table_3 = not_supported_variant_percent_of_civic_df

### <a id='toc6_1_6_'></a>[Summary Table 4](#toc0_)

The table below shows the Not Supported variants broken up into 11 sub categories and what percent each sub category takes up in Not Supported variant group.

<ins>Numerator:</ins> # of CIViC variants that are Not Supported in a given Subcategory
<br><ins>Denominator:</ins> # of CIViC variants that are Not Supported

In [88]:
not_supported_variant_percent_of_not_supported_df = (
    not_supported_variant_percent_of_not_supported_df.set_index("Category")
)
not_supported_variant_percent_of_not_supported_df

Unnamed: 0_level_0,Percent of Not Supported Variants
Category,Unnamed: 1_level_1
Sequence,133 / 1747 (7.61%)
Genotype/Haplotype,22 / 1747 (1.26%)
Fusion,313 / 1747 (17.92%)
Rearrangement,122 / 1747 (6.98%)
Epigenetic Modification,14 / 1747 (0.80%)
Copy Number,32 / 1747 (1.83%)
Expression,294 / 1747 (16.83%)
Gene Function,111 / 1747 (6.35%)
Region-Defined,255 / 1747 (14.60%)
Genome Feature,10 / 1747 (0.57%)


In [89]:
civic_summary_table_4 = not_supported_variant_percent_of_not_supported_df

### <a id='toc6_1_7_'></a>[Summary Table 5](#toc0_)

The table below shows the Not Supported variants broken up into 11 sub categories and what percent each sub category takes up in Not Supported variant group.

<ins>Numerator:</ins> # of CIViC variants that are Not Supported in a given Subcategory based on acceptance status
<br><ins>Denominator:</ins> # of CIViC variants that are Not Supported in a given Subcategory

In [90]:
not_supported_variant_percent_evidence_df = (
    not_supported_variant_percent_evidence_df.set_index("Category")
)
not_supported_variant_percent_evidence_df

Unnamed: 0_level_0,Percent of Accepted Variants,Percent of Submitted Variants
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Sequence,70 / 133 (52.63%),63 / 133 (47.37%)
Genotype/Haplotype,14 / 22 (63.64%),8 / 22 (36.36%)
Fusion,203 / 313 (64.86%),110 / 313 (35.14%)
Rearrangement,52 / 122 (42.62%),70 / 122 (57.38%)
Epigenetic Modification,14 / 14 (100.00%),0 / 14 (0.00%)
Copy Number,19 / 32 (59.38%),13 / 32 (40.62%)
Expression,181 / 294 (61.56%),113 / 294 (38.44%)
Gene Function,59 / 111 (53.15%),52 / 111 (46.85%)
Region-Defined,105 / 255 (41.18%),150 / 255 (58.82%)
Genome Feature,4 / 10 (40.00%),6 / 10 (60.00%)


In [91]:
civic_summary_table_5 = not_supported_variant_percent_evidence_df

## <a id='toc6_2_'></a>[Evidence Analysis](#toc0_)

### <a id='toc6_2_1_'></a>[Building Summary Tables 6 & 7](#toc0_)

In [92]:
all_variant_evidence_df = pd.DataFrame(evidence_analysis_summary)

In [93]:
all_variant_evidence_df = combine_frac_perc(
    all_variant_evidence_df,
    ["all CIViC Evidence Items", "Accepted Evidence Items", "Submitted Evidence Items"],
)
all_variant_evidence_df

Unnamed: 0,Variant Category,Count of CIViC Evidence Items per Category,Percent of all CIViC Evidence Items,Percent of Accepted Evidence Items,Percent of Submitted Evidence Items
0,Normalized,6457,6457 / 10850 (59.51%),2415 / 6457 (37.40%),4042 / 6457 (62.60%)
1,Unable to Normalize,128,128 / 10850 (1.18%),20 / 128 (15.62%),108 / 128 (84.38%)
2,Not Supported,4926,4926 / 10850 (45.40%),2558 / 4926 (51.93%),2368 / 4926 (48.07%)


In [94]:
for_merge_all_variant_evidence_percent_of_civic_df = all_variant_evidence_df.drop(
    ["Percent of Accepted Evidence Items", "Percent of Submitted Evidence Items"],
    axis=1,
)

all_variant_evidence_percent_of_civic_df = (
    for_merge_all_variant_evidence_percent_of_civic_df.drop(
        ["Count of CIViC Evidence Items per Category"], axis=1
    )
)

all_variant_evidence_percent_evidence_df = all_variant_evidence_df.drop(
    [
        "Percent of all CIViC Evidence Items",
        "Count of CIViC Evidence Items per Category",
    ],
    axis=1,
)

In [95]:
for_merge_all_variant_evidence_percent_of_civic_df.to_csv(
    "output/for_merge_all_variant_evidence_percent_of_civic_df.csv",
    index=False,
)

### <a id='toc6_2_2_'></a>[Summary Table 6](#toc0_)

The table below shows what percent of all evidence items in CIViC are associated with Normalized, Unable to Normalize, and Not Supported variants. This will not add up to 100% because evidence items may be used across multiple variants.

<ins>Numerator:</ins> # of CIViC evidence items based on normalization status of associated variant
<br><ins>Denominator:</ins> # of all CIViC evidence items

In [96]:
all_variant_evidence_percent_of_civic_df = (
    all_variant_evidence_percent_of_civic_df.set_index("Variant Category")
)
all_variant_evidence_percent_of_civic_df

Unnamed: 0_level_0,Percent of all CIViC Evidence Items
Variant Category,Unnamed: 1_level_1
Normalized,6457 / 10850 (59.51%)
Unable to Normalize,128 / 10850 (1.18%)
Not Supported,4926 / 10850 (45.40%)


In [97]:
civic_summary_table_6 = all_variant_evidence_percent_of_civic_df

### <a id='toc6_2_3_'></a>[Summmary Table 7](#toc0_)

The table below shows the percent of accepted and submitted evidence items per category of variants.

<ins>Numerator:</ins> # of CIViC evidence items based on evidence acceptance status and normalization status of associated variant
<br><ins>Denominator:</ins> # of all CIViC evidence items based on normalization status of associated variant

In [98]:
all_variant_evidence_percent_evidence_df = (
    all_variant_evidence_percent_evidence_df.set_index("Variant Category")
)
all_variant_evidence_percent_evidence_df

Unnamed: 0_level_0,Percent of Accepted Evidence Items,Percent of Submitted Evidence Items
Variant Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Normalized,2415 / 6457 (37.40%),4042 / 6457 (62.60%)
Unable to Normalize,20 / 128 (15.62%),108 / 128 (84.38%)
Not Supported,2558 / 4926 (51.93%),2368 / 4926 (48.07%)


In [99]:
civic_summary_table_7 = all_variant_evidence_percent_evidence_df

### <a id='toc6_2_4_'></a>[Building Summary Tables 8 - 10](#toc0_)

In [100]:
not_supported_variant_evidence_df = pd.DataFrame(not_supported_variant_evidence_summary)

In [101]:
not_supported_variant_evidence_df = combine_frac_perc(
    not_supported_variant_evidence_df,
    [
        "all CIViC Evidence Items",
        "Not Supported Variant Evidence Items",
        "Accepted Evidence Items",
        "Submitted Evidence Items",
    ],
)
not_supported_variant_evidence_df

Unnamed: 0,Category,Count of CIViC Evidence Items per Category,Percent of all CIViC Evidence Items,Percent of Not Supported Variant Evidence Items,Percent of Accepted Evidence Items,Percent of Submitted Evidence Items
0,Sequence,300,300 / 10850 (2.76%),300 / 4926 (6.09%),193 / 300 (64.33%),107 / 300 (35.67%)
1,Genotype/Haplotype,39,39 / 10850 (0.36%),39 / 4926 (0.79%),19 / 39 (48.72%),20 / 39 (51.28%)
2,Fusion,1590,1590 / 10850 (14.65%),1590 / 4926 (32.28%),1028 / 1590 (64.65%),562 / 1590 (35.35%)
3,Rearrangement,593,593 / 10850 (5.47%),593 / 4926 (12.04%),238 / 593 (40.13%),355 / 593 (59.87%)
4,Epigenetic Modification,23,23 / 10850 (0.21%),23 / 4926 (0.47%),22 / 23 (95.65%),1 / 23 (4.35%)
5,Copy Number,77,77 / 10850 (0.71%),77 / 4926 (1.56%),35 / 77 (45.45%),42 / 77 (54.55%)
6,Expression,623,623 / 10850 (5.74%),623 / 4926 (12.65%),345 / 623 (55.38%),278 / 623 (44.62%)
7,Gene Function,386,386 / 10850 (3.56%),386 / 4926 (7.84%),171 / 386 (44.30%),215 / 386 (55.70%)
8,Region-Defined,782,782 / 10850 (7.21%),782 / 4926 (15.87%),459 / 782 (58.70%),323 / 782 (41.30%)
9,Genome Feature,25,25 / 10850 (0.23%),25 / 4926 (0.51%),5 / 25 (20.00%),20 / 25 (80.00%)


In [102]:
for_merge_not_supported_variant_evidence_percent_of_civic_df = (
    not_supported_variant_evidence_df.drop(
        [
            "Percent of Not Supported Variant Evidence Items",
            "Percent of Accepted Evidence Items",
            "Percent of Submitted Evidence Items",
        ],
        axis=1,
    )
)

not_supported_variant_evidence_percent_of_civic_df = (
    for_merge_not_supported_variant_evidence_percent_of_civic_df.drop(
        ["Count of CIViC Evidence Items per Category"], axis=1
    )
)

not_supported_variant_evidence_percent_of_not_supported_df = (
    not_supported_variant_evidence_df[
        ["Category", "Percent of Not Supported Variant Evidence Items"]
    ].copy()
)


not_supported_variant_evidence_percent_evidence_df = (
    not_supported_variant_evidence_df.drop(
        [
            "Percent of Not Supported Variant Evidence Items",
            "Percent of all CIViC Evidence Items",
            "Count of CIViC Evidence Items per Category",
        ],
        axis=1,
    )
)

In [103]:
for_merge_not_supported_variant_evidence_percent_of_civic_df.to_csv(
    "output/for_merge_not_supported_variant_evidence_percent_of_civic_df.csv",
    index=False,
)

### <a id='toc6_2_5_'></a>[Summary Table 8](#toc0_)

The table below shows the percent of all CIViC evidence items that are associated with a Not Supported variant sub category. This will not add up to 100% since the evidence items can be associated with multiple variants.

<ins>Numerator:</ins> # of CIViC evidence items that are associated with Not Supported variants in a given Subcategory
<br><ins>Denominator:</ins> # of all CIViC evidence items

In [104]:
not_supported_variant_evidence_percent_of_civic_df = (
    not_supported_variant_evidence_percent_of_civic_df.set_index("Category")
)
not_supported_variant_evidence_percent_of_civic_df

Unnamed: 0_level_0,Percent of all CIViC Evidence Items
Category,Unnamed: 1_level_1
Sequence,300 / 10850 (2.76%)
Genotype/Haplotype,39 / 10850 (0.36%)
Fusion,1590 / 10850 (14.65%)
Rearrangement,593 / 10850 (5.47%)
Epigenetic Modification,23 / 10850 (0.21%)
Copy Number,77 / 10850 (0.71%)
Expression,623 / 10850 (5.74%)
Gene Function,386 / 10850 (3.56%)
Region-Defined,782 / 10850 (7.21%)
Genome Feature,25 / 10850 (0.23%)


In [105]:
civic_summary_table_8 = not_supported_variant_evidence_percent_of_civic_df

### <a id='toc6_2_6_'></a>[Summary Table 9](#toc0_)

The table below shows the percent of all evidence items associated with Not Supported variants that are associated with a variant sub category.

<ins>Numerator:</ins> # of CIViC evidence items that are associated with Not Supported variants in a given Subcategory
<br><ins>Denominator:</ins> # of CIViC evidence items that are associated with Not Supported variants

In [106]:
not_supported_variant_evidence_percent_of_not_supported_df = (
    not_supported_variant_evidence_percent_of_not_supported_df.set_index("Category")
)
not_supported_variant_evidence_percent_of_not_supported_df

Unnamed: 0_level_0,Percent of Not Supported Variant Evidence Items
Category,Unnamed: 1_level_1
Sequence,300 / 4926 (6.09%)
Genotype/Haplotype,39 / 4926 (0.79%)
Fusion,1590 / 4926 (32.28%)
Rearrangement,593 / 4926 (12.04%)
Epigenetic Modification,23 / 4926 (0.47%)
Copy Number,77 / 4926 (1.56%)
Expression,623 / 4926 (12.65%)
Gene Function,386 / 4926 (7.84%)
Region-Defined,782 / 4926 (15.87%)
Genome Feature,25 / 4926 (0.51%)


In [107]:
civic_summary_table_9 = not_supported_variant_evidence_percent_of_not_supported_df

### <a id='toc6_2_7_'></a>[Summary Table 10](#toc0_)

The table below shows the percent of evidence items associated with Not Supported variant sub categories that are accepted or submitted.

<ins>Numerator:</ins> # of CIViC evidence items based on evidence acceptance status that are associated with Not Supported variants in a given Subcategory
<br><ins>Denominator:</ins> # of CIViC evidence items that are associated with Not Supported variants in a given Subcategory

In [108]:
not_supported_variant_evidence_percent_evidence_df = (
    not_supported_variant_evidence_percent_evidence_df.set_index("Category")
)
not_supported_variant_evidence_percent_evidence_df

Unnamed: 0_level_0,Percent of Accepted Evidence Items,Percent of Submitted Evidence Items
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Sequence,193 / 300 (64.33%),107 / 300 (35.67%)
Genotype/Haplotype,19 / 39 (48.72%),20 / 39 (51.28%)
Fusion,1028 / 1590 (64.65%),562 / 1590 (35.35%)
Rearrangement,238 / 593 (40.13%),355 / 593 (59.87%)
Epigenetic Modification,22 / 23 (95.65%),1 / 23 (4.35%)
Copy Number,35 / 77 (45.45%),42 / 77 (54.55%)
Expression,345 / 623 (55.38%),278 / 623 (44.62%)
Gene Function,171 / 386 (44.30%),215 / 386 (55.70%)
Region-Defined,459 / 782 (58.70%),323 / 782 (41.30%)
Genome Feature,5 / 25 (20.00%),20 / 25 (80.00%)


In [109]:
civic_summary_table_10 = not_supported_variant_evidence_percent_evidence_df

## <a id='toc6_3_'></a>[Impact](#toc0_)

accepted and submitted variants

In [110]:
not_supported_variants_w_acc_sub_evid_df

Unnamed: 0_level_0,gene_name,variant_name,category,#_evidence_items,evidence_score_sum
variant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,Fusion,Fusion,462,1117.0
5,,Fusion,Fusion,95,167.0
17,BRAF,V600,Sequence,28,166.0
19,CCND1,Expression,Expression,2,10.0
20,CCND1,Overexpression,Expression,10,40.0
...,...,...,...,...,...
5178,CD44,CD44v10,Other,1,5.0
5179,,Fusion,Fusion,1,5.0
5180,BAX,mutation,Region-Defined,1,1.0
5187,,Fusion,Fusion,3,15.0


In [111]:
not_supported_elevel_impact_summary = {
    "Category": NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    "CIVIC Total Sum Impact Score": [],
    "Average Impact Score per Variant": [],
    "Average Impact Score per Evidence Item": [],
    "Total Number Evidence Items": [
        v["number_unique_not_supported_category_evidence"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "% Accepted Evidence Items": [
        v["percentage_accepted_evidence_not_supported_category_variants"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "Total Number Variants": [
        v["number_unique_not_supported_category_variants"]
        for v in not_supported_variant_categories_summary_data.values()
    ],
}

In [112]:
not_supported_variant_categories_impact_data = dict()
for category in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:
    not_supported_variant_categories_impact_data[category] = {}
    impact_category_df = not_supported_variants_w_acc_sub_evid_df[
        not_supported_variants_w_acc_sub_evid_df.category == category
    ]

    total_sum_not_supported_category_impact = impact_category_df[
        "evidence_score_sum"
    ].sum()
    not_supported_variant_categories_impact_data[category][
        "total_sum_not_supported_category_impact"
    ] = total_sum_not_supported_category_impact

    avg_impact_score_variant = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_variants
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_variant"
    ] = avg_impact_score_variant

    avg_impact_score_evidence = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_evidence
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_evidence"
    ] = avg_impact_score_evidence

    not_supported_elevel_impact_summary["CIVIC Total Sum Impact Score"].append(
        total_sum_not_supported_category_impact
    )
    not_supported_elevel_impact_summary["Average Impact Score per Variant"].append(
        avg_impact_score_variant
    )
    not_supported_elevel_impact_summary[
        "Average Impact Score per Evidence Item"
    ].append(avg_impact_score_evidence)

    print(f"{category}: {total_sum_not_supported_category_impact}")

Sequence: 1241.0
Genotype/Haplotype: 232.0
Fusion: 5552.0
Rearrangement: 2373.5
Epigenetic Modification: 92.0
Copy Number: 225.0
Expression: 2117.0
Gene Function: 1362.5
Region-Defined: 3398.5
Genome Feature: 120.0
Other: 506.5
Transcript: 1305.0


In [113]:
not_supported_variant_impact_df = pd.DataFrame(not_supported_elevel_impact_summary)

In [114]:
not_supported_variant_impact_df = not_supported_variant_impact_df.round(2)
not_supported_variant_impact_df

Unnamed: 0,Category,CIVIC Total Sum Impact Score,Average Impact Score per Variant,Average Impact Score per Evidence Item,Total Number Evidence Items,% Accepted Evidence Items,Total Number Variants
0,Sequence,1241.0,3.43,2.85,300,64.33%,133
1,Genotype/Haplotype,232.0,0.64,0.53,39,48.72%,22
2,Fusion,5552.0,15.34,12.76,1590,64.65%,313
3,Rearrangement,2373.5,6.56,5.46,593,40.13%,122
4,Epigenetic Modification,92.0,0.25,0.21,23,95.65%,14
5,Copy Number,225.0,0.62,0.52,77,45.45%,32
6,Expression,2117.0,5.85,4.87,623,55.38%,294
7,Gene Function,1362.5,3.76,3.13,386,44.30%,111
8,Region-Defined,3398.5,9.39,7.81,782,58.70%,255
9,Genome Feature,120.0,0.33,0.28,25,20.00%,10


The bar graph below shows the relationship between the Not Supported variant sub category impact score and the sub category. Additionally, the colors illustrate the number of evidence items associated each sub category.

In [115]:
fig = px.bar(
    not_supported_variant_impact_df,
    x="Category",
    y="CIVIC Total Sum Impact Score",
    hover_data=[
        "Total Number Evidence Items",
        not_supported_variant_impact_df["% Accepted Evidence Items"],
    ],
    color="Total Number Evidence Items",
    labels={"CIVIC Total Sum Impact Score": "CIVIC Total Sum Impact Score"},
    text_auto=".1f",
    color_continuous_scale="geyser",
)
fig.update_traces(width=1)
fig.show()

In [116]:
fig.write_html("output/civic_ns_categories_impact_redgreen.html")

The scatter plot below shows the relationship between the Not Supported variant sub category impact score and the number of evidence items associated with variants in each sub category. Additionally, the sizes of the data point represent the number of variants in each sub category. 

In [117]:
fig2 = px.scatter(
    data_frame=not_supported_variant_impact_df,
    x="Total Number Evidence Items",
    y="CIVIC Total Sum Impact Score",
    size="Total Number Variants",
    size_max=40,
    text="Total Number Variants",
    color="Category",
    hover_data="% Accepted Evidence Items",
)
fig2.show()

In [118]:
fig2.write_html("output/civic_ns_categories_impact_scatterplot.html")

In [119]:
fig3 = px.scatter(
    data_frame=not_supported_variant_impact_df,
    x="Total Number Variants",
    y="Average Impact Score per Evidence Item",
    size="Total Number Variants",
    size_max=40,
    text="Total Number Variants",
    color="Category",
    hover_data=["% Accepted Evidence Items", "Average Impact Score per Variant"],
)
fig3.show()