# <a id='toc1_'></a>[CIViC Evidence Analysis](#toc0_)
This notebook contains an analysis on CIViC evidence data

**Table of contents**<a id='toc0_'></a>    
- [CIViC Evidence Analysis](#toc1_)    
    - [Create output directory](#toc1_1_1_)    
  - [Total Variants in CIViC](#toc1_2_)    
  - [Total Evidence items in CIViC](#toc1_3_)    
  - [Total Molecular Profiles in CIViC](#toc1_4_)    
- [Create analysis functions / global variables](#toc2_)    
  - [Summary dicts](#toc2_1_)    
  - [Define Analysis Functions](#toc2_2_)    
- [Analysis of Normalized Queries](#toc3_)    
  - [List of Normalized Variants ID's](#toc3_1_)    
  - [Variant analysis](#toc3_2_)    
  - [Transform df for evidence analysis](#toc3_3_)    
  - [Evidence analysis](#toc3_4_)    
  - [Impact](#toc3_5_)    
    - [Import molecular profile id](#toc3_5_1_)    
    - [Import molecular profile scores](#toc3_5_2_)    
- [Analysis of Unable to Normalize Queries](#toc4_)    
  - [List of Unable to Normalize Variant ID's](#toc4_1_)    
  - [Variant analysis](#toc4_2_)    
  - [Transform df for evidence analysis](#toc4_3_)    
  - [Evidence analysis](#toc4_4_)    
  - [Impact](#toc4_5_)    
    - [Import molecular profile id](#toc4_5_1_)    
    - [Import molecular profile scores](#toc4_5_2_)    
- [Analysis of Not Supported Variants](#toc5_)    
    - [List of Not Supported Variant ID's](#toc5_1_1_)    
  - [Variant Analysis](#toc5_2_)    
    - [Not Supported Variant Analysis by Subcategory](#toc5_2_1_)    
  - [Transform df for evidence analysis](#toc5_3_)    
  - [Evidence analysis](#toc5_4_)    
    - [Not Supported Variant Evidence Analysis by Subcategory](#toc5_4_1_)    
  - [Impact](#toc5_5_)    
    - [Via Evidence Level](#toc5_5_1_)    
      - [Analysis with only Accepted Variants](#toc5_5_1_1_)    
        - [Calculating evidence score via level](#toc5_5_1_1_1_)    
        - [Summary Table](#toc5_5_1_1_2_)    
      - [Analysis with Accepted and Submitted Variants](#toc5_5_1_2_)    
        - [Calculating evidence score via level](#toc5_5_1_2_1_)    
        - [Summary Table](#toc5_5_1_2_2_)    
    - [Via Molecular Profile Score- this was not used eventaully since MOA evidence items are only scored by level](#toc5_5_2_)    
      - [Import molecular profile id](#toc5_5_2_1_)    
      - [Import molecular profile scores](#toc5_5_2_2_)    
      - [Impact by Subcategory](#toc5_5_2_3_)    
- [Summary](#toc6_)    
  - [Variant Analysis](#toc6_1_)    
    - [Building Summary Table 1 & 2](#toc6_1_1_)    
    - [Summary Table 1](#toc6_1_2_)    
    - [Summary Table 2](#toc6_1_3_)    
    - [Building Summary Tables 3 - 5](#toc6_1_4_)    
    - [Summary Table 3](#toc6_1_5_)    
    - [Summary Table 4](#toc6_1_6_)    
    - [Summary Table 5](#toc6_1_7_)    
  - [Evidence Analysis](#toc6_2_)    
    - [Building Summary Tables 6 & 7](#toc6_2_1_)    
    - [Summary Table 6](#toc6_2_2_)    
    - [Summmary Table 7](#toc6_2_3_)    
    - [Building Summary Tables 8 - 10](#toc6_2_4_)    
    - [Summary Table 8](#toc6_2_5_)    
    - [Summary Table 9](#toc6_2_6_)    
    - [Summary Table 10](#toc6_2_7_)    
  - [Impact](#toc6_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
from pathlib import Path
from enum import Enum
import zipfile

import pandas as pd
from civicpy import civic as civicpy
import plotly.express as px

In [2]:
# Use latest cache that has been pushed to the repo
latest_cache_zip_path = sorted(Path().glob("../cache-*.pkl.zip"))[-1]
print(f"Using {latest_cache_zip_path} for civicpy cache")

with zipfile.ZipFile(latest_cache_zip_path, "r") as zip_ref:
    zip_ref.extractall("../")

civicpy.load_cache(local_cache_path=Path("../cache.pkl"), on_stale="ignore")

Using ../cache-20230803.pkl.zip for civicpy cache


True

### <a id='toc1_1_1_'></a>[Create output directory](#toc0_)

In [3]:
path = Path("civic_evidence_analysis_output")
path.mkdir(exist_ok = True)

## <a id='toc1_2_'></a>[Total Variants in CIViC](#toc0_)

In [4]:
civic_variant_ids = civicpy.get_all_variants(include_status=["accepted", "submitted"])
total_number_variants = len(civic_variant_ids)
f"Total Number of variants in CIViC: {total_number_variants}"

'Total Number of variants in CIViC: 3519'

## <a id='toc1_3_'></a>[Total Evidence items in CIViC](#toc0_)

Need to remove all rejected evidence items

In [5]:
civic_evidence_ids = civicpy.get_all_evidence(include_status=["accepted", "submitted"])

In [6]:
total_ac_sub_evidence = len(civic_evidence_ids)
f"Total Number of accepted and submitted evidence items in CIViC: {total_ac_sub_evidence}"

'Total Number of accepted and submitted evidence items in CIViC: 9920'

## <a id='toc1_4_'></a>[Total Molecular Profiles in CIViC](#toc0_)

In [7]:
civic_molprof_ids = civicpy.get_all_molecular_profiles(
    include_status=["accepted", "submitted", "rejected"]
)

In [8]:
len(civic_evidence_ids)

9920

# <a id='toc2_'></a>[Create analysis functions / global variables](#toc0_)

In [9]:
class VariantNormType(str, Enum):
    """Variation Normalization types"""

    NORMALIZED = "Normalized"
    UNABLE_TO_NORMALIZE = "Unable to Normalize"
    NOT_SUPPORTED = "Not Supported"


VARIANT_NORM_TYPE_VALUES = [v.value for v in VariantNormType.__members__.values()]

In [10]:
class VariantCategory(str, Enum):
    """Create enum for the kind of variants that are in CIViC."""
    EXPRESSION = "Expression Variants"
    EPIGENETIC_MODIFICATION = "Epigenetic Modification"
    FUSION = "Fusion Variants"
    SEQUENCE_VARS = "Sequence Variants"
    GENE_FUNC = "Gene Function Variants"
    REARRANGEMENTS = "Rearrangement Variants"
    COPY_NUMBER = "Copy Number Variants"
    OTHER = "Other Variants"
    GENOTYPES = "Genotype Variants"
    REGION_DEFINED_VAR = "Region Defined Variants"
    TRANSCRIPT_VAR = "Transcript Variants"  # no attempt to normalize these ones, since there is no query we could use


VARIANT_CATEGORY_VALUES = [v.value for v in VariantCategory.__members__.values()]

## <a id='toc2_1_'></a>[Summary dicts](#toc0_)

These dictionaries will be mutated and used at the end of the analysis

In [11]:
variant_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of CIViC Variant Items per Category": [],
    "Fraction of all CIViC Variant Items": [],
    "Percentage of all CIViC Variant Items": [],
    "Fraction of Accepted Variant Items": [],
    "Percentage of Accepted Variant Items": [],
    "Fraction of Not Accepted Variant Items": [],
    "Percentage of Not Accepted Variant Items": [],
}
variant_analysis_summary

{'Variant Category': ['Normalized', 'Unable to Normalize', 'Not Supported'],
 'Count of CIViC Variant Items per Category': [],
 'Fraction of all CIViC Variant Items': [],
 'Percentage of all CIViC Variant Items': [],
 'Fraction of Accepted Variant Items': [],
 'Percentage of Accepted Variant Items': [],
 'Fraction of Not Accepted Variant Items': [],
 'Percentage of Not Accepted Variant Items': []}

In [12]:
evidence_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of CIViC Evidence Items per Category": [],
    "Fraction of all CIViC Evidence Items": [],
    "Percentage of all CIViC Evidence Items": [],
    "Fraction of Accepted Evidence Items": [],
    "Percentage of Accepted Evidence Items": [],
    "Fraction of Submitted Evidence Items": [],
    "Percentage of Submitted Evidence Items": [],
}
evidence_analysis_summary

{'Variant Category': ['Normalized', 'Unable to Normalize', 'Not Supported'],
 'Count of CIViC Evidence Items per Category': [],
 'Fraction of all CIViC Evidence Items': [],
 'Percentage of all CIViC Evidence Items': [],
 'Fraction of Accepted Evidence Items': [],
 'Percentage of Accepted Evidence Items': [],
 'Fraction of Submitted Evidence Items': [],
 'Percentage of Submitted Evidence Items': []}

## <a id='toc2_2_'></a>[Define Analysis Functions](#toc0_)

In [13]:
def variant_analysis(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do variant analysis (counts, percentages)

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of variants that are in `df`
    :return: Transformed dataframe with variant ID duplicates dropped
    """
    # Drop duplicate rows
    df = df.drop_duplicates(subset=["variant_id"])
    variant_ids = list(df["variant_id"])

    # Count
    num_variants = len(variant_ids)
    fraction_variants = f"{num_variants} / {total_number_variants}"
    print(
        f"\nNumber of {variant_norm_type.value} Variants in CIViC: {fraction_variants}"
    )

    # Percentage
    percentage_variants = f"{num_variants / total_number_variants * 100:.2f}%"
    print(
        f"Percentage of {variant_norm_type.value} Variants in CIViC: {percentage_variants}"
    )

    # Get accepted counts
    num_accepted_variants = df.variant_accepted.sum()
    fraction_accepted_variants = f"{num_accepted_variants} / {num_variants}"
    print(
        f"\nNumber of accepted {variant_norm_type.value} Variants: {fraction_accepted_variants}"
    )

    # Get accepted percentage
    percentage_accepted_variants = f"{num_accepted_variants / num_variants * 100:.2f}%"
    print(
        f"Percentage of accepted {variant_norm_type.value} Variants: {percentage_accepted_variants}"
    )

    # Get not accepted counts
    num_not_accepted_variants = len(df) - num_accepted_variants
    fraction_not_accepted_variants = f"{num_not_accepted_variants} / {num_variants}"
    print(
        f"\nNumber of not accepted {variant_norm_type.value} Variants: {fraction_not_accepted_variants}"
    )

    # Get not accepted percentage
    percentage_not_accepted_variants = (
        f"{num_not_accepted_variants / num_variants * 100:.2f}%"
    )
    print(
        f"Percentage of not accepted {variant_norm_type.value} Variants: {percentage_not_accepted_variants}"
    )

    variant_analysis_summary["Count of CIViC Variant Items per Category"].append(
        num_variants
    )
    variant_analysis_summary["Fraction of all CIViC Variant Items"].append(
        fraction_variants
    )
    variant_analysis_summary["Percentage of all CIViC Variant Items"].append(
        percentage_variants
    )
    variant_analysis_summary["Fraction of Accepted Variant Items"].append(
        fraction_accepted_variants
    )
    variant_analysis_summary["Percentage of Accepted Variant Items"].append(
        percentage_accepted_variants
    )
    variant_analysis_summary["Fraction of Not Accepted Variant Items"].append(
        fraction_not_accepted_variants
    )
    variant_analysis_summary["Percentage of Not Accepted Variant Items"].append(
        percentage_not_accepted_variants
    )

    return df

In [14]:
def transform_df_evidence_ids(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include evidence ID information

    :param df: Dataframe of variants
    :return: Transformed dataframe with evidence ID information
    """
    tmp_df = df.copy(deep=True)

    _variants_evidence_ids = []
    variant_ids = list(tmp_df["variant_id"])

    for v_id in variant_ids:
        _variant_evidence_ids = []

        for variant in civic_variant_ids:
            if int(v_id) == variant.id:
                for mp in variant.molecular_profiles:
                    for e in mp.evidence_items:
                        if e.id not in _variant_evidence_ids:
                            _variant_evidence_ids.append(e.id)

        _variants_evidence_ids.append(_variant_evidence_ids or "")

    tmp_df["evidence_ids"] = _variants_evidence_ids

    # Explode and rename evidence ids field
    tmp_df = tmp_df.explode(column="evidence_ids")
    tmp_df = tmp_df.rename(columns={"evidence_ids": "evidence_id"})

    return tmp_df

In [15]:
def transform_df_evidence(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include evidence status, rating, and level

    :param df: Dataframe of variants
    :return: Transformed dataframe with evidence status, rating, and level information.
        Rejected evidence items will be dropped.
    """
    variants_evidence_ids = list(df["evidence_id"])

    # Add evidence status, rating, and level information
    _variants_evidence_statuses = []
    _variants_evidence_ratings = []
    _variants_evidence_levels = []

    for eid in variants_evidence_ids:
        _variant_evidence_statuses = []
        _variant_evidence_ratings = []
        _variant_evidence_levels = []

        for evidence in civic_evidence_ids:
            if eid and (int(eid) == evidence.id):
                if evidence.status not in _variant_evidence_statuses:
                    _variant_evidence_statuses.append(evidence.status)

                if evidence.rating not in _variant_evidence_ratings:
                    _variant_evidence_ratings.append(evidence.rating)

                if evidence.evidence_level not in _variant_evidence_levels:
                    _variant_evidence_levels.append(evidence.evidence_level)

        _variants_evidence_statuses.append(_variant_evidence_statuses or "")
        _variants_evidence_ratings.append(_variant_evidence_ratings or "")
        _variants_evidence_levels.append(_variant_evidence_levels or "")

    df["evidence_status"] = _variants_evidence_statuses
    df["evidence_status"] = df["evidence_status"].str.join(", ")
    df["evidence_rating"] = _variants_evidence_ratings
    df["evidence_level"] = _variants_evidence_levels

    # Drop rejected evidence items
    df = df.drop(df[df.evidence_status == "rejected"].index)

    return df

In [16]:
def evidence_analysis1( #for Not Supported Variant analysis since it has sub categories and 
                        #evidence item duplicates should be dropped within the sub categories, 
                        # not accross all Not Supported Variant evidence items
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do evidence analysis (counts, percentages)

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of variants that are in `df`
    :return: Transformed dataframe with evidence ID duplicates dropped
    """
    # Count
    num_variant_unique_evidence_items = len(set(df.evidence_id))
    fraction_evidence_items = (
        f"{num_variant_unique_evidence_items} / {total_ac_sub_evidence}"
    )
    print(
        f"Number of {variant_norm_type.value} Variant Evidence items in CIViC: {fraction_evidence_items}"
    )

    # Percentage
    percentage_evidence_items = (
        f"{num_variant_unique_evidence_items / total_ac_sub_evidence * 100:.2f}%"
    )
    print(
        f"Percentage of {variant_norm_type.value} Variant Evidence items in CIViC: {percentage_evidence_items}"
    )

    # Add evidence accepted column
    df["evidence_accepted"] = df.evidence_status.map(
        {"accepted": True, "submitted": False}
    )

    # Drop evidence id duplicates
    df = df.drop_duplicates(subset=["evidence_id", "category"])

    # Get accepted counts
    num_accepted_evidences_variants = df.evidence_accepted.sum()
    fraction_accepted_evidences_variants = (
        f"{num_accepted_evidences_variants} / {num_variant_unique_evidence_items}"
    )
    print(
        f"\nNumber of accepted {variant_norm_type.value} Variant Evidence items: {fraction_accepted_evidences_variants}"
    )

    # Get accepted percentage
    percentage_accepted_evidences_variants = f"{num_accepted_evidences_variants / num_variant_unique_evidence_items * 100:.2f}%"
    print(
        f"Percentage of accepted {variant_norm_type.value} Variant Evidence items: {percentage_accepted_evidences_variants}"
    )

    # Get submitted counts
    number_submitted_evidences_variants = len(df) - num_accepted_evidences_variants
    fraction_submitted_evidences_variants = (
        f"{number_submitted_evidences_variants} / {num_variant_unique_evidence_items}"
    )
    print(
        f"\nNumber of submitted {variant_norm_type.value} Variant Evidence items: {fraction_submitted_evidences_variants}"
    )

    # Get submitted percentage
    percentage_submitted_evidences_variants = f"{number_submitted_evidences_variants / num_variant_unique_evidence_items * 100:.2f}%"
    print(
        f"Percentage of not submitted {variant_norm_type.value} Variant Evidence items: {percentage_submitted_evidences_variants}"
    )

    evidence_analysis_summary["Count of CIViC Evidence Items per Category"].append(
        num_variant_unique_evidence_items
    )
    evidence_analysis_summary["Fraction of all CIViC Evidence Items"].append(
        fraction_evidence_items
    )
    evidence_analysis_summary["Percentage of all CIViC Evidence Items"].append(
        percentage_evidence_items
    )
    evidence_analysis_summary["Fraction of Accepted Evidence Items"].append(
        fraction_accepted_evidences_variants
    )
    evidence_analysis_summary["Percentage of Accepted Evidence Items"].append(
        percentage_accepted_evidences_variants
    )
    evidence_analysis_summary["Fraction of Submitted Evidence Items"].append(
        fraction_submitted_evidences_variants
    )
    evidence_analysis_summary["Percentage of Submitted Evidence Items"].append(
        percentage_submitted_evidences_variants
    )

    return df

In [17]:
def evidence_analysis2(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do evidence analysis (counts, percentages)

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of variants that are in `df`
    :return: Transformed dataframe with evidence ID duplicates dropped
    """
    # Count
    num_variant_unique_evidence_items = len(set(df.evidence_id))
    fraction_evidence_items = (
        f"{num_variant_unique_evidence_items} / {total_ac_sub_evidence}"
    )
    print(
        f"Number of {variant_norm_type.value} Variant Evidence items in CIViC: {fraction_evidence_items}"
    )

    # Percentage
    percentage_evidence_items = (
        f"{num_variant_unique_evidence_items / total_ac_sub_evidence * 100:.2f}%"
    )
    print(
        f"Percentage of {variant_norm_type.value} Variant Evidence items in CIViC: {percentage_evidence_items}"
    )

    # Add evidence accepted column
    df["evidence_accepted"] = df.evidence_status.map(
        {"accepted": True, "submitted": False}
    )

    # Drop evidence id duplicates
    df = df.drop_duplicates(subset=["evidence_id"])

    # Get accepted counts
    num_accepted_evidences_variants = df.evidence_accepted.sum()
    fraction_accepted_evidences_variants = (
        f"{num_accepted_evidences_variants} / {num_variant_unique_evidence_items}"
    )
    print(
        f"\nNumber of accepted {variant_norm_type.value} Variant Evidence items: {fraction_accepted_evidences_variants}"
    )

    # Get accepted percentage
    percentage_accepted_evidences_variants = f"{num_accepted_evidences_variants / num_variant_unique_evidence_items * 100:.2f}%"
    print(
        f"Percentage of accepted {variant_norm_type.value} Variant Evidence items: {percentage_accepted_evidences_variants}"
    )

    # Get submitted counts
    number_submitted_evidences_variants = len(df) - num_accepted_evidences_variants
    fraction_submitted_evidences_variants = (
        f"{number_submitted_evidences_variants} / {num_variant_unique_evidence_items}"
    )
    print(
        f"\nNumber of submitted {variant_norm_type.value} Variant Evidence items: {fraction_submitted_evidences_variants}"
    )

    # Get submitted percentage
    percentage_submitted_evidences_variants = f"{number_submitted_evidences_variants / num_variant_unique_evidence_items * 100:.2f}%"
    print(
        f"Percentage of not submitted {variant_norm_type.value} Variant Evidence items: {percentage_submitted_evidences_variants}"
    )

    evidence_analysis_summary["Count of CIViC Evidence Items per Category"].append(
        num_variant_unique_evidence_items
    )
    evidence_analysis_summary["Fraction of all CIViC Evidence Items"].append(
        fraction_evidence_items
    )
    evidence_analysis_summary["Percentage of all CIViC Evidence Items"].append(
        percentage_evidence_items
    )
    evidence_analysis_summary["Fraction of Accepted Evidence Items"].append(
        fraction_accepted_evidences_variants
    )
    evidence_analysis_summary["Percentage of Accepted Evidence Items"].append(
        percentage_accepted_evidences_variants
    )
    evidence_analysis_summary["Fraction of Submitted Evidence Items"].append(
        fraction_submitted_evidences_variants
    )
    evidence_analysis_summary["Percentage of Submitted Evidence Items"].append(
        percentage_submitted_evidences_variants
    )

    return df

In [18]:
def transform_df_mp_id(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile ID information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile ID information
    """
    tmp_df = df.copy(deep=True)

    variants_molprof_ids = []
    variant_ids = list(tmp_df["variant_id"])

    for v_id in variant_ids:
        variant_molprof_ids = []

        for variant in civic_variant_ids:
            if int(v_id) == variant.id:
                for mp in variant.molecular_profiles:
                    if mp.id not in variant_molprof_ids:
                        variant_molprof_ids.append(mp.id)

        variants_molprof_ids.append(variant_molprof_ids or "")

    tmp_df["molecular_profile_id"] = variants_molprof_ids
    return tmp_df

In [19]:
def transform_df_mp_score(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile score information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile score information
    """
    variants_molprof_scores = []
    normalized_variant_molprof_ids = list(df["molecular_profile_id"])

    for mp_ids in normalized_variant_molprof_ids:
        variant_molprof_scores = []
        for mp_id in mp_ids:
            for molprof in civic_molprof_ids:
                if int(mp_id) == molprof.id:
                    if molprof.molecular_profile_score not in variant_molprof_scores:
                        variant_molprof_scores.append(molprof.molecular_profile_score)

        variants_molprof_scores.append(variant_molprof_scores or "")

    df["molecular_profile_score"] = variants_molprof_scores
    return df

In [20]:
def transform_df_mp_score_sum(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile score sum information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile score sum information
    """
    df["molecular_profile_score_sum"] = df["molecular_profile_score"].apply(
        lambda x: sum(x)
    )
    return df

# <a id='toc3_'></a>[Analysis of Normalized Queries](#toc0_)

## <a id='toc3_1_'></a>[List of Normalized Variants ID's](#toc0_)

In [21]:
normalized_queries_df = pd.read_csv("../variation_analysis/able_to_normalize_queries.csv", sep="\t")
normalized_queries_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.JcEpDvhUtgDWU4A-bxqLUuczBNb8QqRf,normalize
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.AmLtooLEvgdnEHD5YVWk6u1e2XBe7FiP,normalize
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.KIz00usFWEmJHNyqmVL61obfgfRPgOIa,normalize


## <a id='toc3_2_'></a>[Variant analysis](#toc0_)

In [22]:
normalized_queries_df = variant_analysis(
    normalized_queries_df, VariantNormType.NORMALIZED
)
normalized_queries_df.head()


Number of Normalized Variants in CIViC: 1876 / 3519
Percentage of Normalized Variants in CIViC: 53.31%

Number of accepted Normalized Variants: 869 / 1876
Percentage of accepted Normalized Variants: 46.32%

Number of not accepted Normalized Variants: 1007 / 1876
Percentage of not accepted Normalized Variants: 53.68%


Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.JcEpDvhUtgDWU4A-bxqLUuczBNb8QqRf,normalize
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.AmLtooLEvgdnEHD5YVWk6u1e2XBe7FiP,normalize
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.KIz00usFWEmJHNyqmVL61obfgfRPgOIa,normalize


In [23]:
variant_analysis_summary

{'Variant Category': ['Normalized', 'Unable to Normalize', 'Not Supported'],
 'Count of CIViC Variant Items per Category': [1876],
 'Fraction of all CIViC Variant Items': ['1876 / 3519'],
 'Percentage of all CIViC Variant Items': ['53.31%'],
 'Fraction of Accepted Variant Items': ['869 / 1876'],
 'Percentage of Accepted Variant Items': ['46.32%'],
 'Fraction of Not Accepted Variant Items': ['1007 / 1876'],
 'Percentage of Not Accepted Variant Items': ['53.68%']}

## <a id='toc3_3_'></a>[Transform df for evidence analysis](#toc0_)

In [24]:
normalized_queries_add_evidence_df = transform_df_evidence_ids(normalized_queries_df)
normalized_queries_add_evidence_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,evidence_id
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,9347
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,6724
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.JcEpDvhUtgDWU4A-bxqLUuczBNb8QqRf,normalize,5336
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,10779
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,6723


In [25]:
normalized_queries_add_evidence_df = transform_df_evidence(
    normalized_queries_add_evidence_df
)
normalized_queries_add_evidence_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,evidence_id,evidence_status,evidence_rating,evidence_level
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,9347,submitted,[3],[C]
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,6724,accepted,[2],[C]
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.JcEpDvhUtgDWU4A-bxqLUuczBNb8QqRf,normalize,5336,accepted,[2],[C]
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,10779,submitted,[3],[C]
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,6723,accepted,[2],[C]


## <a id='toc3_4_'></a>[Evidence analysis](#toc0_)

In [26]:
normalized_queries_add_evidence_df = evidence_analysis2(
    normalized_queries_add_evidence_df, VariantNormType.NORMALIZED
)
normalized_queries_add_evidence_df.head()

Number of Normalized Variant Evidence items in CIViC: 5866 / 9920
Percentage of Normalized Variant Evidence items in CIViC: 59.13%

Number of accepted Normalized Variant Evidence items: 2080 / 5866
Percentage of accepted Normalized Variant Evidence items: 35.46%

Number of submitted Normalized Variant Evidence items: 3786 / 5866
Percentage of not submitted Normalized Variant Evidence items: 64.54%


Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,9347,submitted,[3],[C],False
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,6724,accepted,[2],[C],True
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.JcEpDvhUtgDWU4A-bxqLUuczBNb8QqRf,normalize,5336,accepted,[2],[C],True
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,10779,submitted,[3],[C],False
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,6723,accepted,[2],[C],True


## <a id='toc3_5_'></a>[Impact](#toc0_)
molecular profile score

### <a id='toc3_5_1_'></a>[Import molecular profile id](#toc0_)

In [27]:
normalized_queries_add_molprof_df = transform_df_mp_id(normalized_queries_df)
normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,[2362]
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.JcEpDvhUtgDWU4A-bxqLUuczBNb8QqRf,normalize,[1864]
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,[2361]
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.AmLtooLEvgdnEHD5YVWk6u1e2XBe7FiP,normalize,[1862]
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.KIz00usFWEmJHNyqmVL61obfgfRPgOIa,normalize,[1863]


In [28]:
normalized_queries_add_molprof_df.loc[
    normalized_queries_add_molprof_df["variant_id"] == 190
]

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id
86,190,EGFR Amplification,protein,True,Transcript Amplification,ga4gh:CX.sEHT64Lm86QaTXzw39uKLkBUbEkp4h_X,normalize,"[190, 4175, 4346, 4567]"


### <a id='toc3_5_2_'></a>[Import molecular profile scores](#toc0_)

In [29]:
normalized_queries_add_molprof_df = transform_df_mp_score(
    normalized_queries_add_molprof_df
)
normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id,molecular_profile_score
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,[2362],[5.0]
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.JcEpDvhUtgDWU4A-bxqLUuczBNb8QqRf,normalize,[1864],[5.0]
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,[2361],[5.0]
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.AmLtooLEvgdnEHD5YVWk6u1e2XBe7FiP,normalize,[1862],[10.0]
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.KIz00usFWEmJHNyqmVL61obfgfRPgOIa,normalize,[1863],[5.0]


Example query below

In [30]:
normalized_queries_add_molprof_df.loc[
    normalized_queries_add_molprof_df["variant_id"] == 190
]

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id,molecular_profile_score
86,190,EGFR Amplification,protein,True,Transcript Amplification,ga4gh:CX.sEHT64Lm86QaTXzw39uKLkBUbEkp4h_X,normalize,"[190, 4175, 4346, 4567]","[173.0, 5.0, 0.0]"


In [31]:
normalized_queries_add_molprof_df = transform_df_mp_score_sum(
    normalized_queries_add_molprof_df
)
normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,True,Stop Lost,ga4gh:VA.A34ZoIhq4xBuQbcE3bkj29n6diS6RzLB,normalize,[2362],[5.0],5.0
1,1988,NC_000003.11:g.10191649A>T,genomic,True,Stop Lost,ga4gh:VA.JcEpDvhUtgDWU4A-bxqLUuczBNb8QqRf,normalize,[1864],[5.0],5.0
2,2488,3-10191647-T-G,genomic,True,Stop Lost,ga4gh:VA.7nGd8dgHbqtxMHk_rLxrB6_IMAzJ8XnH,normalize,[2361],[5.0],5.0
3,1986,NC_000003.11:g.10191648G>T,genomic,True,Stop Lost,ga4gh:VA.AmLtooLEvgdnEHD5YVWk6u1e2XBe7FiP,normalize,[1862],[10.0],10.0
4,1987,NC_000003.11:g.10191649A>G,genomic,True,Stop Lost,ga4gh:VA.KIz00usFWEmJHNyqmVL61obfgfRPgOIa,normalize,[1863],[5.0],5.0


Example query below

In [32]:
normalized_queries_add_molprof_df.loc[
    normalized_queries_add_molprof_df["variant_id"] == 190
]

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,vrs_id,succeeded_endpoint,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
86,190,EGFR Amplification,protein,True,Transcript Amplification,ga4gh:CX.sEHT64Lm86QaTXzw39uKLkBUbEkp4h_X,normalize,"[190, 4175, 4346, 4567]","[173.0, 5.0, 0.0]",178.0


# <a id='toc4_'></a>[Analysis of Unable to Normalize Queries](#toc0_)

## <a id='toc4_1_'></a>[List of Unable to Normalize Variant ID's](#toc0_)

In [33]:
not_normalized_queries_df = pd.read_csv("../variation_analysis/unable_to_normalize_queries.csv", sep="\t")
not_normalized_queries_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L']
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V']
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T']
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t..."
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T']


## <a id='toc4_2_'></a>[Variant analysis](#toc0_)

In [34]:
not_normalized_queries_df = variant_analysis(
    not_normalized_queries_df, VariantNormType.UNABLE_TO_NORMALIZE
)
not_normalized_queries_df.head()


Number of Unable to Normalize Variants in CIViC: 80 / 3519
Percentage of Unable to Normalize Variants in CIViC: 2.27%

Number of accepted Unable to Normalize Variants: 11 / 80
Percentage of accepted Unable to Normalize Variants: 13.75%

Number of not accepted Unable to Normalize Variants: 69 / 80
Percentage of not accepted Unable to Normalize Variants: 86.25%


Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L']
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V']
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T']
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t..."
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T']


## <a id='toc4_3_'></a>[Transform df for evidence analysis](#toc0_)

In [35]:
not_normalized_quer_add_evidence_df = transform_df_evidence_ids(
    not_normalized_queries_df
)
not_normalized_quer_add_evidence_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,evidence_id
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],1812
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],10128
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],10135
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11494
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11323


In [36]:
not_normalized_quer_add_evidence_df = transform_df_evidence(
    not_normalized_quer_add_evidence_df
)
not_normalized_quer_add_evidence_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,evidence_id,evidence_status,evidence_rating,evidence_level
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],1812,accepted,[1],[C]
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],10128,submitted,[3],[D]
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],10135,submitted,[3],[D]
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11494,submitted,[4],[D]
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11323,submitted,[3],[B]


## <a id='toc4_4_'></a>[Evidence analysis](#toc0_)

In [37]:
not_normalized_quer_add_evidence_df = evidence_analysis2(
    not_normalized_quer_add_evidence_df, VariantNormType.UNABLE_TO_NORMALIZE
)
not_normalized_quer_add_evidence_df.head()

Number of Unable to Normalize Variant Evidence items in CIViC: 127 / 9920
Percentage of Unable to Normalize Variant Evidence items in CIViC: 1.28%

Number of accepted Unable to Normalize Variant Evidence items: 17 / 127
Percentage of accepted Unable to Normalize Variant Evidence items: 13.39%

Number of submitted Unable to Normalize Variant Evidence items: 110 / 127
Percentage of not submitted Unable to Normalize Variant Evidence items: 86.61%


Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],1812,accepted,[1],[C],True
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],10128,submitted,[3],[D],False
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],10135,submitted,[3],[D],False
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11494,submitted,[4],[D],False
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...",11323,submitted,[3],[B],False


## <a id='toc4_5_'></a>[Impact](#toc0_)
molecular profile score

### <a id='toc4_5_1_'></a>[Import molecular profile id](#toc0_)

In [38]:
not_normalized_queries_add_molprof_df = transform_df_mp_id(not_normalized_queries_df)
not_normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,molecular_profile_id
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],[729]
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],[3586]
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],[3593]
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...","[4463, 4472]"
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T'],[244]


### <a id='toc4_5_2_'></a>[Import molecular profile scores](#toc0_)

In [39]:
not_normalized_queries_add_molprof_df = transform_df_mp_score(
    not_normalized_queries_add_molprof_df
)
not_normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,molecular_profile_id,molecular_profile_score
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],[729],[2.5]
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],[3586],[0.0]
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],[3593],[0.0]
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...","[4463, 4472]",[0.0]
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T'],[244],[40.0]


In [40]:
not_normalized_queries_add_molprof_df = transform_df_mp_score_sum(
    not_normalized_queries_add_molprof_df
)
not_normalized_queries_add_molprof_df.head()

Unnamed: 0,variant_id,query,query_type,variant_accepted,civic_variant_types,exception_raised,message,warnings,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
0,748,MLH1 *757L,protein,True,Stop Lost,False,unable to normalize,['Unable to tokenize: *757L'],[729],[2.5],2.5
1,3718,AR A748V,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A748V'],[3586],[0.0],0.0
2,3725,AR A765T,protein,False,Not provided,False,unable to normalize,['Unable to translate AR A765T'],[3593],[0.0],0.0
3,4485,ERBB2 A775_G776ins YVMA,protein,False,Not provided,False,unable to normalize,"['Unable to tokenize: A775_G776ins', 'Unable t...","[4463, 4472]",[0.0],0.0
4,248,TERT C228T,protein,True,Regulatory Region Variant,False,unable to normalize,['Unable to translate TERT C228T'],[244],[40.0],40.0


# <a id='toc5_'></a>[Analysis of Not Supported Variants](#toc0_)

### <a id='toc5_1_1_'></a>[List of Not Supported Variant ID's](#toc0_)

In [41]:
not_supported_queries_df = pd.read_csv("../variation_analysis/not_supported_variants.csv", sep="\t")
not_supported_queries_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted
0,4170,VHL,,Not provided,Transcript Variants,False
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False
2,4214,VHL,,Not provided,Transcript Variants,False
3,4216,VHL,,Not provided,Transcript Variants,False
4,4278,VHL,,Not provided,Transcript Variants,False


## <a id='toc5_2_'></a>[Variant Analysis](#toc0_)

In [42]:
not_supported_queries_df = variant_analysis(
    not_supported_queries_df, VariantNormType.NOT_SUPPORTED
)
not_supported_queries_df.head()


Number of Not Supported Variants in CIViC: 1563 / 3519
Percentage of Not Supported Variants in CIViC: 44.42%

Number of accepted Not Supported Variants: 790 / 1563
Percentage of accepted Not Supported Variants: 50.54%

Number of not accepted Not Supported Variants: 773 / 1563
Percentage of not accepted Not Supported Variants: 49.46%


Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted
0,4170,VHL,,Not provided,Transcript Variants,False
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False
2,4214,VHL,,Not provided,Transcript Variants,False
3,4216,VHL,,Not provided,Transcript Variants,False
4,4278,VHL,,Not provided,Transcript Variants,False


In [43]:
not_supported_queries_df["variant_accepted"].value_counts()

variant_accepted
True     790
False    773
Name: count, dtype: int64

### <a id='toc5_2_1_'></a>[Not Supported Variant Analysis by Subcategory](#toc0_)

In [44]:
not_supported_variant_analysis_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Count of CIViC Variant Items per Category": [],
    "Fraction of Not Supported Variant Items": [],
    "Percent of Not Supported Variant Items": [],
    "Fraction of all CIViC Variant Items": [],
    "Percent of all CIViC Variant Items": [],
    "Fraction of Accepted Variant Items": [],
    "Percent of Accepted Variant Items": [],
    "Fraction of Not Accepted Variant Items": [],
    "Percent of Not Accepted Variant Items": [],
}

In [45]:
not_supported_variant_categories_summary_data = dict()
total_number_unique_not_supported_variants = len(
    set(not_supported_queries_df.variant_id)
)

for category in VARIANT_CATEGORY_VALUES:  # These are not supported categories
    not_supported_variant_categories_summary_data[category] = {}
    category_df = not_supported_queries_df[
        not_supported_queries_df.category == category
    ]

    # Count
    number_unique_not_supported_category_variants = len(set(category_df.variant_id))
    not_supported_variant_categories_summary_data[category][
        "number_unique_not_supported_category_variants"
    ] = number_unique_not_supported_category_variants

    # Fraction
    fraction_not_supported_category_variant_of_civic = (
        f"{number_unique_not_supported_category_variants} / {total_number_variants}"
    )
    not_supported_variant_categories_summary_data[category][
        "fraction_not_supported_category_variant_of_civic"
    ] = fraction_not_supported_category_variant_of_civic

    # Percent
    percent_not_supported_category_variant_of_civic = f"{number_unique_not_supported_category_variants / total_number_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percent_not_supported_category_variant_of_civic"
    ] = percent_not_supported_category_variant_of_civic

    # Not supported fraction
    fraction_not_supported_category_variant_of_total_not_supported = f"{number_unique_not_supported_category_variants} / {total_number_unique_not_supported_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_not_supported_category_variant_of_total_not_supported"
    ] = fraction_not_supported_category_variant_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_variant_of_total_not_supported = f"{number_unique_not_supported_category_variants / total_number_unique_not_supported_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percent_not_supported_category_variant_of_total_not_supported"
    ] = percent_not_supported_category_variant_of_total_not_supported

    # Accepted fraction
    number_accepted_not_supported_category_variants = category_df.variant_accepted.sum()
    fraction_accepted_not_supported_category_variants = f"{number_accepted_not_supported_category_variants} / {number_unique_not_supported_category_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_accepted_not_supported_category_variants"
    ] = fraction_accepted_not_supported_category_variants

    # Accepted percent
    percentage_accepted_not_supported_category_variants = f"{number_accepted_not_supported_category_variants / number_unique_not_supported_category_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percentage_accepted_not_supported_category_variants"
    ] = percentage_accepted_not_supported_category_variants

    # Not accepted fraction
    number_not_accepted_not_supported_category_variants = (
        len(category_df) - number_accepted_not_supported_category_variants
    )
    fraction_not_accepted_not_supported_category_variants = f" {number_not_accepted_not_supported_category_variants} / {number_unique_not_supported_category_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_not_accepted_not_supported_category_variants"
    ] = fraction_not_accepted_not_supported_category_variants

    # Not accepted percent
    percentage_not_accepted_not_supported_category_variants = f"{number_not_accepted_not_supported_category_variants / number_unique_not_supported_category_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percentage_not_accepted_not_supported_category_variants"
    ] = percentage_not_accepted_not_supported_category_variants

    not_supported_variant_analysis_summary[
        "Count of CIViC Variant Items per Category"
    ].append(number_unique_not_supported_category_variants)
    not_supported_variant_analysis_summary[
        "Fraction of all CIViC Variant Items"
    ].append(fraction_not_supported_category_variant_of_civic)
    not_supported_variant_analysis_summary["Percent of all CIViC Variant Items"].append(
        percent_not_supported_category_variant_of_civic
    )
    not_supported_variant_analysis_summary[
        "Fraction of Not Supported Variant Items"
    ].append(fraction_not_supported_category_variant_of_total_not_supported)
    not_supported_variant_analysis_summary[
        "Percent of Not Supported Variant Items"
    ].append(percent_not_supported_category_variant_of_total_not_supported)
    not_supported_variant_analysis_summary["Fraction of Accepted Variant Items"].append(
        fraction_accepted_not_supported_category_variants
    )
    not_supported_variant_analysis_summary["Percent of Accepted Variant Items"].append(
        percentage_accepted_not_supported_category_variants
    )
    not_supported_variant_analysis_summary[
        "Fraction of Not Accepted Variant Items"
    ].append(fraction_not_accepted_not_supported_category_variants)
    not_supported_variant_analysis_summary[
        "Percent of Not Accepted Variant Items"
    ].append(percentage_not_accepted_not_supported_category_variants)

## <a id='toc5_3_'></a>[Transform df for evidence analysis](#toc0_)

In [46]:
not_supported_variants_add_evidence_df = transform_df_evidence_ids(
    not_supported_queries_df
)
not_supported_variants_add_evidence_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id
0,4170,VHL,,Not provided,Transcript Variants,False,10647
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False,7428
2,4214,VHL,,Not provided,Transcript Variants,False,10752
3,4216,VHL,,Not provided,Transcript Variants,False,10754
4,4278,VHL,,Not provided,Transcript Variants,False,10958


There are no variants without evidence items

In [47]:
not_supported_variants_add_evidence_df.loc[
    not_supported_variants_add_evidence_df["evidence_id"] == ""
]

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id


In [48]:
not_supported_variants_add_evidence_df = transform_df_evidence(
    not_supported_variants_add_evidence_df
)
not_supported_variants_add_evidence_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level
0,4170,VHL,,Not provided,Transcript Variants,False,10647,submitted,[2],[C]
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False,7428,submitted,[3],[C]
2,4214,VHL,,Not provided,Transcript Variants,False,10752,submitted,[3],[C]
3,4216,VHL,,Not provided,Transcript Variants,False,10754,submitted,[3],[C]
4,4278,VHL,,Not provided,Transcript Variants,False,10958,submitted,[3],[C]


## <a id='toc5_4_'></a>[Evidence analysis](#toc0_)

In [49]:
not_supported_variants_add_evidence_df = evidence_analysis1(
    not_supported_variants_add_evidence_df, VariantNormType.NOT_SUPPORTED
)
not_supported_variants_add_evidence_df.head()

Number of Not Supported Variant Evidence items in CIViC: 4243 / 9920
Percentage of Not Supported Variant Evidence items in CIViC: 42.77%

Number of accepted Not Supported Variant Evidence items: 2231 / 4243
Percentage of accepted Not Supported Variant Evidence items: 52.58%

Number of submitted Not Supported Variant Evidence items: 2048 / 4243
Percentage of not submitted Not Supported Variant Evidence items: 48.27%


Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted
0,4170,VHL,,Not provided,Transcript Variants,False,10647,submitted,[2],[C],False
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False,7428,submitted,[3],[C],False
2,4214,VHL,,Not provided,Transcript Variants,False,10752,submitted,[3],[C],False
3,4216,VHL,,Not provided,Transcript Variants,False,10754,submitted,[3],[C],False
4,4278,VHL,,Not provided,Transcript Variants,False,10958,submitted,[3],[C],False


### <a id='toc5_4_1_'></a>[Not Supported Variant Evidence Analysis by Subcategory](#toc0_)

 List all the possible variant categories, have to use non unique file since evidence items are used more than once across groups


In [50]:
not_supported_variant_categories = (
    not_supported_variants_add_evidence_df.category.unique()
)
[v for v in not_supported_variant_categories]

['Transcript Variants',
 'Fusion Variants',
 'Rearrangement Variants',
 'Sequence Variants',
 'Region Defined Variants',
 'Other Variants',
 'Copy Number Variants',
 'Gene Function Variants',
 'Expression Variants',
 'Genotype Variants',
 'Epigenetic Modification']

Evidence items may be used across multiple variants

In [51]:
duplicate = not_supported_variants_add_evidence_df[
    not_supported_variants_add_evidence_df.duplicated("evidence_id", keep=False)
]
duplicate

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted
52,1,ABL1,BCR::ABL,Transcript Fusion,Fusion Variants,True,11341,submitted,[4],[A],False
112,4497,FGFR1,BCR::FGFR1,Not provided,Fusion Variants,False,11324,submitted,[3],[B],False
176,437,FLT3,D835,Protein Altering Variant,Sequence Variants,True,11260,submitted,[4],[A],False
176,437,FLT3,D835,Protein Altering Variant,Sequence Variants,True,11261,submitted,[4],[A],False
198,200,IKZF1,Deletion,Transcript Ablation,Gene Function Variants,True,7786,submitted,[5],[B],False
...,...,...,...,...,...,...,...,...,...,...,...
1322,4389,ALK,T1151dup,Inframe Insertion,Copy Number Variants,True,4608,submitted,[2],[D],False
1344,2371,ABL1,TKD MUTATION,Not provided,Gene Function Variants,True,11339,submitted,[4],[B],False
1367,4500,FGFR1,Translocation,Not provided,Rearrangement Variants,False,11324,submitted,[3],[B],False
1480,4528,ZFTA,ZFTA::RELA,Not provided,Fusion Variants,True,11446,submitted,[3],[B],False


In [52]:
not_supported_variant_evidence_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Count of CIViC Evidence Items per Category": [],
    "Fraction of CIViC Evidence Items": [],
    "Percent of all CIViC Evidence Items": [],
    "Fraction of Not Supported Variant Evidence Items": [],
    "Percent of Not Supported Variant Evidence Items": [],
    "Fraction of Accepted Evidence Items": [],
    "Percent of Accepted Evidence Items": [],
    "Fraction of Submitted Evidence Items": [],
    "Percent of Submitted Evidence Items": [],
}

In [53]:
not_supported_variant_categories_evidence_summary_data = dict()
total_number_not_supported_variant_unique_evidence_items = len(
    set(not_supported_variants_add_evidence_df.evidence_id)
)

for category in VARIANT_CATEGORY_VALUES:
    not_supported_variant_categories_evidence_summary_data[category] = {}
    evidence_category_df = not_supported_variants_add_evidence_df[
        not_supported_variants_add_evidence_df.category == category
    ]
    evidence_category_df = evidence_category_df.drop_duplicates(subset=["evidence_id", "category"])

    # Count
    number_unique_not_supported_category_evidence = len(
        set(evidence_category_df.evidence_id)
    )
    not_supported_variant_categories_evidence_summary_data[category][
        "number_unique_not_supported_category_evidence"
    ] = number_unique_not_supported_category_evidence

    # Fraction
    fraction_not_supported_category_variant_evidence_of_civic = (
        f"{number_unique_not_supported_category_evidence} / {total_ac_sub_evidence}"
    )
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_not_supported_category_variant_evidence_of_civic"
    ] = fraction_not_supported_category_variant_evidence_of_civic

    # Percent
    percent_not_supported_category_variant_evidence_of_civic = f"{number_unique_not_supported_category_evidence / total_ac_sub_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percent_not_supported_category_variant_evidence_of_civic"
    ] = percent_not_supported_category_variant_evidence_of_civic

    # Not supported fraction
    fraction_not_supported_category_variant_evidence_of_total_not_supported = f"{number_unique_not_supported_category_evidence} / {total_number_not_supported_variant_unique_evidence_items}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_not_supported_category_variant_evidence_of_total_not_supported"
    ] = fraction_not_supported_category_variant_evidence_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_variant_evidence_of_total_not_supported = f"{number_unique_not_supported_category_evidence / total_number_not_supported_variant_unique_evidence_items * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percent_not_supported_category_variant_evidence_of_total_not_supported"
    ] = percent_not_supported_category_variant_evidence_of_total_not_supported

    # Accepted fraction
    number_accepted_not_supported_category_variant_evidence = (
        evidence_category_df.evidence_accepted.sum()
    )
    fraction_accepted_evidence_not_supported_category_variants = f"{number_accepted_not_supported_category_variant_evidence} / {number_unique_not_supported_category_evidence}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_accepted_evidence_not_supported_category_variants"
    ] = fraction_accepted_evidence_not_supported_category_variants

    # Accepted percent
    percentage_accepted_evidence_not_supported_category_variants = f"{number_accepted_not_supported_category_variant_evidence / number_unique_not_supported_category_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percentage_accepted_evidence_not_supported_category_variants"
    ] = percentage_accepted_evidence_not_supported_category_variants

    # Submitted fraction
    number_submitted_not_supported_category_variant_evidence = (
        number_unique_not_supported_category_evidence
        - evidence_category_df.evidence_accepted.sum()
    )
    fraction_submitted_evidence_not_supported_category_variants = f"{number_submitted_not_supported_category_variant_evidence} / {number_unique_not_supported_category_evidence}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_submitted_evidence_not_supported_category_variants"
    ] = fraction_submitted_evidence_not_supported_category_variants

    # Submitted percent
    percentage_submitted_evidence_not_supported_category_variants = f"{number_submitted_not_supported_category_variant_evidence / number_unique_not_supported_category_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percentage_submitted_evidence_not_supported_category_variants"
    ] = percentage_submitted_evidence_not_supported_category_variants

    not_supported_variant_evidence_summary[
        "Count of CIViC Evidence Items per Category"
    ].append(number_unique_not_supported_category_evidence)
    not_supported_variant_evidence_summary["Fraction of CIViC Evidence Items"].append(
        fraction_not_supported_category_variant_evidence_of_civic
    )
    not_supported_variant_evidence_summary[
        "Percent of all CIViC Evidence Items"
    ].append(percent_not_supported_category_variant_evidence_of_civic)
    not_supported_variant_evidence_summary[
        "Fraction of Not Supported Variant Evidence Items"
    ].append(fraction_not_supported_category_variant_evidence_of_total_not_supported)
    not_supported_variant_evidence_summary[
        "Percent of Not Supported Variant Evidence Items"
    ].append(percent_not_supported_category_variant_evidence_of_total_not_supported)
    not_supported_variant_evidence_summary[
        "Fraction of Accepted Evidence Items"
    ].append(fraction_accepted_evidence_not_supported_category_variants)
    not_supported_variant_evidence_summary["Percent of Accepted Evidence Items"].append(
        percentage_accepted_evidence_not_supported_category_variants
    )
    not_supported_variant_evidence_summary[
        "Fraction of Submitted Evidence Items"
    ].append(fraction_submitted_evidence_not_supported_category_variants)
    not_supported_variant_evidence_summary[
        "Percent of Submitted Evidence Items"
    ].append(percentage_submitted_evidence_not_supported_category_variants)

## <a id='toc5_5_'></a>[Impact](#toc0_)

### <a id='toc5_5_1_'></a>[Via Evidence Level](#toc0_)

#### <a id='toc5_5_1_1_'></a>[Analysis with only Accepted Variants](#toc0_)

accepted variant = a variant with at least one evidence item that is accepted

In [54]:
ns_var_w_evid_df = not_supported_variants_add_evidence_df.copy()

There are no variants without an evidence status

In [55]:
df_na = ns_var_w_evid_df[(ns_var_w_evid_df[("evidence_accepted")]!=False) & ns_var_w_evid_df["evidence_accepted"].isna()].copy()
df_na

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted


Selecting only variants with at least one accepted evidence item (Accepted Variants)

In [56]:
ns_var_w_acc_evid_df = ns_var_w_evid_df[(ns_var_w_evid_df[("evidence_accepted")]!=False) & ns_var_w_evid_df["evidence_accepted"].notna()].copy()

##### <a id='toc5_5_1_1_1_'></a>[Calculating evidence score via level](#toc0_)

In [57]:
ns_var_w_acc_evid_df["evidence_score"] = ''
ns_var_w_acc_evid_df

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted,evidence_score
7,2930,VHL,,Not provided,Transcript Variants,True,7892,accepted,[3],[C],True,
9,785,CHEK2,1100DELC,Frameshift Truncation,Sequence Variants,True,1850,accepted,[3],[B],True,
12,823,EPCAM,3' Exon Deletion,Disruptive Inframe Deletion,Rearrangement Variants,True,1901,accepted,[4],[B],True,
13,433,HIF1A,3' UTR Polymorphism,3 Prime UTR Variant;Snp,Region Defined Variants,True,1031,accepted,[2],[B],True,
15,2367,VHL,3p26.3-25.3 11Mb del,Not provided,Rearrangement Variants,True,6287,accepted,[3],[C],True,
...,...,...,...,...,...,...,...,...,...,...,...,...
1543,272,CDKN2A,p16 Expression,,Expression Variants,True,1314,accepted,[2],[B],True,
1545,3313,CDKN1A,rs1059234,Not provided,Other Variants,True,9244,accepted,[3],[B],True,
1547,256,KIT,rs17084733,3 Prime UTR Variant,Other Variants,True,666,accepted,[3],[B],True,
1548,2671,CDKN1A,rs1801270,Not provided,Other Variants,True,7227,accepted,[3],[B],True,


In [58]:
evidence_level_to_impact = {"A": 10, "B": 5, "C": 3, "D": 1, "E": 0.5}

In [59]:
ns_var_w_acc_evid_df["evidence_level"] = ns_var_w_acc_evid_df["evidence_level"].apply(lambda x: x[0])

In [60]:
ns_var_w_acc_evid_df["evidence_score"] = ns_var_w_acc_evid_df["evidence_level"].map(evidence_level_to_impact)
ns_var_w_acc_evid_df

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted,evidence_score
7,2930,VHL,,Not provided,Transcript Variants,True,7892,accepted,[3],C,True,3.0
9,785,CHEK2,1100DELC,Frameshift Truncation,Sequence Variants,True,1850,accepted,[3],B,True,5.0
12,823,EPCAM,3' Exon Deletion,Disruptive Inframe Deletion,Rearrangement Variants,True,1901,accepted,[4],B,True,5.0
13,433,HIF1A,3' UTR Polymorphism,3 Prime UTR Variant;Snp,Region Defined Variants,True,1031,accepted,[2],B,True,5.0
15,2367,VHL,3p26.3-25.3 11Mb del,Not provided,Rearrangement Variants,True,6287,accepted,[3],C,True,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1543,272,CDKN2A,p16 Expression,,Expression Variants,True,1314,accepted,[2],B,True,5.0
1545,3313,CDKN1A,rs1059234,Not provided,Other Variants,True,9244,accepted,[3],B,True,5.0
1547,256,KIT,rs17084733,3 Prime UTR Variant,Other Variants,True,666,accepted,[3],B,True,5.0
1548,2671,CDKN1A,rs1801270,Not provided,Other Variants,True,7227,accepted,[3],B,True,5.0


Each variant recieves an evidence score by adding up the numerical value of levels of the evidence items associated with the variant

In [61]:
ns_var_w_acc_evid_df.sort_values(by=["variant_id"])
not_supported_variants_w_acc_evid_df = ns_var_w_acc_evid_df.groupby("variant_id").aggregate(
    {
        "gene_name": "first",
        "variant_name": "first",
        "category": "first",
        "evidence_id": "count",
        "evidence_score": "sum",
    }
)
not_supported_variants_w_acc_evid_df = not_supported_variants_w_acc_evid_df.rename(
    columns={"evidence_id": "#_evidence_items", "evidence_score": "evidence_score_sum"}
)
not_supported_variants_w_acc_evid_df

Unnamed: 0_level_0,gene_name,variant_name,category,#_evidence_items,evidence_score_sum
variant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,ABL1,BCR::ABL,Fusion Variants,129,320.0
5,ALK,EML4::ALK,Fusion Variants,47,91.0
17,BRAF,V600,Sequence Variants,22,123.0
19,CCND1,Expression,Expression Variants,2,10.0
20,CCND1,Overexpression,Expression Variants,8,36.0
...,...,...,...,...,...
4532,ZFTA,ZFTA::NCOA1,Fusion Variants,1,3.0
4533,ZFTA,ZFTA::NCOA2,Fusion Variants,1,3.0
4534,ZFTA,ZFTA::MAML2,Fusion Variants,1,3.0
4535,ZFTA,MN1::ZFTA,Fusion Variants,1,3.0


##### <a id='toc5_5_1_1_2_'></a>[Summary Table](#toc0_)

In [62]:
not_supported_accepted_variant_categories_df = not_supported_variants_w_acc_evid_df.groupby("category").aggregate(
    {"gene_name": "count", "#_evidence_items": "sum", "evidence_score_sum": "sum"}
)
not_supported_accepted_variant_categories_df = not_supported_accepted_variant_categories_df.rename(
    columns={"evidence_score_sum": "impact", "gene_name": "number_of_variants"}
)
not_supported_accepted_variant_categories_df["average_impact_per_variant"] = (not_supported_accepted_variant_categories_df["impact"] / not_supported_accepted_variant_categories_df["number_of_variants"]).round(2)
not_supported_accepted_variant_categories_df.sort_values(by=["impact"], ascending=False)

Unnamed: 0_level_0,number_of_variants,#_evidence_items,impact,average_impact_per_variant
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fusion Variants,204,747,2304.5,11.3
Region Defined Variants,99,408,1853.0,18.72
Expression Variants,180,342,1235.0,6.86
Rearrangement Variants,47,202,892.5,18.99
Sequence Variants,73,196,856.5,11.73
Gene Function Variants,49,154,540.5,11.03
Other Variants,42,58,226.0,5.38
Transcript Variants,51,54,156.0,3.06
Copy Number Variants,19,31,89.0,4.68
Genotype Variants,12,17,86.0,7.17


In [63]:
not_supported_accepted_variant_categories_df.sum().round(2)

number_of_variants             790.00
#_evidence_items              2231.00
impact                        8321.00
average_impact_per_variant     104.78
dtype: float64

#### <a id='toc5_5_1_2_'></a>[Analysis with Accepted and Submitted Variants](#toc0_)

submitted variant = a variant with no accepted evidence items but with evidence items with the "submitted" status

In [64]:
ns_var_w_acc_sub_evid_df = not_supported_variants_add_evidence_df.copy()

##### <a id='toc5_5_1_2_1_'></a>[Calculating evidence score via level](#toc0_)

In [65]:
ns_var_w_acc_sub_evid_df["evidence_score"] = ""
ns_var_w_acc_sub_evid_df

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted,evidence_score
0,4170,VHL,,Not provided,Transcript Variants,False,10647,submitted,[2],[C],False,
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False,7428,submitted,[3],[C],False,
2,4214,VHL,,Not provided,Transcript Variants,False,10752,submitted,[3],[C],False,
3,4216,VHL,,Not provided,Transcript Variants,False,10754,submitted,[3],[C],False,
4,4278,VHL,,Not provided,Transcript Variants,False,10958,submitted,[3],[C],False,
...,...,...,...,...,...,...,...,...,...,...,...,...
1560,3478,ESR2,underexpression beta-1,Not provided,Other Variants,False,9618,submitted,[4],[B],False,
1560,3478,ESR2,underexpression beta-1,Not provided,Other Variants,False,9619,submitted,[4],[B],False,
1561,3508,CD274,v242,Not provided,Sequence Variants,False,9695,submitted,[4],[E],False,
1562,2422,NTRK3,~DEPRECATED-ETV6-NTRK3,Transcript Fusion,Other Variants,False,10692,submitted,[3],[C],False,


In [66]:
ns_var_w_acc_sub_evid_df["evidence_level"] = ns_var_w_acc_sub_evid_df["evidence_level"].apply(lambda y: y[0])

In [67]:
ns_var_w_acc_sub_evid_df["evidence_score"] = ns_var_w_acc_sub_evid_df["evidence_level"].map(
    evidence_level_to_impact
)
ns_var_w_acc_sub_evid_df

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,evidence_id,evidence_status,evidence_rating,evidence_level,evidence_accepted,evidence_score
0,4170,VHL,,Not provided,Transcript Variants,False,10647,submitted,[2],C,False,3.0
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False,7428,submitted,[3],C,False,3.0
2,4214,VHL,,Not provided,Transcript Variants,False,10752,submitted,[3],C,False,3.0
3,4216,VHL,,Not provided,Transcript Variants,False,10754,submitted,[3],C,False,3.0
4,4278,VHL,,Not provided,Transcript Variants,False,10958,submitted,[3],C,False,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1560,3478,ESR2,underexpression beta-1,Not provided,Other Variants,False,9618,submitted,[4],B,False,5.0
1560,3478,ESR2,underexpression beta-1,Not provided,Other Variants,False,9619,submitted,[4],B,False,5.0
1561,3508,CD274,v242,Not provided,Sequence Variants,False,9695,submitted,[4],E,False,0.5
1562,2422,NTRK3,~DEPRECATED-ETV6-NTRK3,Transcript Fusion,Other Variants,False,10692,submitted,[3],C,False,3.0


In [68]:
ns_var_w_acc_sub_evid_df.sort_values(by=["variant_id"])
not_supported_variants_w_acc_sub_evid_df = ns_var_w_acc_sub_evid_df.groupby("variant_id").aggregate(
    {
        "gene_name": "first",
        "variant_name": "first",
        "category": "first",
        "evidence_id": "count",
        "evidence_score": "sum",
    }
)
not_supported_variants_w_acc_sub_evid_df = not_supported_variants_w_acc_sub_evid_df.rename(
    columns={"evidence_id": "#_evidence_items", "evidence_score": "evidence_score_sum"}
)
not_supported_variants_w_acc_sub_evid_df

Unnamed: 0_level_0,gene_name,variant_name,category,#_evidence_items,evidence_score_sum
variant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,ABL1,BCR::ABL,Fusion Variants,185,494.0
5,ALK,EML4::ALK,Fusion Variants,95,167.0
17,BRAF,V600,Sequence Variants,26,153.0
19,CCND1,Expression,Expression Variants,2,10.0
20,CCND1,Overexpression,Expression Variants,10,40.0
...,...,...,...,...,...
4539,MET,,Transcript Variants,1,3.0
4540,MET,,Transcript Variants,1,3.0
4543,DICER1,Loss-of-function,Gene Function Variants,6,35.0
4544,MET,,Transcript Variants,1,3.0


##### <a id='toc5_5_1_2_2_'></a>[Summary Table](#toc0_)

In [69]:
not_supported_accepted_submitted_variant_categories_df = not_supported_variants_w_acc_sub_evid_df.groupby("category").aggregate(
    {"gene_name": "count", "#_evidence_items": "sum", "evidence_score_sum": "sum"}
)
not_supported_accepted_submitted_variant_categories_df = not_supported_accepted_submitted_variant_categories_df.rename(
    columns={"evidence_score_sum": "impact", "gene_name": "number_of_variants"}
)
not_supported_accepted_submitted_variant_categories_df["average_impact_per_variant"] = (
    not_supported_accepted_submitted_variant_categories_df["impact"] / not_supported_accepted_submitted_variant_categories_df["number_of_variants"]
).round(2)
not_supported_accepted_submitted_variant_categories_df.sort_values(by=["impact"], ascending=False)

Unnamed: 0_level_0,number_of_variants,#_evidence_items,impact,average_impact_per_variant
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fusion Variants,290,1218,3816.5,13.16
Region Defined Variants,124,566,2689.0,21.69
Expression Variants,287,610,2080.5,7.25
Rearrangement Variants,114,531,1969.0,17.27
Transcript Variants,366,446,1336.0,3.65
Sequence Variants,133,302,1206.0,9.07
Gene Function Variants,86,345,1134.0,13.19
Other Variants,83,144,515.5,6.21
Copy Number Variants,34,67,199.0,5.85
Genotype Variants,16,27,141.0,8.81


The difference in impact when removing the submitted variants from the analysis

In [70]:
(not_supported_accepted_submitted_variant_categories_df["impact"] - not_supported_accepted_variant_categories_df["impact"]).sort_values(ascending=False)

category
Fusion Variants            1512.0
Transcript Variants        1180.0
Rearrangement Variants     1076.5
Expression Variants         845.5
Region Defined Variants     836.0
Gene Function Variants      593.5
Sequence Variants           349.5
Other Variants              289.5
Copy Number Variants        110.0
Genotype Variants            55.0
Epigenetic Modification      10.0
Name: impact, dtype: float64

In [71]:
not_supported_accepted_submitted_variant_categories_df.to_csv("civic_evidence_analysis_output/civic_both_evidence_cat_impact_df.csv", index=True)
not_supported_accepted_variant_categories_df.to_csv("civic_evidence_analysis_output/civic_accepted_evidence_only_impact_df.csv", index=True)

### <a id='toc5_5_2_'></a>[Via Molecular Profile Score- this was not used eventaully since MOA evidence items are only scored by level](#toc0_)

#### <a id='toc5_5_2_1_'></a>[Import molecular profile id](#toc0_)

In [72]:
not_supported_variants_add_molprof_df = transform_df_mp_id(not_supported_queries_df)
not_supported_variants_add_molprof_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id
0,4170,VHL,,Not provided,Transcript Variants,False,[4038]
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False,[4350]
2,4214,VHL,,Not provided,Transcript Variants,False,[4082]
3,4216,VHL,,Not provided,Transcript Variants,False,[4084]
4,4278,VHL,,Not provided,Transcript Variants,False,[4146]


#### <a id='toc5_5_2_2_'></a>[Import molecular profile scores](#toc0_)

In [73]:
not_supported_variants_add_molprof_df = transform_df_mp_score(
    not_supported_variants_add_molprof_df
)
not_supported_variants_add_molprof_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id,molecular_profile_score
0,4170,VHL,,Not provided,Transcript Variants,False,[4038],[0.0]
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False,[4350],[0.0]
2,4214,VHL,,Not provided,Transcript Variants,False,[4082],[0.0]
3,4216,VHL,,Not provided,Transcript Variants,False,[4084],[0.0]
4,4278,VHL,,Not provided,Transcript Variants,False,[4146],[0.0]


In [74]:
not_supported_variants_add_molprof_df = transform_df_mp_score_sum(
    not_supported_variants_add_molprof_df
)
not_supported_variants_add_molprof_df.head()

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
0,4170,VHL,,Not provided,Transcript Variants,False,[4038],[0.0],0.0
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False,[4350],[0.0],0.0
2,4214,VHL,,Not provided,Transcript Variants,False,[4082],[0.0],0.0
3,4216,VHL,,Not provided,Transcript Variants,False,[4084],[0.0],0.0
4,4278,VHL,,Not provided,Transcript Variants,False,[4146],[0.0],0.0


In [75]:
not_supported_variants_add_molprof_df[
    (not_supported_variants_add_molprof_df["molecular_profile_score_sum"] == 0.0)
    & (not_supported_variants_add_molprof_df["variant_accepted"] == True)
]

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
728,1247,BRCA2,M1R,Missense Variant,Other Variants,True,[1221],[0.0],0.0


In [76]:
not_supported_variants_add_molprof_df["molecular_profile_score_sum"].max()

862.5

In [77]:
not_supported_variants_add_molprof_df[
    (not_supported_variants_add_molprof_df["molecular_profile_score_sum"] != 0.0)
]

Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted,molecular_profile_id,molecular_profile_score,molecular_profile_score_sum
7,2930,VHL,,Not provided,Transcript Variants,True,[2799],[7.5],7.5
9,785,CHEK2,1100DELC,Frameshift Truncation,Sequence Variants,True,[766],[15.0],15.0
12,823,EPCAM,3' Exon Deletion,Disruptive Inframe Deletion,Rearrangement Variants,True,[801],[20.0],20.0
13,433,HIF1A,3' UTR Polymorphism,3 Prime UTR Variant;Snp,Region Defined Variants,True,[429],[10.0],10.0
15,2367,VHL,3p26.3-25.3 11Mb del,Not provided,Rearrangement Variants,True,[2240],[7.5],7.5
...,...,...,...,...,...,...,...,...,...
1543,272,CDKN2A,p16 Expression,,Expression Variants,True,[268],[180.0],180.0
1545,3313,CDKN1A,rs1059234,Not provided,Other Variants,True,[3181],[15.0],15.0
1547,256,KIT,rs17084733,3 Prime UTR Variant,Other Variants,True,[252],[15.0],15.0
1548,2671,CDKN1A,rs1801270,Not provided,Other Variants,True,[2540],[15.0],15.0


#### <a id='toc5_5_2_3_'></a>[Impact by Subcategory](#toc0_)

In [78]:
not_supported_impact_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "CIVIC Total Sum Impact Score": [],
    "Average Impact Score per Variant": [],
    "Average Impact Score per Evidence Item": [],
    "Total Number Evidence Items": [
        v["number_unique_not_supported_category_evidence"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "% Accepted Evidence Items": [
        v["percentage_accepted_evidence_not_supported_category_variants"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "Total Number Variants": [
        v["number_unique_not_supported_category_variants"]
        for v in not_supported_variant_categories_summary_data.values()
    ],
}

In [79]:
not_supported_variant_categories_impact_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    not_supported_variant_categories_impact_data[category] = {}
    impact_category_df = not_supported_variants_add_molprof_df[
        not_supported_variants_add_molprof_df.category == category
    ]

    total_sum_not_supported_category_impact = impact_category_df[
        "molecular_profile_score_sum"
    ].sum()
    not_supported_variant_categories_impact_data[category][
        "total_sum_not_supported_category_impact"
    ] = total_sum_not_supported_category_impact

    avg_impact_score_variant = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_variants
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_variant"
    ] = avg_impact_score_variant

    avg_impact_score_evidence = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_evidence
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_evidence"
    ] = avg_impact_score_evidence

    not_supported_impact_summary["CIVIC Total Sum Impact Score"].append(
        total_sum_not_supported_category_impact
    )
    not_supported_impact_summary["Average Impact Score per Variant"].append(
        avg_impact_score_variant
    )
    not_supported_impact_summary["Average Impact Score per Evidence Item"].append(
        avg_impact_score_evidence
    )

    print(f"{category}: {total_sum_not_supported_category_impact}")

Expression Variants: 3618.0
Epigenetic Modification: 285.5
Fusion Variants: 6558.75
Sequence Variants: 2746.75
Gene Function Variants: 1805.5
Rearrangement Variants: 2794.0
Copy Number Variants: 225.0
Other Variants: 653.5
Genotype Variants: 312.5
Region Defined Variants: 6199.5
Transcript Variants: 356.5


In [80]:
not_supported_variant_impact_df = pd.DataFrame(not_supported_impact_summary)

In [81]:
not_supported_variant_impact_df = not_supported_variant_impact_df.round(2)
not_supported_variant_impact_df

Unnamed: 0,Category,CIVIC Total Sum Impact Score,Average Impact Score per Variant,Average Impact Score per Evidence Item,Total Number Evidence Items,% Accepted Evidence Items,Total Number Variants
0,Expression Variants,3618.0,9.89,8.11,610,56.07%,287
1,Epigenetic Modification,285.5,0.78,0.64,23,95.65%,14
2,Fusion Variants,6558.75,17.92,14.71,1218,61.33%,294
3,Sequence Variants,2746.75,7.5,6.16,302,64.90%,133
4,Gene Function Variants,1805.5,4.93,4.05,345,44.64%,91
5,Rearrangement Variants,2794.0,7.63,6.26,531,38.04%,116
6,Copy Number Variants,225.0,0.61,0.5,67,46.27%,34
7,Other Variants,653.5,1.79,1.47,144,40.28%,83
8,Genotype Variants,312.5,0.85,0.7,27,62.96%,16
9,Region Defined Variants,6199.5,16.94,13.9,566,72.08%,129


In [82]:
not_supported_variant_impact_df.to_csv(
    "civic_evidence_analysis_output/not_supported_variant_impact_df.csv", index=False
)

# <a id='toc6_'></a>[Summary](#toc0_)

## <a id='toc6_1_'></a>[Variant Analysis](#toc0_)

### <a id='toc6_1_1_'></a>[Building Summary Table 1 & 2](#toc0_)

In [83]:
all_variant_df = pd.DataFrame(variant_analysis_summary)

In [84]:
all_variant_df["Percentage of all CIViC Variant Items"] = (
    all_variant_df["Fraction of all CIViC Variant Items"].astype(str)
    + "  ("
    + all_variant_df["Percentage of all CIViC Variant Items"]
    + ")"
)
all_variant_df["Percentage of Accepted Variant Items"] = (
    all_variant_df["Fraction of Accepted Variant Items"].astype(str)
    + "  ("
    + all_variant_df["Percentage of Accepted Variant Items"]
    + ")"
)
all_variant_df["Percentage of Not Accepted Variant Items"] = (
    all_variant_df["Fraction of Not Accepted Variant Items"].astype(str)
    + "  ("
    + all_variant_df["Percentage of Not Accepted Variant Items"]
    + ")"
)

In [85]:
all_variant_df = all_variant_df.drop(
    [
        "Fraction of all CIViC Variant Items",
        "Fraction of Accepted Variant Items",
        "Fraction of Not Accepted Variant Items",
    ],
    axis=1,
)

In [86]:
all_variant_percent_status_df = all_variant_df.drop(
    [
        "Percentage of all CIViC Variant Items",
        "Count of CIViC Variant Items per Category",
    ],
    axis=1,
)

for_merge_all_variant_percent_of_civic_df = all_variant_df.drop(
    [
        "Percentage of Accepted Variant Items",
        "Percentage of Not Accepted Variant Items",
    ],
    axis=1,
)

all_variant_percent_of_civic_df = for_merge_all_variant_percent_of_civic_df.drop(
    ["Count of CIViC Variant Items per Category"], axis=1
)

In [87]:
for_merge_all_variant_percent_of_civic_df.to_csv(
    "civic_evidence_analysis_output/for_merge_all_variant_percent_of_civic_df.csv", index=False
)

### <a id='toc6_1_2_'></a>[Summary Table 1](#toc0_)

The table below shows the 3 categories that CIViC variants were divided into after normalization and what percentage they make up of all variants in CIViC data.

In [88]:
all_variant_percent_of_civic_df = all_variant_percent_of_civic_df.set_index(
    "Variant Category"
)
all_variant_percent_of_civic_df

Unnamed: 0_level_0,Percentage of all CIViC Variant Items
Variant Category,Unnamed: 1_level_1
Normalized,1876 / 3519 (53.31%)
Unable to Normalize,80 / 3519 (2.27%)
Not Supported,1563 / 3519 (44.42%)


In [89]:
civic_summary_table_1 = all_variant_percent_of_civic_df

### <a id='toc6_1_3_'></a>[Summary Table 2](#toc0_)

The table below shows the 3 categories that CIViC variants were divided into after normalization and what percentage of the variants in each category are accepted (have at least one evidence item that is accepted) or not.

In [90]:
all_variant_percent_status_df = all_variant_percent_status_df.set_index(
    "Variant Category"
)
all_variant_percent_status_df

Unnamed: 0_level_0,Percentage of Accepted Variant Items,Percentage of Not Accepted Variant Items
Variant Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Normalized,869 / 1876 (46.32%),1007 / 1876 (53.68%)
Unable to Normalize,11 / 80 (13.75%),69 / 80 (86.25%)
Not Supported,790 / 1563 (50.54%),773 / 1563 (49.46%)


In [91]:
civic_summary_table_2 = all_variant_percent_status_df

### <a id='toc6_1_4_'></a>[Building Summary Tables 3 - 5](#toc0_)

In [92]:
not_supported_variant_df = pd.DataFrame(not_supported_variant_analysis_summary)

In [93]:
not_supported_variant_df["Percent of Not Supported Variant Items"] = (
    not_supported_variant_df["Fraction of Not Supported Variant Items"].astype(str)
    + "  ("
    + not_supported_variant_df["Percent of Not Supported Variant Items"]
    + ")"
)
not_supported_variant_df["Percent of all CIViC Variant Items"] = (
    not_supported_variant_df["Fraction of all CIViC Variant Items"].astype(str)
    + "  ("
    + not_supported_variant_df["Percent of all CIViC Variant Items"]
    + ")"
)
not_supported_variant_df["Percent of Accepted Variant Items"] = (
    not_supported_variant_df["Fraction of Accepted Variant Items"].astype(str)
    + "  ("
    + not_supported_variant_df["Percent of Accepted Variant Items"]
    + ")"
)
not_supported_variant_df["Percent of Not Accepted Variant Items"] = (
    not_supported_variant_df["Fraction of Not Accepted Variant Items"].astype(str)
    + "  ("
    + not_supported_variant_df["Percent of Not Accepted Variant Items"]
    + ")"
)

In [94]:
not_supported_variant_df = not_supported_variant_df.drop(
    [
        "Fraction of Not Supported Variant Items",
        "Fraction of all CIViC Variant Items",
        "Fraction of Accepted Variant Items",
        "Fraction of Not Accepted Variant Items",
    ],
    axis=1,
)

In [95]:
for_merge_not_supported_variant_percent_of_civic_df = not_supported_variant_df.drop(
    [
        "Percent of Not Supported Variant Items",
        "Percent of Accepted Variant Items",
        "Percent of Not Accepted Variant Items",
    ],
    axis=1,
)

not_supported_variant_percent_of_civic_df = (
    for_merge_not_supported_variant_percent_of_civic_df.drop(
        ["Count of CIViC Variant Items per Category"], axis=1
    )
)

not_supported_variant_percent_of_not_supported_df = not_supported_variant_df.drop(
    [
        "Percent of all CIViC Variant Items",
        "Count of CIViC Variant Items per Category",
        "Percent of Accepted Variant Items",
        "Percent of Not Accepted Variant Items",
    ],
    axis=1,
)

not_supported_variant_percent_evidence_df = not_supported_variant_df.drop(
    [
        "Percent of all CIViC Variant Items",
        "Percent of Not Supported Variant Items",
        "Count of CIViC Variant Items per Category",
    ],
    axis=1,
)

In [96]:
for_merge_not_supported_variant_percent_of_civic_df.to_csv(
    "civic_evidence_analysis_output/for_merge_not_supported_variant_percent_of_civic_df.csv", index=False
)

### <a id='toc6_1_5_'></a>[Summary Table 3](#toc0_)

The table below shows the categories that the Not Supported variants were broken into and what percentage of all CIViC variants they make up. These percentages will not add up to 100% because Not Supported variants make up 45.62% of all CIViC variants.

In [97]:
not_supported_variant_percent_of_civic_df = (
    not_supported_variant_percent_of_civic_df.set_index("Category")
)
not_supported_variant_percent_of_civic_df

Unnamed: 0_level_0,Percent of all CIViC Variant Items
Category,Unnamed: 1_level_1
Expression Variants,287 / 3519 (8.16%)
Epigenetic Modification,14 / 3519 (0.40%)
Fusion Variants,294 / 3519 (8.35%)
Sequence Variants,133 / 3519 (3.78%)
Gene Function Variants,91 / 3519 (2.59%)
Rearrangement Variants,116 / 3519 (3.30%)
Copy Number Variants,34 / 3519 (0.97%)
Other Variants,83 / 3519 (2.36%)
Genotype Variants,16 / 3519 (0.45%)
Region Defined Variants,129 / 3519 (3.67%)


In [98]:
civic_summary_table_3 = not_supported_variant_percent_of_civic_df

### <a id='toc6_1_6_'></a>[Summary Table 4](#toc0_)

The table below shows the Not Supported variants broken up into 12 sub categories and what percent each sub category take up in Not Supported variant group.

In [99]:
not_supported_variant_percent_of_not_supported_df = (
    not_supported_variant_percent_of_not_supported_df.set_index("Category")
)
not_supported_variant_percent_of_not_supported_df

Unnamed: 0_level_0,Percent of Not Supported Variant Items
Category,Unnamed: 1_level_1
Expression Variants,287 / 1563 (18.36%)
Epigenetic Modification,14 / 1563 (0.90%)
Fusion Variants,294 / 1563 (18.81%)
Sequence Variants,133 / 1563 (8.51%)
Gene Function Variants,91 / 1563 (5.82%)
Rearrangement Variants,116 / 1563 (7.42%)
Copy Number Variants,34 / 1563 (2.18%)
Other Variants,83 / 1563 (5.31%)
Genotype Variants,16 / 1563 (1.02%)
Region Defined Variants,129 / 1563 (8.25%)


In [100]:
civic_summary_table_4 = not_supported_variant_percent_of_not_supported_df

### <a id='toc6_1_7_'></a>[Summary Table 5](#toc0_)

The table below shows the Not Supported variants broken up into 12 sub categories and what percent each sub category take up in Not Supported variant group.

In [101]:
not_supported_variant_percent_evidence_df = (
    not_supported_variant_percent_evidence_df.set_index("Category")
)
not_supported_variant_percent_evidence_df

Unnamed: 0_level_0,Percent of Accepted Variant Items,Percent of Not Accepted Variant Items
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Expression Variants,180 / 287 (62.72%),107 / 287 (37.28%)
Epigenetic Modification,14 / 14 (100.00%),0 / 14 (0.00%)
Fusion Variants,204 / 294 (69.39%),90 / 294 (30.61%)
Sequence Variants,73 / 133 (54.89%),60 / 133 (45.11%)
Gene Function Variants,49 / 91 (53.85%),42 / 91 (46.15%)
Rearrangement Variants,47 / 116 (40.52%),69 / 116 (59.48%)
Copy Number Variants,19 / 34 (55.88%),15 / 34 (44.12%)
Other Variants,42 / 83 (50.60%),41 / 83 (49.40%)
Genotype Variants,12 / 16 (75.00%),4 / 16 (25.00%)
Region Defined Variants,99 / 129 (76.74%),30 / 129 (23.26%)


In [102]:
civic_summary_table_5 = not_supported_variant_percent_evidence_df

## <a id='toc6_2_'></a>[Evidence Analysis](#toc0_)

### <a id='toc6_2_1_'></a>[Building Summary Tables 6 & 7](#toc0_)

In [103]:
all_variant_evidence_df = pd.DataFrame(evidence_analysis_summary)

In [104]:
all_variant_evidence_df["Percentage of all CIViC Evidence Items"] = (
    all_variant_evidence_df["Fraction of all CIViC Evidence Items"].astype(str)
    + "  ("
    + all_variant_evidence_df["Percentage of all CIViC Evidence Items"]
    + ")"
)
all_variant_evidence_df["Percentage of Accepted Evidence Items"] = (
    all_variant_evidence_df["Fraction of Accepted Evidence Items"].astype(str)
    + "  ("
    + all_variant_evidence_df["Percentage of Accepted Evidence Items"]
    + ")"
)
all_variant_evidence_df["Percentage of Submitted Evidence Items"] = (
    all_variant_evidence_df["Fraction of Submitted Evidence Items"].astype(str)
    + "  ("
    + all_variant_evidence_df["Percentage of Submitted Evidence Items"]
    + ")"
)

In [105]:
all_variant_evidence_df = all_variant_evidence_df.drop(
    [
        "Fraction of all CIViC Evidence Items",
        "Fraction of Accepted Evidence Items",
        "Fraction of Submitted Evidence Items",
    ],
    axis=1,
)

In [106]:
for_merge_all_variant_evidence_percent_of_civic_df = all_variant_evidence_df.drop(
    ["Percentage of Accepted Evidence Items", "Percentage of Submitted Evidence Items"],
    axis=1,
)

all_variant_evidence_percent_of_civic_df = (
    for_merge_all_variant_evidence_percent_of_civic_df.drop(
        ["Count of CIViC Evidence Items per Category"], axis=1
    )
)

all_variant_evidence_percent_evidence_df = all_variant_evidence_df.drop(
    [
        "Percentage of all CIViC Evidence Items",
        "Count of CIViC Evidence Items per Category",
    ],
    axis=1,
)

In [107]:
for_merge_all_variant_evidence_percent_of_civic_df.to_csv(
    "civic_evidence_analysis_output/for_merge_all_variant_evidence_percent_of_civic_df.csv", index=False
)

### <a id='toc6_2_2_'></a>[Summary Table 6](#toc0_)

The table below shows what percentage of all evidence items in CIViC are associated with Normalized, Unable to Normalize, and Not Supported variants. This will not add up to 100% because evidence itmes may be used across multiple variants.

In [108]:
all_variant_evidence_percent_of_civic_df = (
    all_variant_evidence_percent_of_civic_df.set_index("Variant Category")
)
all_variant_evidence_percent_of_civic_df

Unnamed: 0_level_0,Percentage of all CIViC Evidence Items
Variant Category,Unnamed: 1_level_1
Normalized,5866 / 9920 (59.13%)
Unable to Normalize,127 / 9920 (1.28%)
Not Supported,4243 / 9920 (42.77%)


In [109]:
civic_summary_table_6 = all_variant_evidence_percent_of_civic_df

### <a id='toc6_2_3_'></a>[Summmary Table 7](#toc0_)

The table below shows the percentage of accepted and sumbitted evidence items per category of variants.

In [110]:
all_variant_evidence_percent_evidence_df = (
    all_variant_evidence_percent_evidence_df.set_index("Variant Category")
)
all_variant_evidence_percent_evidence_df

Unnamed: 0_level_0,Percentage of Accepted Evidence Items,Percentage of Submitted Evidence Items
Variant Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Normalized,2080 / 5866 (35.46%),3786 / 5866 (64.54%)
Unable to Normalize,17 / 127 (13.39%),110 / 127 (86.61%)
Not Supported,2231 / 4243 (52.58%),2048 / 4243 (48.27%)


In [111]:
civic_summary_table_7 = all_variant_evidence_percent_evidence_df

### <a id='toc6_2_4_'></a>[Building Summary Tables 8 - 10](#toc0_)

In [112]:
not_supported_variant_evidence_df = pd.DataFrame(not_supported_variant_evidence_summary)

In [113]:
not_supported_variant_evidence_df["Percent of all CIViC Evidence Items"] = (
    not_supported_variant_evidence_df["Fraction of CIViC Evidence Items"].astype(str)
    + "  ("
    + not_supported_variant_evidence_df["Percent of all CIViC Evidence Items"]
    + ")"
)
not_supported_variant_evidence_df["Percent of Not Supported Variant Evidence Items"] = (
    not_supported_variant_evidence_df[
        "Fraction of Not Supported Variant Evidence Items"
    ].astype(str)
    + "  ("
    + not_supported_variant_evidence_df[
        "Percent of Not Supported Variant Evidence Items"
    ]
    + ")"
)
not_supported_variant_evidence_df["Percent of Accepted Evidence Items"] = (
    not_supported_variant_evidence_df["Fraction of Accepted Evidence Items"].astype(str)
    + "  ("
    + not_supported_variant_evidence_df["Percent of Accepted Evidence Items"]
    + ")"
)
not_supported_variant_evidence_df["Percent of Submitted Evidence Items"] = (
    not_supported_variant_evidence_df["Fraction of Submitted Evidence Items"].astype(
        str
    )
    + "  ("
    + not_supported_variant_evidence_df["Percent of Submitted Evidence Items"]
    + ")"
)

In [114]:
not_supported_variant_evidence_df = not_supported_variant_evidence_df.drop(
    [
        "Fraction of CIViC Evidence Items",
        "Fraction of Not Supported Variant Evidence Items",
        "Fraction of Accepted Evidence Items",
        "Fraction of Submitted Evidence Items",
    ],
    axis=1,
)

In [115]:
for_merge_not_supported_variant_evidence_percent_of_civic_df = (
    not_supported_variant_evidence_df.drop(
        [
            "Percent of Not Supported Variant Evidence Items",
            "Percent of Accepted Evidence Items",
            "Percent of Submitted Evidence Items",
        ],
        axis=1,
    )
)

not_supported_variant_evidence_percent_of_civic_df = (
    for_merge_not_supported_variant_evidence_percent_of_civic_df.drop(
        ["Count of CIViC Evidence Items per Category"], axis=1
    )
)

not_supported_variant_evidence_percent_of_not_supported_df = (
    not_supported_variant_evidence_df.drop(
        [
            "Percent of all CIViC Evidence Items",
            "Percent of Accepted Evidence Items",
            "Percent of Submitted Evidence Items",
            "Count of CIViC Evidence Items per Category",
        ],
        axis=1,
    )
)

not_supported_variant_evidence_percent_evidence_df = (
    not_supported_variant_evidence_df.drop(
        [
            "Percent of Not Supported Variant Evidence Items",
            "Percent of all CIViC Evidence Items",
            "Count of CIViC Evidence Items per Category",
        ],
        axis=1,
    )
)

In [116]:
for_merge_not_supported_variant_evidence_percent_of_civic_df.to_csv(
    "civic_evidence_analysis_output/for_merge_not_supported_variant_evidence_percent_of_civic_df.csv", index=False
)

### <a id='toc6_2_5_'></a>[Summary Table 8](#toc0_)

The table below shows the percentage of all CIViC evidence items that are associated with a Not Supported variant sub category. This will not add up to 100% since the evidence items can be associated with multiple variants.

In [117]:
not_supported_variant_evidence_percent_of_civic_df = (
    not_supported_variant_evidence_percent_of_civic_df.set_index("Category")
)
not_supported_variant_evidence_percent_of_civic_df

Unnamed: 0_level_0,Percent of all CIViC Evidence Items
Category,Unnamed: 1_level_1
Expression Variants,610 / 9920 (6.15%)
Epigenetic Modification,23 / 9920 (0.23%)
Fusion Variants,1218 / 9920 (12.28%)
Sequence Variants,302 / 9920 (3.04%)
Gene Function Variants,345 / 9920 (3.48%)
Rearrangement Variants,531 / 9920 (5.35%)
Copy Number Variants,67 / 9920 (0.68%)
Other Variants,144 / 9920 (1.45%)
Genotype Variants,27 / 9920 (0.27%)
Region Defined Variants,566 / 9920 (5.71%)


In [118]:
civic_summary_table_8 = not_supported_variant_evidence_percent_of_civic_df

### <a id='toc6_2_6_'></a>[Summary Table 9](#toc0_)

The table below shows the percentage of all evidence items associated with Not Supported variants that are associated with a variant sub category.

In [119]:
not_supported_variant_evidence_percent_of_not_supported_df = (
    not_supported_variant_evidence_percent_of_not_supported_df.set_index("Category")
)
not_supported_variant_evidence_percent_of_not_supported_df

Unnamed: 0_level_0,Percent of Not Supported Variant Evidence Items
Category,Unnamed: 1_level_1
Expression Variants,610 / 4243 (14.38%)
Epigenetic Modification,23 / 4243 (0.54%)
Fusion Variants,1218 / 4243 (28.71%)
Sequence Variants,302 / 4243 (7.12%)
Gene Function Variants,345 / 4243 (8.13%)
Rearrangement Variants,531 / 4243 (12.51%)
Copy Number Variants,67 / 4243 (1.58%)
Other Variants,144 / 4243 (3.39%)
Genotype Variants,27 / 4243 (0.64%)
Region Defined Variants,566 / 4243 (13.34%)


In [120]:
civic_summary_table_9 = not_supported_variant_evidence_percent_of_not_supported_df

### <a id='toc6_2_7_'></a>[Summary Table 10](#toc0_)

The table below shows the percentage of evidence items associated with Not Supported variant sub categories that are accepted or submitted.

In [121]:
not_supported_variant_evidence_percent_evidence_df = (
    not_supported_variant_evidence_percent_evidence_df.set_index("Category")
)
not_supported_variant_evidence_percent_evidence_df

Unnamed: 0_level_0,Percent of Accepted Evidence Items,Percent of Submitted Evidence Items
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Expression Variants,342 / 610 (56.07%),268 / 610 (43.93%)
Epigenetic Modification,22 / 23 (95.65%),1 / 23 (4.35%)
Fusion Variants,747 / 1218 (61.33%),471 / 1218 (38.67%)
Sequence Variants,196 / 302 (64.90%),106 / 302 (35.10%)
Gene Function Variants,154 / 345 (44.64%),191 / 345 (55.36%)
Rearrangement Variants,202 / 531 (38.04%),329 / 531 (61.96%)
Copy Number Variants,31 / 67 (46.27%),36 / 67 (53.73%)
Other Variants,58 / 144 (40.28%),86 / 144 (59.72%)
Genotype Variants,17 / 27 (62.96%),10 / 27 (37.04%)
Region Defined Variants,408 / 566 (72.08%),158 / 566 (27.92%)


In [122]:
civic_summary_table_10 = not_supported_variant_evidence_percent_evidence_df

## <a id='toc6_3_'></a>[Impact](#toc0_)

In [123]:
not_supported_variant_impact_df = not_supported_variant_impact_df.round(2)
not_supported_variant_impact_df

Unnamed: 0,Category,CIVIC Total Sum Impact Score,Average Impact Score per Variant,Average Impact Score per Evidence Item,Total Number Evidence Items,% Accepted Evidence Items,Total Number Variants
0,Expression Variants,3618.0,9.89,8.11,610,56.07%,287
1,Epigenetic Modification,285.5,0.78,0.64,23,95.65%,14
2,Fusion Variants,6558.75,17.92,14.71,1218,61.33%,294
3,Sequence Variants,2746.75,7.5,6.16,302,64.90%,133
4,Gene Function Variants,1805.5,4.93,4.05,345,44.64%,91
5,Rearrangement Variants,2794.0,7.63,6.26,531,38.04%,116
6,Copy Number Variants,225.0,0.61,0.5,67,46.27%,34
7,Other Variants,653.5,1.79,1.47,144,40.28%,83
8,Genotype Variants,312.5,0.85,0.7,27,62.96%,16
9,Region Defined Variants,6199.5,16.94,13.9,566,72.08%,129


The bar graph below shows the relationship between the Not Suported variant sub category impact score and the sub category. Additionally, the colors illustrate the number of evidence items associated each sub category.

In [124]:
fig = px.bar(
    not_supported_variant_impact_df,
    x="Category",
    y="CIVIC Total Sum Impact Score",
    hover_data=[
        "Total Number Evidence Items",
        not_supported_variant_impact_df["% Accepted Evidence Items"],
    ],
    color="Total Number Evidence Items",
    labels={"CIVIC Total Sum Impact Score": "CIVIC Total Sum Impact Score"},
    text_auto=".1f",
    color_continuous_scale="geyser",
)
fig.update_traces(width=1)
fig.show()

In [126]:
fig.write_html("civic_evidence_analysis_output/civic_ns_categories_impact_redgreen.html")

The scatterplot below shows the relationship between the Not Suported variant sub category impact score and the number of evidence items associated with variants in each sub category. Additionally, the sizes of the data point represent the number of variants in each sub category. 

In [127]:
fig2 = px.scatter(
    data_frame=not_supported_variant_impact_df,
    x="Total Number Evidence Items",
    y="CIVIC Total Sum Impact Score",
    size="Total Number Variants",
    size_max=40,
    text="Total Number Variants",
    color="Category",
    hover_data="% Accepted Evidence Items",
)
fig2.show()

In [128]:
fig2.write_html("civic_evidence_analysis_output/civic_ns_categories_impact_scatterplot.html")

In [129]:
fig3 = px.scatter(
    data_frame=not_supported_variant_impact_df,
    x="Total Number Variants",
    y="Average Impact Score per Evidence Item",
    size="Total Number Variants",
    size_max=40,
    text="Total Number Variants",
    # color_discrete_sequence= Bold,
    color="Category",
    hover_data=["% Accepted Evidence Items", "Average Impact Score per Variant"],
)
fig3.show()