# <a id='toc1_'></a>[CIViC Evidence Analysis](#toc0_)
The civic_evidence_analysis notebook contains an analysis on CIViC evidence data

**Table of contents**<a id='toc0_'></a>    
- [CIViC Evidence Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Create output directory](#toc1_1_2_)    
    - [Use latest cache that has been pushed to the repo](#toc1_1_3_)    
  - [Total Variants in CIViC](#toc1_2_)    
  - [Total Evidence items in CIViC](#toc1_3_)    
  - [Total Molecular Profiles in CIViC](#toc1_4_)    
- [Create analysis functions / global variables](#toc2_)    
  - [Summary dicts](#toc2_1_)    
  - [Define Analysis Functions](#toc2_2_)    
- [Analysis of Normalized Queries](#toc3_)    
  - [List of Normalized Variants ID's](#toc3_1_)    
  - [Variant analysis](#toc3_2_)    
  - [Transform df for evidence analysis](#toc3_3_)    
  - [Evidence analysis](#toc3_4_)    
  - [Impact](#toc3_5_)    
    - [Import molecular profile id](#toc3_5_1_)    
    - [Import molecular profile scores](#toc3_5_2_)    
- [Analysis of Unable to Normalize Queries](#toc4_)    
  - [List of Unable to Normalize Variant ID's](#toc4_1_)    
  - [Variant analysis](#toc4_2_)    
  - [Transform df for evidence analysis](#toc4_3_)    
  - [Evidence analysis](#toc4_4_)    
  - [Impact](#toc4_5_)    
    - [Import molecular profile id](#toc4_5_1_)    
    - [Import molecular profile scores](#toc4_5_2_)    
- [Analysis of Not Supported Variants](#toc5_)    
    - [List of Not Supported Variant ID's](#toc5_1_1_)    
  - [Variant Analysis](#toc5_2_)    
    - [Not Supported Variant Analysis by Subcategory](#toc5_2_1_)    
  - [Transform df for evidence analysis](#toc5_3_)    
  - [Evidence analysis](#toc5_4_)    
    - [Not Supported Variant Evidence Analysis by Subcategory](#toc5_4_1_)    
  - [Impact](#toc5_5_)    
    - [Via Evidence Level](#toc5_5_1_)    
      - [Analysis with only Accepted Variants](#toc5_5_1_1_)    
        - [Calculating evidence score via level](#toc5_5_1_1_1_)    
        - [Summary Table](#toc5_5_1_1_2_)    
      - [Analysis with Accepted and Submitted Variants](#toc5_5_1_2_)    
        - [Calculating evidence score via level](#toc5_5_1_2_1_)    
        - [Summary Table](#toc5_5_1_2_2_)    
    - [Via Molecular Profile Score- this was not used eventaully since MOA evidence items are only scored by level](#toc5_5_2_)    
      - [Import molecular profile id](#toc5_5_2_1_)    
      - [Import molecular profile scores](#toc5_5_2_2_)    
      - [Impact by Subcategory](#toc5_5_2_3_)    
- [Summary](#toc6_)    
  - [Variant Analysis](#toc6_1_)    
    - [Building Summary Table 1 & 2](#toc6_1_1_)    
    - [Summary Table 1](#toc6_1_2_)    
    - [Summary Table 2](#toc6_1_3_)    
    - [Building Summary Tables 3 - 5](#toc6_1_4_)    
    - [Summary Table 3](#toc6_1_5_)    
    - [Summary Table 4](#toc6_1_6_)    
    - [Summary Table 5](#toc6_1_7_)    
  - [Evidence Analysis](#toc6_2_)    
    - [Building Summary Tables 6 & 7](#toc6_2_1_)    
    - [Summary Table 6](#toc6_2_2_)    
    - [Summmary Table 7](#toc6_2_3_)    
    - [Building Summary Tables 8 - 10](#toc6_2_4_)    
    - [Summary Table 8](#toc6_2_5_)    
    - [Summary Table 9](#toc6_2_6_)    
    - [Summary Table 10](#toc6_2_7_)    
  - [Impact](#toc6_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [None]:
from pathlib import Path
from enum import Enum
import zipfile
import pandas as pd
from civicpy import civic as civicpy
import plotly.express as px

### <a id='toc1_1_2_'></a>[Create output directory](#toc0_)

In [None]:
path = Path("output")
path.mkdir(exist_ok=True)

### <a id='toc1_1_3_'></a>[Use latest cache that has been pushed to the repo](#toc0_)

In [None]:
latest_cache_zip_path = sorted(Path().glob("../cache-*.pkl.zip"))[-1]
print(f"Using {latest_cache_zip_path} for civicpy cache")

with zipfile.ZipFile(latest_cache_zip_path, "r") as zip_ref:
    zip_ref.extractall("../")

civicpy.load_cache(local_cache_path=Path("../cache.pkl"), on_stale="ignore")

## <a id='toc1_2_'></a>[Total Variants in CIViC](#toc0_)

In [None]:
civic_variant_ids = civicpy.get_all_variants(include_status=["accepted", "submitted"])
total_number_variants = len(civic_variant_ids)
f"Total Number of variants in CIViC: {total_number_variants}"

## <a id='toc1_3_'></a>[Total Evidence items in CIViC](#toc0_)

Rejected evidence items are excluded

In [None]:
civic_evidence_items = civicpy.get_all_evidence(
    include_status=["accepted", "submitted"]
)

In [None]:
total_ac_sub_evidence = len(civic_evidence_items)
f"Total Number of accepted and submitted evidence items in CIViC: {total_ac_sub_evidence}"

## <a id='toc1_4_'></a>[Total Molecular Profiles in CIViC](#toc0_)

In [None]:
civic_molprofs = civicpy.get_all_molecular_profiles(
    include_status=["accepted", "submitted"]
)

# <a id='toc2_'></a>[Create analysis functions / global variables](#toc0_)

In [None]:
class VariantNormType(str, Enum):
    """Variation Normalization types"""

    NORMALIZED = "Normalized"
    UNABLE_TO_NORMALIZE = "Unable to Normalize"
    NOT_SUPPORTED = "Not Supported"


VARIANT_NORM_TYPE_VALUES = [v.value for v in VariantNormType.__members__.values()]

In [None]:
class VariantCategory(str, Enum):
    """Create enum for the kind of variants that are in CIViC."""

    EXPRESSION = "Expression Variants"
    EPIGENETIC_MODIFICATION = "Epigenetic Modification"
    FUSION = "Fusion Variants"
    SEQUENCE_VARS = "Sequence Variants"
    GENE_FUNC = "Gene Function Variants"
    REARRANGEMENTS = "Rearrangement Variants"
    COPY_NUMBER = "Copy Number Variants"
    OTHER = "Other Variants"
    GENOTYPES = "Genotype Variants"
    REGION_DEFINED_VAR = "Region Defined Variants"
    TRANSCRIPT_VAR = "Transcript Variants"  # no attempt to normalize these ones, since there is no query we could use


VARIANT_CATEGORY_VALUES = [v.value for v in VariantCategory.__members__.values()]

## <a id='toc2_1_'></a>[Summary dicts](#toc0_)

These dictionaries will be mutated and used at the end of the analysis

In [None]:
variant_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of CIViC Variants per Category": [],
    "Fraction of all CIViC Variants": [],
    "Percent of all CIViC Variants": [],
    "Fraction of Accepted Variants": [],
    "Percent of Accepted Variants": [],
    "Fraction of Submitted Variants": [],
    "Percent of Submitted Variants": [],
}
variant_analysis_summary

In [None]:
evidence_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of CIViC Evidence Items per Category": [],
    "Fraction of all CIViC Evidence Items": [],
    "Percent of all CIViC Evidence Items": [],
    "Fraction of Accepted Evidence Items": [],
    "Percent of Accepted Evidence Items": [],
    "Fraction of Submitted Evidence Items": [],
    "Percent of Submitted Evidence Items": [],
}
evidence_analysis_summary

## <a id='toc2_2_'></a>[Define Analysis Functions](#toc0_)

In [None]:
def variant_analysis(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do variant analysis (counts, percentages)

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of variants that are in `df`
    :return: Transformed dataframe with variant ID duplicates dropped
    """
    # Drop duplicate rows
    df = df.drop_duplicates(subset=["variant_id"])
    variant_ids = list(df["variant_id"])

    # Count
    num_variants = len(variant_ids)
    fraction_variants = f"{num_variants} / {total_number_variants}"
    print(
        f"\nNumber of {variant_norm_type.value} Variants in CIViC: {fraction_variants}"
    )

    # Percent
    percentage_variants = f"{num_variants / total_number_variants * 100:.2f}%"
    print(
        f"Percent of {variant_norm_type.value} Variants in CIViC: {percentage_variants}"
    )

    # Get accepted counts
    num_accepted_variants = df.variant_accepted.sum()
    fraction_accepted_variants = f"{num_accepted_variants} / {num_variants}"
    print(
        f"\nNumber of accepted {variant_norm_type.value} Variants: {fraction_accepted_variants}"
    )

    # Get accepted Percent
    percentage_accepted_variants = f"{num_accepted_variants / num_variants * 100:.2f}%"
    print(
        f"Percent of accepted {variant_norm_type.value} Variants: {percentage_accepted_variants}"
    )

    # Get submitted counts
    num_submitted_variants = len(df) - num_accepted_variants
    fraction_submitted_variants = f"{num_submitted_variants} / {num_variants}"
    print(
        f"\nNumber of submitted {variant_norm_type.value} Variants: {fraction_submitted_variants}"
    )

    # Get submitted Percent
    percentage_submitted_variants = (
        f"{num_submitted_variants / num_variants * 100:.2f}%"
    )
    print(
        f"Percent of submitted {variant_norm_type.value} Variants: {percentage_submitted_variants}"
    )

    variant_analysis_summary["Count of CIViC Variants per Category"].append(
        num_variants
    )
    variant_analysis_summary["Fraction of all CIViC Variants"].append(fraction_variants)
    variant_analysis_summary["Percent of all CIViC Variants"].append(
        percentage_variants
    )
    variant_analysis_summary["Fraction of Accepted Variants"].append(
        fraction_accepted_variants
    )
    variant_analysis_summary["Percent of Accepted Variants"].append(
        percentage_accepted_variants
    )
    variant_analysis_summary["Fraction of Submitted Variants"].append(
        fraction_submitted_variants
    )
    variant_analysis_summary["Percent of Submitted Variants"].append(
        percentage_submitted_variants
    )

    return df

In [None]:
def transform_df_evidence_ids(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include evidence ID information

    :param df: Dataframe of variants
    :return: Transformed dataframe with evidence ID information
    """
    tmp_df = df.copy(deep=True)

    _variants_evidence_ids = []
    variant_ids = list(tmp_df["variant_id"])

    for v_id in variant_ids:
        _variant_evidence_ids = []

        for variant in civic_variant_ids:
            if int(v_id) == variant.id:
                for mp in variant.molecular_profiles:
                    for e in mp.evidence_items:
                        if e.id not in _variant_evidence_ids:
                            _variant_evidence_ids.append(e.id)

        _variants_evidence_ids.append(_variant_evidence_ids or "")

    tmp_df["evidence_ids"] = _variants_evidence_ids

    # Explode and rename evidence ids field
    tmp_df = tmp_df.explode(column="evidence_ids")
    tmp_df = tmp_df.rename(columns={"evidence_ids": "evidence_id"})

    return tmp_df

In [None]:
def transform_df_evidence(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include evidence status, rating, and level

    :param df: Dataframe of variants
    :return: Transformed dataframe with evidence status, rating, and level information.
    """
    variants_evidence_ids = list(df["evidence_id"])

    # Add evidence status, rating, and level information
    _variants_evidence_statuses = []
    _variants_evidence_ratings = []
    _variants_evidence_levels = []

    for eid in variants_evidence_ids:
        _variant_evidence_statuses = []
        _variant_evidence_ratings = []
        _variant_evidence_levels = []

        for evidence in civic_evidence_items:
            if eid and (int(eid) == evidence.id):
                if evidence.status not in _variant_evidence_statuses:
                    _variant_evidence_statuses.append(evidence.status)

                if evidence.rating not in _variant_evidence_ratings:
                    _variant_evidence_ratings.append(evidence.rating)

                if evidence.evidence_level not in _variant_evidence_levels:
                    _variant_evidence_levels.append(evidence.evidence_level)

        _variants_evidence_statuses.append(_variant_evidence_statuses or "")
        _variants_evidence_ratings.append(_variant_evidence_ratings or "")
        _variants_evidence_levels.append(_variant_evidence_levels or "")

    df["evidence_status"] = _variants_evidence_statuses
    df["evidence_status"] = df["evidence_status"].str.join(", ")
    df["evidence_rating"] = _variants_evidence_ratings
    df["evidence_level"] = _variants_evidence_levels

    return df

In [None]:
def evidence_analysis(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do evidence analysis (counts, percentages)

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of variants that are in `df`
    :return: Transformed dataframe with evidence ID duplicates dropped

    this is for Not Supported Variant analysis since it has sub categories and
    evidence item duplicates should be dropped within the sub categories,
    not across all Not Supported Variant evidence items
    """
    # Count
    num_variant_unique_evidence_items = len(set(df.evidence_id))
    fraction_evidence_items = (
        f"{num_variant_unique_evidence_items} / {total_ac_sub_evidence}"
    )
    print(
        f"Number of {variant_norm_type.value} Variant Evidence items in CIViC: {fraction_evidence_items}"
    )

    # Percent
    percentage_evidence_items = (
        f"{num_variant_unique_evidence_items / total_ac_sub_evidence * 100:.2f}%"
    )
    print(
        f"Percent of {variant_norm_type.value} Variant Evidence items in CIViC: {percentage_evidence_items}"
    )

    # Add evidence accepted column
    df["evidence_accepted"] = df.evidence_status.map(
        {"accepted": True, "submitted": False}
    )

    # Drop evidence id duplicates- this creates a new temporary df so that later duplicates can be
    # dropped by evidence id and category
    df1 = df.drop_duplicates(subset=["evidence_id"])

    # Get accepted counts
    num_accepted_evidences_variants = df1.evidence_accepted.sum()
    fraction_accepted_evidences_variants = (
        f"{num_accepted_evidences_variants} / {num_variant_unique_evidence_items}"
    )
    print(
        f"\nNumber of accepted {variant_norm_type.value} Variant Evidence items: {fraction_accepted_evidences_variants}"
    )

    # Get accepted Percent
    percentage_accepted_evidences_variants = f"{num_accepted_evidences_variants / num_variant_unique_evidence_items * 100:.2f}%"
    print(
        f"Percent of accepted {variant_norm_type.value} Variant Evidence items: {percentage_accepted_evidences_variants}"
    )

    # Get submitted counts
    number_submitted_evidences_variants = len(df1) - num_accepted_evidences_variants
    fraction_submitted_evidences_variants = (
        f"{number_submitted_evidences_variants} / {num_variant_unique_evidence_items}"
    )
    print(
        f"\nNumber of submitted {variant_norm_type.value} Variant Evidence items: {fraction_submitted_evidences_variants}"
    )

    # Get submitted Percent
    percentage_submitted_evidences_variants = f"{number_submitted_evidences_variants / num_variant_unique_evidence_items * 100:.2f}%"
    print(
        f"Percent of submitted {variant_norm_type.value} Variant Evidence items: {percentage_submitted_evidences_variants}"
    )

    evidence_analysis_summary["Count of CIViC Evidence Items per Category"].append(
        num_variant_unique_evidence_items
    )
    evidence_analysis_summary["Fraction of all CIViC Evidence Items"].append(
        fraction_evidence_items
    )
    evidence_analysis_summary["Percent of all CIViC Evidence Items"].append(
        percentage_evidence_items
    )
    evidence_analysis_summary["Fraction of Accepted Evidence Items"].append(
        fraction_accepted_evidences_variants
    )
    evidence_analysis_summary["Percent of Accepted Evidence Items"].append(
        percentage_accepted_evidences_variants
    )
    evidence_analysis_summary["Fraction of Submitted Evidence Items"].append(
        fraction_submitted_evidences_variants
    )
    evidence_analysis_summary["Percent of Submitted Evidence Items"].append(
        percentage_submitted_evidences_variants
    )
    if variant_norm_type == VariantNormType.NOT_SUPPORTED:
        return df
    else:
        return df1

In [None]:
def transform_df_mp_id(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile ID information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile ID information
    """
    tmp_df = df.copy(deep=True)

    variants_molprof_ids = []
    variant_ids = list(tmp_df["variant_id"])

    for v_id in variant_ids:
        variant_molprof_ids = []

        for variant in civic_variant_ids:
            if int(v_id) == variant.id:
                for mp in variant.molecular_profiles:
                    if mp.id not in variant_molprof_ids:
                        variant_molprof_ids.append(mp.id)

        variants_molprof_ids.append(variant_molprof_ids or "")

    tmp_df["molecular_profile_id"] = variants_molprof_ids
    return tmp_df

In [None]:
def transform_df_mp_score(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile score information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile score information
    """
    variants_molprof_scores = []
    normalized_variant_molprof_ids = list(df["molecular_profile_id"])

    for mp_ids in normalized_variant_molprof_ids:
        variant_molprof_scores = []
        for mp_id in mp_ids:
            for molprof in civic_molprofs:
                if int(mp_id) == molprof.id:
                    variant_molprof_scores.append(molprof.molecular_profile_score)

        variants_molprof_scores.append(variant_molprof_scores or "")

    df["molecular_profile_score"] = variants_molprof_scores
    return df

In [None]:
def transform_df_mp_score_sum(df: pd.DataFrame) -> pd.DataFrame:
    """Transform dataframe to include molecular profile score sum information

    :param df: Dataframe of variants
    :return: Transformed dataframe with molecular profile score sum information
    """
    df["molecular_profile_score_sum"] = df["molecular_profile_score"].apply(
        lambda x: sum(x)
    )
    return df

# <a id='toc3_'></a>[Analysis of Normalized Queries](#toc0_)

## <a id='toc3_1_'></a>[List of Normalized Variants ID's](#toc0_)

In [None]:
normalized_queries_df = pd.read_csv(
    "../variation_analysis/able_to_normalize_queries.csv", sep="\t"
)
normalized_queries_df.head()

## <a id='toc3_2_'></a>[Variant analysis](#toc0_)

In [None]:
normalized_queries_df = variant_analysis(
    normalized_queries_df, VariantNormType.NORMALIZED
)
normalized_queries_df.head()

In [None]:
variant_analysis_summary

## <a id='toc3_3_'></a>[Transform df for evidence analysis](#toc0_)

In [None]:
normalized_queries_add_evidence_df = transform_df_evidence_ids(normalized_queries_df)
normalized_queries_add_evidence_df.head()

In [None]:
normalized_queries_add_evidence_df = transform_df_evidence(
    normalized_queries_add_evidence_df
)
normalized_queries_add_evidence_df.head()

## <a id='toc3_4_'></a>[Evidence analysis](#toc0_)

In [None]:
normalized_queries_add_evidence_df = evidence_analysis(
    normalized_queries_add_evidence_df, VariantNormType.NORMALIZED
)
normalized_queries_add_evidence_df.head()

## <a id='toc3_5_'></a>[Impact](#toc0_)
Via molecular profile score

### <a id='toc3_5_1_'></a>[Import molecular profile id](#toc0_)

In [None]:
normalized_queries_add_molprof_df = transform_df_mp_id(normalized_queries_df)
normalized_queries_add_molprof_df.head()

### <a id='toc3_5_2_'></a>[Import molecular profile scores](#toc0_)

In [None]:
normalized_queries_add_molprof_df = transform_df_mp_score(
    normalized_queries_add_molprof_df
)
normalized_queries_add_molprof_df.head()

In [None]:
normalized_queries_add_molprof_df = transform_df_mp_score_sum(
    normalized_queries_add_molprof_df
)
normalized_queries_add_molprof_df.head()

# <a id='toc4_'></a>[Analysis of Unable to Normalize Queries](#toc0_)

## <a id='toc4_1_'></a>[List of Unable to Normalize Variant ID's](#toc0_)

In [None]:
not_normalized_queries_df = pd.read_csv(
    "../variation_analysis/unable_to_normalize_queries.csv", sep="\t"
)
not_normalized_queries_df.head()

## <a id='toc4_2_'></a>[Variant analysis](#toc0_)

In [None]:
not_normalized_queries_df = variant_analysis(
    not_normalized_queries_df, VariantNormType.UNABLE_TO_NORMALIZE
)
not_normalized_queries_df.head()

## <a id='toc4_3_'></a>[Transform df for evidence analysis](#toc0_)

In [None]:
not_normalized_quer_add_evidence_df = transform_df_evidence_ids(
    not_normalized_queries_df
)
not_normalized_quer_add_evidence_df.head()

In [None]:
not_normalized_quer_add_evidence_df = transform_df_evidence(
    not_normalized_quer_add_evidence_df
)
not_normalized_quer_add_evidence_df.head()

## <a id='toc4_4_'></a>[Evidence analysis](#toc0_)

In [None]:
not_normalized_quer_add_evidence_df = evidence_analysis(
    not_normalized_quer_add_evidence_df, VariantNormType.UNABLE_TO_NORMALIZE
)
not_normalized_quer_add_evidence_df.head()

## <a id='toc4_5_'></a>[Impact](#toc0_)
Via molecular profile score

### <a id='toc4_5_1_'></a>[Import molecular profile id](#toc0_)

In [None]:
not_normalized_queries_add_molprof_df = transform_df_mp_id(not_normalized_queries_df)
not_normalized_queries_add_molprof_df.head()

### <a id='toc4_5_2_'></a>[Import molecular profile scores](#toc0_)

In [None]:
not_normalized_queries_add_molprof_df = transform_df_mp_score(
    not_normalized_queries_add_molprof_df
)
not_normalized_queries_add_molprof_df.head()

In [None]:
not_normalized_queries_add_molprof_df = transform_df_mp_score_sum(
    not_normalized_queries_add_molprof_df
)
not_normalized_queries_add_molprof_df.head()

# <a id='toc5_'></a>[Analysis of Not Supported Variants](#toc0_)

### <a id='toc5_1_1_'></a>[List of Not Supported Variant ID's](#toc0_)

In [None]:
not_supported_queries_df = pd.read_csv(
    "../variation_analysis/not_supported_variants.csv", sep="\t"
)
not_supported_queries_df.head()

## <a id='toc5_2_'></a>[Variant Analysis](#toc0_)

In [None]:
not_supported_queries_df = variant_analysis(
    not_supported_queries_df, VariantNormType.NOT_SUPPORTED
)
not_supported_queries_df.head()

In [None]:
not_supported_queries_df["variant_accepted"].value_counts()

### <a id='toc5_2_1_'></a>[Not Supported Variant Analysis by Subcategory](#toc0_)

In [None]:
not_supported_variant_analysis_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Count of CIViC Variants per Category": [],
    "Fraction of Not Supported Variants": [],
    "Percent of Not Supported Variants": [],
    "Fraction of all CIViC Variants": [],
    "Percent of all CIViC Variants": [],
    "Fraction of Accepted Variants": [],
    "Percent of Accepted Variants": [],
    "Fraction of Submitted Variants": [],
    "Percent of Submitted Variants": [],
}

In [None]:
not_supported_variant_categories_summary_data = dict()
total_number_unique_not_supported_variants = len(
    set(not_supported_queries_df.variant_id)
)

for category in VARIANT_CATEGORY_VALUES:  # These are not supported categories
    not_supported_variant_categories_summary_data[category] = {}
    category_df = not_supported_queries_df[
        not_supported_queries_df.category == category
    ]

    # Count
    number_unique_not_supported_category_variants = len(set(category_df.variant_id))
    not_supported_variant_categories_summary_data[category][
        "number_unique_not_supported_category_variants"
    ] = number_unique_not_supported_category_variants

    # Fraction
    fraction_not_supported_category_variant_of_civic = (
        f"{number_unique_not_supported_category_variants} / {total_number_variants}"
    )
    not_supported_variant_categories_summary_data[category][
        "fraction_not_supported_category_variant_of_civic"
    ] = fraction_not_supported_category_variant_of_civic

    # Percent
    percent_not_supported_category_variant_of_civic = f"{number_unique_not_supported_category_variants / total_number_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percent_not_supported_category_variant_of_civic"
    ] = percent_not_supported_category_variant_of_civic

    # Not supported fraction
    fraction_not_supported_category_variant_of_total_not_supported = f"{number_unique_not_supported_category_variants} / {total_number_unique_not_supported_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_not_supported_category_variant_of_total_not_supported"
    ] = fraction_not_supported_category_variant_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_variant_of_total_not_supported = f"{number_unique_not_supported_category_variants / total_number_unique_not_supported_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percent_not_supported_category_variant_of_total_not_supported"
    ] = percent_not_supported_category_variant_of_total_not_supported

    # Accepted fraction
    number_accepted_not_supported_category_variants = category_df.variant_accepted.sum()
    fraction_accepted_not_supported_category_variants = f"{number_accepted_not_supported_category_variants} / {number_unique_not_supported_category_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_accepted_not_supported_category_variants"
    ] = fraction_accepted_not_supported_category_variants

    # Accepted percent
    percentage_accepted_not_supported_category_variants = f"{number_accepted_not_supported_category_variants / number_unique_not_supported_category_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percentage_accepted_not_supported_category_variants"
    ] = percentage_accepted_not_supported_category_variants

    # Submitted fraction
    number_submitted_not_supported_category_variants = (
        len(category_df) - number_accepted_not_supported_category_variants
    )
    fraction_submitted_not_supported_category_variants = f" {number_submitted_not_supported_category_variants} / {number_unique_not_supported_category_variants}"
    not_supported_variant_categories_summary_data[category][
        "fraction_submitted_not_supported_category_variants"
    ] = fraction_submitted_not_supported_category_variants

    # Submitted percent
    percentage_submitted_not_supported_category_variants = f"{number_submitted_not_supported_category_variants / number_unique_not_supported_category_variants * 100:.2f}%"
    not_supported_variant_categories_summary_data[category][
        "percentage_submitted_not_supported_category_variants"
    ] = percentage_submitted_not_supported_category_variants

    not_supported_variant_analysis_summary[
        "Count of CIViC Variants per Category"
    ].append(number_unique_not_supported_category_variants)
    not_supported_variant_analysis_summary["Fraction of all CIViC Variants"].append(
        fraction_not_supported_category_variant_of_civic
    )
    not_supported_variant_analysis_summary["Percent of all CIViC Variants"].append(
        percent_not_supported_category_variant_of_civic
    )
    not_supported_variant_analysis_summary["Fraction of Not Supported Variants"].append(
        fraction_not_supported_category_variant_of_total_not_supported
    )
    not_supported_variant_analysis_summary["Percent of Not Supported Variants"].append(
        percent_not_supported_category_variant_of_total_not_supported
    )
    not_supported_variant_analysis_summary["Fraction of Accepted Variants"].append(
        fraction_accepted_not_supported_category_variants
    )
    not_supported_variant_analysis_summary["Percent of Accepted Variants"].append(
        percentage_accepted_not_supported_category_variants
    )
    not_supported_variant_analysis_summary["Fraction of Submitted Variants"].append(
        fraction_submitted_not_supported_category_variants
    )
    not_supported_variant_analysis_summary["Percent of Submitted Variants"].append(
        percentage_submitted_not_supported_category_variants
    )

## <a id='toc5_3_'></a>[Transform df for evidence analysis](#toc0_)

In [None]:
not_supported_variants_add_evidence_df = transform_df_evidence_ids(
    not_supported_queries_df
)
not_supported_variants_add_evidence_df

There are no variants without evidence items

In [None]:
not_supported_variants_add_evidence_df.loc[
    not_supported_variants_add_evidence_df["evidence_id"] == ""
]

In [None]:
not_supported_variants_add_evidence_df = transform_df_evidence(
    not_supported_variants_add_evidence_df
)
not_supported_variants_add_evidence_df

## <a id='toc5_4_'></a>[Evidence analysis](#toc0_)

In [None]:
not_supported_variants_add_evidence_df = evidence_analysis(
    not_supported_variants_add_evidence_df, VariantNormType.NOT_SUPPORTED
)
not_supported_variants_add_evidence_df

### <a id='toc5_4_1_'></a>[Not Supported Variant Evidence Analysis by Subcategory](#toc0_)

 List all the possible variant categories, have to use non unique file since evidence items are used more than once across groups


In [None]:
not_supported_variant_categories = (
    not_supported_variants_add_evidence_df.category.unique()
)
[v for v in not_supported_variant_categories]

Evidence items may be used across multiple variants

In [None]:
duplicate = not_supported_variants_add_evidence_df[
    not_supported_variants_add_evidence_df.duplicated("evidence_id", keep=False)
]
duplicate

In [None]:
not_supported_variant_evidence_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Count of CIViC Evidence Items per Category": [],
    "Fraction of all CIViC Evidence Items": [],
    "Percent of all CIViC Evidence Items": [],
    "Fraction of Not Supported Variant Evidence Items": [],
    "Percent of Not Supported Variant Evidence Items": [],
    "Fraction of Accepted Evidence Items": [],
    "Percent of Accepted Evidence Items": [],
    "Fraction of Submitted Evidence Items": [],
    "Percent of Submitted Evidence Items": [],
}

In [None]:
not_supported_variant_categories_evidence_summary_data = dict()
total_number_not_supported_variant_unique_evidence_items = len(
    set(not_supported_variants_add_evidence_df.evidence_id)
)

for category in VARIANT_CATEGORY_VALUES:
    not_supported_variant_categories_evidence_summary_data[category] = {}
    evidence_category_df = not_supported_variants_add_evidence_df[
        not_supported_variants_add_evidence_df.category == category
    ]
    evidence_category_df = evidence_category_df.drop_duplicates(
        subset=["evidence_id", "category"]
    )

    # Count
    number_unique_not_supported_category_evidence = len(
        set(evidence_category_df.evidence_id)
    )
    not_supported_variant_categories_evidence_summary_data[category][
        "number_unique_not_supported_category_evidence"
    ] = number_unique_not_supported_category_evidence

    # Fraction
    fraction_not_supported_category_variant_evidence_of_civic = (
        f"{number_unique_not_supported_category_evidence} / {total_ac_sub_evidence}"
    )
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_not_supported_category_variant_evidence_of_civic"
    ] = fraction_not_supported_category_variant_evidence_of_civic

    # Percent
    percent_not_supported_category_variant_evidence_of_civic = f"{number_unique_not_supported_category_evidence / total_ac_sub_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percent_not_supported_category_variant_evidence_of_civic"
    ] = percent_not_supported_category_variant_evidence_of_civic

    # Not supported fraction
    fraction_not_supported_category_variant_evidence_of_total_not_supported = f"{number_unique_not_supported_category_evidence} / {total_number_not_supported_variant_unique_evidence_items}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_not_supported_category_variant_evidence_of_total_not_supported"
    ] = fraction_not_supported_category_variant_evidence_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_variant_evidence_of_total_not_supported = f"{number_unique_not_supported_category_evidence / total_number_not_supported_variant_unique_evidence_items * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percent_not_supported_category_variant_evidence_of_total_not_supported"
    ] = percent_not_supported_category_variant_evidence_of_total_not_supported

    # Accepted fraction
    number_accepted_not_supported_category_variant_evidence = (
        evidence_category_df.evidence_accepted.sum()
    )
    fraction_accepted_evidence_not_supported_category_variants = f"{number_accepted_not_supported_category_variant_evidence} / {number_unique_not_supported_category_evidence}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_accepted_evidence_not_supported_category_variants"
    ] = fraction_accepted_evidence_not_supported_category_variants

    # Accepted percent
    percentage_accepted_evidence_not_supported_category_variants = f"{number_accepted_not_supported_category_variant_evidence / number_unique_not_supported_category_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percentage_accepted_evidence_not_supported_category_variants"
    ] = percentage_accepted_evidence_not_supported_category_variants

    # Submitted fraction
    number_submitted_not_supported_category_variant_evidence = (
        number_unique_not_supported_category_evidence
        - evidence_category_df.evidence_accepted.sum()
    )
    fraction_submitted_evidence_not_supported_category_variants = f"{number_submitted_not_supported_category_variant_evidence} / {number_unique_not_supported_category_evidence}"
    not_supported_variant_categories_evidence_summary_data[category][
        "fraction_submitted_evidence_not_supported_category_variants"
    ] = fraction_submitted_evidence_not_supported_category_variants

    # Submitted percent
    percentage_submitted_evidence_not_supported_category_variants = f"{number_submitted_not_supported_category_variant_evidence / number_unique_not_supported_category_evidence * 100:.2f}%"
    not_supported_variant_categories_evidence_summary_data[category][
        "percentage_submitted_evidence_not_supported_category_variants"
    ] = percentage_submitted_evidence_not_supported_category_variants

    not_supported_variant_evidence_summary[
        "Count of CIViC Evidence Items per Category"
    ].append(number_unique_not_supported_category_evidence)
    not_supported_variant_evidence_summary[
        "Fraction of all CIViC Evidence Items"
    ].append(fraction_not_supported_category_variant_evidence_of_civic)
    not_supported_variant_evidence_summary[
        "Percent of all CIViC Evidence Items"
    ].append(percent_not_supported_category_variant_evidence_of_civic)
    not_supported_variant_evidence_summary[
        "Fraction of Not Supported Variant Evidence Items"
    ].append(fraction_not_supported_category_variant_evidence_of_total_not_supported)
    not_supported_variant_evidence_summary[
        "Percent of Not Supported Variant Evidence Items"
    ].append(percent_not_supported_category_variant_evidence_of_total_not_supported)
    not_supported_variant_evidence_summary[
        "Fraction of Accepted Evidence Items"
    ].append(fraction_accepted_evidence_not_supported_category_variants)
    not_supported_variant_evidence_summary["Percent of Accepted Evidence Items"].append(
        percentage_accepted_evidence_not_supported_category_variants
    )
    not_supported_variant_evidence_summary[
        "Fraction of Submitted Evidence Items"
    ].append(fraction_submitted_evidence_not_supported_category_variants)
    not_supported_variant_evidence_summary[
        "Percent of Submitted Evidence Items"
    ].append(percentage_submitted_evidence_not_supported_category_variants)

## <a id='toc5_5_'></a>[Impact](#toc0_)

### <a id='toc5_5_1_'></a>[Via Evidence Level](#toc0_)

#### <a id='toc5_5_1_1_'></a>[Analysis with only Accepted Variants](#toc0_)

accepted variant = a variant with at least one 'accepted' evidence item

In [None]:
ns_var_w_evid_df = not_supported_variants_add_evidence_df.copy()

There are no variants without an evidence status

In [None]:
ns_var_w_evid_df[
    (ns_var_w_evid_df["evidence_accepted"])
    & ns_var_w_evid_df["evidence_accepted"].isna()
]

Selecting only variants with at least one accepted evidence item (Accepted Variants)

In [None]:
ns_var_w_acc_evid_df = ns_var_w_evid_df[
    ns_var_w_evid_df["evidence_accepted"]
].copy()

In [None]:
ns_var_w_acc_evid_df = ns_var_w_acc_evid_df.drop_duplicates(
    subset=["evidence_id", "category"]
)

##### <a id='toc5_5_1_1_1_'></a>[Calculating evidence score via level](#toc0_)

Each variant receives an evidence score by adding up the numerical value of levels of the evidence items associated with the variant

In [None]:
def calculate_impact_score(df: pd.DataFrame) -> pd.DataFrame:
    """Converts the alphabetical evidence level to a numerical score and adds the score of each evidence item per variant

    :param df: Dataframe of variants with respective evidence items
    :return: Transformed dataframe with evidence score
    """
    EVIDENCE_LEVEL_TO_IMPACT = {"A": 10, "B": 5, "C": 3, "D": 1, "E": 0.5}
    df["evidence_level"] = df["evidence_level"].apply(lambda x: x[0])
    df["evidence_score"] = ""
    df["evidence_score"] = df["evidence_level"].map(EVIDENCE_LEVEL_TO_IMPACT)

    df.sort_values(by=["variant_id"])
    df1 = df.groupby("variant_id").aggregate(
        {
            "gene_name": "first",
            "variant_name": "first",
            "category": "first",
            "evidence_id": "count",
            "evidence_score": "sum",
        }
    )
    df1 = df1.rename(
        columns={
            "evidence_id": "#_evidence_items",
            "evidence_score": "evidence_score_sum",
        }
    )

    return df1

In [None]:
not_supported_variants_w_acc_evid_df = calculate_impact_score(ns_var_w_acc_evid_df)
not_supported_variants_w_acc_evid_df

##### <a id='toc5_5_1_1_2_'></a>[Summary Table](#toc0_)

In [None]:
def summarize_impact(df: pd.DataFrame) -> pd.DataFrame:
    """Calculates the number of variants, evidence items, and impact score per category

    :param df: Dataframe of variants
    :return: Transformed dataframe with the number of variants, evidence items, and impact score per category
    """
    df1 = df.groupby("category").aggregate(
        {"gene_name": "count", "#_evidence_items": "sum", "evidence_score_sum": "sum"}
    )
    df1 = df1.rename(
        columns={"evidence_score_sum": "impact", "gene_name": "number_of_variants"}
    )
    df1["average_impact_per_variant"] = (
        df1["impact"] / df1["number_of_variants"]
    ).round(2)
    df1 = df1.sort_values(by=["impact"], ascending=False)
    
    return df1

In [None]:
not_supported_accepted_variant_categories_df = summarize_impact(
    not_supported_variants_w_acc_evid_df
)
not_supported_accepted_variant_categories_df

In [None]:
not_supported_accepted_variant_categories_df.sum().round(2)

#### <a id='toc5_5_1_2_'></a>[Analysis with Accepted and Submitted Variants](#toc0_)

submitted variant = a variant with only 'submitted' evidence items

In [None]:
ns_var_w_acc_sub_evid_df = not_supported_variants_add_evidence_df.copy()

In [None]:
ns_var_w_acc_sub_evid_df = ns_var_w_acc_sub_evid_df.drop_duplicates(
    subset=["evidence_id", "category"]
)

##### <a id='toc5_5_1_2_1_'></a>[Calculating evidence score via level](#toc0_)

In [None]:
not_supported_variants_w_acc_sub_evid_df = calculate_impact_score(
    ns_var_w_acc_sub_evid_df
)
not_supported_variants_w_acc_sub_evid_df

##### <a id='toc5_5_1_2_2_'></a>[Summary Table](#toc0_)

In [None]:
not_supported_accepted_submitted_variant_categories_df = summarize_impact(
    not_supported_variants_w_acc_sub_evid_df
)
not_supported_accepted_submitted_variant_categories_df

The difference in impact when removing the submitted variants from the analysis

In [None]:
(
    not_supported_accepted_submitted_variant_categories_df["impact"]
    - not_supported_accepted_variant_categories_df["impact"]
).sort_values(ascending=False)

In [None]:
not_supported_accepted_submitted_variant_categories_df.to_csv(
    "output/civic_both_evidence_cat_impact_df.csv", index=True
)
not_supported_accepted_variant_categories_df.to_csv(
    "output/civic_accepted_evidence_only_impact_df.csv",
    index=True,
)

### <a id='toc5_5_2_'></a>[Via Molecular Profile Score- this was not used](#toc0_)
 Since MOA evidence items are only scored by level, we used impact score via evidence level for CIViC variants to remain consistent

#### <a id='toc5_5_2_1_'></a>[Import molecular profile id](#toc0_)

In [None]:
not_supported_variants_add_molprof_df = transform_df_mp_id(not_supported_queries_df)
not_supported_variants_add_molprof_df.head()

#### <a id='toc5_5_2_2_'></a>[Import molecular profile scores](#toc0_)

In [None]:
not_supported_variants_add_molprof_df = transform_df_mp_score(
    not_supported_variants_add_molprof_df
)
not_supported_variants_add_molprof_df.head()

In [None]:
not_supported_variants_add_molprof_df = transform_df_mp_score_sum(
    not_supported_variants_add_molprof_df
)
not_supported_variants_add_molprof_df.head()

In [None]:
not_supported_variants_add_molprof_df[
    (not_supported_variants_add_molprof_df["molecular_profile_score_sum"] == 0.0)
    & (not_supported_variants_add_molprof_df["variant_accepted"] == True)
]

In [None]:
not_supported_variants_add_molprof_df["molecular_profile_score_sum"].max()

In [None]:
not_supported_variants_add_molprof_df[
    (not_supported_variants_add_molprof_df["molecular_profile_score_sum"] != 0.0)
]

#### <a id='toc5_5_2_3_'></a>[Impact by Subcategory](#toc0_)

In [None]:
not_supported_impact_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "CIVIC Total Sum Impact Score": [],
    "Average Impact Score per Variant": [],
    "Average Impact Score per Evidence Item": [],
    "Total Number Evidence Items": [
        v["number_unique_not_supported_category_evidence"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "% Accepted Evidence Items": [
        v["percentage_accepted_evidence_not_supported_category_variants"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "Total Number Variants": [
        v["number_unique_not_supported_category_variants"]
        for v in not_supported_variant_categories_summary_data.values()
    ],
}

In [None]:
not_supported_variant_categories_impact_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    not_supported_variant_categories_impact_data[category] = {}
    impact_category_df = not_supported_variants_add_molprof_df[
        not_supported_variants_add_molprof_df.category == category
    ]

    total_sum_not_supported_category_impact = impact_category_df[
        "molecular_profile_score_sum"
    ].sum()
    not_supported_variant_categories_impact_data[category][
        "total_sum_not_supported_category_impact"
    ] = total_sum_not_supported_category_impact

    avg_impact_score_variant = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_variants
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_variant"
    ] = avg_impact_score_variant

    avg_impact_score_evidence = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_evidence
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_evidence"
    ] = avg_impact_score_evidence

    not_supported_impact_summary["CIVIC Total Sum Impact Score"].append(
        total_sum_not_supported_category_impact
    )
    not_supported_impact_summary["Average Impact Score per Variant"].append(
        avg_impact_score_variant
    )
    not_supported_impact_summary["Average Impact Score per Evidence Item"].append(
        avg_impact_score_evidence
    )

    print(f"{category}: {total_sum_not_supported_category_impact}")

In [None]:
not_supported_variant_impact_df = pd.DataFrame(not_supported_impact_summary)

In [None]:
not_supported_variant_impact_df = not_supported_variant_impact_df.round(2)
not_supported_variant_impact_df

In [None]:
not_supported_variant_impact_df.to_csv(
    "output/not_supported_variant_impact_df.csv", index=False
)

# <a id='toc6_'></a>[Summary](#toc0_)

## <a id='toc6_1_'></a>[Variant Analysis](#toc0_)

### <a id='toc6_1_1_'></a>[Building Summary Table 1 & 2](#toc0_)

In [None]:
all_variant_df = pd.DataFrame(variant_analysis_summary)

In [None]:
def combine_frac_perc(df: pd.DataFrame, denominator: str) -> pd.DataFrame:
    """Put fraction and percent string into one string

    :param df: Dataframe of variant statistics
    :param denominator: string representing what the denominator of the fraction is
    :return: Transformed dataframe with fraction and percent string as one string
    """
    for d in denominator:
        perc_key = f"Percent of {d}"
        frac_key = f"Fraction of {d}"
        df[perc_key] = (
            df[frac_key].astype(str) + "  (" + df[perc_key] + ")"
        )
        df = df.drop([frac_key], axis=1)
    return df

In [None]:
all_variant_df = combine_frac_perc(
    all_variant_df, ["all CIViC Variants", "Accepted Variants", "Submitted Variants"]
)
all_variant_df

In [None]:
all_variant_percent_status_df = all_variant_df.drop(
    [
        "Percent of all CIViC Variants",
        "Count of CIViC Variants per Category",
    ],
    axis=1,
)

for_merge_all_variant_percent_of_civic_df = all_variant_df.drop(
    [
        "Percent of Accepted Variants",
        "Percent of Submitted Variants",
    ],
    axis=1,
)

all_variant_percent_of_civic_df = for_merge_all_variant_percent_of_civic_df.drop(
    ["Count of CIViC Variants per Category"], axis=1
)

In [None]:
for_merge_all_variant_percent_of_civic_df.to_csv(
    "output/for_merge_all_variant_percent_of_civic_df.csv",
    index=False,
)

### <a id='toc6_1_2_'></a>[Summary Table 1](#toc0_)

The table below shows the 3 categories that CIViC variants were divided into after normalization and what percent they make up of all variants in CIViC data.

<ins>Numerator:</ins> # of CIViC variants based on normalization status
<br><ins>Denominator:</ins> # of all CIViC variants

In [None]:
all_variant_percent_of_civic_df = all_variant_percent_of_civic_df.set_index(
    "Variant Category"
)
all_variant_percent_of_civic_df

In [None]:
civic_summary_table_1 = all_variant_percent_of_civic_df

### <a id='toc6_1_3_'></a>[Summary Table 2](#toc0_)

The table below shows the 3 categories that CIViC variants were divided into after normalization and what percent of the variants in each category are accepted (have at least one evidence item that is accepted) or not.

<ins>Numerator:</ins> # of CIViC variants based on normalization and acceptance status
<br><ins>Denominator:</ins> # of CIViC variants based on normalization status

In [None]:
all_variant_percent_status_df = all_variant_percent_status_df.set_index(
    "Variant Category"
)
all_variant_percent_status_df

In [None]:
civic_summary_table_2 = all_variant_percent_status_df

### <a id='toc6_1_4_'></a>[Building Summary Tables 3 - 5](#toc0_)

In [None]:
not_supported_variant_df = pd.DataFrame(not_supported_variant_analysis_summary)

In [None]:
not_supported_variant_df = combine_frac_perc(
    not_supported_variant_df,
    [
        "Not Supported Variants",
        "all CIViC Variants",
        "Accepted Variants",
        "Submitted Variants",
    ],
)
not_supported_variant_df

In [None]:
for_merge_not_supported_variant_percent_of_civic_df = not_supported_variant_df.drop(
    [
        "Percent of Not Supported Variants",
        "Percent of Accepted Variants",
        "Percent of Submitted Variants",
    ],
    axis=1,
)

not_supported_variant_percent_of_civic_df = (
    for_merge_not_supported_variant_percent_of_civic_df.drop(
        ["Count of CIViC Variants per Category"], axis=1
    )
)

not_supported_variant_percent_of_not_supported_df = not_supported_variant_df[
    ["Category", "Percent of Not Supported Variants"]
].copy()

not_supported_variant_percent_evidence_df = not_supported_variant_df.drop(
    [
        "Percent of all CIViC Variants",
        "Percent of Not Supported Variants",
        "Count of CIViC Variants per Category",
    ],
    axis=1,
)

In [None]:
for_merge_not_supported_variant_percent_of_civic_df.to_csv(
    "output/for_merge_not_supported_variant_percent_of_civic_df.csv",
    index=False,
)

### <a id='toc6_1_5_'></a>[Summary Table 3](#toc0_)

The table below shows the categories that the Not Supported variants were broken into and what percent of all CIViC variants they make up. These percentages will not add up to 100% because Not Supported variants are only a subset of all CIViC variants.

<ins>Numerator:</ins> # of CIViC variants that are Not Supported in a given Subcategory
<br><ins>Denominator:</ins> # of all CIViC variants

In [None]:
not_supported_variant_percent_of_civic_df = (
    not_supported_variant_percent_of_civic_df.set_index("Category")
)
not_supported_variant_percent_of_civic_df

In [None]:
civic_summary_table_3 = not_supported_variant_percent_of_civic_df

### <a id='toc6_1_6_'></a>[Summary Table 4](#toc0_)

The table below shows the Not Supported variants broken up into 11 sub categories and what percent each sub category takes up in Not Supported variant group.

<ins>Numerator:</ins> # of CIViC variants that are Not Supported in a given Subcategory
<br><ins>Denominator:</ins> # of CIViC variants that are Not Supported

In [None]:
not_supported_variant_percent_of_not_supported_df = (
    not_supported_variant_percent_of_not_supported_df.set_index("Category")
)
not_supported_variant_percent_of_not_supported_df

In [None]:
civic_summary_table_4 = not_supported_variant_percent_of_not_supported_df

### <a id='toc6_1_7_'></a>[Summary Table 5](#toc0_)

The table below shows the Not Supported variants broken up into 11 sub categories and what percent each sub category takes up in Not Supported variant group.

<ins>Numerator:</ins> # of CIViC variants that are Not Supported in a given Subcategory based on acceptance status
<br><ins>Denominator:</ins> # of CIViC variants that are Not Supported in a given Subcategory

In [None]:
not_supported_variant_percent_evidence_df = (
    not_supported_variant_percent_evidence_df.set_index("Category")
)
not_supported_variant_percent_evidence_df

In [None]:
civic_summary_table_5 = not_supported_variant_percent_evidence_df

## <a id='toc6_2_'></a>[Evidence Analysis](#toc0_)

### <a id='toc6_2_1_'></a>[Building Summary Tables 6 & 7](#toc0_)

In [None]:
all_variant_evidence_df = pd.DataFrame(evidence_analysis_summary)

In [None]:
all_variant_evidence_df = combine_frac_perc(
    all_variant_evidence_df,
    ["all CIViC Evidence Items", "Accepted Evidence Items", "Submitted Evidence Items"],
)
all_variant_evidence_df

In [None]:
for_merge_all_variant_evidence_percent_of_civic_df = all_variant_evidence_df.drop(
    ["Percent of Accepted Evidence Items", "Percent of Submitted Evidence Items"],
    axis=1,
)

all_variant_evidence_percent_of_civic_df = (
    for_merge_all_variant_evidence_percent_of_civic_df.drop(
        ["Count of CIViC Evidence Items per Category"], axis=1
    )
)

all_variant_evidence_percent_evidence_df = all_variant_evidence_df.drop(
    [
        "Percent of all CIViC Evidence Items",
        "Count of CIViC Evidence Items per Category",
    ],
    axis=1,
)

In [None]:
for_merge_all_variant_evidence_percent_of_civic_df.to_csv(
    "output/for_merge_all_variant_evidence_percent_of_civic_df.csv",
    index=False,
)

### <a id='toc6_2_2_'></a>[Summary Table 6](#toc0_)

The table below shows what percent of all evidence items in CIViC are associated with Normalized, Unable to Normalize, and Not Supported variants. This will not add up to 100% because evidence items may be used across multiple variants.

<ins>Numerator:</ins> # of CIViC evidence items based on normalization status of associated variant
<br><ins>Denominator:</ins> # of all CIViC evidence items

In [None]:
all_variant_evidence_percent_of_civic_df = (
    all_variant_evidence_percent_of_civic_df.set_index("Variant Category")
)
all_variant_evidence_percent_of_civic_df

In [None]:
civic_summary_table_6 = all_variant_evidence_percent_of_civic_df

### <a id='toc6_2_3_'></a>[Summmary Table 7](#toc0_)

The table below shows the percent of accepted and submitted evidence items per category of variants.

<ins>Numerator:</ins> # of CIViC evidence items based on evidence acceptance status and normalization status of associated variant
<br><ins>Denominator:</ins> # of all CIViC evidence items based on normalization status of associated variant

In [None]:
all_variant_evidence_percent_evidence_df = (
    all_variant_evidence_percent_evidence_df.set_index("Variant Category")
)
all_variant_evidence_percent_evidence_df

In [None]:
civic_summary_table_7 = all_variant_evidence_percent_evidence_df

### <a id='toc6_2_4_'></a>[Building Summary Tables 8 - 10](#toc0_)

In [None]:
not_supported_variant_evidence_df = pd.DataFrame(not_supported_variant_evidence_summary)

In [None]:
not_supported_variant_evidence_df = combine_frac_perc(
    not_supported_variant_evidence_df,
    [
        "all CIViC Evidence Items",
        "Not Supported Variant Evidence Items",
        "Accepted Evidence Items",
        "Submitted Evidence Items",
    ],
)
not_supported_variant_evidence_df

In [None]:
for_merge_not_supported_variant_evidence_percent_of_civic_df = (
    not_supported_variant_evidence_df.drop(
        [
            "Percent of Not Supported Variant Evidence Items",
            "Percent of Accepted Evidence Items",
            "Percent of Submitted Evidence Items",
        ],
        axis=1,
    )
)

not_supported_variant_evidence_percent_of_civic_df = (
    for_merge_not_supported_variant_evidence_percent_of_civic_df.drop(
        ["Count of CIViC Evidence Items per Category"], axis=1
    )
)

not_supported_variant_evidence_percent_of_not_supported_df = (
    not_supported_variant_evidence_df[
        ["Category", "Percent of Not Supported Variant Evidence Items"]
    ].copy()
)


not_supported_variant_evidence_percent_evidence_df = (
    not_supported_variant_evidence_df.drop(
        [
            "Percent of Not Supported Variant Evidence Items",
            "Percent of all CIViC Evidence Items",
            "Count of CIViC Evidence Items per Category",
        ],
        axis=1,
    )
)

In [None]:
for_merge_not_supported_variant_evidence_percent_of_civic_df.to_csv(
    "output/for_merge_not_supported_variant_evidence_percent_of_civic_df.csv",
    index=False,
)

### <a id='toc6_2_5_'></a>[Summary Table 8](#toc0_)

The table below shows the percent of all CIViC evidence items that are associated with a Not Supported variant sub category. This will not add up to 100% since the evidence items can be associated with multiple variants.

<ins>Numerator:</ins> # of CIViC evidence items that are associated with Not Supported variants in a given Subcategory
<br><ins>Denominator:</ins> # of all CIViC evidence items

In [None]:
not_supported_variant_evidence_percent_of_civic_df = (
    not_supported_variant_evidence_percent_of_civic_df.set_index("Category")
)
not_supported_variant_evidence_percent_of_civic_df

In [None]:
civic_summary_table_8 = not_supported_variant_evidence_percent_of_civic_df

### <a id='toc6_2_6_'></a>[Summary Table 9](#toc0_)

The table below shows the percent of all evidence items associated with Not Supported variants that are associated with a variant sub category.

<ins>Numerator:</ins> # of CIViC evidence items that are associated with Not Supported variants in a given Subcategory
<br><ins>Denominator:</ins> # of CIViC evidence items that are associated with Not Supported variants

In [None]:
not_supported_variant_evidence_percent_of_not_supported_df = (
    not_supported_variant_evidence_percent_of_not_supported_df.set_index("Category")
)
not_supported_variant_evidence_percent_of_not_supported_df

In [None]:
civic_summary_table_9 = not_supported_variant_evidence_percent_of_not_supported_df

### <a id='toc6_2_7_'></a>[Summary Table 10](#toc0_)

The table below shows the percent of evidence items associated with Not Supported variant sub categories that are accepted or submitted.

<ins>Numerator:</ins> # of CIViC evidence items based on evidence acceptance status that are associated with Not Supported variants in a given Subcategory
<br><ins>Denominator:</ins> # of CIViC evidence items that are associated with Not Supported variants in a given Subcategory

In [None]:
not_supported_variant_evidence_percent_evidence_df = (
    not_supported_variant_evidence_percent_evidence_df.set_index("Category")
)
not_supported_variant_evidence_percent_evidence_df

In [None]:
civic_summary_table_10 = not_supported_variant_evidence_percent_evidence_df

## <a id='toc6_3_'></a>[Impact](#toc0_)

accepted and submitted variants

In [None]:
not_supported_variants_w_acc_sub_evid_df

In [None]:
not_supported_elevel_impact_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "CIVIC Total Sum Impact Score": [],
    "Average Impact Score per Variant": [],
    "Average Impact Score per Evidence Item": [],
    "Total Number Evidence Items": [
        v["number_unique_not_supported_category_evidence"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "% Accepted Evidence Items": [
        v["percentage_accepted_evidence_not_supported_category_variants"]
        for v in not_supported_variant_categories_evidence_summary_data.values()
    ],
    "Total Number Variants": [
        v["number_unique_not_supported_category_variants"]
        for v in not_supported_variant_categories_summary_data.values()
    ],
}

In [None]:
not_supported_variant_categories_impact_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    not_supported_variant_categories_impact_data[category] = {}
    impact_category_df = not_supported_variants_w_acc_sub_evid_df[
        not_supported_variants_w_acc_sub_evid_df.category == category
    ]

    total_sum_not_supported_category_impact = impact_category_df[
        "evidence_score_sum"
    ].sum()
    not_supported_variant_categories_impact_data[category][
        "total_sum_not_supported_category_impact"
    ] = total_sum_not_supported_category_impact

    avg_impact_score_variant = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_variants
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_variant"
    ] = avg_impact_score_variant

    avg_impact_score_evidence = (
        total_sum_not_supported_category_impact
        / number_unique_not_supported_category_evidence
    )
    not_supported_variant_categories_impact_data[category][
        "avg_impact_score_evidence"
    ] = avg_impact_score_evidence

    not_supported_elevel_impact_summary["CIVIC Total Sum Impact Score"].append(
        total_sum_not_supported_category_impact
    )
    not_supported_elevel_impact_summary["Average Impact Score per Variant"].append(
        avg_impact_score_variant
    )
    not_supported_elevel_impact_summary[
        "Average Impact Score per Evidence Item"
    ].append(avg_impact_score_evidence)

    print(f"{category}: {total_sum_not_supported_category_impact}")

In [None]:
not_supported_variant_impact_df = pd.DataFrame(not_supported_elevel_impact_summary)

In [None]:
not_supported_variant_impact_df = not_supported_variant_impact_df.round(2)
not_supported_variant_impact_df

The bar graph below shows the relationship between the Not Supported variant sub category impact score and the sub category. Additionally, the colors illustrate the number of evidence items associated each sub category.

In [None]:
fig = px.bar(
    not_supported_variant_impact_df,
    x="Category",
    y="CIVIC Total Sum Impact Score",
    hover_data=[
        "Total Number Evidence Items",
        not_supported_variant_impact_df["% Accepted Evidence Items"],
    ],
    color="Total Number Evidence Items",
    labels={"CIVIC Total Sum Impact Score": "CIVIC Total Sum Impact Score"},
    text_auto=".1f",
    color_continuous_scale="geyser",
)
fig.update_traces(width=1)
fig.show()

In [None]:
fig.write_html(
    "output/civic_ns_categories_impact_redgreen.html"
)

The scatter plot below shows the relationship between the Not Supported variant sub category impact score and the number of evidence items associated with variants in each sub category. Additionally, the sizes of the data point represent the number of variants in each sub category. 

In [None]:
fig2 = px.scatter(
    data_frame=not_supported_variant_impact_df,
    x="Total Number Evidence Items",
    y="CIVIC Total Sum Impact Score",
    size="Total Number Variants",
    size_max=40,
    text="Total Number Variants",
    color="Category",
    hover_data="% Accepted Evidence Items",
)
fig2.show()

In [None]:
fig2.write_html(
    "output/civic_ns_categories_impact_scatterplot.html"
)

In [None]:
fig3 = px.scatter(
    data_frame=not_supported_variant_impact_df,
    x="Total Number Variants",
    y="Average Impact Score per Evidence Item",
    size="Total Number Variants",
    size_max=40,
    text="Total Number Variants",
    color="Category",
    hover_data=["% Accepted Evidence Items", "Average Impact Score per Variant"],
)
fig3.show()