# <a id='toc1_'></a>[Molecular Oncology Almanac Assertion Analysis](#toc0_)

MOA evidence items are referred to as assertions and MOA variants are referred to as features in this analysis. 

The moa_features_analysis notebook is a prerequisite to thus notebook as it will update the cache

**Table of contents**<a id='toc0_'></a>    
- [Molecular Oncology Almanac Assertion Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Create output directory](#toc1_1_2_)    
  - [Create analysis functions / global variables](#toc1_2_)    
  - [All Features (Variants) Analysis](#toc1_3_)    
    - [Creating a table with feature (variant) and assertion (evidence) information](#toc1_3_1_)    
    - [Converting feature (variant) types to normalized categories](#toc1_3_2_)    
    - [Adding a numerical impact score based on the predictive implication](#toc1_3_3_)    
    - [Impact Score Analysis](#toc1_3_4_)    
    - [Features (Variants) Analysis](#toc1_3_5_)    
    - [Assertions (Evidence Items) Analysis](#toc1_3_6_)    
    - [Summaries for all Features (Variants) and Assertions (Evidence Items)](#toc1_3_7_)    
  - [Create functions / global variables used in analysis](#toc1_4_)    
  - [Normalized Analysis](#toc1_5_)    
  - [Not Supported Analysis](#toc1_6_)    
    - [Feature (Variant) Analysis](#toc1_6_1_)    
    - [Not Supported Feature (Variant) Analysis by Subcategory](#toc1_6_2_)    
    - [Not Support Feature (Variant) Assertion (Evidence) Analysis by Subcategory](#toc1_6_3_)    
    - [Impact Score Analysis by Subcategory](#toc1_6_4_)    
- [MOA Summary](#toc2_)    
  - [Feature (Variant) Analysis](#toc2_1_)    
    - [Building Summary Tables 1 - 3](#toc2_1_1_)    
    - [Summary Table 1](#toc2_1_2_)    
    - [Summary Table 2](#toc2_1_3_)    
    - [Summary Table 3](#toc2_1_4_)    
  - [Evidence Analysis](#toc2_2_)    
    - [Building Summary Table 4](#toc2_2_1_)    
    - [Summary Table 4](#toc2_2_2_)    
    - [Building Sumary Tables 5 & 6](#toc2_2_3_)    
    - [Summary Table 5](#toc2_2_4_)    
    - [Summary Table 6](#toc2_2_5_)    
  - [Impact](#toc2_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [None]:
from enum import Enum
from typing import Dict
import json
from pathlib import Path
import pandas as pd
import plotly.express as px
from ga4gh.core import sha512t24u
import zipfile

### <a id='toc1_1_2_'></a>[Create output directory](#toc0_)

In [None]:
path = Path("moa_assertion_analysis_output")
path.mkdir(exist_ok=True)

## <a id='toc1_2_'></a>[Create analysis functions / global variables](#toc0_)

In [None]:
# Use latest zip that has been pushed to the repo
latest_zip_path = sorted(Path().glob("../feature_analysis/moa_features_*.zip"))[-1]
json_fn = latest_zip_path.name[:-4]

with zipfile.ZipFile(latest_zip_path, "r") as zip_ref:
    zip_ref.extractall()

with open(json_fn, "r") as f:
    variants_resp = json.load(f)

f"Using {json_fn} for MOA features"

In [None]:
def get_feature_digest(feature: Dict) -> str:
    """Get digest for feature

    :param feature: MOA feature
    :return: Digest
    """
    attrs = json.dumps(
        feature["attributes"][0], sort_keys=True, separators=(",", ":"), indent=None
    ).encode("utf-8")
    return sha512t24u(attrs)

In [None]:
class VariantNormType(str, Enum):
    """Variation Normalization types"""

    NORMALIZED = "Normalized"
    NOT_SUPPORTED = "Not Supported"


VARIANT_NORM_TYPE_VALUES = [v.value for v in VariantNormType.__members__.values()]

In [None]:
class VariantCategory(str, Enum):
    """Create enum for the kind of variants that are in MOA."""

    EXPRESSION = "Expression Variants"
    EPIGENETIC_MODIFICATION = "Epigenetic Modification"
    FUSION = "Fusion Variants"
    SEQUENCE_VARS = "Sequence Variants"
    GENE_FUNC = "Gene Function Variants"
    REARRANGEMENTS = "Rearrangement Variants"
    COPY_NUMBER = "Copy Number Variants"
    OTHER = "Other Variants"
    GENOTYPES = "Genotype Variants"
    REGION_DEFINED_VAR = "Region Defined Variants"
    TRANSCRIPT_VAR = "Transcript Variants"  # no attempt to normalize these ones, since there is no query we could use


VARIANT_CATEGORY_VALUES = [v.value for v in VariantCategory.__members__.values()]

In [None]:
class ItemType(str, Enum):
    """Create enum for the kind of items that will be analyzed."""

    FEATURE = "feature"
    ASSERTION = "assertion"

## <a id='toc1_3_'></a>[All Features (Variants) Analysis](#toc0_)

### <a id='toc1_3_1_'></a>[Creating a table with feature (variant) and assertion (evidence) information](#toc0_)

In [None]:
# Create dictionary for MOA Feature ID -> Feature Type

features = {}

for feature in variants_resp:
    feature_id = feature["feature_id"]
    digest = get_feature_digest(feature)
    features[digest] = feature["feature_type"]

count_unique_feature_ids = len(features.keys())
print(count_unique_feature_ids)

In [None]:
# Use latest zip that has been pushed to the repo
latest_zip_path = sorted(Path().glob("moa_assertions_*.zip"))[-1]
json_fna = latest_zip_path.name[:-4]

with zipfile.ZipFile(latest_zip_path, "r") as zip_ref:
    zip_ref.extractall()

with open(json_fna, "r") as f:
    assertions_resp = json.load(f)

f"Using {json_fna} for MOA assertions"

In [None]:
# Create DF for assertions and their associated feature + predictive implication

transformed = []

# Mapping from feature ID to feature digest
feature_id_to_digest = {}

for assertion in assertions_resp:
    assertion_id = assertion["assertion_id"]
    predictive_implication = assertion["predictive_implication"]

    if len(assertion["features"]) != 1:
        print(f"assertion id ({assertion_id}) does not have 1 feature")
        continue

    feature = assertion["features"][0]
    feature_id = feature["feature_id"]
    feature_digest = get_feature_digest(feature)

    feature_id_to_digest[feature_id] = digest

    transformed.append(
        {
            "assertion_id": assertion_id,
            "feature_id": feature_id,
            "feature_type": features[feature_digest],
            "predictive_implication": predictive_implication,
            "feature_digest": feature_digest,
        }
    )
moa_df = pd.DataFrame(transformed)
print(len(moa_df["feature_digest"].unique()))
moa_df

In [None]:
moa_df.to_csv("moa_assertion_analysis_output/moa_df.csv")

In [None]:
unique_features_df = moa_df.sort_values("feature_id").drop_duplicates(
    subset=["feature_digest"]
)
len_unique_feature_ids = len(list(unique_features_df.feature_id))
len_unique_feature_ids

In [None]:
total_len_features = len(moa_df.feature_digest.unique())
f"Total number of unique features (variants): {total_len_features}"

In [None]:
assert total_len_features == len_unique_feature_ids

In [None]:
total_len_assertions = len(moa_df.assertion_id.unique())
f"Total number of unique assertions (evidence items): {total_len_assertions}"

### <a id='toc1_3_2_'></a>[Converting feature (variant) types to normalized categories](#toc0_)

In [None]:
list(moa_df.feature_type.unique())

In [None]:
moa_df["category"] = moa_df["feature_type"].copy()

moa_df["category"] = moa_df["category"].replace(
    "rearrangement", VariantCategory.REARRANGEMENTS.value
)
moa_df["category"] = moa_df["category"].replace(
    "aneuploidy", VariantCategory.COPY_NUMBER.value
)
moa_df["category"] = moa_df["category"].replace(
    "knockdown", VariantCategory.EXPRESSION.value
)
moa_df["category"] = moa_df["category"].replace(
    "somatic_variant", VariantCategory.SEQUENCE_VARS.value
)
moa_df["category"] = moa_df["category"].replace(
    "germline_variant", VariantCategory.SEQUENCE_VARS.value
)
moa_df["category"] = moa_df["category"].replace(
    "microsatellite_stability", VariantCategory.REARRANGEMENTS.value
)
moa_df["category"] = moa_df["category"].replace(
    "mutational_burden", VariantCategory.OTHER.value
)
moa_df["category"] = moa_df["category"].replace(
    "mutational_signature", VariantCategory.OTHER.value
)
moa_df["category"] = moa_df["category"].replace(
    "copy_number", VariantCategory.COPY_NUMBER.value
)

moa_df.head()

In [None]:
list(moa_df.category.unique())

### <a id='toc1_3_3_'></a>[Adding a numerical impact score based on the predictive implication](#toc0_)
This is based on the structure of MOA scoring

In [None]:
predictive_implication_categories = moa_df.predictive_implication.unique()
list(predictive_implication_categories)

In [None]:
moa_df["impact_score"] = moa_df["predictive_implication"].copy()

moa_df.loc[moa_df["impact_score"] == "FDA-Approved", "impact_score"] = 10
moa_df.loc[moa_df["impact_score"] == "Guideline", "impact_score"] = 10
moa_df.loc[moa_df["impact_score"] == "Clinical evidence", "impact_score"] = 5
moa_df.loc[moa_df["impact_score"] == "Clinical trial", "impact_score"] = 5
moa_df.loc[moa_df["impact_score"] == "Preclinical", "impact_score"] = 1
moa_df.loc[moa_df["impact_score"] == "Inferential", "impact_score"] = 0.5

moa_df.head()

### <a id='toc1_3_4_'></a>[Impact Score Analysis](#toc0_)

In [None]:
feature_categories_impact_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    feature_categories_impact_data[category] = {}
    impact_category_df = moa_df[moa_df.category == category]

    total_sum_category_impact = impact_category_df["impact_score"].sum()
    feature_categories_impact_data[category][
        "total_sum_category_impact"
    ] = total_sum_category_impact
    print(f"{category}: {total_sum_category_impact}")

### <a id='toc1_3_5_'></a>[Features (Variants) Analysis](#toc0_)

In [None]:
def calc_perc_item_analysis(item_type: ItemType, total_len: int) -> dict:
    """Calculates the percent of either the features or the assertions in MOA

    :param item_type: The type of item
    :param total_len: The total number of items defined by 'item_type'
    :return: Dictionary with a string indicating the percent of the item 
    """
    moa_item_data = dict()

    for category in VARIANT_CATEGORY_VALUES:
        moa_item_data[category] = {}
        item_type_df = moa_df[moa_df.category == category]
        if item_type == ItemType.FEATURE:
            number_unique_category_items = len(set(item_type_df.feature_digest))
        else:
            number_unique_category_items = len(set(item_type_df.assertion_id))

        if item_type == ItemType.FEATURE:
            singular = ItemType.FEATURE.value
            plural = "features"
        else:
            singular = ItemType.ASSERTION.value
            plural = "assertions"

        moa_item_data[category][
            f"number_unique_category_{plural}"
        ] = number_unique_category_items

        fraction_category_item = f"{number_unique_category_items} / {total_len}"
        moa_item_data[category][
            f"fraction_category_{singular}"
        ] = fraction_category_item

        percent_category_item = (
            "{:.2f}".format(number_unique_category_items / total_len * 100) + "%"
        )

        moa_item_data[category][f"percent_category_{singular}"] = percent_category_item

    return moa_item_data

In [None]:
moa_feature_data = calc_perc_item_analysis(ItemType.FEATURE, total_len_features)

### <a id='toc1_3_6_'></a>[Assertions (Evidence Items) Analysis](#toc0_)

In [None]:
moa_assertion_data = calc_perc_item_analysis(ItemType.ASSERTION, total_len_assertions)

### <a id='toc1_3_7_'></a>[Summaries for all Features (Variants) and Assertions (Evidence Items)](#toc0_)

In [None]:
feature_category_impact_score = [
    v["total_sum_category_impact"] for v in feature_categories_impact_data.values()
]
feature_category_number = [
    v["number_unique_category_features"] for v in moa_feature_data.values()
]
feature_category_fraction = [
    v["fraction_category_feature"] for v in moa_feature_data.values()
]
feature_category_percent = [
    v["percent_category_feature"] for v in moa_feature_data.values()
]
feature_category_assertion_number = [
    v["number_unique_category_assertions"] for v in moa_assertion_data.values()
]
feature_category_assertion_fraction = [
    v["fraction_category_assertion"] for v in moa_assertion_data.values()
]
feature_category_assertion_percent = [
    v["percent_category_assertion"] for v in moa_assertion_data.values()
]

In [None]:
feature_category_dict = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Number of Features": feature_category_number,
    "Fraction of Features": feature_category_fraction,
    "Percent of Features": feature_category_percent,
    "Number of Assertions": feature_category_assertion_number,
    "Fraction of Assertions": feature_category_assertion_fraction,
    "Percent of Assertions": feature_category_assertion_percent,
    "Impact Score": feature_category_impact_score,
}

In [None]:
moa_feature_df = pd.DataFrame(feature_category_dict)
moa_feature_df

In [None]:
moa_feature_df = combine_frac_perc(moa_feature_df, "Features")

In [None]:
moa_feature_df = combine_frac_perc(moa_feature_df, "Assertions")

In [None]:
moa_feature_df_abbreviated = moa_feature_df[
    [
        "Category",
        "Percent of Features",
        "Percent of Assertions",
        "Impact Score",
    ]
].copy()

In [None]:
moa_feature_df_abbreviated = moa_feature_df_abbreviated.set_index("Category")
moa_feature_df_abbreviated

In [None]:
fig = px.scatter(
    data_frame=moa_feature_df,
    x="Number of Assertions",
    y="Impact Score",
    size="Number of Features",
    size_max=40,
    text="Number of Features",
    color="Category",
)
fig.show()

In [None]:
fig.write_html(
    "moa_assertion_analysis_output/moa_feature_categories_impact_scatterplot.html"
)

## <a id='toc1_4_'></a>[Create functions / global variables used in analysis](#toc0_)

In [None]:
feature_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of MOA Features per Category": [],
    "Fraction of all MOA Features": [],
    "Percent of all MOA Features": [],
}
feature_analysis_summary

In [None]:
def feature_analysis(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do feature analysis (counts, percents). Updates `feature_analysis_summary`

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of features that are in `df`
    :return: Transformed dataframe with variant ID duplicates dropped
    """
    # Drop duplicate rows
    df = df.drop_duplicates(subset=["feature_id"])
    feature_ids = list(df["feature_id"])

    # Count
    num_features = len(feature_ids)
    fraction_features = f"{num_features} / {total_len_features}"
    print(f"\nNumber of {variant_norm_type.value} Features in MOA: {fraction_features}")

    # Percent
    percent_features = f"{num_features / total_len_features * 100:.2f}%"
    print(f"Percent of {variant_norm_type.value} Features in MOA: {percent_features}")

    feature_analysis_summary["Count of MOA Features per Category"].append(num_features)
    feature_analysis_summary["Fraction of all MOA Features"].append(fraction_features)
    feature_analysis_summary["Percent of all MOA Features"].append(percent_features)

    return df

In [None]:
assertion_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of MOA Assertions per Category": [],
    "Fraction of all MOA Assertions": [],
    "Percent of all MOA Assertions": [],
}
assertion_analysis_summary

In [None]:
def assertion_analysis(
    all_df: pd.DataFrame,
    variant_norm_df: pd.DataFrame,
    variant_norm_type: VariantNormType,
) -> str:
    """Do evidence analysis (counts, percents). Updates `assertion_analysis_summary`

    :param all_df: Dataframe for all assertions and features
    :param variant_norm_df: Dataframe for features given certain `variant_norm_type`
    :param variant_norm_type: The kind of variants that are in `df`
    :return: a string with the evidence counts and percents per category
    """
    # Need to do this bc of duplicate features
    _feature_ids = set(variant_norm_df.feature_digest)
    tmp_df = all_df[all_df["feature_digest"].isin(_feature_ids)]

    # Count
    num_assertions = len(tmp_df.assertion_id)
    fraction_assertions = f"{num_assertions} / {total_len_assertions}"
    print(
        f"Number of {variant_norm_type.value} Feature Assertions in MOA: {fraction_assertions}"
    )

    # Percent
    percent_assertions = f"{num_assertions / total_len_assertions * 100:.2f}%"
    print(
        f"Percent of {variant_norm_type.value} Feature Assertions in MOA: {percent_assertions}"
    )

    assertion_analysis_summary["Count of MOA Assertions per Category"].append(
        num_assertions
    )
    assertion_analysis_summary["Fraction of all MOA Assertions"].append(
        fraction_assertions
    )
    assertion_analysis_summary["Percent of all MOA Assertions"].append(
        percent_assertions
    )

In [None]:
feature_id_to_digest_df = pd.DataFrame(
    feature_id_to_digest.items(), columns=["feature_id", "feature_digest"]
)
feature_id_to_digest_df

## <a id='toc1_5_'></a>[Normalized Analysis](#toc0_)

In [None]:
normalized_queries_df = pd.read_csv(
    "../feature_analysis/able_to_normalize_queries.csv", sep="\t"
)
normalized_queries_df = pd.merge(
    normalized_queries_df,
    feature_id_to_digest_df,
    left_on="variant_id",
    right_on="feature_id",
)
normalized_queries_df.shape

In [None]:
normalized_queries_df = pd.merge(
    normalized_queries_df["variant_id"],
    moa_df,
    left_on="variant_id",
    right_on="feature_id",
    how="left",
)
normalized_queries_df = normalized_queries_df.drop(columns=["variant_id"])

In [None]:
normalized_queries_df = feature_analysis(
    normalized_queries_df, VariantNormType.NORMALIZED
)
normalized_queries_df

In [None]:
assertion_analysis(moa_df, normalized_queries_df, VariantNormType.NORMALIZED)

## <a id='toc1_6_'></a>[Not Supported Analysis](#toc0_)

In [None]:
not_supported_queries_df = pd.read_csv(
    "../feature_analysis/not_supported_variants.csv", sep="\t"
)
not_supported_queries_df = pd.merge(
    not_supported_queries_df,
    feature_id_to_digest_df,
    left_on="variant_id",
    right_on="feature_id",
)
not_supported_queries_df.shape

In [None]:
not_supported_queries_df = pd.merge(
    not_supported_queries_df["variant_id"],
    moa_df,
    left_on="variant_id",
    right_on="feature_id",
    how="left",
)
not_supported_queries_df = not_supported_queries_df.drop(columns=["variant_id"])
not_supported_queries_df

### <a id='toc1_6_1_'></a>[Feature (Variant) Analysis](#toc0_)

In [None]:
not_supported_queries_df = feature_analysis(
    not_supported_queries_df, VariantNormType.NOT_SUPPORTED
)

### <a id='toc1_6_2_'></a>[Not Supported Feature (Variant) Analysis by Subcategory](#toc0_)

In [None]:
not_supported_feature_analysis_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Count of MOA Features per Category": [],
    "Fraction of all MOA Features": [],
    "Percent of all MOA Features": [],
    "Fraction of Not Supported Features": [],
    "Percent of Not Supported Features": [],
}

In [None]:
not_supported_feature_categories_summary_data = dict()
total_number_unique_not_supported_features = len(
    set(not_supported_queries_df.feature_id)
)

for category in VARIANT_CATEGORY_VALUES:  # These are not supported categories
    not_supported_feature_categories_summary_data[category] = {}
    category_df = not_supported_queries_df[
        not_supported_queries_df.category == category
    ]

    # Count
    number_unique_not_supported_category_features = len(set(category_df.feature_id))
    not_supported_feature_categories_summary_data[category][
        "number_unique_not_supported_category_features"
    ] = number_unique_not_supported_category_features

    # Fraction
    fraction_not_supported_category_feature_of_moa = (
        f"{number_unique_not_supported_category_features} / {total_len_features}"
    )
    not_supported_feature_categories_summary_data[category][
        "fraction_not_supported_category_feature_of_moa"
    ] = fraction_not_supported_category_feature_of_moa

    # Percent
    percent_not_supported_category_feature_of_moa = f"{number_unique_not_supported_category_features / total_len_features * 100:.2f}%"
    not_supported_feature_categories_summary_data[category][
        "percent_not_supported_category_feature_of_moa"
    ] = percent_not_supported_category_feature_of_moa

    # Not supported fraction
    fraction_not_supported_category_feature_of_total_not_supported = f"{number_unique_not_supported_category_features} / {total_number_unique_not_supported_features}"
    not_supported_feature_categories_summary_data[category][
        "fraction_not_supported_category_feature_of_total_not_supported"
    ] = fraction_not_supported_category_feature_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_feature_of_total_not_supported = f"{number_unique_not_supported_category_features / total_number_unique_not_supported_features * 100:.2f}%"
    not_supported_feature_categories_summary_data[category][
        "percent_not_supported_category_feature_of_total_not_supported"
    ] = percent_not_supported_category_feature_of_total_not_supported

    not_supported_feature_analysis_summary["Count of MOA Features per Category"].append(
        number_unique_not_supported_category_features
    )
    not_supported_feature_analysis_summary["Fraction of all MOA Features"].append(
        fraction_not_supported_category_feature_of_moa
    )
    not_supported_feature_analysis_summary["Percent of all MOA Features"].append(
        percent_not_supported_category_feature_of_moa
    )
    not_supported_feature_analysis_summary["Fraction of Not Supported Features"].append(
        fraction_not_supported_category_feature_of_total_not_supported
    )
    not_supported_feature_analysis_summary["Percent of Not Supported Features"].append(
        percent_not_supported_category_feature_of_total_not_supported
    )

In [None]:
not_supported_variant_df = pd.DataFrame(not_supported_feature_analysis_summary)
not_supported_variant_df

### <a id='toc1_6_3_'></a>[Not Support Feature (Variant) Assertion (Evidence) Analysis by Subcategory](#toc0_)

List all the possible variant categories

In [None]:
not_supported_feature_categories = not_supported_queries_df.category.unique()
[v for v in not_supported_feature_categories]

In [None]:
assertion_analysis(moa_df, not_supported_queries_df, VariantNormType.NOT_SUPPORTED)

In [None]:
not_supported_feature_assertion_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Count of MOA Assertions per Category": [],
    "Fraction of all MOA Assertions": [],
    "Percent of all MOA Assertions": [],
    "Fraction of Not Supported Feature Assertions": [],
    "Percent of Not Supported Feature Assertions": [],
}

In [None]:
not_supported_feature_ids = set(not_supported_queries_df.feature_digest)

In [None]:
not_supported_feature_categories_assertion_summary_data = dict()

not_supported_feature_ids = set(not_supported_queries_df.feature_digest)

for category in VARIANT_CATEGORY_VALUES:
    not_supported_feature_categories_assertion_summary_data[category] = {}

    # Need to do this bc of duplicate features
    tmp_df = moa_df[moa_df["feature_digest"].isin(not_supported_feature_ids)]

    evidence_category_df = tmp_df[tmp_df.category == category]

    evidence_category_df = evidence_category_df.drop_duplicates(subset=["assertion_id"])

    # Count for Not Supported Feature Assertions
    total_number_not_supported_feature_unique_assertions = len(tmp_df.assertion_id)

    # Count per Category
    number_unique_not_supported_category_assertion = len(
        set(evidence_category_df.assertion_id)
    )
    not_supported_feature_categories_assertion_summary_data[category][
        "number_unique_not_supported_category_assertion"
    ] = number_unique_not_supported_category_assertion

    # Fraction
    fraction_not_supported_category_feature_assertion_of_moa = (
        f"{number_unique_not_supported_category_assertion} / {total_len_assertions}"
    )
    not_supported_feature_categories_assertion_summary_data[category][
        "fraction_not_supported_category_feature_assertion_of_moa"
    ] = fraction_not_supported_category_feature_assertion_of_moa

    # Percent
    percent_not_supported_category_feature_assertion_of_moa = f"{number_unique_not_supported_category_assertion / total_len_assertions * 100:.2f}%"
    not_supported_feature_categories_assertion_summary_data[category][
        "percent_not_supported_category_feature_assertion_of_moa"
    ] = percent_not_supported_category_feature_assertion_of_moa

    # Not supported fraction
    fraction_not_supported_category_feature_assertion_of_total_not_supported = f"{number_unique_not_supported_category_assertion} / {total_number_not_supported_feature_unique_assertions}"
    not_supported_feature_categories_assertion_summary_data[category][
        "fraction_not_supported_category_feature_assertion_of_total_not_supported"
    ] = fraction_not_supported_category_feature_assertion_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_feature_assertion_of_total_not_supported = f"{number_unique_not_supported_category_assertion / total_number_not_supported_feature_unique_assertions * 100:.2f}%"
    not_supported_feature_categories_assertion_summary_data[category][
        "percent_not_supported_category_feature_assertion_of_total_not_supported"
    ] = percent_not_supported_category_feature_assertion_of_total_not_supported

    not_supported_feature_assertion_summary[
        "Count of MOA Assertions per Category"
    ].append(number_unique_not_supported_category_assertion)
    not_supported_feature_assertion_summary["Fraction of all MOA Assertions"].append(
        fraction_not_supported_category_feature_assertion_of_moa
    )
    not_supported_feature_assertion_summary["Percent of all MOA Assertions"].append(
        percent_not_supported_category_feature_assertion_of_moa
    )
    not_supported_feature_assertion_summary[
        "Fraction of Not Supported Feature Assertions"
    ].append(fraction_not_supported_category_feature_assertion_of_total_not_supported)
    not_supported_feature_assertion_summary[
        "Percent of Not Supported Feature Assertions"
    ].append(percent_not_supported_category_feature_assertion_of_total_not_supported)

In [None]:
number_unique_not_supported_category_features

### <a id='toc1_6_4_'></a>[Impact Score Analysis by Subcategory](#toc0_)

In [None]:
not_supported_impact_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "MOA Total Sum Impact Score": [],
    "Average Impact Score per Feature": [],
    "Average Impact Score per Assertion": [],
    "Total Number Assertions": [
        v["number_unique_not_supported_category_assertion"]
        for v in not_supported_feature_categories_assertion_summary_data.values()
    ],
    "Total Number Features": [
        v["number_unique_not_supported_category_features"]
        for v in not_supported_feature_categories_summary_data.values()
    ],
}

In [None]:
not_supported_feature_categories_impact_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    not_supported_feature_categories_impact_data[category] = {}
    impact_category_df = not_supported_queries_df[
        not_supported_queries_df["category"] == category
    ].copy()

    total_sum_not_supported_category_impact = impact_category_df["impact_score"].sum()

    not_supported_feature_categories_impact_data[category][
        "total_sum_not_supported_category_impact"
    ] = total_sum_not_supported_category_impact

    number_unique_not_supported_category_features = (
        impact_category_df.feature_id.nunique()
    )
    number_unique_not_supported_category_assertion = (
        impact_category_df.assertion_id.nunique()
    )

    if number_unique_not_supported_category_features == 0:
        avg_impact_score_feature = 0
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_feature"
        ] = avg_impact_score_feature

        avg_impact_score_assertion = 0
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_evidence"
        ] = avg_impact_score_assertion
    else:
        avg_impact_score_feature = f"{total_sum_not_supported_category_impact / number_unique_not_supported_category_features:.2f}"
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_feature"
        ] = avg_impact_score_feature

        avg_impact_score_assertion = f"{total_sum_not_supported_category_impact / number_unique_not_supported_category_assertion:.2f}"
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_evidence"
        ] = avg_impact_score_assertion

    not_supported_impact_summary["MOA Total Sum Impact Score"].append(
        total_sum_not_supported_category_impact
    )
    not_supported_impact_summary["Average Impact Score per Feature"].append(
        avg_impact_score_feature
    )
    not_supported_impact_summary["Average Impact Score per Assertion"].append(
        avg_impact_score_assertion
    )

    print(
        f"Number of unique features within category: {number_unique_not_supported_category_features}"
    )
    print(
        f"{category}: {total_sum_not_supported_category_impact}, {avg_impact_score_feature}, {avg_impact_score_assertion}"
    )

In [None]:
not_supported_feature_impact_df = pd.DataFrame(not_supported_impact_summary)

In [None]:
not_supported_feature_impact_df

In [None]:
not_supported_feature_impact_df.to_csv(
    "moa_assertion_analysis_output/not_supported_feature_impact_df.csv", index=False
)

# <a id='toc2_'></a>[MOA Summary](#toc0_)

## <a id='toc2_1_'></a>[Feature (Variant) Analysis](#toc0_)

### <a id='toc2_1_1_'></a>[Building Summary Tables 1 - 3](#toc0_)

In [None]:
all_features_df = pd.DataFrame(feature_analysis_summary)

In [None]:
all_features_df = combine_frac_perc(all_features_df, "all MOA Features")

In [None]:
for_merge_all_variant_percent_of_moa_df = all_features_df.drop(
    columns=["Fraction of all MOA Features"]
)

all_features_percent_of_moa_df = all_features_df.drop(
    columns=["Fraction of all MOA Features", "Count of MOA Features per Category"]
)

In [None]:
for_merge_all_variant_percent_of_moa_df.to_csv(
    "moa_assertion_analysis_output/for_merge_all_variant_percent_of_moa_df.csv",
    index=False,
)

### <a id='toc2_1_2_'></a>[Summary Table 1](#toc0_)

The table below shows the 2 categories that MOA features (variants) were divided into after normalization and what percent they make up of all features (variants) in MOA data. 

<ins>Numerator:</ins> # of MOA Features (variants) that are Normalized or Not Supported
<br><ins>Denominator:</ins> # of all MOA Features (variants)

In [None]:
all_features_percent_of_moa_df = all_features_percent_of_moa_df.set_index(
    "Variant Category"
)
all_features_percent_of_moa_df

In [None]:
moa_summary_table_1 = all_features_percent_of_moa_df

### <a id='toc2_1_3_'></a>[Summary Table 2](#toc0_)

The table below shows the categories that the Not Supported features (variants) were broken into and what percent of all MOA features (variants) they make up.

<ins>Numerator:</ins> # of MOA Features (variants) that are Not Supported in a given Subcategory
<br><ins>Denominator:</ins> # of all MOA Features (variants)

In [None]:
not_supported_features_df = pd.DataFrame(not_supported_feature_analysis_summary)

In [None]:
not_supported_features_total_df = combine_frac_perc(
    not_supported_features_df, "all MOA Features"
)

In [None]:
for_merge_not_supported_features_total_df = not_supported_features_total_df[
    [
        "Category",
        "Count of MOA Features per Category",
        "Percent of all MOA Features",
    ]
].copy()

In [None]:
not_supported_features_total_df = (
    not_supported_features_total_df[
        [
            "Category",
            "Percent of all MOA Features",
        ]
    ]
    .copy()
    .set_index("Category")
)
not_supported_features_total_df

In [None]:
moa_summary_table_2 = not_supported_features_total_df

In [None]:
for_merge_not_supported_features_total_df.to_csv(
    "moa_assertion_analysis_output/for_merge_not_supported_features_total_df.csv",
    index=False,
)

### <a id='toc2_1_4_'></a>[Summary Table 3](#toc0_)

The table below shows the categories that the Not Supported features (variants) were broken into what percent each sub category take up in Not Supported variant group.

<ins>Numerator:</ins> # of MOA Features (variants) that are Not Supported in a given Subcategory
<br><ins>Denominator:</ins> # of MOA Features (variants) that are Not Supported

In [None]:
not_supported_features_category_df = combine_frac_perc(
    not_supported_features_df, "Not Supported Features"
)

In [None]:
not_supported_features_category_df = not_supported_features_category_df[
    ["Category", "Percent of Not Supported Features"]
]
not_supported_features_category_df = not_supported_features_category_df.set_index(
    "Category"
)
not_supported_features_category_df

In [None]:
moa_summary_table_3 = not_supported_features_category_df

## <a id='toc2_2_'></a>[Evidence Analysis](#toc0_)

### <a id='toc2_2_1_'></a>[Building Summary Table 4](#toc0_)

In [None]:
all_features_assertions_df = pd.DataFrame(assertion_analysis_summary)

In [None]:
all_features_assertions_df = combine_frac_perc(
    all_features_assertions_df, "all MOA Assertions"
)

In [None]:
for_merge_all_features_assertions_df = all_features_assertions_df.drop(
    columns=["Fraction of all MOA Assertions"]
)

all_features_assertions_df = for_merge_all_features_assertions_df.drop(
    columns=["Count of MOA Assertions per Category"]
)

In [None]:
for_merge_all_features_assertions_df.to_csv(
    "moa_assertion_analysis_output/for_merge_all_features_assertions_df.csv",
    index=False,
)

### <a id='toc2_2_2_'></a>[Summary Table 4](#toc0_)

The table below shows what percent of all assertions (evidence items) in MOA are associated with Normalized and Not Supported features (variants)

<ins>Numerator:</ins> # of MOA Assertions (evidence items) based on normalization status of associated features (variants)
<br><ins>Denominator:</ins> # of all MOA Assertions (evidence items)

In [None]:
all_features_assertions_df = all_features_assertions_df.set_index("Variant Category")
moa_summary_table_4 = all_features_assertions_df
moa_summary_table_4

### <a id='toc2_2_3_'></a>[Building Sumary Tables 5 & 6](#toc0_)

In [None]:
not_supported_feature_assertion_df = pd.DataFrame(
    not_supported_feature_assertion_summary
)

In [None]:
not_supported_feature_assertion_df = combine_frac_perc(
    not_supported_feature_assertion_df, "all MOA Assertions"
)

In [None]:
not_supported_feature_assertion_df = combine_frac_perc(
    not_supported_feature_assertion_df, "Not Supported Feature Assertions"
)

In [None]:
not_supported_feature_assertion_df = not_supported_feature_assertion_df.drop(
    columns=[
        "Fraction of all MOA Assertions",
        "Fraction of Not Supported Feature Assertions",
    ]
)

In [None]:
for_merge_not_supported_feature_assertion_df = not_supported_feature_assertion_df.drop(
    ["Percent of Not Supported Feature Assertions"], axis=1
)

not_supported_feature_assertion_of_moa_df = (
    for_merge_not_supported_feature_assertion_df.drop(
        ["Count of MOA Assertions per Category"], axis=1
    )
)

not_supported_feature_assertion_of_not_supported_df = (
    not_supported_feature_assertion_df.drop(
        ["Percent of all MOA Assertions", "Count of MOA Assertions per Category"],
        axis=1,
    )
)

In [None]:
for_merge_not_supported_feature_assertion_df.to_csv(
    "moa_assertion_analysis_output/for_merge_not_supported_feature_assertion_df.csv",
    index=False,
)

### <a id='toc2_2_4_'></a>[Summary Table 5](#toc0_)

The table below shows the percent of all MOA assertions (evidence items) that are associated with a Not Supported variant sub category.

<ins>Numerator:</ins> # of MOA Assertions (evidence items) associated with Not Supported features (variants) in a given Subcategory
<br><ins>Denominator:</ins> # of all MOA Assertions (evidence items)

In [None]:
not_supported_feature_assertion_of_moa_df = (
    not_supported_feature_assertion_of_moa_df.set_index("Category")
)
moa_summary_table_5 = not_supported_feature_assertion_of_moa_df
moa_summary_table_5

### <a id='toc2_2_5_'></a>[Summary Table 6](#toc0_)

The table below shows the percent of MOA Assertions (evidence items) associated with Not Supported features (variants) that belong to each variant sub category. 

<ins>Numerator:</ins> # of MOA Assertions (evidence items) associated with Not Supported features (variants) in a given Subcategory
<br><ins>Denominator:</ins> # of MOA Assertions (evidence items) associated with all Not Supported features (variants)

In [None]:
not_supported_feature_assertion_of_not_supported_df = (
    not_supported_feature_assertion_of_not_supported_df.set_index("Category")
)
moa_summary_table_6 = not_supported_feature_assertion_of_not_supported_df
moa_summary_table_6

## <a id='toc2_3_'></a>[Impact](#toc0_)

The bar graph below shows the relationship between the Not Supported variant sub category impact score and the sub category. Additionally, the colors illustrate the number of assertions (evidence items) associated each sub category.

In [None]:
not_supported_feature_impact_df = pd.DataFrame(not_supported_impact_summary)
not_supported_feature_impact_df

In [None]:
not_supported_feature_impact_df.to_csv(
    "moa_assertion_analysis_output/not_supported_feature_impact_df.csv", index=False
)

In [None]:
fig3 = px.bar(
    not_supported_feature_impact_df,
    x="Category",
    y="MOA Total Sum Impact Score",
    hover_data=["Total Number Assertions"],
    color="Total Number Assertions",
    labels={"MOA Total Sum Impact Score": "MOA Total Sum Impact Score"},
    text_auto=".1f",
    color_continuous_scale="geyser",
)
fig3.update_traces(width=1)
fig3.show()

In [None]:
fig3.write_html("moa_assertion_analysis_output/moa_ns_categories_impact_redgreen.html")

The scatterplot below shows the relationship between the Not Supported variant sub category impact score and the number of assertions (evidence items) associated with features (variants) in each sub category. Additionally, the sizes of the data point represent the number of features (variants) in each sub category. 

In [None]:
fig2 = px.scatter(
    data_frame=not_supported_feature_impact_df,
    x="Total Number Assertions",
    y="MOA Total Sum Impact Score",
    size="Total Number Features",
    size_max=40,
    text="Total Number Features",
    color="Category",
)
fig2.show()

In [None]:
fig2.write_html(
    "moa_assertion_analysis_output/moa_ns_categories_impact_scatterplot.html"
)