# <a id='toc1_'></a>[Molecular Oncology Almanac Assertion Analysis](#toc0_)

The moa_assertion_analysis notebook contains an analysis on the assertions in MOA. 

MOA evidence items are referred to as assertions and MOA variants are referred to as features in this analysis. 

The moa_features_analysis notebook is a prerequisite to this notebook as it will update the cache

**Table of contents**<a id='toc0_'></a>    
- [Molecular Oncology Almanac Assertion Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Create output directory](#toc1_1_2_)    
  - [Create analysis functions / global variables](#toc1_2_)    
  - [All Features (Variants) Analysis](#toc1_3_)    
    - [Creating a table with feature (variant) and assertion (evidence) information](#toc1_3_1_)    
    - [Converting feature (variant) types to normalized categories](#toc1_3_2_)    
    - [Adding a numerical impact score based on the predictive implication](#toc1_3_3_)    
    - [Impact Score Analysis](#toc1_3_4_)    
    - [Features (Variants) Analysis](#toc1_3_5_)    
    - [Assertions (Evidence Items) Analysis](#toc1_3_6_)    
    - [Summaries for all Features (Variants) and Assertions (Evidence Items)](#toc1_3_7_)    
  - [Create functions / global variables used in analysis](#toc1_4_)    
  - [Normalized Analysis](#toc1_5_)    
  - [Not Supported Analysis](#toc1_6_)    
    - [Feature (Variant) Analysis](#toc1_6_1_)    
    - [Not Supported Feature (Variant) Analysis by Subcategory](#toc1_6_2_)    
    - [Not Support Feature (Variant) Assertion (Evidence) Analysis by Subcategory](#toc1_6_3_)    
    - [Impact Score Analysis by Subcategory](#toc1_6_4_)    
- [MOA Summary](#toc2_)    
  - [Feature (Variant) Analysis](#toc2_1_)    
    - [Building Summary Tables 1 - 3](#toc2_1_1_)    
    - [Summary Table 1](#toc2_1_2_)    
    - [Summary Table 2](#toc2_1_3_)    
    - [Summary Table 3](#toc2_1_4_)    
  - [Evidence Analysis](#toc2_2_)    
    - [Building Summary Table 4](#toc2_2_1_)    
    - [Summary Table 4](#toc2_2_2_)    
    - [Building Sumary Tables 5 & 6](#toc2_2_3_)    
    - [Summary Table 5](#toc2_2_4_)    
    - [Summary Table 6](#toc2_2_5_)    
  - [Impact](#toc2_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [1]:
import os
import sys
import csv
import json
from enum import Enum
from pathlib import Path
from typing import Dict

import pandas as pd
import plotly.express as px
from ga4gh.core import sha512t24u

module_path = os.path.abspath(os.path.join("../.."))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils import (  # noqa: E402
    NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    MoaItemType,
    load_latest_moa_zip,
)

### <a id='toc1_1_2_'></a>[Create output directory](#toc0_)

In [2]:
path = Path("output")
path.mkdir(exist_ok=True)

## <a id='toc1_2_'></a>[Create analysis functions / global variables](#toc0_)

In [3]:
# Use latest feature zip that has been pushed to the repo
variants_resp = load_latest_moa_zip(MoaItemType.FEATURE)

Using moa_features_20250717.json for MOA features


In [4]:
def get_feature_digest(feature: Dict) -> str:
    """Get digest for feature

    :param feature: MOA feature
    :return: Digest
    """
    attrs = json.dumps(
        feature["attributes"][0], sort_keys=True, separators=(",", ":"), indent=None
    ).encode("utf-8")
    return sha512t24u(attrs)

In [5]:
class VariantNormType(str, Enum):
    """Variation Normalization types"""

    NORMALIZED = "Normalized"
    NOT_SUPPORTED = "Not Supported"


VARIANT_NORM_TYPE_VALUES = [v.value for v in VariantNormType.__members__.values()]

## <a id='toc1_3_'></a>[All Features (Variants) Analysis](#toc0_)

### <a id='toc1_3_1_'></a>[Creating a table with feature (variant) and assertion (evidence) information](#toc0_)

In [6]:
# Create dictionary for MOA Feature ID -> Feature Type

features = {}

for feature in variants_resp:
    feature_id = feature["feature_id"]
    digest = get_feature_digest(feature)
    features[digest] = feature["feature_type"]

count_unique_feature_ids = len(features.keys())
print(count_unique_feature_ids)

452


In [7]:
# Use latest assertion zip that has been pushed to the repo
assertions_resp = load_latest_moa_zip(MoaItemType.ASSERTION)

Using moa_assertions_20250717.json for MOA assertions


In [8]:
# Create DF for assertions and their associated feature + predictive implication

transformed = []

# Mapping from feature ID to feature digest
feature_id_to_digest = {}

for assertion in assertions_resp:
    assertion_id = assertion["assertion_id"]
    predictive_implication = assertion["predictive_implication"]

    if len(assertion["features"]) != 1:
        print(f"assertion id ({assertion_id}) does not have 1 feature")
        continue

    feature = assertion["features"][0]
    feature_id = feature["feature_id"]
    feature_digest = get_feature_digest(feature)

    feature_id_to_digest[feature_id] = feature_digest

    transformed.append(
        {
            "assertion_id": assertion_id,
            "feature_id": feature_id,
            "feature_type": features[feature_digest],
            "predictive_implication": predictive_implication,
            "feature_digest": feature_digest,
        }
    )
moa_df = pd.DataFrame(transformed)
print(len(moa_df["feature_digest"].unique()))
moa_df

452


Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
1,2,2,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
2,3,3,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
3,4,4,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
4,5,5,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
...,...,...,...,...,...
1002,1006,1006,somatic_variant,FDA-Approved,IpbUYvcvcPuvnET_IyVv-Yi_IUKddyvg
1003,1007,1007,somatic_variant,FDA-Approved,fakcYvzKfW8iRc3NOsxpYAHNWzO2rJtS
1004,1008,1008,rearrangement,FDA-Approved,yzNjLPO2biW_kdgqdjlMQXAEzDMrTxin
1005,1009,1009,somatic_variant,FDA-Approved,fakcYvzKfW8iRc3NOsxpYAHNWzO2rJtS


In [9]:
# Create dictionary for MOA Feature Digest -> Variant Category
feature_digest_to_category = {}

for fn in [
    "able_to_normalize_queries.tsv",
    "unable_to_normalize_queries.tsv",
    "not_supported_variants.tsv",
]:
    with open(f"../feature_analysis/{fn}") as f:
        reader = csv.reader(f, delimiter="\t")
        header = next(reader)
        for row in reader:
            feature_id = row[0]
            category = row[3]
            digest = feature_id_to_digest[int(feature_id)]
            feature_digest_to_category[digest] = category

In [10]:
# Add category column
moa_df["category"] = moa_df["feature_digest"].apply(
    lambda digest: feature_digest_to_category[digest]
)
moa_df

Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement
1,2,2,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement
2,3,3,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement
3,4,4,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement
4,5,5,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement
...,...,...,...,...,...,...
1002,1006,1006,somatic_variant,FDA-Approved,IpbUYvcvcPuvnET_IyVv-Yi_IUKddyvg,Sequence
1003,1007,1007,somatic_variant,FDA-Approved,fakcYvzKfW8iRc3NOsxpYAHNWzO2rJtS,Sequence
1004,1008,1008,rearrangement,FDA-Approved,yzNjLPO2biW_kdgqdjlMQXAEzDMrTxin,Rearrangement
1005,1009,1009,somatic_variant,FDA-Approved,fakcYvzKfW8iRc3NOsxpYAHNWzO2rJtS,Sequence


In [11]:
moa_df.to_csv("output/moa_df.csv")

In [12]:
unique_features_df = moa_df.sort_values("feature_id").drop_duplicates(
    subset=["feature_digest"]
)
len_unique_feature_ids = len(list(unique_features_df.feature_id))
len_unique_feature_ids

452

In [13]:
total_len_features = len(moa_df.feature_digest.unique())
f"Total number of unique features (variants): {total_len_features}"

'Total number of unique features (variants): 452'

In [14]:
assert total_len_features == len_unique_feature_ids

In [15]:
total_len_assertions = len(moa_df.assertion_id.unique())
f"Total number of unique assertions (evidence items): {total_len_assertions}"

'Total number of unique assertions (evidence items): 1007'

### <a id='toc1_3_2_'></a>[Converting feature (variant) types to normalized categories](#toc0_)

In [16]:
list(moa_df.feature_type.unique())

['rearrangement',
 'somatic_variant',
 'germline_variant',
 'copy_number',
 'microsatellite_stability',
 'mutational_signature',
 'mutational_burden',
 'knockdown',
 'aneuploidy']

In [17]:
list(moa_df.category.unique())

['Rearrangement',
 'Sequence',
 'Gene Function',
 'Region-Defined',
 'Other',
 'Copy Number',
 'Expression']

### <a id='toc1_3_3_'></a>[Adding a numerical impact score based on the predictive implication](#toc0_)
This is based on the structure of MOA scoring

In [18]:
predictive_implication_categories = moa_df.predictive_implication.unique()
list(predictive_implication_categories)

['FDA-Approved',
 'Guideline',
 'Clinical trial',
 'Preclinical',
 'Inferential',
 'Clinical evidence']

In [19]:
moa_df["impact_score"] = moa_df["predictive_implication"].copy()

moa_df.loc[moa_df["impact_score"] == "FDA-Approved", "impact_score"] = 10
moa_df.loc[moa_df["impact_score"] == "Guideline", "impact_score"] = 10
moa_df.loc[moa_df["impact_score"] == "Clinical evidence", "impact_score"] = 5
moa_df.loc[moa_df["impact_score"] == "Clinical trial", "impact_score"] = 5
moa_df.loc[moa_df["impact_score"] == "Preclinical", "impact_score"] = 1
moa_df.loc[moa_df["impact_score"] == "Inferential", "impact_score"] = 0.5

moa_df.head()

Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category,impact_score
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement,10
1,2,2,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement,10
2,3,3,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement,10
3,4,4,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement,10
4,5,5,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement,10


### <a id='toc1_3_4_'></a>[Impact Score Analysis](#toc0_)

In [20]:
feature_categories_impact_data = dict()
for category in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:
    feature_categories_impact_data[category] = {}
    impact_category_df = moa_df[moa_df.category == category]

    total_sum_category_impact = impact_category_df["impact_score"].sum()
    feature_categories_impact_data[category]["total_sum_category_impact"] = (
        total_sum_category_impact
    )
    print(f"{category}: {total_sum_category_impact}")

Sequence: 4216.5
Genotype/Haplotype: 0
Fusion: 0
Rearrangement: 693.0
Epigenetic Modification: 0
Copy Number: 427.5
Expression: 12
Gene Function: 112
Region-Defined: 600.0
Genome Feature: 0
Other: 199.0
Transcript: 0


### <a id='toc1_3_5_'></a>[Features (Variants) Analysis](#toc0_)

In [21]:
def calc_perc_item_analysis(item_type: MoaItemType, total_len: int) -> dict:
    """Calculates the percent of either the features or the assertions in MOA

    :param item_type: The type of item
    :param total_len: The total number of items defined by 'item_type'
    :return: Dictionary with a string indicating the percent of the item
    """
    moa_item_data = dict()

    for category in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:
        moa_item_data[category] = {}
        item_type_df = moa_df[moa_df.category == category]
        if item_type == MoaItemType.FEATURE:
            number_unique_category_items = len(set(item_type_df.feature_digest))
        else:
            number_unique_category_items = len(set(item_type_df.assertion_id))

        if item_type == MoaItemType.FEATURE:
            singular = MoaItemType.FEATURE.value
            plural = "features"
        else:
            singular = MoaItemType.ASSERTION.value
            plural = "assertions"

        moa_item_data[category][f"number_unique_category_{plural}"] = (
            number_unique_category_items
        )

        fraction_category_item = f"{number_unique_category_items} / {total_len}"
        moa_item_data[category][f"fraction_category_{singular}"] = (
            fraction_category_item
        )

        percent_category_item = (
            "{:.2f}".format(number_unique_category_items / total_len * 100) + "%"
        )

        moa_item_data[category][f"percent_category_{singular}"] = percent_category_item

    return moa_item_data

In [22]:
moa_feature_data = calc_perc_item_analysis(MoaItemType.FEATURE, total_len_features)

### <a id='toc1_3_6_'></a>[Assertions (Evidence Items) Analysis](#toc0_)

In [23]:
moa_assertion_data = calc_perc_item_analysis(
    MoaItemType.ASSERTION, total_len_assertions
)

### <a id='toc1_3_7_'></a>[Summaries for all Features (Variants) and Assertions (Evidence Items)](#toc0_)

In [24]:
feature_category_impact_score = [
    v["total_sum_category_impact"] for v in feature_categories_impact_data.values()
]
feature_category_number = [
    v["number_unique_category_features"] for v in moa_feature_data.values()
]
feature_category_fraction = [
    v["fraction_category_feature"] for v in moa_feature_data.values()
]
feature_category_percent = [
    v["percent_category_feature"] for v in moa_feature_data.values()
]
feature_category_assertion_number = [
    v["number_unique_category_assertions"] for v in moa_assertion_data.values()
]
feature_category_assertion_fraction = [
    v["fraction_category_assertion"] for v in moa_assertion_data.values()
]
feature_category_assertion_percent = [
    v["percent_category_assertion"] for v in moa_assertion_data.values()
]

In [25]:
feature_category_dict = {
    "Category": NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    "Number of Features": feature_category_number,
    "Fraction of Features": feature_category_fraction,
    "Percent of Features": feature_category_percent,
    "Number of Assertions": feature_category_assertion_number,
    "Fraction of Assertions": feature_category_assertion_fraction,
    "Percent of Assertions": feature_category_assertion_percent,
    "Impact Score": feature_category_impact_score,
}

In [26]:
moa_feature_df = pd.DataFrame(feature_category_dict)
moa_feature_df

Unnamed: 0,Category,Number of Features,Fraction of Features,Percent of Features,Number of Assertions,Fraction of Assertions,Percent of Assertions,Impact Score
0,Sequence,286,286 / 452,63.27%,634,634 / 1007,62.96%,4216.5
1,Genotype/Haplotype,0,0 / 452,0.00%,0,0 / 1007,0.00%,0.0
2,Fusion,0,0 / 452,0.00%,0,0 / 1007,0.00%,0.0
3,Rearrangement,35,35 / 452,7.74%,85,85 / 1007,8.44%,693.0
4,Epigenetic Modification,0,0 / 452,0.00%,0,0 / 1007,0.00%,0.0
5,Copy Number,54,54 / 452,11.95%,116,116 / 1007,11.52%,427.5
6,Expression,11,11 / 452,2.43%,12,12 / 1007,1.19%,12.0
7,Gene Function,14,14 / 452,3.10%,20,20 / 1007,1.99%,112.0
8,Region-Defined,40,40 / 452,8.85%,109,109 / 1007,10.82%,600.0
9,Genome Feature,0,0 / 452,0.00%,0,0 / 1007,0.00%,0.0


In [27]:
def combine_frac_perc(df: pd.DataFrame, denominator: str) -> pd.DataFrame:
    """Put fraction and percent string into one string

    :param df: Dataframe of variant statistics
    :param denominator: string representing what the denominator of the fraction is
    :return: Transformed dataframe with fraction and percent string as one string
    """
    for d in denominator:
        perc_key = f"Percent of {d}"
        frac_key = f"Fraction of {d}"
        df[perc_key] = df[frac_key].astype(str) + "  (" + df[perc_key] + ")"
        df = df.drop([frac_key], axis=1)
    return df

In [28]:
moa_feature_df = combine_frac_perc(moa_feature_df, ["Features"])

In [29]:
moa_feature_df = combine_frac_perc(moa_feature_df, ["Assertions"])

In [30]:
moa_feature_df_abbreviated = moa_feature_df[
    [
        "Category",
        "Percent of Features",
        "Percent of Assertions",
        "Impact Score",
    ]
].copy()

In [31]:
moa_feature_df_abbreviated = moa_feature_df_abbreviated.set_index("Category")
moa_feature_df_abbreviated

Unnamed: 0_level_0,Percent of Features,Percent of Assertions,Impact Score
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sequence,286 / 452 (63.27%),634 / 1007 (62.96%),4216.5
Genotype/Haplotype,0 / 452 (0.00%),0 / 1007 (0.00%),0.0
Fusion,0 / 452 (0.00%),0 / 1007 (0.00%),0.0
Rearrangement,35 / 452 (7.74%),85 / 1007 (8.44%),693.0
Epigenetic Modification,0 / 452 (0.00%),0 / 1007 (0.00%),0.0
Copy Number,54 / 452 (11.95%),116 / 1007 (11.52%),427.5
Expression,11 / 452 (2.43%),12 / 1007 (1.19%),12.0
Gene Function,14 / 452 (3.10%),20 / 1007 (1.99%),112.0
Region-Defined,40 / 452 (8.85%),109 / 1007 (10.82%),600.0
Genome Feature,0 / 452 (0.00%),0 / 1007 (0.00%),0.0


In [32]:
fig = px.scatter(
    data_frame=moa_feature_df,
    x="Number of Assertions",
    y="Impact Score",
    size="Number of Features",
    size_max=40,
    text="Number of Features",
    color="Category",
)
fig.show()

In [33]:
fig.write_html("output/moa_feature_categories_impact_scatterplot.html")

## <a id='toc1_4_'></a>[Create functions / global variables used in analysis](#toc0_)

In [34]:
feature_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of MOA Features per Category": [],
    "Fraction of all MOA Features": [],
    "Percent of all MOA Features": [],
}
feature_analysis_summary

{'Variant Category': ['Normalized', 'Not Supported'],
 'Count of MOA Features per Category': [],
 'Fraction of all MOA Features': [],
 'Percent of all MOA Features': []}

In [35]:
def feature_analysis(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do feature analysis (counts, percents). Updates `feature_analysis_summary`

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of features that are in `df`
    :return: Transformed dataframe with variant ID duplicates dropped
    """
    # Drop duplicate rows
    df = df.drop_duplicates(subset=["feature_id"])
    feature_ids = list(df["feature_id"])

    # Count
    num_features = len(feature_ids)
    fraction_features = f"{num_features} / {total_len_features}"
    print(f"\nNumber of {variant_norm_type.value} Features in MOA: {fraction_features}")

    # Percent
    percent_features = f"{num_features / total_len_features * 100:.2f}%"
    print(f"Percent of {variant_norm_type.value} Features in MOA: {percent_features}")

    feature_analysis_summary["Count of MOA Features per Category"].append(num_features)
    feature_analysis_summary["Fraction of all MOA Features"].append(fraction_features)
    feature_analysis_summary["Percent of all MOA Features"].append(percent_features)

    return df

In [36]:
assertion_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of MOA Assertions per Category": [],
    "Fraction of all MOA Assertions": [],
    "Percent of all MOA Assertions": [],
}
assertion_analysis_summary

{'Variant Category': ['Normalized', 'Not Supported'],
 'Count of MOA Assertions per Category': [],
 'Fraction of all MOA Assertions': [],
 'Percent of all MOA Assertions': []}

In [37]:
def assertion_analysis(
    all_df: pd.DataFrame,
    variant_norm_df: pd.DataFrame,
    variant_norm_type: VariantNormType,
) -> str:
    """Do evidence analysis (counts, percents). Updates `assertion_analysis_summary`

    :param all_df: Dataframe for all assertions and features
    :param variant_norm_df: Dataframe for features given certain `variant_norm_type`
    :param variant_norm_type: The kind of variants that are in `df`
    :return: a string with the evidence counts and percents per category
    """
    # Need to do this bc of duplicate features
    _feature_ids = set(variant_norm_df.feature_digest)
    tmp_df = all_df[all_df["feature_digest"].isin(_feature_ids)]

    # Count
    num_assertions = len(tmp_df.assertion_id)
    fraction_assertions = f"{num_assertions} / {total_len_assertions}"
    print(
        f"Number of {variant_norm_type.value} Feature Assertions in MOA: {fraction_assertions}"
    )

    # Percent
    percent_assertions = f"{num_assertions / total_len_assertions * 100:.2f}%"
    print(
        f"Percent of {variant_norm_type.value} Feature Assertions in MOA: {percent_assertions}"
    )

    assertion_analysis_summary["Count of MOA Assertions per Category"].append(
        num_assertions
    )
    assertion_analysis_summary["Fraction of all MOA Assertions"].append(
        fraction_assertions
    )
    assertion_analysis_summary["Percent of all MOA Assertions"].append(
        percent_assertions
    )

In [38]:
feature_id_to_digest_df = pd.DataFrame(
    feature_id_to_digest.items(), columns=["feature_id", "feature_digest"]
)
feature_id_to_digest_df

Unnamed: 0,feature_id,feature_digest
0,1,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
1,2,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
2,3,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
3,4,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
4,5,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
...,...,...
1002,1006,IpbUYvcvcPuvnET_IyVv-Yi_IUKddyvg
1003,1007,fakcYvzKfW8iRc3NOsxpYAHNWzO2rJtS
1004,1008,yzNjLPO2biW_kdgqdjlMQXAEzDMrTxin
1005,1009,fakcYvzKfW8iRc3NOsxpYAHNWzO2rJtS


## <a id='toc1_5_'></a>[Normalized Analysis](#toc0_)

In [39]:
normalized_queries_df = pd.read_csv(
    "../feature_analysis/able_to_normalize_queries.tsv", sep="\t"
)
normalized_queries_df = pd.merge(
    normalized_queries_df,
    feature_id_to_digest_df,
    left_on="variant_id",
    right_on="feature_id",
)
normalized_queries_df.shape

(196, 7)

In [40]:
normalized_queries_df = pd.merge(
    normalized_queries_df["variant_id"],
    moa_df,
    left_on="variant_id",
    right_on="feature_id",
    how="left",
)
normalized_queries_df = normalized_queries_df.drop(columns=["variant_id"])

In [41]:
normalized_queries_df = feature_analysis(
    normalized_queries_df, VariantNormType.NORMALIZED
)
normalized_queries_df


Number of Normalized Features in MOA: 196 / 452
Percent of Normalized Features in MOA: 43.36%


Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category,impact_score
0,66,66,somatic_variant,Preclinical,1ZF2zGmQI4p_iTMPh7nkrUKJ77tIOkGq,Sequence,1
1,68,68,somatic_variant,Clinical evidence,j3HtSnIdrU8CcuW8_Qs3qVxOn-kMJV1T,Sequence,5
2,70,70,somatic_variant,Clinical evidence,X_Az48pPjt4IODuY2a50Yl2_1tGopcuF,Sequence,5
3,71,71,somatic_variant,Clinical evidence,LQQXFXpA4FCOQ3Fz4988x2vynER4J-Wh,Sequence,5
4,72,72,somatic_variant,Clinical evidence,DKoCqZUY0WBdUnoly9DL_PAjBBZTs51d,Sequence,5
...,...,...,...,...,...,...,...
191,933,933,somatic_variant,Guideline,IPk4rQ1eAdyoX-M6VQliEu-bO6FGh7SI,Sequence,10
192,934,934,somatic_variant,Guideline,n-ihfI4tIYL5o_cqV68kGTK-TRGFDn4s,Sequence,10
193,935,935,somatic_variant,Guideline,1KbdWJrRMPrBqij9mYcz0e2eBiJOBk2b,Sequence,10
194,982,982,somatic_variant,FDA-Approved,jsnDTkZjGqQYzYDrnV2j1QT8adztw8Cx,Sequence,10


In [42]:
assertion_analysis(moa_df, normalized_queries_df, VariantNormType.NORMALIZED)

Number of Normalized Feature Assertions in MOA: 391 / 1007
Percent of Normalized Feature Assertions in MOA: 38.83%


## <a id='toc1_6_'></a>[Not Supported Analysis](#toc0_)

In [43]:
not_supported_queries_df = pd.read_csv(
    "../feature_analysis/not_supported_variants.tsv", sep="\t"
)
not_supported_queries_df = pd.merge(
    not_supported_queries_df,
    feature_id_to_digest_df,
    left_on="variant_id",
    right_on="feature_id",
)
not_supported_queries_df.shape

(256, 6)

In [44]:
not_supported_queries_df = pd.merge(
    not_supported_queries_df["variant_id"],
    moa_df,
    left_on="variant_id",
    right_on="feature_id",
    how="left",
)
not_supported_queries_df = not_supported_queries_df.drop(columns=["variant_id"])
not_supported_queries_df

Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category,impact_score
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement,10
1,9,9,rearrangement,FDA-Approved,g99yF3kKnB-We_fMS5RaVygoSuT7qA-I,Rearrangement,10
2,12,12,rearrangement,FDA-Approved,e8PMq2A96-aBJ3Ip74ovx5VOUCztBTq7,Rearrangement,10
3,15,15,rearrangement,Guideline,DxfRiRV-3J6zRON4pnzNJjXkJf2bsp20,Rearrangement,10
4,18,18,rearrangement,Preclinical,BRsPjsZSCyDXnKtBt9XgsWX2JDNWY3FP,Rearrangement,1
...,...,...,...,...,...,...,...
251,956,956,somatic_variant,FDA-Approved,YOntQn-fL_3YdCwEFEqpVCNmzzxC8-4-,Sequence,10
252,964,964,rearrangement,FDA-Approved,yLodZXr73fFWpx8cIHv0ulWvyaHZcuO7,Rearrangement,10
253,965,965,rearrangement,FDA-Approved,sHaHVeGtfC5xdUaPl-qWIoSkAwucCCJF,Rearrangement,10
254,1001,1001,rearrangement,FDA-Approved,ovD4rWJe1XrPFmQlHBHX5hm8pfN9rEXQ,Rearrangement,10


### <a id='toc1_6_1_'></a>[Feature (Variant) Analysis](#toc0_)

In [45]:
not_supported_queries_df = feature_analysis(
    not_supported_queries_df, VariantNormType.NOT_SUPPORTED
)


Number of Not Supported Features in MOA: 256 / 452
Percent of Not Supported Features in MOA: 56.64%


### <a id='toc1_6_2_'></a>[Not Supported Feature (Variant) Analysis by Subcategory](#toc0_)

In [46]:
not_supported_feature_analysis_summary = {
    "Category": NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    "Count of MOA Features per Category": [],
    "Fraction of all MOA Features": [],
    "Percent of all MOA Features": [],
    "Fraction of Not Supported Features": [],
    "Percent of Not Supported Features": [],
}

In [47]:
not_supported_feature_categories_summary_data = dict()
total_number_unique_not_supported_features = len(
    set(not_supported_queries_df.feature_id)
)

for (
    category
) in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:  # These are not supported categories
    not_supported_feature_categories_summary_data[category] = {}
    category_df = not_supported_queries_df[
        not_supported_queries_df.category == category
    ]

    # Count
    number_unique_not_supported_category_features = len(set(category_df.feature_id))
    not_supported_feature_categories_summary_data[category][
        "number_unique_not_supported_category_features"
    ] = number_unique_not_supported_category_features

    # Fraction
    fraction_not_supported_category_feature_of_moa = (
        f"{number_unique_not_supported_category_features} / {total_len_features}"
    )
    not_supported_feature_categories_summary_data[category][
        "fraction_not_supported_category_feature_of_moa"
    ] = fraction_not_supported_category_feature_of_moa

    # Percent
    percent_not_supported_category_feature_of_moa = f"{number_unique_not_supported_category_features / total_len_features * 100:.2f}%"
    not_supported_feature_categories_summary_data[category][
        "percent_not_supported_category_feature_of_moa"
    ] = percent_not_supported_category_feature_of_moa

    # Not supported fraction
    fraction_not_supported_category_feature_of_total_not_supported = f"{number_unique_not_supported_category_features} / {total_number_unique_not_supported_features}"
    not_supported_feature_categories_summary_data[category][
        "fraction_not_supported_category_feature_of_total_not_supported"
    ] = fraction_not_supported_category_feature_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_feature_of_total_not_supported = f"{number_unique_not_supported_category_features / total_number_unique_not_supported_features * 100:.2f}%"
    not_supported_feature_categories_summary_data[category][
        "percent_not_supported_category_feature_of_total_not_supported"
    ] = percent_not_supported_category_feature_of_total_not_supported

    not_supported_feature_analysis_summary["Count of MOA Features per Category"].append(
        number_unique_not_supported_category_features
    )
    not_supported_feature_analysis_summary["Fraction of all MOA Features"].append(
        fraction_not_supported_category_feature_of_moa
    )
    not_supported_feature_analysis_summary["Percent of all MOA Features"].append(
        percent_not_supported_category_feature_of_moa
    )
    not_supported_feature_analysis_summary["Fraction of Not Supported Features"].append(
        fraction_not_supported_category_feature_of_total_not_supported
    )
    not_supported_feature_analysis_summary["Percent of Not Supported Features"].append(
        percent_not_supported_category_feature_of_total_not_supported
    )

In [48]:
not_supported_variant_df = pd.DataFrame(not_supported_feature_analysis_summary)
not_supported_variant_df

Unnamed: 0,Category,Count of MOA Features per Category,Fraction of all MOA Features,Percent of all MOA Features,Fraction of Not Supported Features,Percent of Not Supported Features
0,Sequence,127,127 / 452,28.10%,127 / 256,49.61%
1,Genotype/Haplotype,0,0 / 452,0.00%,0 / 256,0.00%
2,Fusion,0,0 / 452,0.00%,0 / 256,0.00%
3,Rearrangement,35,35 / 452,7.74%,35 / 256,13.67%
4,Epigenetic Modification,0,0 / 452,0.00%,0 / 256,0.00%
5,Copy Number,23,23 / 452,5.09%,23 / 256,8.98%
6,Expression,11,11 / 452,2.43%,11 / 256,4.30%
7,Gene Function,8,8 / 452,1.77%,8 / 256,3.12%
8,Region-Defined,40,40 / 452,8.85%,40 / 256,15.62%
9,Genome Feature,0,0 / 452,0.00%,0 / 256,0.00%


### <a id='toc1_6_3_'></a>[Not Support Feature (Variant) Assertion (Evidence) Analysis by Subcategory](#toc0_)

List all the possible variant categories

In [49]:
not_supported_feature_categories = not_supported_queries_df.category.unique()
[v for v in not_supported_feature_categories]

['Rearrangement',
 'Sequence',
 'Gene Function',
 'Region-Defined',
 'Other',
 'Copy Number',
 'Expression']

In [50]:
assertion_analysis(moa_df, not_supported_queries_df, VariantNormType.NOT_SUPPORTED)

Number of Not Supported Feature Assertions in MOA: 616 / 1007
Percent of Not Supported Feature Assertions in MOA: 61.17%


In [51]:
not_supported_feature_assertion_summary = {
    "Category": NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    "Count of MOA Assertions per Category": [],
    "Fraction of all MOA Assertions": [],
    "Percent of all MOA Assertions": [],
    "Fraction of Not Supported Feature Assertions": [],
    "Percent of Not Supported Feature Assertions": [],
}

In [52]:
not_supported_feature_ids = set(not_supported_queries_df.feature_digest)

In [53]:
not_supported_feature_categories_assertion_summary_data = dict()

not_supported_feature_ids = set(not_supported_queries_df.feature_digest)

for category in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:
    not_supported_feature_categories_assertion_summary_data[category] = {}

    # Need to do this bc of duplicate features
    tmp_df = moa_df[moa_df["feature_digest"].isin(not_supported_feature_ids)]

    evidence_category_df = tmp_df[tmp_df.category == category]

    evidence_category_df = evidence_category_df.drop_duplicates(subset=["assertion_id"])

    # Count for Not Supported Feature Assertions
    total_number_not_supported_feature_unique_assertions = len(tmp_df.assertion_id)

    # Count per Category
    number_unique_not_supported_category_assertion = len(
        set(evidence_category_df.assertion_id)
    )
    not_supported_feature_categories_assertion_summary_data[category][
        "number_unique_not_supported_category_assertion"
    ] = number_unique_not_supported_category_assertion

    # Fraction
    fraction_not_supported_category_feature_assertion_of_moa = (
        f"{number_unique_not_supported_category_assertion} / {total_len_assertions}"
    )
    not_supported_feature_categories_assertion_summary_data[category][
        "fraction_not_supported_category_feature_assertion_of_moa"
    ] = fraction_not_supported_category_feature_assertion_of_moa

    # Percent
    percent_not_supported_category_feature_assertion_of_moa = f"{number_unique_not_supported_category_assertion / total_len_assertions * 100:.2f}%"
    not_supported_feature_categories_assertion_summary_data[category][
        "percent_not_supported_category_feature_assertion_of_moa"
    ] = percent_not_supported_category_feature_assertion_of_moa

    # Not supported fraction
    fraction_not_supported_category_feature_assertion_of_total_not_supported = f"{number_unique_not_supported_category_assertion} / {total_number_not_supported_feature_unique_assertions}"
    not_supported_feature_categories_assertion_summary_data[category][
        "fraction_not_supported_category_feature_assertion_of_total_not_supported"
    ] = fraction_not_supported_category_feature_assertion_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_feature_assertion_of_total_not_supported = f"{number_unique_not_supported_category_assertion / total_number_not_supported_feature_unique_assertions * 100:.2f}%"
    not_supported_feature_categories_assertion_summary_data[category][
        "percent_not_supported_category_feature_assertion_of_total_not_supported"
    ] = percent_not_supported_category_feature_assertion_of_total_not_supported

    not_supported_feature_assertion_summary[
        "Count of MOA Assertions per Category"
    ].append(number_unique_not_supported_category_assertion)
    not_supported_feature_assertion_summary["Fraction of all MOA Assertions"].append(
        fraction_not_supported_category_feature_assertion_of_moa
    )
    not_supported_feature_assertion_summary["Percent of all MOA Assertions"].append(
        percent_not_supported_category_feature_assertion_of_moa
    )
    not_supported_feature_assertion_summary[
        "Fraction of Not Supported Feature Assertions"
    ].append(fraction_not_supported_category_feature_assertion_of_total_not_supported)
    not_supported_feature_assertion_summary[
        "Percent of Not Supported Feature Assertions"
    ].append(percent_not_supported_category_feature_assertion_of_total_not_supported)

In [54]:
number_unique_not_supported_category_features

0

### <a id='toc1_6_4_'></a>[Impact Score Analysis by Subcategory](#toc0_)

In [55]:
not_supported_impact_summary = {
    "Category": NOT_SUPPORTED_VARIANT_CATEGORY_VALUES,
    "MOA Total Sum Impact Score": [],
    "Average Impact Score per Feature": [],
    "Average Impact Score per Assertion": [],
    "Total Number Assertions": [
        v["number_unique_not_supported_category_assertion"]
        for v in not_supported_feature_categories_assertion_summary_data.values()
    ],
    "Total Number Features": [
        v["number_unique_not_supported_category_features"]
        for v in not_supported_feature_categories_summary_data.values()
    ],
}

In [56]:
not_supported_feature_categories_impact_data = dict()
for category in NOT_SUPPORTED_VARIANT_CATEGORY_VALUES:
    not_supported_feature_categories_impact_data[category] = {}
    impact_category_df = not_supported_queries_df[
        not_supported_queries_df["category"] == category
    ].copy()

    total_sum_not_supported_category_impact = impact_category_df["impact_score"].sum()

    not_supported_feature_categories_impact_data[category][
        "total_sum_not_supported_category_impact"
    ] = total_sum_not_supported_category_impact

    number_unique_not_supported_category_features = (
        impact_category_df.feature_id.nunique()
    )
    number_unique_not_supported_category_assertion = (
        impact_category_df.assertion_id.nunique()
    )

    if number_unique_not_supported_category_features == 0:
        avg_impact_score_feature = 0
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_feature"
        ] = avg_impact_score_feature

        avg_impact_score_assertion = 0
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_evidence"
        ] = avg_impact_score_assertion
    else:
        avg_impact_score_feature = f"{total_sum_not_supported_category_impact / number_unique_not_supported_category_features:.2f}"
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_feature"
        ] = avg_impact_score_feature

        avg_impact_score_assertion = f"{total_sum_not_supported_category_impact / number_unique_not_supported_category_assertion:.2f}"
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_evidence"
        ] = avg_impact_score_assertion

    not_supported_impact_summary["MOA Total Sum Impact Score"].append(
        total_sum_not_supported_category_impact
    )
    not_supported_impact_summary["Average Impact Score per Feature"].append(
        avg_impact_score_feature
    )
    not_supported_impact_summary["Average Impact Score per Assertion"].append(
        avg_impact_score_assertion
    )

    print(
        f"Number of unique features within category: {number_unique_not_supported_category_features}"
    )
    print(
        f"{category}: {total_sum_not_supported_category_impact}, {avg_impact_score_feature}, {avg_impact_score_assertion}"
    )

Number of unique features within category: 127
Sequence: 767.5, 6.04, 6.04
Number of unique features within category: 0
Genotype/Haplotype: 0, 0, 0
Number of unique features within category: 0
Fusion: 0, 0, 0
Number of unique features within category: 35
Rearrangement: 261.0, 7.46, 7.46
Number of unique features within category: 0
Epigenetic Modification: 0, 0, 0
Number of unique features within category: 23
Copy Number: 50.0, 2.17, 2.17
Number of unique features within category: 11
Expression: 11, 1.00, 1.00
Number of unique features within category: 8
Gene Function: 47, 5.88, 5.88
Number of unique features within category: 40
Region-Defined: 267.0, 6.67, 6.67
Number of unique features within category: 0
Genome Feature: 0, 0, 0
Number of unique features within category: 12
Other: 77.0, 6.42, 6.42
Number of unique features within category: 0
Transcript: 0, 0, 0


In [57]:
not_supported_feature_impact_df = pd.DataFrame(not_supported_impact_summary)

In [58]:
not_supported_feature_impact_df

Unnamed: 0,Category,MOA Total Sum Impact Score,Average Impact Score per Feature,Average Impact Score per Assertion,Total Number Assertions,Total Number Features
0,Sequence,767.5,6.04,6.04,328,127
1,Genotype/Haplotype,0.0,0.0,0.0,0,0
2,Fusion,0.0,0.0,0.0,0,0
3,Rearrangement,261.0,7.46,7.46,85,35
4,Epigenetic Modification,0.0,0.0,0.0,0,0
5,Copy Number,50.0,2.17,2.17,37,23
6,Expression,11.0,1.0,1.0,12,11
7,Gene Function,47.0,5.88,5.88,14,8
8,Region-Defined,267.0,6.67,6.67,109,40
9,Genome Feature,0.0,0.0,0.0,0,0


In [59]:
not_supported_feature_impact_df.to_csv(
    "output/not_supported_feature_impact_df.csv", index=False
)

# <a id='toc2_'></a>[MOA Summary](#toc0_)

## <a id='toc2_1_'></a>[Feature (Variant) Analysis](#toc0_)

### <a id='toc2_1_1_'></a>[Building Summary Tables 1 - 3](#toc0_)

In [60]:
all_features_df = pd.DataFrame(feature_analysis_summary)

In [61]:
all_features_df = combine_frac_perc(all_features_df, ["all MOA Features"])

In [62]:
for_merge_all_variant_percent_of_moa_df = all_features_df
all_features_percent_of_moa_df = all_features_df.drop(
    "Count of MOA Features per Category", axis=1
)

In [63]:
for_merge_all_variant_percent_of_moa_df.to_csv(
    "output/for_merge_all_variant_percent_of_moa_df.csv",
    index=False,
)

### <a id='toc2_1_2_'></a>[Summary Table 1](#toc0_)

The table below shows the 2 categories that MOA features (variants) were divided into after normalization and what percent they make up of all features (variants) in MOA data. 

<ins>Numerator:</ins> # of MOA Features (variants) that are Normalized or Not Supported
<br><ins>Denominator:</ins> # of all MOA Features (variants)

In [64]:
all_features_percent_of_moa_df = all_features_percent_of_moa_df.set_index(
    "Variant Category"
)
all_features_percent_of_moa_df

Unnamed: 0_level_0,Percent of all MOA Features
Variant Category,Unnamed: 1_level_1
Normalized,196 / 452 (43.36%)
Not Supported,256 / 452 (56.64%)


In [65]:
moa_summary_table_1 = all_features_percent_of_moa_df

### <a id='toc2_1_3_'></a>[Summary Table 2](#toc0_)

The table below shows the categories that the Not Supported features (variants) were broken into and what percent of all MOA features (variants) they make up.

<ins>Numerator:</ins> # of MOA Features (variants) that are Not Supported in a given Subcategory
<br><ins>Denominator:</ins> # of all MOA Features (variants)

In [66]:
not_supported_features_df = pd.DataFrame(not_supported_feature_analysis_summary)

In [67]:
not_supported_features_total_df = combine_frac_perc(
    not_supported_features_df, ["all MOA Features"]
)

In [68]:
for_merge_not_supported_features_total_df = not_supported_features_total_df[
    [
        "Category",
        "Count of MOA Features per Category",
        "Percent of all MOA Features",
    ]
].copy()

In [69]:
not_supported_features_total_df = (
    not_supported_features_total_df[
        [
            "Category",
            "Percent of all MOA Features",
        ]
    ]
    .copy()
    .set_index("Category")
)
not_supported_features_total_df

Unnamed: 0_level_0,Percent of all MOA Features
Category,Unnamed: 1_level_1
Sequence,127 / 452 (28.10%)
Genotype/Haplotype,0 / 452 (0.00%)
Fusion,0 / 452 (0.00%)
Rearrangement,35 / 452 (7.74%)
Epigenetic Modification,0 / 452 (0.00%)
Copy Number,23 / 452 (5.09%)
Expression,11 / 452 (2.43%)
Gene Function,8 / 452 (1.77%)
Region-Defined,40 / 452 (8.85%)
Genome Feature,0 / 452 (0.00%)


In [70]:
moa_summary_table_2 = not_supported_features_total_df

In [71]:
for_merge_not_supported_features_total_df.to_csv(
    "output/for_merge_not_supported_features_total_df.csv",
    index=False,
)

### <a id='toc2_1_4_'></a>[Summary Table 3](#toc0_)

The table below shows the categories that the Not Supported features (variants) were broken into what percent each sub category take up in Not Supported variant group.

<ins>Numerator:</ins> # of MOA Features (variants) that are Not Supported in a given Subcategory
<br><ins>Denominator:</ins> # of MOA Features (variants) that are Not Supported

In [72]:
not_supported_features_category_df = combine_frac_perc(
    not_supported_features_df, ["Not Supported Features"]
)

In [73]:
not_supported_features_category_df = not_supported_features_category_df[
    ["Category", "Percent of Not Supported Features"]
]
not_supported_features_category_df = not_supported_features_category_df.set_index(
    "Category"
)
not_supported_features_category_df

Unnamed: 0_level_0,Percent of Not Supported Features
Category,Unnamed: 1_level_1
Sequence,127 / 256 (49.61%)
Genotype/Haplotype,0 / 256 (0.00%)
Fusion,0 / 256 (0.00%)
Rearrangement,35 / 256 (13.67%)
Epigenetic Modification,0 / 256 (0.00%)
Copy Number,23 / 256 (8.98%)
Expression,11 / 256 (4.30%)
Gene Function,8 / 256 (3.12%)
Region-Defined,40 / 256 (15.62%)
Genome Feature,0 / 256 (0.00%)


In [74]:
moa_summary_table_3 = not_supported_features_category_df

## <a id='toc2_2_'></a>[Evidence Analysis](#toc0_)

### <a id='toc2_2_1_'></a>[Building Summary Table 4](#toc0_)

In [75]:
all_features_assertions_df = pd.DataFrame(assertion_analysis_summary)

In [76]:
all_features_assertions_df = combine_frac_perc(
    all_features_assertions_df, ["all MOA Assertions"]
)

In [77]:
for_merge_all_features_assertions_df = all_features_assertions_df
all_features_assertions_df = for_merge_all_features_assertions_df.drop(
    columns=["Count of MOA Assertions per Category"]
)

In [78]:
for_merge_all_features_assertions_df.to_csv(
    "output/for_merge_all_features_assertions_df.csv",
    index=False,
)

### <a id='toc2_2_2_'></a>[Summary Table 4](#toc0_)

The table below shows what percent of all assertions (evidence items) in MOA are associated with Normalized and Not Supported features (variants)

<ins>Numerator:</ins> # of MOA Assertions (evidence items) based on normalization status of associated features (variants)
<br><ins>Denominator:</ins> # of all MOA Assertions (evidence items)

In [79]:
all_features_assertions_df = all_features_assertions_df.set_index("Variant Category")
moa_summary_table_4 = all_features_assertions_df
moa_summary_table_4

Unnamed: 0_level_0,Percent of all MOA Assertions
Variant Category,Unnamed: 1_level_1
Normalized,391 / 1007 (38.83%)
Not Supported,616 / 1007 (61.17%)


### <a id='toc2_2_3_'></a>[Building Sumary Tables 5 & 6](#toc0_)

In [80]:
not_supported_feature_assertion_df = pd.DataFrame(
    not_supported_feature_assertion_summary
)

In [81]:
not_supported_feature_assertion_df = combine_frac_perc(
    not_supported_feature_assertion_df, ["all MOA Assertions"]
)

In [82]:
not_supported_feature_assertion_df = combine_frac_perc(
    not_supported_feature_assertion_df, ["Not Supported Feature Assertions"]
)

In [83]:
for_merge_not_supported_feature_assertion_df = not_supported_feature_assertion_df.drop(
    ["Percent of Not Supported Feature Assertions"], axis=1
)

not_supported_feature_assertion_of_moa_df = (
    for_merge_not_supported_feature_assertion_df.drop(
        ["Count of MOA Assertions per Category"], axis=1
    )
)

not_supported_feature_assertion_of_not_supported_df = (
    not_supported_feature_assertion_df.drop(
        ["Percent of all MOA Assertions", "Count of MOA Assertions per Category"],
        axis=1,
    )
)

In [84]:
for_merge_not_supported_feature_assertion_df.to_csv(
    "output/for_merge_not_supported_feature_assertion_df.csv",
    index=False,
)

### <a id='toc2_2_4_'></a>[Summary Table 5](#toc0_)

The table below shows the percent of all MOA assertions (evidence items) that are associated with a Not Supported variant sub category.

<ins>Numerator:</ins> # of MOA Assertions (evidence items) associated with Not Supported features (variants) in a given Subcategory
<br><ins>Denominator:</ins> # of all MOA Assertions (evidence items)

In [85]:
not_supported_feature_assertion_of_moa_df = (
    not_supported_feature_assertion_of_moa_df.set_index("Category")
)
moa_summary_table_5 = not_supported_feature_assertion_of_moa_df
moa_summary_table_5

Unnamed: 0_level_0,Percent of all MOA Assertions
Category,Unnamed: 1_level_1
Sequence,328 / 1007 (32.57%)
Genotype/Haplotype,0 / 1007 (0.00%)
Fusion,0 / 1007 (0.00%)
Rearrangement,85 / 1007 (8.44%)
Epigenetic Modification,0 / 1007 (0.00%)
Copy Number,37 / 1007 (3.67%)
Expression,12 / 1007 (1.19%)
Gene Function,14 / 1007 (1.39%)
Region-Defined,109 / 1007 (10.82%)
Genome Feature,0 / 1007 (0.00%)


### <a id='toc2_2_5_'></a>[Summary Table 6](#toc0_)

The table below shows the percent of MOA Assertions (evidence items) associated with Not Supported features (variants) that belong to each variant sub category. 

<ins>Numerator:</ins> # of MOA Assertions (evidence items) associated with Not Supported features (variants) in a given Subcategory
<br><ins>Denominator:</ins> # of MOA Assertions (evidence items) associated with all Not Supported features (variants)

In [86]:
not_supported_feature_assertion_of_not_supported_df = (
    not_supported_feature_assertion_of_not_supported_df.set_index("Category")
)
moa_summary_table_6 = not_supported_feature_assertion_of_not_supported_df
moa_summary_table_6

Unnamed: 0_level_0,Percent of Not Supported Feature Assertions
Category,Unnamed: 1_level_1
Sequence,328 / 616 (53.25%)
Genotype/Haplotype,0 / 616 (0.00%)
Fusion,0 / 616 (0.00%)
Rearrangement,85 / 616 (13.80%)
Epigenetic Modification,0 / 616 (0.00%)
Copy Number,37 / 616 (6.01%)
Expression,12 / 616 (1.95%)
Gene Function,14 / 616 (2.27%)
Region-Defined,109 / 616 (17.69%)
Genome Feature,0 / 616 (0.00%)


## <a id='toc2_3_'></a>[Impact](#toc0_)

The bar graph below shows the relationship between the Not Supported variant sub category impact score and the sub category. Additionally, the colors illustrate the number of assertions (evidence items) associated each sub category.

In [87]:
not_supported_feature_impact_df = pd.DataFrame(not_supported_impact_summary)
not_supported_feature_impact_df

Unnamed: 0,Category,MOA Total Sum Impact Score,Average Impact Score per Feature,Average Impact Score per Assertion,Total Number Assertions,Total Number Features
0,Sequence,767.5,6.04,6.04,328,127
1,Genotype/Haplotype,0.0,0.0,0.0,0,0
2,Fusion,0.0,0.0,0.0,0,0
3,Rearrangement,261.0,7.46,7.46,85,35
4,Epigenetic Modification,0.0,0.0,0.0,0,0
5,Copy Number,50.0,2.17,2.17,37,23
6,Expression,11.0,1.0,1.0,12,11
7,Gene Function,47.0,5.88,5.88,14,8
8,Region-Defined,267.0,6.67,6.67,109,40
9,Genome Feature,0.0,0.0,0.0,0,0


In [88]:
not_supported_feature_impact_df.to_csv(
    "output/not_supported_feature_impact_df.csv", index=False
)

In [89]:
fig3 = px.bar(
    not_supported_feature_impact_df,
    x="Category",
    y="MOA Total Sum Impact Score",
    hover_data=["Total Number Assertions"],
    color="Total Number Assertions",
    labels={"MOA Total Sum Impact Score": "MOA Total Sum Impact Score"},
    text_auto=".1f",
    color_continuous_scale="geyser",
)
fig3.update_traces(width=1)
fig3.show()

In [90]:
fig3.write_html("output/moa_ns_categories_impact_redgreen.html")

The scatterplot below shows the relationship between the Not Supported variant sub category impact score and the number of assertions (evidence items) associated with features (variants) in each sub category. Additionally, the sizes of the data point represent the number of features (variants) in each sub category. 

In [91]:
fig2 = px.scatter(
    data_frame=not_supported_feature_impact_df,
    x="Total Number Assertions",
    y="MOA Total Sum Impact Score",
    size="Total Number Features",
    size_max=40,
    text="Total Number Features",
    color="Category",
)
fig2.show()

In [92]:
fig2.write_html("output/moa_ns_categories_impact_scatterplot.html")