# <a id='toc1_'></a>[Molecular Oncology Almanac Assertion Analysis](#toc0_)

MOA evidence items are referred to as assertions and MOA variants are referred to as features in this analysis. 

**Table of contents**<a id='toc0_'></a>    
- [Molecular Oncology Almanac Assertion Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Create output directory](#toc1_1_2_)    
  - [Create analysis functions / global variables](#toc1_2_)    
  - [All Features (Variants) Analysis](#toc1_3_)    
    - [Creating a table with feature (variant) and assertion (evidence) information](#toc1_3_1_)    
    - [Converting feature (variant) types to normalized categories](#toc1_3_2_)    
    - [Adding a numerical impact score based on the predictive implication](#toc1_3_3_)    
    - [Impact Score Analysis](#toc1_3_4_)    
    - [Features (Variants) Analysis](#toc1_3_5_)    
    - [Assertions (Evidence Items) Analysis](#toc1_3_6_)    
    - [Summaries for all Features (Variants) and Assertions (Evidence Items)](#toc1_3_7_)    
  - [Create functions / global variables used in analysis](#toc1_4_)    
  - [Normalized Analysis](#toc1_5_)    
  - [Not Supported Analysis](#toc1_6_)    
    - [Feature (Variant) Analysis](#toc1_6_1_)    
    - [Not Supported Feature (Variant) Analysis by Subcategory](#toc1_6_2_)    
    - [Not Support Feature (Variant) Assertion (Evidence) Analysis by Subcategory](#toc1_6_3_)    
    - [Impact Score Analysis by Subcategory](#toc1_6_4_)    
- [MOA Summary](#toc2_)    
  - [Feature (Variant) Analysis](#toc2_1_)    
    - [Building Summary Tables 1 - 3](#toc2_1_1_)    
    - [Summary Table 1](#toc2_1_2_)    
    - [Summary Table 2](#toc2_1_3_)    
    - [Summary Table 3](#toc2_1_4_)    
  - [Evidence Analysis](#toc2_2_)    
    - [Building Summary Table 4](#toc2_2_1_)    
    - [Summary Table 4](#toc2_2_2_)    
    - [Building Sumary Tables 5 & 6](#toc2_2_3_)    
    - [Summary Table 5](#toc2_2_4_)    
    - [Summary Table 6](#toc2_2_5_)    
  - [Impact](#toc2_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [3]:
from enum import Enum
from typing import Dict
import json
from pathlib import Path
import pandas as pd
import plotly.express as px
import requests
from ga4gh.core import sha512t24u

### <a id='toc1_1_2_'></a>[Create output directory](#toc0_)

In [4]:
path = Path("moa_assertion_analysis_output")
path.mkdir(exist_ok = True)

## <a id='toc1_2_'></a>[Create analysis functions / global variables](#toc0_)

In [5]:
def get_feature_digest(feature: Dict) -> str:
    """Get digest for feature

    :param feature: MOA feature
    :return: Digest
    """
    attrs = json.dumps(
        feature["attributes"][0], sort_keys=True, separators=(",", ":"), indent=None
    ).encode("utf-8")
    return sha512t24u(attrs)

In [6]:
class VariantNormType(str, Enum):
    """Variation Normalization types"""

    NORMALIZED = "Normalized"
    NOT_SUPPORTED = "Not Supported"


VARIANT_NORM_TYPE_VALUES = [v.value for v in VariantNormType.__members__.values()]

In [7]:
class VariantCategory(str, Enum):
    """Create enum for the kind of variants that are in CIViC."""
    EXPRESSION = "Expression Variants"
    EPIGENETIC_MODIFICATION = "Epigenetic Modification"
    FUSION = "Fusion Variants"
    SEQUENCE_VARS = "Sequence Variants"
    GENE_FUNC = "Gene Function Variants"
    REARRANGEMENTS = "Rearrangement Variants"
    COPY_NUMBER = "Copy Number Variants"
    OTHER = "Other Variants"
    GENOTYPES = "Genotype Variants"
    REGION_DEFINED_VAR = "Region Defined Variants"
    TRANSCRIPT_VAR = "Transcript Variants"  # no attempt to normalize these ones, since there is no query we could use


VARIANT_CATEGORY_VALUES = [v.value for v in VariantCategory.__members__.values()]

## <a id='toc1_3_'></a>[All Features (Variants) Analysis](#toc0_)

### <a id='toc1_3_1_'></a>[Creating a table with feature (variant) and assertion (evidence) information](#toc0_)

In [8]:
# Create dictionary for MOA Feature ID -> Feature Type
r = requests.get("https://moalmanac.org/api/features")
if r.status_code == 200:
    feature_data = r.json()

features = {}

for feature in feature_data:
    feature_id = feature["feature_id"]
    digest = get_feature_digest(feature)
    features[digest] = feature["feature_type"]

count_unique_feature_ids = len(features.keys())
print(count_unique_feature_ids)

428


In [9]:
# Create DF for assertions and their associated feature + predictive implication
r = requests.get("https://moalmanac.org/api/assertions")
if r.status_code == 200:
    assertion_data = r.json()

transformed = []

# Mapping from feature ID to feature digest
feature_id_to_digest = {}

for assertion in assertion_data:
    assertion_id = assertion["assertion_id"]
    predictive_implication = assertion["predictive_implication"]

    if len(assertion["features"]) != 1:
        print(f"assertion id ({assertion_id}) does not have 1 feature")
        continue

    feature = assertion["features"][0]
    feature_id = feature["feature_id"]
    feature_digest = get_feature_digest(feature)

    feature_id_to_digest[feature_id] = digest

    transformed.append(
        {
            "assertion_id": assertion_id,
            "feature_id": feature_id,
            "feature_type": features[feature_digest],
            "predictive_implication": predictive_implication,
            "feature_digest": feature_digest,
        }
    )
moa_df = pd.DataFrame(transformed)
moa_df

Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
1,2,2,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
2,3,3,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
3,4,4,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
4,5,5,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP
...,...,...,...,...,...
893,894,894,germline_variant,FDA-Approved,5UD28MjD2bPMEnlA6enWWTEftU7JVXzi
894,895,895,somatic_variant,FDA-Approved,OQ7B9XkAYOPvcmJES3ULOTn7Ai9ZQec9
895,896,896,germline_variant,FDA-Approved,i3K0tcuhFKzBkb4jkwHaU0_NLz4NKKJj
896,897,897,somatic_variant,FDA-Approved,3CBFHlLnAU0h_OurUo6SY_Wlo04P03N5


In [10]:
unique_features_df = moa_df.sort_values("feature_id").drop_duplicates("feature_digest")
len_unique_feature_ids = len(list(unique_features_df.feature_id))

In [11]:
total_len_features = len(moa_df.feature_digest.unique())
f"Total number of unique features (variants): {total_len_features}"

'Total number of unique features (variants): 428'

In [12]:
assert total_len_features == len_unique_feature_ids

In [13]:
total_len_assertions = len(moa_df.assertion_id.unique())
f"Total number of unique assertions (evidence items): {total_len_assertions}"

'Total number of unique assertions (evidence items): 898'

### <a id='toc1_3_2_'></a>[Converting feature (variant) types to normalized categories](#toc0_)

In [14]:
list(moa_df.feature_type.unique())

['rearrangement',
 'somatic_variant',
 'germline_variant',
 'copy_number',
 'microsatellite_stability',
 'mutational_signature',
 'mutational_burden',
 'knockdown',
 'aneuploidy']

In [15]:
moa_df["category"] = moa_df["feature_type"].copy()

moa_df["category"] = moa_df["category"].replace("rearrangement", VariantCategory.REARRANGEMENTS.value)
moa_df["category"] = moa_df["category"].replace("aneuploidy", VariantCategory.COPY_NUMBER.value)
moa_df["category"] = moa_df["category"].replace("knockdown", VariantCategory.EXPRESSION.value)
moa_df["category"] = moa_df["category"].replace("somatic_variant", VariantCategory.SEQUENCE_VARS.value)
moa_df["category"] = moa_df["category"].replace("germline_variant", VariantCategory.SEQUENCE_VARS.value)
moa_df["category"] = moa_df["category"].replace("microsatellite_stability", VariantCategory.REARRANGEMENTS.value)
moa_df["category"] = moa_df["category"].replace("mutational_burden", VariantCategory.OTHER.value)
moa_df["category"] = moa_df["category"].replace("mutational_signature", VariantCategory.OTHER.value)
moa_df["category"] = moa_df["category"].replace("copy_number", VariantCategory.COPY_NUMBER.value)

moa_df.head()

Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants
1,2,2,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants
2,3,3,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants
3,4,4,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants
4,5,5,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants


In [16]:
list(moa_df.category.unique())

['Rearrangement Variants',
 'Sequence Variants',
 'Copy Number Variants',
 'Other Variants',
 'Expression Variants']

### <a id='toc1_3_3_'></a>[Adding a numerical impact score based on the predictive implication](#toc0_)
This is based on the structure of MOA scoring

In [17]:
predictive_implication_categories = moa_df.predictive_implication.unique()
list(predictive_implication_categories)

['FDA-Approved',
 'Guideline',
 'Clinical trial',
 'Preclinical',
 'Inferential',
 'Clinical evidence']

In [18]:
moa_df["impact_score"] = moa_df["predictive_implication"].copy()

moa_df.loc[moa_df["impact_score"] == "FDA-Approved", "impact_score"] = 10
moa_df.loc[moa_df["impact_score"] == "Guideline", "impact_score"] = 10
moa_df.loc[moa_df["impact_score"] == "Clinical evidence", "impact_score"] = 5
moa_df.loc[moa_df["impact_score"] == "Clinical trial", "impact_score"] = 5
moa_df.loc[moa_df["impact_score"] == "Preclinical", "impact_score"] = 1
moa_df.loc[moa_df["impact_score"] == "Inferential", "impact_score"] = 0.5

moa_df.head()

Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category,impact_score
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants,10
1,2,2,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants,10
2,3,3,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants,10
3,4,4,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants,10
4,5,5,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants,10


### <a id='toc1_3_4_'></a>[Impact Score Analysis](#toc0_)

In [19]:
feature_categories_impact_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    feature_categories_impact_data[category] = {}
    impact_category_df = moa_df[moa_df.category == category]

    total_sum_category_impact = impact_category_df["impact_score"].sum()
    feature_categories_impact_data[category][
        "total_sum_category_impact"
    ] = total_sum_category_impact
    print(f"{category}: {total_sum_category_impact}")

Expression Variants: 12
Epigenetic Modification: 0
Fusion Variants: 0
Sequence Variants: 4182.5
Gene Function Variants: 0
Rearrangement Variants: 653.0
Copy Number Variants: 400.0
Other Variants: 53.5
Genotype Variants: 0
Region Defined Variants: 0
Transcript Variants: 0


### <a id='toc1_3_5_'></a>[Features (Variants) Analysis](#toc0_)

In [20]:
moa_feature_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    moa_feature_data[category] = {}
    feature_type_df = moa_df[moa_df.category == category]

    number_unique_category_features = len(set(feature_type_df.feature_digest))
    moa_feature_data[category][
        "number_unique_category_features"
    ] = number_unique_category_features

    fraction_category_feature = (
        f"{number_unique_category_features} / {total_len_features}"
    )
    moa_feature_data[category]["fraction_category_feature"] = fraction_category_feature

    percent_category_feature = (
        "{:.2f}".format(number_unique_category_features / total_len_features * 100)
        + "%"
    )
    moa_feature_data[category]["percent_category_feature"] = percent_category_feature

### <a id='toc1_3_6_'></a>[Assertions (Evidence Items) Analysis](#toc0_)

In [21]:
moa_assertion_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    moa_assertion_data[category] = {}
    assertion_type_df = moa_df[moa_df.category == category]

    number_unique_category_assertions = len(set(assertion_type_df.assertion_id))
    moa_assertion_data[category][
        "number_unique_category_assertions"
    ] = number_unique_category_assertions

    fraction_category_assertion = (
        f"{number_unique_category_assertions} / {total_len_assertions}"
    )
    moa_assertion_data[category][
        "fraction_category_assertion"
    ] = fraction_category_assertion

    percent_category_assertion = (
        "{:.2f}".format(number_unique_category_assertions / total_len_assertions * 100)
        + "%"
    )
    moa_assertion_data[category][
        "percent_category_assertion"
    ] = percent_category_assertion

### <a id='toc1_3_7_'></a>[Summaries for all Features (Variants) and Assertions (Evidence Items)](#toc0_)

In [22]:
feature_category_impact_score = [
    v["total_sum_category_impact"] for v in feature_categories_impact_data.values()
]
feature_category_number = [
    v["number_unique_category_features"] for v in moa_feature_data.values()
]
feature_category_fraction = [
    v["fraction_category_feature"] for v in moa_feature_data.values()
]
feature_category_percent = [
    v["percent_category_feature"] for v in moa_feature_data.values()
]
feature_category_assertion_number = [
    v["number_unique_category_assertions"] for v in moa_assertion_data.values()
]
feature_category_assertion_fraction = [
    v["fraction_category_assertion"] for v in moa_assertion_data.values()
]
feature_category_assertion_percent = [
    v["percent_category_assertion"] for v in moa_assertion_data.values()
]

In [23]:
feature_category_dict = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Number of Features": feature_category_number,
    "Fraction of Features": feature_category_fraction,
    "Percent of Features": feature_category_percent,
    "Number of Assertions": feature_category_assertion_number,
    "Fraction of Assertions": feature_category_assertion_fraction,
    "Percent of Assertions": feature_category_assertion_percent,
    "Impact Score": feature_category_impact_score,
}

In [24]:
moa_feature_df = pd.DataFrame(feature_category_dict)
moa_feature_df

Unnamed: 0,Category,Number of Features,Fraction of Features,Percent of Features,Number of Assertions,Fraction of Assertions,Percent of Assertions,Impact Score
0,Expression Variants,11,11 / 428,2.57%,12,12 / 898,1.34%,12.0
1,Epigenetic Modification,0,0 / 428,0.00%,0,0 / 898,0.00%,0.0
2,Fusion Variants,0,0 / 428,0.00%,0,0 / 898,0.00%,0.0
3,Sequence Variants,323,323 / 428,75.47%,679,679 / 898,75.61%,4182.5
4,Gene Function Variants,0,0 / 428,0.00%,0,0 / 898,0.00%,0.0
5,Rearrangement Variants,38,38 / 428,8.88%,82,82 / 898,9.13%,653.0
6,Copy Number Variants,47,47 / 428,10.98%,102,102 / 898,11.36%,400.0
7,Other Variants,9,9 / 428,2.10%,23,23 / 898,2.56%,53.5
8,Genotype Variants,0,0 / 428,0.00%,0,0 / 898,0.00%,0.0
9,Region Defined Variants,0,0 / 428,0.00%,0,0 / 898,0.00%,0.0


In [25]:
moa_feature_df["Percent of Features"] = (
    moa_feature_df["Fraction of Features"].astype(str)
    + " ("
    + moa_feature_df["Percent of Features"]
    + ")"
)
moa_feature_df["Percent of Assertions"] = (
    moa_feature_df["Fraction of Assertions"].astype(str)
    + " ("
    + moa_feature_df["Percent of Assertions"]
    + ")"
)

In [26]:
moa_feature_df_abbreviated = moa_feature_df.drop(
    [
        "Number of Features",
        "Fraction of Features",
        "Number of Assertions",
        "Fraction of Assertions",
    ],
    axis=1,
)

In [27]:
moa_feature_df_abbreviated = moa_feature_df_abbreviated.set_index("Category")
moa_feature_df_abbreviated

Unnamed: 0_level_0,Percent of Features,Percent of Assertions,Impact Score
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Expression Variants,11 / 428 (2.57%),12 / 898 (1.34%),12.0
Epigenetic Modification,0 / 428 (0.00%),0 / 898 (0.00%),0.0
Fusion Variants,0 / 428 (0.00%),0 / 898 (0.00%),0.0
Sequence Variants,323 / 428 (75.47%),679 / 898 (75.61%),4182.5
Gene Function Variants,0 / 428 (0.00%),0 / 898 (0.00%),0.0
Rearrangement Variants,38 / 428 (8.88%),82 / 898 (9.13%),653.0
Copy Number Variants,47 / 428 (10.98%),102 / 898 (11.36%),400.0
Other Variants,9 / 428 (2.10%),23 / 898 (2.56%),53.5
Genotype Variants,0 / 428 (0.00%),0 / 898 (0.00%),0.0
Region Defined Variants,0 / 428 (0.00%),0 / 898 (0.00%),0.0


In [28]:
fig = px.scatter(
    data_frame=moa_feature_df,
    x="Number of Assertions",
    y="Impact Score",
    size="Number of Features",
    size_max=40,
    text="Number of Features",
    color="Category",
)
fig.show()

In [31]:
fig.write_html("moa_assertion_analysis_output/moa_feature_categories_impact_scatterplot.html")

## <a id='toc1_4_'></a>[Create functions / global variables used in analysis](#toc0_)

In [32]:
feature_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of MOA Features per Category": [],
    "Fraction of all MOA Features": [],
    "Percentage of all MOA Features": [],
}
feature_analysis_summary

{'Variant Category': ['Normalized', 'Not Supported'],
 'Count of MOA Features per Category': [],
 'Fraction of all MOA Features': [],
 'Percentage of all MOA Features': []}

In [33]:
def feature_analysis(
    df: pd.DataFrame, variant_norm_type: VariantNormType
) -> pd.DataFrame:
    """Do feature analysis (counts, percentages). Updates `feature_analysis_summary`

    :param df: Dataframe of variants
    :param variant_norm_type: The kind of features that are in `df`
    :return: Transformed dataframe with variant ID duplicates dropped
    """
    # Drop duplicate rows
    df = df.drop_duplicates(subset=["feature_id"])
    feature_ids = list(df["feature_id"])

    # Count
    num_features = len(feature_ids)
    fraction_features = f"{num_features} / {total_len_features}"
    print(f"\nNumber of {variant_norm_type.value} Features in MOA: {fraction_features}")

    # Percentage
    percentage_features = f"{num_features / total_len_features * 100:.2f}%"
    print(
        f"Percentage of {variant_norm_type.value} Features in MOA: {percentage_features}"
    )

    feature_analysis_summary["Count of MOA Features per Category"].append(num_features)
    feature_analysis_summary["Fraction of all MOA Features"].append(fraction_features)
    feature_analysis_summary["Percentage of all MOA Features"].append(
        percentage_features
    )

    return df

In [34]:
assertion_analysis_summary = {
    "Variant Category": VARIANT_NORM_TYPE_VALUES,
    "Count of MOA Assertions per Category": [],
    "Fraction of all MOA Assertions": [],
    "Percentage of all MOA Assertions": [],
}
assertion_analysis_summary

{'Variant Category': ['Normalized', 'Not Supported'],
 'Count of MOA Assertions per Category': [],
 'Fraction of all MOA Assertions': [],
 'Percentage of all MOA Assertions': []}

In [35]:
def assertion_analysis(
    all_df: pd.DataFrame,
    variant_norm_df: pd.DataFrame,
    variant_norm_type: VariantNormType,
):
    """Do evidence analysis (counts, percentages). Updates `assertion_analysis_summary`

    :param all_df: Dataframe for all assertions and features
    :param variant_norm_df: Dataframe for features given certain `variant_norm_type`
    :param variant_norm_type: The kind of variants that are in `df`
    """
    # Need to do this bc of duplicate features
    _feature_ids = set(variant_norm_df.feature_digest)
    tmp_df = all_df[all_df["feature_digest"].isin(_feature_ids)]

    # Count
    num_assertions = len(tmp_df.assertion_id)
    fraction_assertions = f"{num_assertions} / {total_len_assertions}"
    print(
        f"Number of {variant_norm_type.value} Feature Assertions in MOA: {fraction_assertions}"
    )

    # Percentage
    percentage_assertions = f"{num_assertions / total_len_assertions * 100:.2f}%"
    print(
        f"Percentage of {variant_norm_type.value} Feature Assertions in MOA: {percentage_assertions}"
    )

    assertion_analysis_summary["Count of MOA Assertions per Category"].append(
        num_assertions
    )
    assertion_analysis_summary["Fraction of all MOA Assertions"].append(
        fraction_assertions
    )
    assertion_analysis_summary["Percentage of all MOA Assertions"].append(
        percentage_assertions
    )

In [36]:
feature_id_to_digest_df = pd.DataFrame(
    feature_id_to_digest.items(), columns=["feature_id", "feature_digest"]
)
feature_id_to_digest_df

Unnamed: 0,feature_id,feature_digest
0,1,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1
1,2,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1
2,3,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1
3,4,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1
4,5,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1
...,...,...
893,894,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1
894,895,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1
895,896,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1
896,897,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1


## <a id='toc1_5_'></a>[Normalized Analysis](#toc0_)

In [37]:
normalized_queries_df = pd.read_csv("../feature_analysis/able_to_normalize_queries.csv", sep="\t")
normalized_queries_df = pd.merge(
    normalized_queries_df,
    feature_id_to_digest_df,
    left_on="variant_id",
    right_on="feature_id",
)
normalized_queries_df.shape

(181, 7)

In [38]:
normalized_queries_df = pd.merge(
    normalized_queries_df["variant_id"],
    moa_df,
    left_on="variant_id",
    right_on="feature_id",
    how="left",
)
normalized_queries_df = normalized_queries_df.drop(columns=["variant_id"])

In [39]:
normalized_queries_df = feature_analysis(
    normalized_queries_df, VariantNormType.NORMALIZED
)
normalized_queries_df


Number of Normalized Features in MOA: 181 / 428
Percentage of Normalized Features in MOA: 42.29%


Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category,impact_score
0,71,71,somatic_variant,Preclinical,KgolzM3HWhww4t4HywFYCySUtGRIQ_mx,Sequence Variants,1
1,73,73,somatic_variant,Clinical evidence,j3HtSnIdrU8CcuW8_Qs3qVxOn-kMJV1T,Sequence Variants,5
2,75,75,somatic_variant,Clinical evidence,X_Az48pPjt4IODuY2a50Yl2_1tGopcuF,Sequence Variants,5
3,76,76,somatic_variant,Clinical evidence,LQQXFXpA4FCOQ3Fz4988x2vynER4J-Wh,Sequence Variants,5
4,77,77,somatic_variant,Clinical evidence,DKoCqZUY0WBdUnoly9DL_PAjBBZTs51d,Sequence Variants,5
...,...,...,...,...,...,...,...
176,868,868,somatic_variant,FDA-Approved,fqvuveTjuO96HizOsbWgFQmfF76lGtdl,Sequence Variants,10
177,869,869,somatic_variant,FDA-Approved,1JInmjKzPW9V9q9UKen4VODk1drBadA2,Sequence Variants,10
178,870,870,somatic_variant,FDA-Approved,txWE0iDd8r36tzSRZw9tyMcMz9-L5M0g,Sequence Variants,10
179,895,895,somatic_variant,FDA-Approved,OQ7B9XkAYOPvcmJES3ULOTn7Ai9ZQec9,Sequence Variants,10


In [40]:
assertion_analysis(moa_df, normalized_queries_df, VariantNormType.NORMALIZED)

Number of Normalized Feature Assertions in MOA: 358 / 898
Percentage of Normalized Feature Assertions in MOA: 39.87%


## <a id='toc1_6_'></a>[Not Supported Analysis](#toc0_)

In [41]:
not_supported_queries_df = pd.read_csv("../feature_analysis/not_supported_variants.csv", sep="\t")
not_supported_queries_df = pd.merge(
    not_supported_queries_df,
    feature_id_to_digest_df,
    left_on="variant_id",
    right_on="feature_id",
)
not_supported_queries_df.shape

(249, 6)

In [42]:
not_supported_queries_df = pd.merge(
    not_supported_queries_df["variant_id"],
    moa_df,
    left_on="variant_id",
    right_on="feature_id",
    how="left",
)
not_supported_queries_df = not_supported_queries_df.drop(columns=["variant_id"])
not_supported_queries_df

Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category,impact_score
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants,10
1,12,12,rearrangement,FDA-Approved,g99yF3kKnB-We_fMS5RaVygoSuT7qA-I,Rearrangement Variants,10
2,15,15,rearrangement,FDA-Approved,e8PMq2A96-aBJ3Ip74ovx5VOUCztBTq7,Rearrangement Variants,10
3,18,18,rearrangement,Guideline,DxfRiRV-3J6zRON4pnzNJjXkJf2bsp20,Rearrangement Variants,10
4,21,21,rearrangement,Preclinical,BRsPjsZSCyDXnKtBt9XgsWX2JDNWY3FP,Rearrangement Variants,1
...,...,...,...,...,...,...,...
244,884,884,somatic_variant,FDA-Approved,OQ7B9XkAYOPvcmJES3ULOTn7Ai9ZQec9,Sequence Variants,10
245,889,889,somatic_variant,FDA-Approved,c3CkYcMt4ssh4AL4gpacJtFil8xl2TB2,Sequence Variants,10
246,890,890,somatic_variant,FDA-Approved,uAW4cOXId1N1MKo5fqHYdw9JGceCMmE5,Sequence Variants,10
247,891,891,somatic_variant,FDA-Approved,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1,Sequence Variants,10


### <a id='toc1_6_1_'></a>[Feature (Variant) Analysis](#toc0_)

In [43]:
not_supported_queries_df = feature_analysis(
    not_supported_queries_df, VariantNormType.NOT_SUPPORTED
)


Number of Not Supported Features in MOA: 249 / 428
Percentage of Not Supported Features in MOA: 58.18%


### <a id='toc1_6_2_'></a>[Not Supported Feature (Variant) Analysis by Subcategory](#toc0_)

In [44]:
not_supported_feature_analysis_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Count of MOA Features per Category": [],
    "Fraction of all MOA Features": [],
    "Percent of all MOA Features": [],
    "Fraction of Not Supported Features": [],
    "Percent of Not Supported Features": [],
}

In [45]:
not_supported_feature_categories_summary_data = dict()
total_number_unique_not_supported_features = len(
    set(not_supported_queries_df.feature_id)
)

for category in VARIANT_CATEGORY_VALUES:  # These are not supported categories
    not_supported_feature_categories_summary_data[category] = {}
    category_df = not_supported_queries_df[
        not_supported_queries_df.category == category
    ]

    # Count
    number_unique_not_supported_category_features = len(set(category_df.feature_id))
    not_supported_feature_categories_summary_data[category][
        "number_unique_not_supported_category_features"
    ] = number_unique_not_supported_category_features

    # Fraction
    fraction_not_supported_category_feature_of_moa = (
        f"{number_unique_not_supported_category_features} / {total_len_features}"
    )
    not_supported_feature_categories_summary_data[category][
        "fraction_not_supported_category_feature_of_moa"
    ] = fraction_not_supported_category_feature_of_moa

    # Percent
    percent_not_supported_category_feature_of_moa = f"{number_unique_not_supported_category_features / total_len_features * 100:.2f}%"
    not_supported_feature_categories_summary_data[category][
        "percent_not_supported_category_feature_of_moa"
    ] = percent_not_supported_category_feature_of_moa

    # Not supported fraction
    fraction_not_supported_category_feature_of_total_not_supported = f"{number_unique_not_supported_category_features} / {total_number_unique_not_supported_features}"
    not_supported_feature_categories_summary_data[category][
        "fraction_not_supported_category_feature_of_total_not_supported"
    ] = fraction_not_supported_category_feature_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_feature_of_total_not_supported = f"{number_unique_not_supported_category_features / total_number_unique_not_supported_features * 100:.2f}%"
    not_supported_feature_categories_summary_data[category][
        "percent_not_supported_category_feature_of_total_not_supported"
    ] = percent_not_supported_category_feature_of_total_not_supported

    not_supported_feature_analysis_summary["Count of MOA Features per Category"].append(
        number_unique_not_supported_category_features
    )
    not_supported_feature_analysis_summary["Fraction of all MOA Features"].append(
        fraction_not_supported_category_feature_of_moa
    )
    not_supported_feature_analysis_summary["Percent of all MOA Features"].append(
        percent_not_supported_category_feature_of_moa
    )
    not_supported_feature_analysis_summary["Fraction of Not Supported Features"].append(
        fraction_not_supported_category_feature_of_total_not_supported
    )
    not_supported_feature_analysis_summary["Percent of Not Supported Features"].append(
        percent_not_supported_category_feature_of_total_not_supported
    )

In [46]:
number_unique_not_supported_category_features

0

In [47]:
not_supported_variant_df = pd.DataFrame(not_supported_feature_analysis_summary)
not_supported_variant_df

Unnamed: 0,Category,Count of MOA Features per Category,Fraction of all MOA Features,Percent of all MOA Features,Fraction of Not Supported Features,Percent of Not Supported Features
0,Expression Variants,10,10 / 428,2.34%,10 / 249,4.02%
1,Epigenetic Modification,0,0 / 428,0.00%,0 / 249,0.00%
2,Fusion Variants,0,0 / 428,0.00%,0 / 249,0.00%
3,Sequence Variants,174,174 / 428,40.65%,174 / 249,69.88%
4,Gene Function Variants,0,0 / 428,0.00%,0 / 249,0.00%
5,Rearrangement Variants,39,39 / 428,9.11%,39 / 249,15.66%
6,Copy Number Variants,18,18 / 428,4.21%,18 / 249,7.23%
7,Other Variants,8,8 / 428,1.87%,8 / 249,3.21%
8,Genotype Variants,0,0 / 428,0.00%,0 / 249,0.00%
9,Region Defined Variants,0,0 / 428,0.00%,0 / 249,0.00%


### <a id='toc1_6_3_'></a>[Not Support Feature (Variant) Assertion (Evidence) Analysis by Subcategory](#toc0_)

List all the possible variant categories

In [48]:
not_supported_feature_categories = not_supported_queries_df.category.unique()
[v for v in not_supported_feature_categories]

['Rearrangement Variants',
 'Sequence Variants',
 'Copy Number Variants',
 'Other Variants',
 'Expression Variants']

In [49]:
not_supported_queries_df

Unnamed: 0,assertion_id,feature_id,feature_type,predictive_implication,feature_digest,category,impact_score
0,1,1,rearrangement,FDA-Approved,RnRyn89cJzVbVM93aw4OA44NIF5zblyP,Rearrangement Variants,10
1,12,12,rearrangement,FDA-Approved,g99yF3kKnB-We_fMS5RaVygoSuT7qA-I,Rearrangement Variants,10
2,15,15,rearrangement,FDA-Approved,e8PMq2A96-aBJ3Ip74ovx5VOUCztBTq7,Rearrangement Variants,10
3,18,18,rearrangement,Guideline,DxfRiRV-3J6zRON4pnzNJjXkJf2bsp20,Rearrangement Variants,10
4,21,21,rearrangement,Preclinical,BRsPjsZSCyDXnKtBt9XgsWX2JDNWY3FP,Rearrangement Variants,1
...,...,...,...,...,...,...,...
244,884,884,somatic_variant,FDA-Approved,OQ7B9XkAYOPvcmJES3ULOTn7Ai9ZQec9,Sequence Variants,10
245,889,889,somatic_variant,FDA-Approved,c3CkYcMt4ssh4AL4gpacJtFil8xl2TB2,Sequence Variants,10
246,890,890,somatic_variant,FDA-Approved,uAW4cOXId1N1MKo5fqHYdw9JGceCMmE5,Sequence Variants,10
247,891,891,somatic_variant,FDA-Approved,B5m8cSgi6w2xRCg0X_dPpQU2dwbvtXk1,Sequence Variants,10


In [50]:
assertion_analysis(moa_df, not_supported_queries_df, VariantNormType.NOT_SUPPORTED)

Number of Not Supported Feature Assertions in MOA: 594 / 898
Percentage of Not Supported Feature Assertions in MOA: 66.15%


In [51]:
not_supported_feature_assertion_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "Count of MOA Assertions per Category": [],
    "Fraction of MOA Assertions": [],
    "Percent of all MOA Assertions": [],
    "Fraction of Not Supported Feature Assertions": [],
    "Percent of Not Supported Feature Assertions": [],
}

In [115]:
not_supported_feature_ids = set(not_supported_queries_df.feature_digest)

{'auITFECTrlGI1UTBH6-aWAQEhoaJRHMR', 'P-Yipc2TvmaRgnUAp55FjaT-LTUt3Yfc', 'Ey9F3N9DidgVt-CJ4kFR51idM9vTTbVD', 'bZxpjdSrbkgw8D0xwqG2lGCWskFFGxzg', '4qVoeZa3k3VP28q38mCC6j6uZJBBSvKt', 'ppk6p0D0ellIvd7_2TM4XxCPWlsMWzmh', '0L-ogJeoATt3NK1X8xVZ7J0vccIGkCfY', 'feFkHIOWV4kBvbDHVewGLa52U8ZBt9Uo', '5ceTclGLI0NocQ6uK3z13ZcpgytDgwyV', '5afOJbUvLSLOWF9Vurqjk70KZvuN3xF3', 'Hb1DEj27Wswiosz_Pgy_kFr7dcvzAMOS', '9LjwcHsiQCrcsdPOKDUFlf9c2FeQFW0t', 'eeg-oafZ-3D22nVAFqJNWTXyLinbZQHa', 'Tnor4FhXVjkemW3fZxO1tA77JS8-5AQq', 'l8Mac3EePI09odrKvjqIWAxUM2BZZvCk', 'ON6Ezh4WYs1PLjjduaf-ANSxWTlOay_9', 'M3nhslwAOtH53_1qKxTkH1lYwXjgZ_z9', 'KH5cDdTin0DOJUjksguu6lAlX0F7AJIy', 'c0-fFq7I9C8ewqyIgF--xqtBs-NsIQ3H', 'Ru7m2r3M04OPMvvUTWL3ZHLr0cjlyygh', 'fGmSXmL03h92tz1U9l3puJKrp6BKJd5Z', '5QoSW3rvW4ukXqbQagzzZ-Fdp3A4hZnb', 'KGCaHZMsoObEpd491MLOZVeSxa63NlYj', 'jgShxgLTJ-JqPQvcTY1vXcN96mP2jg9B', 'bTLvO0_2ddAudNKcRp8XBG4FHy-Rvxcx', 'EFnQfPffgfrNDaueeiR-8iBgECCRtOiu', '1d8qprONBVH0DCf33DxRrESsdKGmwq-w', 'YxmWusiYAa9bOE-_OVLzwg-_Rb

In [91]:
not_supported_feature_categories_assertion_summary_data = dict()
total_number_not_supported_feature_unique_assertions = len(
    set(not_supported_queries_df.assertion_id)
)
not_supported_feature_ids = set(not_supported_queries_df.feature_digest)

for category in VARIANT_CATEGORY_VALUES:
    not_supported_feature_categories_assertion_summary_data[category] = {}

    # Need to do this bc of duplicate features
    tmp_df = moa_df[moa_df["feature_digest"].isin(not_supported_feature_ids)]

    evidence_category_df = tmp_df[tmp_df.category == category]

    evidence_category_df = evidence_category_df.drop_duplicates(subset=["assertion_id"])

    # Count
    number_unique_not_supported_category_assertion = len(
        set(evidence_category_df.assertion_id)
    )
    not_supported_feature_categories_assertion_summary_data[category][
        "number_unique_not_supported_category_assertion"
    ] = number_unique_not_supported_category_assertion

    # Fraction
    fraction_not_supported_category_feature_assertion_of_moa = (
        f"{number_unique_not_supported_category_assertion} / {total_len_assertions}"
    )
    not_supported_feature_categories_assertion_summary_data[category][
        "fraction_not_supported_category_feature_assertion_of_moa"
    ] = fraction_not_supported_category_feature_assertion_of_moa

    # Percent
    percent_not_supported_category_feature_assertion_of_moa = f"{number_unique_not_supported_category_assertion / total_len_assertions * 100:.2f}%"
    not_supported_feature_categories_assertion_summary_data[category][
        "percent_not_supported_category_feature_assertion_of_moa"
    ] = percent_not_supported_category_feature_assertion_of_moa

    # Not supported fraction
    fraction_not_supported_category_feature_assertion_of_total_not_supported = f"{number_unique_not_supported_category_assertion} / {total_number_not_supported_feature_unique_assertions}"
    not_supported_feature_categories_assertion_summary_data[category][
        "fraction_not_supported_category_feature_assertion_of_total_not_supported"
    ] = fraction_not_supported_category_feature_assertion_of_total_not_supported

    # Not supported percent
    percent_not_supported_category_feature_assertion_of_total_not_supported = f"{number_unique_not_supported_category_assertion / total_number_not_supported_feature_unique_assertions * 100:.2f}%"
    not_supported_feature_categories_assertion_summary_data[category][
        "percent_not_supported_category_feature_assertion_of_total_not_supported"
    ] = percent_not_supported_category_feature_assertion_of_total_not_supported

    not_supported_feature_assertion_summary[
        "Count of MOA Assertions per Category"
    ].append(number_unique_not_supported_category_assertion)
    not_supported_feature_assertion_summary["Fraction of MOA Assertions"].append(
        fraction_not_supported_category_feature_assertion_of_moa
    )
    not_supported_feature_assertion_summary["Percent of all MOA Assertions"].append(
        percent_not_supported_category_feature_assertion_of_moa
    )
    not_supported_feature_assertion_summary[
        "Fraction of Not Supported Feature Assertions"
    ].append(fraction_not_supported_category_feature_assertion_of_total_not_supported)
    not_supported_feature_assertion_summary[
        "Percent of Not Supported Feature Assertions"
    ].append(percent_not_supported_category_feature_assertion_of_total_not_supported)

In [92]:
number_unique_not_supported_category_features

0

### <a id='toc1_6_4_'></a>[Impact Score Analysis by Subcategory](#toc0_)

In [93]:
not_supported_impact_summary = {
    "Category": VARIANT_CATEGORY_VALUES,
    "MOA Total Sum Impact Score": [],
    "Average Impact Score per Feature": [],
    "Average Impact Score per Assertion": [],
    "Total Number Assertions": [
        v["number_unique_not_supported_category_assertion"]
        for v in not_supported_feature_categories_assertion_summary_data.values()
    ],
    "Total Number Features": [
        v["number_unique_not_supported_category_features"]
        for v in not_supported_feature_categories_summary_data.values()
    ],
}

In [94]:
not_supported_feature_categories_impact_data = dict()
for category in VARIANT_CATEGORY_VALUES:
    # print(category)
    not_supported_feature_categories_impact_data[category] = {}
    impact_category_df = not_supported_queries_df[
        not_supported_queries_df["category"] == category
    ].copy()

    total_sum_not_supported_category_impact = impact_category_df["impact_score"].sum()
    # print(f"total sum {total_sum_not_supported_category_impact}")
    not_supported_feature_categories_impact_data[category][
        "total_sum_not_supported_category_impact"
    ] = total_sum_not_supported_category_impact
    # print(f"")
    number_unique_not_supported_category_features = (
        impact_category_df.feature_id.nunique()
    )
    number_unique_not_supported_category_assertion = (
        impact_category_df.assertion_id.nunique()
    )
    # print(number_unique_not_supported_category_features)
    # print(number_unique_not_supported_category_assertion)

    if number_unique_not_supported_category_features == 0:
        avg_impact_score_feature = 0
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_feature"
        ] = avg_impact_score_feature

        avg_impact_score_assertion = 0
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_evidence"
        ] = avg_impact_score_assertion
    else:
        avg_impact_score_feature = f"{total_sum_not_supported_category_impact / number_unique_not_supported_category_features:.2f}"
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_feature"
        ] = avg_impact_score_feature

        avg_impact_score_assertion = f"{total_sum_not_supported_category_impact / number_unique_not_supported_category_assertion:.2f}"
        not_supported_feature_categories_impact_data[category][
            "avg_impact_score_evidence"
        ] = avg_impact_score_assertion

    not_supported_impact_summary["MOA Total Sum Impact Score"].append(
        total_sum_not_supported_category_impact
    )
    not_supported_impact_summary["Average Impact Score per Feature"].append(
        avg_impact_score_feature
    )
    not_supported_impact_summary["Average Impact Score per Assertion"].append(
        avg_impact_score_assertion
    )

    print(
        f"Number of unique features within category: {number_unique_not_supported_category_features}"
    )
    print(
        f"{category}: {total_sum_not_supported_category_impact}, {avg_impact_score_feature}, {avg_impact_score_assertion}"
    )

Number of unique features within category: 10
Expression Variants: 10, 1.00, 1.00
Number of unique features within category: 0
Epigenetic Modification: 0, 0, 0
Number of unique features within category: 0
Fusion Variants: 0, 0, 0
Number of unique features within category: 174
Sequence Variants: 1080.5, 6.21, 6.21
Number of unique features within category: 0
Gene Function Variants: 0, 0, 0
Number of unique features within category: 39
Rearrangement Variants: 301.0, 7.72, 7.72
Number of unique features within category: 18
Copy Number Variants: 57.0, 3.17, 3.17
Number of unique features within category: 8
Other Variants: 22.5, 2.81, 2.81
Number of unique features within category: 0
Genotype Variants: 0, 0, 0
Number of unique features within category: 0
Region Defined Variants: 0, 0, 0
Number of unique features within category: 0
Transcript Variants: 0, 0, 0


In [95]:
not_supported_feature_impact_df = pd.DataFrame(not_supported_impact_summary)

In [96]:
not_supported_feature_impact_df

Unnamed: 0,Category,MOA Total Sum Impact Score,Average Impact Score per Feature,Average Impact Score per Assertion,Total Number Assertions,Total Number Features
0,Expression Variants,10.0,1.0,1.0,10,10
1,Epigenetic Modification,0.0,0.0,0.0,0,0
2,Fusion Variants,0.0,0.0,0.0,0,0
3,Sequence Variants,1080.5,6.21,6.21,443,174
4,Gene Function Variants,0.0,0.0,0.0,0,0
5,Rearrangement Variants,301.0,7.72,7.72,82,39
6,Copy Number Variants,57.0,3.17,3.17,39,18
7,Other Variants,22.5,2.81,2.81,20,8
8,Genotype Variants,0.0,0.0,0.0,0,0
9,Region Defined Variants,0.0,0.0,0.0,0,0


In [97]:
not_supported_feature_impact_df.to_csv(
    "moa_assertion_analysis_output/not_supported_feature_impact_df.csv", index=False
)

# <a id='toc2_'></a>[MOA Summary](#toc0_)

## <a id='toc2_1_'></a>[Feature (Variant) Analysis](#toc0_)

### <a id='toc2_1_1_'></a>[Building Summary Tables 1 - 3](#toc0_)

In [98]:
all_features_df = pd.DataFrame(feature_analysis_summary)

In [99]:
all_features_df["Percentage of all MOA Features"] = (
    all_features_df["Fraction of all MOA Features"].astype(str)
    + "  ("
    + all_features_df["Percentage of all MOA Features"]
    + ")"
)

In [100]:
for_merge_all_variant_percent_of_moa_df = all_features_df.drop(
    columns=["Fraction of all MOA Features"]
)

all_features_percent_of_moa_df = all_features_df.drop(
    columns=["Fraction of all MOA Features", "Count of MOA Features per Category"]
)

In [101]:
for_merge_all_variant_percent_of_moa_df.to_csv(
    "moa_assertion_analysis_output/for_merge_all_variant_percent_of_moa_df.csv", index=False
)

### <a id='toc2_1_2_'></a>[Summary Table 1](#toc0_)

The table below shows the 2 categories that MOA features (variants) were divided into after normalization and what percentage they make up of all features (variants) in MOA data. 

In [102]:
all_features_percent_of_moa_df = all_features_percent_of_moa_df.set_index(
    "Variant Category"
)
all_features_percent_of_moa_df

Unnamed: 0_level_0,Percentage of all MOA Features
Variant Category,Unnamed: 1_level_1
Normalized,181 / 428 (42.29%)
Not Supported,249 / 428 (58.18%)


In [103]:
moa_summary_table_1 = all_features_percent_of_moa_df

### <a id='toc2_1_3_'></a>[Summary Table 2](#toc0_)

The table below shows the categories that the Not Supported features (variants) were broken into and what percentage of all MOA features (variants) they make up.

In [104]:
not_supported_features_total_df = pd.DataFrame(not_supported_feature_analysis_summary)
not_supported_features_total_df["Percent of all MOA Features"] = (
    not_supported_features_total_df["Fraction of all MOA Features"].astype(str)
    + "  ("
    + not_supported_features_total_df["Percent of all MOA Features"]
    + ")"
)
for_merge_not_supported_features_total_df = not_supported_features_total_df.drop(
    columns=[
        "Fraction of all MOA Features",
        "Fraction of Not Supported Features",
        "Percent of Not Supported Features",
    ]
)

not_supported_features_total_df = not_supported_features_total_df.drop(
    columns=[
        "Fraction of all MOA Features",
        "Fraction of Not Supported Features",
        "Percent of Not Supported Features",
        "Count of MOA Features per Category",
    ]
)
not_supported_features_total_df = not_supported_features_total_df.set_index("Category")
not_supported_features_total_df

Unnamed: 0_level_0,Percent of all MOA Features
Category,Unnamed: 1_level_1
Expression Variants,10 / 428 (2.34%)
Epigenetic Modification,0 / 428 (0.00%)
Fusion Variants,0 / 428 (0.00%)
Sequence Variants,174 / 428 (40.65%)
Gene Function Variants,0 / 428 (0.00%)
Rearrangement Variants,39 / 428 (9.11%)
Copy Number Variants,18 / 428 (4.21%)
Other Variants,8 / 428 (1.87%)
Genotype Variants,0 / 428 (0.00%)
Region Defined Variants,0 / 428 (0.00%)


In [105]:
moa_summary_table_2 = not_supported_features_total_df

In [106]:
for_merge_not_supported_features_total_df.to_csv(
    "moa_assertion_analysis_output/for_merge_not_supported_features_total_df.csv", index=False
)

### <a id='toc2_1_4_'></a>[Summary Table 3](#toc0_)

The table below shows the categories that the Not Supported features (variants) were broken into what percent each sub category take up in Not Supported variant group.

In [107]:
not_supported_features_category_df = pd.DataFrame(
    not_supported_feature_analysis_summary
)
not_supported_features_category_df["Percent of Not Supported Features"] = (
    not_supported_features_category_df["Fraction of Not Supported Features"].astype(str)
    + "  ("
    + not_supported_features_category_df["Percent of Not Supported Features"]
    + ")"
)
not_supported_features_category_df = not_supported_features_category_df.drop(
    columns=[
        "Fraction of all MOA Features",
        "Fraction of Not Supported Features",
        "Percent of all MOA Features",
        "Count of MOA Features per Category",
    ]
)
not_supported_features_category_df = not_supported_features_category_df.set_index(
    "Category"
)
not_supported_features_category_df

Unnamed: 0_level_0,Percent of Not Supported Features
Category,Unnamed: 1_level_1
Expression Variants,10 / 249 (4.02%)
Epigenetic Modification,0 / 249 (0.00%)
Fusion Variants,0 / 249 (0.00%)
Sequence Variants,174 / 249 (69.88%)
Gene Function Variants,0 / 249 (0.00%)
Rearrangement Variants,39 / 249 (15.66%)
Copy Number Variants,18 / 249 (7.23%)
Other Variants,8 / 249 (3.21%)
Genotype Variants,0 / 249 (0.00%)
Region Defined Variants,0 / 249 (0.00%)


In [108]:
moa_summary_table_3 = not_supported_features_category_df

## <a id='toc2_2_'></a>[Evidence Analysis](#toc0_)

### <a id='toc2_2_1_'></a>[Building Summary Table 4](#toc0_)

In [109]:
all_features_assertions_df = pd.DataFrame(assertion_analysis_summary)

In [110]:
all_features_assertions_df["Percentage of all MOA Assertions"] = (
    all_features_assertions_df["Fraction of all MOA Assertions"].astype(str)
    + "  ("
    + all_features_assertions_df["Percentage of all MOA Assertions"]
    + ")"
)

In [111]:
for_merge_all_features_assertions_df = all_features_assertions_df.drop(
    columns=["Fraction of all MOA Assertions"]
)

all_features_assertions_df = for_merge_all_features_assertions_df.drop(
    columns=["Count of MOA Assertions per Category"]
)

In [112]:
for_merge_all_features_assertions_df.to_csv(
    "moa_assertion_analysis_output/for_merge_all_features_assertions_df.csv", index=False
)

### <a id='toc2_2_2_'></a>[Summary Table 4](#toc0_)

The table below shows what percentage of all assertions (evidence items) in MOA are associated with Normalized and Not Supported features (variants)

In [113]:
all_features_assertions_df = all_features_assertions_df.set_index("Variant Category")
moa_summary_table_4 = all_features_assertions_df
moa_summary_table_4

Unnamed: 0_level_0,Percentage of all MOA Assertions
Variant Category,Unnamed: 1_level_1
Normalized,358 / 898 (39.87%)
Not Supported,594 / 898 (66.15%)


### <a id='toc2_2_3_'></a>[Building Sumary Tables 5 & 6](#toc0_)

In [114]:
not_supported_feature_assertion_df = pd.DataFrame(
    not_supported_feature_assertion_summary
)

ValueError: All arrays must be of the same length

In [None]:
not_supported_feature_assertion_df["Percent of all MOA Assertions"] = (
    not_supported_feature_assertion_df["Fraction of MOA Assertions"].astype(str)
    + "  ("
    + not_supported_feature_assertion_df["Percent of all MOA Assertions"]
    + ")"
)
not_supported_feature_assertion_df["Percent of Not Supported Feature Assertions"] = (
    not_supported_feature_assertion_df[
        "Fraction of Not Supported Feature Assertions"
    ].astype(str)
    + "  ("
    + not_supported_feature_assertion_df["Percent of Not Supported Feature Assertions"]
    + ")"
)

In [None]:
not_supported_feature_assertion_df = not_supported_feature_assertion_df.drop(
    columns=[
        "Fraction of MOA Assertions",
        "Fraction of Not Supported Feature Assertions",
    ]
)

In [None]:
for_merge_not_supported_feature_assertion_df = not_supported_feature_assertion_df.drop(
    ["Percent of Not Supported Feature Assertions"], axis=1
)

not_supported_feature_assertion_of_moa_df = (
    for_merge_not_supported_feature_assertion_df.drop(
        ["Count of MOA Assertions per Category"], axis=1
    )
)

not_supported_feature_assertion_of_not_supported_df = (
    not_supported_feature_assertion_df.drop(
        ["Percent of all MOA Assertions", "Count of MOA Assertions per Category"],
        axis=1,
    )
)

In [None]:
for_merge_not_supported_feature_assertion_df.to_csv(
    "moa_assertion_analysis_output/for_merge_not_supported_feature_assertion_df.csv", index=False
)

### <a id='toc2_2_4_'></a>[Summary Table 5](#toc0_)

The table below shows the percentage of all MOA assertions (evidence items) that are associated with a Not Supported variant sub category.

In [None]:
not_supported_feature_assertion_of_moa_df = (
    not_supported_feature_assertion_of_moa_df.set_index("Category")
)
moa_summary_table_5 = not_supported_feature_assertion_of_moa_df
moa_summary_table_5

Unnamed: 0_level_0,Percent of all MOA Assertions
Category,Unnamed: 1_level_1
Expression Variants,10 / 898 (1.11%)
Epigenetic Modification,0 / 898 (0.00%)
Fusion Variants,0 / 898 (0.00%)
Sequence Variants,443 / 898 (49.33%)
Gene Function Variants,0 / 898 (0.00%)
Rearrangement Variants,82 / 898 (9.13%)
Copy Number Variants,39 / 898 (4.34%)
Other Variants,20 / 898 (2.23%)
Genotype Variants,0 / 898 (0.00%)
Region Defined Variants,0 / 898 (0.00%)


### <a id='toc2_2_5_'></a>[Summary Table 6](#toc0_)

The table below shows the percentage of all MOA Assertions (evidence items) associated with Not Supported features (variants) that are associated with a variant sub category. 

In [None]:
not_supported_feature_assertion_of_not_supported_df = (
    not_supported_feature_assertion_of_not_supported_df.set_index("Category")
)
moa_summary_table_6 = not_supported_feature_assertion_of_not_supported_df
moa_summary_table_6

Unnamed: 0_level_0,Percent of Not Supported Feature Assertions
Category,Unnamed: 1_level_1
Expression Variants,10 / 249 (4.02%)
Epigenetic Modification,0 / 249 (0.00%)
Fusion Variants,0 / 249 (0.00%)
Sequence Variants,443 / 249 (177.91%)
Gene Function Variants,0 / 249 (0.00%)
Rearrangement Variants,82 / 249 (32.93%)
Copy Number Variants,39 / 249 (15.66%)
Other Variants,20 / 249 (8.03%)
Genotype Variants,0 / 249 (0.00%)
Region Defined Variants,0 / 249 (0.00%)


## <a id='toc2_3_'></a>[Impact](#toc0_)

The bar graph below shows the relationship between the Not Suported variant sub category impact score and the sub category. Additionally, the colors illustrate the number of assertions (evidence items) associated each sub category.

In [None]:
not_supported_feature_impact_df = pd.DataFrame(not_supported_impact_summary)
not_supported_feature_impact_df

Unnamed: 0,Category,MOA Total Sum Impact Score,Average Impact Score per Feature,Average Impact Score per Assertion,Total Number Assertions,Total Number Features
0,Expression Variants,10.0,1.0,1.0,10,10
1,Epigenetic Modification,0.0,0.0,0.0,0,0
2,Fusion Variants,0.0,0.0,0.0,0,0
3,Sequence Variants,1080.5,6.21,6.21,443,174
4,Gene Function Variants,0.0,0.0,0.0,0,0
5,Rearrangement Variants,301.0,7.72,7.72,82,39
6,Copy Number Variants,57.0,3.17,3.17,39,18
7,Other Variants,22.5,2.81,2.81,20,8
8,Genotype Variants,0.0,0.0,0.0,0,0
9,Region Defined Variants,0.0,0.0,0.0,0,0


In [None]:
not_supported_feature_impact_df.to_csv(
    "moa_assertion_analysis_output/not_supported_feature_impact_df.csv", index=False
)

In [None]:
fig3 = px.bar(
    not_supported_feature_impact_df,
    x="Category",
    y="MOA Total Sum Impact Score",
    hover_data=["Total Number Assertions"],
    color="Total Number Assertions",
    labels={"MOA Total Sum Impact Score": "MOA Total Sum Impact Score"},
    text_auto=".1f",
    color_continuous_scale="geyser",
)
fig3.update_traces(width=1)
fig3.show()

In [None]:
fig3.write_html("moa_assertion_analysis_output/moa_ns_categories_impact_redgreen.html")

The scatterplot below shows the relationship between the Not Suported variant sub category impact score and the number of assertions (evidence items) associated with features (variants) in each sub category. Additionally, the sizes of the data point represent the number of features (variants) in each sub category. 

In [None]:
fig2 = px.scatter(
    data_frame=not_supported_feature_impact_df,
    x="Total Number Assertions",
    y="MOA Total Sum Impact Score",
    size="Total Number Features",
    size_max=40,
    text="Total Number Features",
    color="Category",
)
fig2.show()

In [None]:
fig2.write_html("moa_assertion_analysis_output/moa_ns_categories_impact_scatterplot.html")