# Normalizer Performance Analysis

This notebook contains an analysis of the normalizer performance on the CIViC, MOA, and Clinvar data

## Import relevant packages

In [1]:
from pathlib import Path
import pandas as pd
# from civicpy import civic as civicpy
# import plotly.express as px
import ndjson
import re
from dotenv import load_dotenv
import os

## Dictionaries to map variants to categories and record category counts

In [2]:
civic_category_bins = {
    "Sequence Variants": "Sequence Variants",
    "Copy Number Variants": "Copy Number Variants"
}



moa_category_bins = {
    "Sequence Variants": "Sequence Variants",
    "Copy Number Variants": "Copy Number Variants",
    "Rearrangement Variants": "Rearrangement Variants",
    "Expression Variants": "Expression Variants",
    "Other Variants": "Other Variants"
}

# the values in this dictionary are lists of 4 integer values:
# [nomalized_count, unable_to_normalize_count, unsupported_count, total_count]
nomalized_count = 0
unable_to_normalize_count = 1
unsupported_count = 2
total_count = 3

category_counts = {
    "Copy Number Variants":[0,0,0,0],
    "Expression Variants":[0,0,0,0],
    "Other Variants": [0,0,0,0],
    "Rearrangement Variants":[0,0,0,0],
    "Sequence Variants":[0,0,0,0]
}

## CIViC



Read in .csv of normalized variants in CIVIC

In [11]:
civic_normalized_df = pd.read_csv("../civic/variation_analysis/able_to_normalize_queries.csv",sep = "\t")
print(civic_normalized_df.shape)
civic_normalized_df.head()
type(civic_normalized_df)

(1876, 7)


pandas.core.frame.DataFrame

Trim columns and add new column to flag as normalized.

In [None]:
os.path.realpath("merged_performance_analysis.ipynb")
os.chdir('/Users/dpp002/Documents/Variant Normalizer Manuscript/variation-normalizer-manuscript-1/analysis')
print("Current working directory: {0}".format(os.getcwd()))
for entry in os.scandir('.'):
    if entry.is_file():
        print(entry.name)

In [None]:
os.chdir('/Users/dpp002/Documents/Variant Normalizer Manuscript/variation-normalizer-manuscript-1/analysis/civic/variation_analysis')
print("Current working directory: {0}".format(os.getcwd()))

for entry in os.scandir('.'):
    if entry.is_file():
        print(entry.name)

Add category column to CIViC df.

Bin variants to categories.

For variants with multiple associated types:  If the 2+ types have a subset relationship (eg frameshift; frameshift truncation), reassign with the superset type (frameshift).  If the types are disjoint (eg: Trancript Variant; Loss of Function Variant), reassign with the type most closely  associated with the assayed data (Transcript Variant).


For variants without an associated type (approximately 1,632 entries) use regexes to assign types where possible.

Finally, assign variants to category bins based on type.

Split df by normalized/not_normalized/not_supported flags

For each df, Get CIViC Variant counts by category and add to counts dictionary

## MOA

Read MOA .csv file for Normalized variants

In [3]:
moa_normalized_df = pd.read_csv("../moa/feature_analysis/able_to_normalize_queries.csv",sep = "\t")
print(moa_normalized_df.shape)
moa_normalized_df.head()
type(moa_normalized_df)

(181, 5)


pandas.core.frame.DataFrame

Get variant counts by category, update variant counts df 

In [4]:
moa_normalized_category_counts = moa_normalized_df["category"].value_counts(dropna=False)
moa_normalized_category_counts.head()
indeces = moa_normalized_category_counts.index
for i in range(len(moa_normalized_category_counts)):
    variant = moa_normalized_category_counts.index[i]
    count = moa_normalized_category_counts[i]
    print(variant, count)
    target_category = moa_category_bins[variant]
    # print(target_category)
    category_counts[target_category][nomalized_count] += count
    category_counts[target_category][total_count] += count

for i in category_counts.items():
    print(i)


Sequence Variants 149
Copy Number Variants 32
('Copy Number Variants', [32, 0, 0, 32])
('Expression Variants', [0, 0, 0, 0])
('Other Variants', [0, 0, 0, 0])
('Rearrangement Variants', [0, 0, 0, 0])
('Sequence Variants', [149, 0, 0, 149])


Repeat same process for variants that were supported but failed to normalize.

In [5]:
moa_not_normalized_df = pd.read_csv("../moa/feature_analysis/unable_to_normalize_queries.csv",sep = "\t")
print(moa_not_normalized_df.shape)
moa_not_normalized_df.head()
type(moa_not_normalized_df)

(0, 7)


pandas.core.frame.DataFrame

In [6]:
moa__not_normalized_category_counts = moa_not_normalized_df["category"].value_counts(dropna=False)
moa__not_normalized_category_counts.head()
indeces = moa__not_normalized_category_counts.index
for i in range(len(moa__not_normalized_category_counts)):
    variant = moa__not_normalized_category_counts.index[i]
    count = moa__not_normalized_category_counts[i]
    print(variant, count)
    target_category = moa_category_bins[variant]
    # print(target_category)
    category_counts[target_category][unable_to_normalize_count] += count
    category_counts[target_category][total_count] += count

for i in category_counts.items():
    print(i)


('Copy Number Variants', [32, 0, 0, 32])
('Expression Variants', [0, 0, 0, 0])
('Other Variants', [0, 0, 0, 0])
('Rearrangement Variants', [0, 0, 0, 0])
('Sequence Variants', [149, 0, 0, 149])


Repeat same process for variants that are unsupported.

In [7]:
moa_not_supported_df = pd.read_csv("../moa/feature_analysis/not_supported_variants.csv",sep = "\t")
print(moa_not_supported_df.shape)
print(moa_not_supported_df.head())
type(moa_not_supported_df)
print(moa_not_supported_df["category"].value_counts(dropna=False))

(249, 4)
   variant_id                        query moa_feature_type  \
0           1             BCR--ABL1 Fusion    rearrangement   
1          12                   ALK Fusion    rearrangement   
2          15                          ALK    rearrangement   
3          18            ALK Translocation    rearrangement   
4          21  BRD4 t(15;19) Translocation    rearrangement   

                 category  
0  Rearrangement Variants  
1  Rearrangement Variants  
2  Rearrangement Variants  
3  Rearrangement Variants  
4  Rearrangement Variants  
category
Sequence Variants         181
Rearrangement Variants     35
Copy Number Variants       17
Expression Variants        11
Other Variants              5
Name: count, dtype: int64


In [8]:
othervars = moa_not_supported_df[moa_not_supported_df["category"] == "Other Variants"]
# print(othervars.head())
type(othervars)
othervars.head

<bound method NDFrame.head of      variant_id                      query          moa_feature_type  \
210         781                   MSI-High  microsatellite_stability   
216         803                       High         mutational_burden   
217         805    High (>= 178 mutations)         mutational_burden   
218         806    High (>= 100 mutations)         mutational_burden   
219         808  High (>= 10 mutations/Mb)         mutational_burden   

           category  
210  Other Variants  
216  Other Variants  
217  Other Variants  
218  Other Variants  
219  Other Variants  >

In [9]:
moa__not_supported_category_counts = moa_not_supported_df["category"].value_counts(dropna=False)
moa__not_supported_category_counts.head()
indeces = moa__not_supported_category_counts.index
for i in range(len(moa__not_supported_category_counts)):
    variant = moa__not_supported_category_counts.index[i]
    count = moa__not_supported_category_counts[i]
    print(variant, count)
    target_category = moa_category_bins[variant]
    # print(target_category)
    category_counts[target_category][unsupported_count] += count
    category_counts[target_category][total_count] += count

for i in category_counts.items():
    print(i)


Sequence Variants 181
Rearrangement Variants 35
Copy Number Variants 17
Expression Variants 11
Other Variants 5
('Copy Number Variants', [32, 0, 17, 49])
('Expression Variants', [0, 0, 11, 11])
('Other Variants', [0, 0, 5, 5])
('Rearrangement Variants', [0, 0, 35, 35])
('Sequence Variants', [149, 0, 181, 330])


## ClinVar

Get df from ClinVar Analysis

Get Clinvar Variant counts by category, update variant counts df 

Output counts df

Generate figure(?)