# Normalizer Performance Analysis

This notebook contains an analysis of the normalizer performance on the CIViC, MOA, and Clinvar data

## Import relevant packages

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
# from civicpy import civic as civicpy
# import plotly.express as px
import ndjson
import re
from dotenv import load_dotenv
import os

## Dictionaries to map variants to categories and record category counts

Bin variants to categories.

For variants with multiple associated types:  If the 2+ types have a subset relationship (eg frameshift; frameshift truncation), they are assigned to categories consistent with the superset type (frameshift).  If the types are disjoint (eg: Trancript Variant; Loss of Function Variant), they are assigned with the category most closely associated with the assayed data (Transcript Variant).  This assignment is done in the civic_category_bins dictionary.

In [2]:
civic_category_bins = {
    "Delins":"Sequence Variants",
    "Direct Tandem Duplication":"Sequence Variants",
    "Disruptive Inframe Deletion":"Sequence Variants",
    "Disruptive Inframe Insertion":"Sequence Variants",
    "Coding Sequence Variant":"Sequence Variants",
    "Conservative Inframe Deletion":"Sequence Variants",
    "Copy Number Variants":"Copy Number Variants",
    "Frameshift":"Sequence Variants",
    "Frameshift Truncation":"Sequence Variants",
    "Frameshift Variant":"Sequence Variants",
    "Frameshift Variant;Minus 1 Frameshift Variant":"Sequence Variants",
    "Inframe Deletion":"Sequence Variants",
    "Inframe Indel":"Sequence Variants",
    "Inframe Insertion":"Sequence Variants",
    "Intron Variant":"Region-Defined Variants",
    "Minus 1 Frameshift Variant":"Sequence Variants",
    "Minus 2 Frameshift Variant":"Sequence Variants",
    "Missense Variant":"Sequence Variants",
    "Non Conservative Missense Variant":"Sequence Variants",
    "Plus 1 Frameshift Variant":"Sequence Variants",
    "Region-Defined Variant":"Region-Defined Variants",
    "Regulatory Region Variant":"Region-Defined Variants",
    "Sequence Variants":"Sequence Variants",
    "Splice Acceptor Variant":"Region-Defined Variants",
    "Splice Donor Region Variant":"Region-Defined Variants",
    "Splice Donor Variant":"Region-Defined Variants",
    "Splicing Variant":"Other Variants",
    "Start Lost":"Sequence Variants",
    "Stop Gained":"Sequence Variants",
    "Stop Lost":"Sequence Variants",
    "Synonymous Variant":"Sequence Variants",
    "Transcript Amplification":"Copy Number Variants",
    "Transcript Fusion":"Fusion Variants",
    "3 Prime UTR Variant":"Region-Defined Variants",
    "Amino Acid Deletion;Inframe Deletion":"Sequence Variants",
    "Frameshift Truncation;Minus 2 Frameshift Variant":"Sequence Variants",
    "Frameshift Truncation;Plus 2 Frameshift Variant":"Sequence Variants",
    "Frameshift Variant;Delins":"Sequence Variants",
    "Inframe Insertion;Delins":"Sequence Variants",
    "Inframe Insertion;Inframe Deletion;Delins":"Sequence Variants",
    "Inframe Variant;Inframe Insertion;Inframe Deletion;Delins ":"Sequence Variants",
    "Minus 1 Frameshift Variant;Frameshift Truncation":"Sequence Variants",
    "Plus 1 Frameshift Variant;Frameshift Elongation":"Sequence Variants",
    "Plus 1 Frameshift Variant;Frameshift Truncation":"Sequence Variants",
    "Missense Variant;Gain Of Function Variant":"Sequence Variants", 
    "Missense Variant;Loss Of Function Variant":"Sequence Variants", 
    "Missense Variant;Loss Of Heterozygosity":"Sequence Variants", 
    "Missense Variant;Polymorphic Sequence Variant":"Sequence Variants", 
    "Missense Variant;Snp":"Sequence Variants", 
    "Missense Variant;Transcript Fusion":"Sequence Variants",
    "Stop Gained;Loss Of Function Variant":"Sequence Variants",
    "Stop Lost;Inframe Deletion":"Sequence Variants"
}



moa_category_bins = {
    "Copy Number Variants": "Copy Number Variants",
    "Expression Variants": "Expression Variants",
    "Other Variants": "Other Variants",
    "Rearrangement Variants": "Rearrangement Variants",
    "Sequence Variants": "Sequence Variants"
}



clinvar_category_bins = {
    "Complex":"Other Variants",
    "CompoundHeterozygote":"Genotype Variants",
    "Deletion":"Sequence Variants",
    "Diplotype":"Genotype Variants",
    "Distinct chromosomes":"Rearrangement Variants",
    "Duplication":"Sequence Variants",
    "Haplotype":"Sequence Variants",
    "Haplotype, single variant":"Sequence Variants",
    "Indel":"Sequence Variants",
    "Insertion":"Sequence Variants",
    "Inversion":"Sequence Variants",
    "Microsatellite":"Sequence Variants",
    "Phase unknown":"Other Variants",
    "Tandem duplication":"Sequence Variants",
    "Translocation":"Rearrangement Variants",
    "Variation":"Other Variants",
    "copy number gain":"Copy Number Variants",
    "copy number loss":"Copy Number Variants",
    "fusion":"Fusion Variants",
    "protein only":"Sequence Variants",
    "single nucleotide variant":"Sequence Variants"
}


# the values in this dictionary are lists of 4 integer values:
# [nomalized_count, unable_to_normalize_count, unsupported_count, total_count]
nomalized_count = 0
unable_to_normalize_count = 1
unsupported_count = 2
total_count = 3

category_counts = {
    "Copy Number Variants":[0,0,0,0],
    "Epigenetic Modification":[0,0,0,0],
    "Expression Variants":[0,0,0,0],
    "Fusion Variants":[0,0,0,0],
    "Gene Function Variants":[0,0,0,0],
    "Genotype Variants":[0,0,0,0],
    "Other Variants":[0,0,0,0],
    "Rearrangement Variants":[0,0,0,0],
    "Region-Defined Variants":[0,0,0,0],
    "Sequence Variants":[0,0,0,0]
}

## CIViC



In order to score the normalizer's performance on the CIViC data, some cleaning is required.

First we need to read in the data that was ostensibly supported, get rid of varients with multiple type labels, and assign variant types to as  many of the entries as possible that have a "Not provided" value for civic_variant_types.

Read in .csv of normalized variants in CIVIC

In [3]:
civic_normalized_df = pd.read_csv("../civic/variation_analysis/able_to_normalize_queries.csv",sep = "\t")
print(civic_normalized_df.shape)
civic_normalized_df.head()
type(civic_normalized_df)

(1876, 7)


pandas.core.frame.DataFrame

Prune columns and add new column to flag as normalized.

In [4]:
pruned_civic_normalized_df = civic_normalized_df[["variant_id","query","query_type","civic_variant_types"]]
pruned_civic_normalized_df.insert(4,"normalization_status","normalized")
pruned_civic_normalized_df.head()

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status
0,2489,NC_000003.11:g.10191648_10191649insC,genomic,Stop Lost,normalized
1,1988,NC_000003.11:g.10191649A>T,genomic,Stop Lost,normalized
2,2488,3-10191647-T-G,genomic,Stop Lost,normalized
3,1986,NC_000003.11:g.10191648G>T,genomic,Stop Lost,normalized
4,1987,NC_000003.11:g.10191649A>G,genomic,Stop Lost,normalized


Repeat process with the variants that were unable to be normalized.

In [5]:
civic_not_normalized_df = pd.read_csv("../civic/variation_analysis/unable_to_normalize_queries.csv",sep = "\t")
print(civic_not_normalized_df.shape)
civic_not_normalized_df.head()
type(civic_not_normalized_df)

(80, 8)


pandas.core.frame.DataFrame

In [6]:
pruned_civic_not_normalized_df = civic_not_normalized_df[["variant_id","query","query_type","civic_variant_types"]]
pruned_civic_not_normalized_df.insert(4,"normalization_status","not_normalized")
pruned_civic_not_normalized_df.head()

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status
0,748,MLH1 *757L,protein,Stop Lost,not_normalized
1,3718,AR A748V,protein,Not provided,not_normalized
2,3725,AR A765T,protein,Not provided,not_normalized
3,4485,ERBB2 A775_G776ins YVMA,protein,Not provided,not_normalized
4,248,TERT C228T,protein,Regulatory Region Variant,not_normalized


Merge these dfs

In [7]:
frames = [pruned_civic_normalized_df, pruned_civic_not_normalized_df]
civic_supported_df = pd.concat(frames)
civic_supported_df.shape

(1956, 5)

Making all queries in all caps to make it easier to account of untyped variants later on.

In [8]:
civic_supported_df["query"] = civic_supported_df["query"].apply(str.upper)

Checking variant types.  The single largest types is "Not provided".  
Most of these look like amino acid substitutions.
Defining a regex to detect these variants and assign "Missense Varaint" type to these variants.

In [9]:
civic_supported_df["civic_variant_types"].value_counts(dropna=False)

civic_variant_types
Not provided                                                 816
Missense Variant                                             815
Stop Gained                                                   60
Frameshift Truncation                                         36
Transcript Amplification                                      36
Inframe Deletion                                              35
Minus 1 Frameshift Variant;Frameshift Truncation              21
Inframe Insertion                                             17
Synonymous Variant                                            13
Splice Donor Variant                                          13
Splice Acceptor Variant                                        9
Missense Variant;Gain Of Function Variant                      9
Frameshift Variant                                             7
Stop Lost                                                      6
Conservative Inframe Deletion                                  6
Delin

If a variant does not have an assigned variant type in civic, it is a protein query, and the query matches a regex pattern associated with variant flagstitutions (such as "PTEN A126D"), then I am re-classifying them as a "Missense Variant" instead.

In [10]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+[A-Z|*]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

# civic_supported_df[0:20]
    # bool(re.match("\S+\s+[A-Z]+\d+[A-Z]","PTEN A126D"))

Doing so reduced the 816 untyped variants down to 70.
Checking the remaining weird variants.

In [11]:
untyped_variants = civic_supported_df[civic_supported_df["civic_variant_types"] == "Not provided"]
print(untyped_variants.head(20))

     variant_id                      query query_type civic_variant_types  \
11         3342          KRAS A11_G12INSGA    protein        Not provided   
65         4484     ERBB2 A775_G776INSIVMA    protein        Not provided   
66         2658     ERBB2 A775_G776INSYVMA    protein        Not provided   
67         4483     ERBB2 A775_G776INSYVMA    protein        Not provided   
74         3751  ARHGAP35 A865_L870DELINSV    protein        Not provided   
82         2655          MYB AMPLIFICATION    protein        Not provided   
110        1261         MDM2 AMPLIFICATION    protein        Not provided   
113        1276          SMO AMPLIFICATION    protein        Not provided   
116        1684        PSMD4 AMPLIFICATION    protein        Not provided   
118        2205         FLT4 AMPLIFICATION    protein        Not provided   
119        2240         TLK2 AMPLIFICATION    protein        Not provided   
120        2397         CRKL AMPLIFICATION    protein        Not provided   

Reassigning variants marked as {gene} Amplification as Transcript Amplification Variants

In [12]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+AMPLIFICATION", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Transcript Amplification", civic_supported_df["civic_variant_types"])


Reassigning amino acid insertions, delins, and deletions as "Missense Variant", including a couple of variants that have a random space before or after the sequence operation like "INS"

In [13]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+_+[A-Z]+\d+INS+[A-Z]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])


In [14]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+_+[A-Z]+\d+INS+\s+[A-Z]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

In [15]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+-+\d+\s+INS+[A-Z]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

In [16]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+_+[A-Z]+\d+DELINS+[A-Z]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])


In [17]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+[A-Z]+\d+_+[A-Z]+\d+DEL", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])


And assigning missense types to a handful of remaining variants that are non-standard names for genomic and protein sequence variants

In [18]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+\s+P\.+[A-Z]+\d+[A-Z]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])


In [19]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("\S+[A-Z]+\-+[A-Z]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "genomic") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

In [20]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("NC_\d+\.+\d+:[A-Z]+\.+\d+[A-Z]+>+[A-Z]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "genomic") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

In [21]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("NC_\d+\.+\d+:[A-Z]+\.+\d+_+\d+INS+[A-Z]", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "genomic") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Missense Variant", civic_supported_df["civic_variant_types"])

This last variant is a unique (to this db) nonstandard nomenclature for just some variant in a particular domain, so it is a region-defined variant.

In [22]:
civic_supported_df["variant flag"] = civic_supported_df["query"].apply(lambda x: bool(re.match("DICER1 RNASE IIIB MUTATION", x)))

civic_supported_df["civic_variant_types"] = np.where((civic_supported_df["query_type"] == "protein") & (civic_supported_df["civic_variant_types"] == "Not provided") & (civic_supported_df["variant flag"]), "Region-Defined Variant", civic_supported_df["civic_variant_types"])

Add category column to CIViC df.

In [23]:
civic_supported_df["category"] = civic_supported_df["civic_variant_types"].map(civic_category_bins)
civic_supported_df.tail()

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status,variant flag,category
75,953,KIT Y553_K558DEL,protein,Inframe Deletion,not_normalized,False,Sequence Variants
76,1542,KIT Y553_W557DELYEVQW,protein,Inframe Deletion,not_normalized,False,Sequence Variants
77,1548,KIT Y570_L576DEL,protein,Inframe Deletion,not_normalized,False,Sequence Variants
78,1630,FLT3 Y591_V592INSVDFREYE,protein,Missense Variant,not_normalized,False,Sequence Variants
79,3724,AR Y763C,protein,Missense Variant,not_normalized,False,Sequence Variants


Split df by normalized/not_normalized flag

In [24]:
civic_normalized_df_cats = civic_supported_df[civic_supported_df["normalization_status"] == "normalized"]
civic_normalized_df_cats

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status,variant flag,category
0,2489,NC_000003.11:G.10191648_10191649INSC,genomic,Stop Lost,normalized,False,Sequence Variants
1,1988,NC_000003.11:G.10191649A>T,genomic,Stop Lost,normalized,False,Sequence Variants
2,2488,3-10191647-T-G,genomic,Stop Lost,normalized,False,Sequence Variants
3,1986,NC_000003.11:G.10191648G>T,genomic,Stop Lost,normalized,False,Sequence Variants
4,1987,NC_000003.11:G.10191649A>G,genomic,Stop Lost,normalized,False,Sequence Variants
...,...,...,...,...,...,...,...
1871,3161,3-10183878-G-A,genomic,Missense Variant,normalized,False,Sequence Variants
1872,877,NC_000020.11:G.58903752C>T,genomic,Synonymous Variant,normalized,False,Sequence Variants
1873,731,NC_000003.11:G.37056036G>A,genomic,Splice Donor Variant,normalized,False,Region-Defined Variants
1874,3045,VHL P.F76DEL,protein,Missense Variant,normalized,False,Sequence Variants


In [25]:
civic_not_normalized_df_cats = civic_supported_df[civic_supported_df["normalization_status"] == "not_normalized"]
civic_not_normalized_df_cats

Unnamed: 0,variant_id,query,query_type,civic_variant_types,normalization_status,variant flag,category
0,748,MLH1 *757L,protein,Stop Lost,not_normalized,False,Sequence Variants
1,3718,AR A748V,protein,Missense Variant,not_normalized,False,Sequence Variants
2,3725,AR A765T,protein,Missense Variant,not_normalized,False,Sequence Variants
3,4485,ERBB2 A775_G776INS YVMA,protein,Missense Variant,not_normalized,False,Sequence Variants
4,248,TERT C228T,protein,Regulatory Region Variant,not_normalized,False,Region-Defined Variants
...,...,...,...,...,...,...,...
75,953,KIT Y553_K558DEL,protein,Inframe Deletion,not_normalized,False,Sequence Variants
76,1542,KIT Y553_W557DELYEVQW,protein,Inframe Deletion,not_normalized,False,Sequence Variants
77,1548,KIT Y570_L576DEL,protein,Inframe Deletion,not_normalized,False,Sequence Variants
78,1630,FLT3 Y591_V592INSVDFREYE,protein,Missense Variant,not_normalized,False,Sequence Variants


For each df, Get CIViC Variant counts by category and add to counts dictionary

In [26]:
civic_normalized_category_counts = civic_normalized_df_cats["category"].value_counts()
civic_normalized_category_counts.head()
for i in range(len(civic_normalized_category_counts)):
    category = civic_normalized_category_counts.index[i]
    count = civic_normalized_category_counts[i]
    print(category, count)
    category_counts[category][nomalized_count] += count
    category_counts[category][total_count] += count

for i in category_counts.items():
    print(i)


Sequence Variants 1788
Copy Number Variants 57
Region-Defined Variants 28
Fusion Variants 1
Other Variants 1
('Copy Number Variants', [57, 0, 0, 57])
('Epigenetic Modification', [0, 0, 0, 0])
('Expression Variants', [0, 0, 0, 0])
('Fusion Variants', [1, 0, 0, 1])
('Gene Function Variants', [0, 0, 0, 0])
('Genotype Variants', [0, 0, 0, 0])
('Other Variants', [1, 0, 0, 1])
('Rearrangement Variants', [0, 0, 0, 0])
('Region-Defined Variants', [28, 0, 0, 28])
('Sequence Variants', [1788, 0, 0, 1788])


In [27]:
civic_not_normalized_category_counts = civic_not_normalized_df_cats["category"].value_counts()
civic_not_normalized_category_counts.head()
for i in range(len(civic_not_normalized_category_counts)):
    category = civic_not_normalized_category_counts.index[i]
    count = civic_not_normalized_category_counts[i]
    print(category, count)
    category_counts[category][unable_to_normalize_count] += count
    category_counts[category][total_count] += count

for i in category_counts.items():
    print(i)

Sequence Variants 77
Region-Defined Variants 3
('Copy Number Variants', [57, 0, 0, 57])
('Epigenetic Modification', [0, 0, 0, 0])
('Expression Variants', [0, 0, 0, 0])
('Fusion Variants', [1, 0, 0, 1])
('Gene Function Variants', [0, 0, 0, 0])
('Genotype Variants', [0, 0, 0, 0])
('Other Variants', [1, 0, 0, 1])
('Rearrangement Variants', [0, 0, 0, 0])
('Region-Defined Variants', [28, 3, 0, 31])
('Sequence Variants', [1788, 77, 0, 1865])


Read in the csv for unsupported variants.  This data was already mapped to categories in civic_variant_analysis.  Therefore, we only need to import the data and perform the count on the category column.

In [28]:
not_supported_variants = pd.read_csv("../civic/variation_analysis/not_supported_variants.csv",sep = "\t")
print(not_supported_variants.shape)
not_supported_variants.head()

(1563, 6)


Unnamed: 0,variant_id,gene_name,variant_name,civic_variant_types,category,variant_accepted
0,4170,VHL,,Not provided,Transcript Variants,False
1,4417,ALK,FBXO11::ALK,Not provided,Fusion Variants,False
2,4214,VHL,,Not provided,Transcript Variants,False
3,4216,VHL,,Not provided,Transcript Variants,False
4,4278,VHL,,Not provided,Transcript Variants,False


Checking Counts.

In [29]:
not_supported_variants["category"].value_counts()

category
Transcript Variants        366
Fusion Variants            294
Expression Variants        287
Sequence Variants          133
Region Defined Variants    129
Rearrangement Variants     116
Gene Function Variants      91
Other Variants              83
Copy Number Variants        34
Genotype Variants           16
Epigenetic Modification     14
Name: count, dtype: int64

There are two small discrepencies here. First, there is a hyphen missing from "Region-Defined Variants" which will cause a key error.  Second, the variants labelled as "Transcript Variants" here should be binned under "Other Variants".  Fixing that now.

In [30]:
not_supported_variants["category"].replace("Region Defined Variants", "Region-Defined Variants", inplace=True)
not_supported_variants["category"].replace("Transcript Variants", "Other Variants", inplace=True)

In [31]:
not_supported_variants_category_counts = not_supported_variants["category"].value_counts()
not_supported_variants_category_counts.head()
for i in range(len(not_supported_variants_category_counts)):
    category = not_supported_variants_category_counts.index[i]
    count = not_supported_variants_category_counts[i]
    print(category, count)
    category_counts[category][unsupported_count] += count
    category_counts[category][total_count] += count

for i in category_counts.items():
    print(i)

Other Variants 449
Fusion Variants 294
Expression Variants 287
Sequence Variants 133
Region-Defined Variants 129
Rearrangement Variants 116
Gene Function Variants 91
Copy Number Variants 34
Genotype Variants 16
Epigenetic Modification 14
('Copy Number Variants', [57, 0, 34, 91])
('Epigenetic Modification', [0, 0, 14, 14])
('Expression Variants', [0, 0, 287, 287])
('Fusion Variants', [1, 0, 294, 295])
('Gene Function Variants', [0, 0, 91, 91])
('Genotype Variants', [0, 0, 16, 16])
('Other Variants', [1, 0, 449, 450])
('Rearrangement Variants', [0, 0, 116, 116])
('Region-Defined Variants', [28, 3, 129, 160])
('Sequence Variants', [1788, 77, 133, 1998])


## MOA

Read MOA .csv file for Normalized variants

In [32]:
moa_normalized_df = pd.read_csv("../moa/feature_analysis/able_to_normalize_queries.csv",sep = "\t")
print(moa_normalized_df.shape)
moa_normalized_df.head()
type(moa_normalized_df)

(181, 5)


pandas.core.frame.DataFrame

Get variant counts by category, update variant counts df 

In [33]:
moa_normalized_category_counts = moa_normalized_df["category"].value_counts(dropna=False)
moa_normalized_category_counts.head()
indeces = moa_normalized_category_counts.index
for i in range(len(moa_normalized_category_counts)):
    variant = moa_normalized_category_counts.index[i]
    count = moa_normalized_category_counts[i]
    print(variant, count)
    target_category = moa_category_bins[variant]
    # print(target_category)
    category_counts[target_category][nomalized_count] += count
    category_counts[target_category][total_count] += count

for i in category_counts.items():
    print(i)


Sequence Variants 149
Copy Number Variants 32
('Copy Number Variants', [89, 0, 34, 123])
('Epigenetic Modification', [0, 0, 14, 14])
('Expression Variants', [0, 0, 287, 287])
('Fusion Variants', [1, 0, 294, 295])
('Gene Function Variants', [0, 0, 91, 91])
('Genotype Variants', [0, 0, 16, 16])
('Other Variants', [1, 0, 449, 450])
('Rearrangement Variants', [0, 0, 116, 116])
('Region-Defined Variants', [28, 3, 129, 160])
('Sequence Variants', [1937, 77, 133, 2147])


Repeat same process for variants that were supported but failed to normalize.

In [34]:
moa_not_normalized_df = pd.read_csv("../moa/feature_analysis/unable_to_normalize_queries.csv",sep = "\t")
print(moa_not_normalized_df.shape)
moa_not_normalized_df.head()
type(moa_not_normalized_df)

(0, 7)


pandas.core.frame.DataFrame

In [35]:
moa__not_normalized_category_counts = moa_not_normalized_df["category"].value_counts(dropna=False)
moa__not_normalized_category_counts.head()
indeces = moa__not_normalized_category_counts.index
for i in range(len(moa__not_normalized_category_counts)):
    variant = moa__not_normalized_category_counts.index[i]
    count = moa__not_normalized_category_counts[i]
    print(variant, count)
    target_category = moa_category_bins[variant]
    # print(target_category)
    category_counts[target_category][unable_to_normalize_count] += count
    category_counts[target_category][total_count] += count

for i in category_counts.items():
    print(i)


('Copy Number Variants', [89, 0, 34, 123])
('Epigenetic Modification', [0, 0, 14, 14])
('Expression Variants', [0, 0, 287, 287])
('Fusion Variants', [1, 0, 294, 295])
('Gene Function Variants', [0, 0, 91, 91])
('Genotype Variants', [0, 0, 16, 16])
('Other Variants', [1, 0, 449, 450])
('Rearrangement Variants', [0, 0, 116, 116])
('Region-Defined Variants', [28, 3, 129, 160])
('Sequence Variants', [1937, 77, 133, 2147])


Repeat same process for variants that are unsupported.

In [36]:
moa_not_supported_df = pd.read_csv("../moa/feature_analysis/not_supported_variants.csv",sep = "\t")
print(moa_not_supported_df.shape)
print(moa_not_supported_df.head())
type(moa_not_supported_df)
print(moa_not_supported_df["category"].value_counts(dropna=False))

(249, 4)
   variant_id                        query moa_feature_type  \
0           1             BCR--ABL1 Fusion    rearrangement   
1          12                   ALK Fusion    rearrangement   
2          15                          ALK    rearrangement   
3          18            ALK Translocation    rearrangement   
4          21  BRD4 t(15;19) Translocation    rearrangement   

                 category  
0  Rearrangement Variants  
1  Rearrangement Variants  
2  Rearrangement Variants  
3  Rearrangement Variants  
4  Rearrangement Variants  
category
Sequence Variants         181
Rearrangement Variants     35
Copy Number Variants       17
Expression Variants        11
Other Variants              5
Name: count, dtype: int64


In [37]:
othervars = moa_not_supported_df[moa_not_supported_df["category"] == "Other Variants"]
# print(othervars.head())
type(othervars)
othervars.head

<bound method NDFrame.head of      variant_id                      query          moa_feature_type  \
210         781                   MSI-High  microsatellite_stability   
216         803                       High         mutational_burden   
217         805    High (>= 178 mutations)         mutational_burden   
218         806    High (>= 100 mutations)         mutational_burden   
219         808  High (>= 10 mutations/Mb)         mutational_burden   

           category  
210  Other Variants  
216  Other Variants  
217  Other Variants  
218  Other Variants  
219  Other Variants  >

In [38]:
moa__not_supported_category_counts = moa_not_supported_df["category"].value_counts(dropna=False)
moa__not_supported_category_counts.head()
indeces = moa__not_supported_category_counts.index
for i in range(len(moa__not_supported_category_counts)):
    variant = moa__not_supported_category_counts.index[i]
    count = moa__not_supported_category_counts[i]
    print(variant, count)
    target_category = moa_category_bins[variant]
    # print(target_category)
    category_counts[target_category][unsupported_count] += count
    category_counts[target_category][total_count] += count

for i in category_counts.items():
    print(i)


Sequence Variants 181
Rearrangement Variants 35
Copy Number Variants 17
Expression Variants 11
Other Variants 5
('Copy Number Variants', [89, 0, 51, 140])
('Epigenetic Modification', [0, 0, 14, 14])
('Expression Variants', [0, 0, 298, 298])
('Fusion Variants', [1, 0, 294, 295])
('Gene Function Variants', [0, 0, 91, 91])
('Genotype Variants', [0, 0, 16, 16])
('Other Variants', [1, 0, 454, 455])
('Rearrangement Variants', [0, 0, 151, 151])
('Region-Defined Variants', [28, 3, 129, 160])
('Sequence Variants', [1937, 77, 314, 2328])


## ClinVar

Get df from ClinVar Analysis

Get Clinvar Variant counts by category, update variant counts df 

Output counts df

Generate figure(?)