# <a id='toc1_'></a>[ClinVar Variant Analysis](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [ClinVar Variant Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Create output directory](#toc1_1_2_)    
    - [Import variant information file](#toc1_1_3_)    
  - [Add Supported Status of Variant based on in.vrs_xform_plan.policy](#toc1_2_)    
  - [Add Normalization Status of Variant based on out.errors](#toc1_3_)    
    - [Set Normalize Status of Variant as T/F](#toc1_3_1_)    
      - [Summary Table](#toc1_3_1_1_)    
  - [Create subgroups based on Variant Status](#toc1_4_)    
    - [Supported and Normalized Variants](#toc1_4_1_)    
    - [Supported and Not Normalized Variants](#toc1_4_2_)    
    - [Not Supported Variants](#toc1_4_3_)    
  - [Counting variants from each group](#toc1_5_)    
  - [Counting variant types for each group](#toc1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [7]:
import ndjson
import pandas as pd
import numpy as np
import re
from pathlib import Path
import boto3
import gzip

### <a id='toc1_1_2_'></a>[Create output directory](#toc0_)

In [8]:
path = Path("clinvar_variation_analysis_output")
path.mkdir(exist_ok = True)

### <a id='toc1_1_3_'></a>[Import variant information file](#toc0_)

In [9]:
s3 = boto3.client('s3')

with open('../clinvar/output-variation_identity-vrs-1.3.ndjson.gz', 'wb') as data:
    s3.download_fileobj('nch-igm-wagner-lab-public', 'variation-normalizer-manuscript/output-variation_identity-vrs-1.3.ndjson.gz', data)

In [10]:
with gzip.open('output-variation_identity-vrs-1.3.ndjson.gz', 'rb') as f:
    file_content = ndjson.load(f)


In [11]:
df0 = pd.json_normalize(file_content)

In [12]:
df = df0.copy()

## <a id='toc1_2_'></a>[Add Supported Status of Variant based on in.vrs_xform_plan.policy](#toc0_)

Checking for blanks

In [13]:
df["in.vrs_xform_plan.policy"] = df["in.vrs_xform_plan.policy"].fillna("None")

In [14]:
df["in.vrs_xform_plan.policy"].value_counts()

in.vrs_xform_plan.policy
Canonical SPDI                                      2118669
Absolute copy count                                   53263
Copy number change (cn loss|del and cn gain|dup)      27104
NCBI36 genomic only                                    4771
No hgvs or location info                               3089
Genotype/Haplotype                                     1440
Invalid/unsupported hgvs                               1336
Remaining valid hgvs alleles                            941
Min/max copy count range not supported                   14
Name: count, dtype: int64

In [15]:
df["support_status"] = df["in.vrs_xform_plan.policy"].copy()

df.loc[df["support_status"] == "Canonical SPDI", "support_status"] = True
df.loc[df["support_status"] == "Absolute copy count", "support_status"] = True
df.loc[df["support_status"] == "Copy number change (cn loss|del and cn gain|dup)",
    "support_status"] = True
df.loc[df["support_status"] == "NCBI36 genomic only", "support_status"] = False
df.loc[df["support_status"] == "No hgvs or location info", "support_status"] = False
df.loc[df["support_status"] == "Genotype/Haplotype", "support_status"] = False
df.loc[df["support_status"] == "Invalid/unsupported hgvs", "support_status"] = False
df.loc[df["support_status"] == "Remaining valid hgvs alleles", "support_status"] = True
df.loc[df["support_status"] == "Min/max copy count range not supported", 
    "support_status"] = False

In [16]:
df["support_status"].value_counts()

support_status
True     2199977
False      10650
Name: count, dtype: int64

## <a id='toc1_3_'></a>[Add Normalization Status of Variant based on out.errors](#toc0_)

The errors are stored as a list of values, some of which are strings and other of which are dictionaries (determined by whether error was handled at the level of Variation Normalizer or after the normalizer)

The "get_errors" function extracts the text error responses for better readability and ease string processing

In [17]:
def get_errors(errors: list) -> str:
    """Takes the values for the errors and mkes them into a string
    :param errors: list of errors
    :return: string representing error
    """
    errors_out = []
    for e in errors:
        if isinstance(e, str):
            errors_out.append(e)
        elif isinstance(e, dict):
            for k, v in e.items():
                if k not in [
                    "msg",
                    "response-errors"]:
                ## only get these keys from normalizer response
                    continue
                if isinstance(v, str):
                    errors_out.append(v)
                elif isinstance(e, list):
                    errors_out.append(";".join(v))
    return ";".join(errors_out)

In [18]:
df["error_string"] = df["out.errors"].fillna("").apply(get_errors)

In [19]:
df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,in.seq.position_vcf,in.seq.alt_allele_vcf,in.canonical_spdi,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string
0,425693,NM_001204.6(BMPR2):c.77-?_247+?dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:425693,Text,clinvar:425693,...,,,,,,,,,False,
1,90650,NM_000251.2(MSH2):c.1387-?_1510+?del,SimpleAllele,Deletion,Text,[id],No hgvs or location info,Text:clinvar:90650,Text,clinvar:90650,...,,,,,,,,,False,
2,16098,"SLC2A2, 1-BP INS, 793C",SimpleAllele,Insertion,Text,[id],No hgvs or location info,Text:clinvar:16098,Text,clinvar:16098,...,,,,,,,,,False,
3,14905,"HLA-DRB1, HLA-DRB1*1101",SimpleAllele,Variation,Text,[id],No hgvs or location info,Text:clinvar:14905,Text,clinvar:14905,...,,,,,,,,,False,
4,1048409,NM_000512.5:c.567_1002dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:1048409,Text,clinvar:1048409,...,,,,,,,,,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2210622,1801441,NM_015189.3(EXOC6B):c.2197-66917_2197-66916ins...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,72251103,TCTGATTGATATTTAATAATGTAATTTAATTAAAATATATTTA,NC_000002.12:72251103::CTGATTGATATTTAATAATGTAA...,,,,,,True,
2210623,2202105,NM_153676.4(USH1C):c.496+14_496+15insGTACTCCAT...,SimpleAllele,Microsatellite,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,17527208,TCCCCCGCCCTCCCTCCCTCCCACCGTCATGGAGTA,NC_000011.10:17527208:C:CCCCCGCCCTCCCTCCCTCCCA...,,,,,,True,
2210624,1972383,NM_003470.3(USP7):c.383+10_383+11insGTTTAAATGA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,8923204,TACAACGATGTGGGGGTTTGTAGTTACATCATTTAA,NC_000016.10:8923204:AC:ACAACGATGTGGGGGTTTGTAG...,,,,,,True,
2210625,1496502,NM_007255.3(B4GALT7):c.881_882insTGAGGTGGATTAA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,177609590,TGTTGAGGTGGATTAAACCAAACCCAGCTACGCAAAATCTTA,NC_000005.10:177609590:GT:GTTGAGGTGGATTAAACCAA...,,,,,,True,


This is the number of unique error strings

There are many different strings because many of the errors contain specific genomic coordinates, which are unlikely to occur more than once

In [20]:
df["error_string"].nunique()

3

To get the core error message, the numeric values are replaced with "#"

In [21]:
def reduce_errors(error_string: str) -> str:
    """ Reduces the error strings to consolidate number of groups
    :param error_string: string representing error
    :return: string representing error but shorter
    """
    out = error_string.lower()
    out = re.sub("\d+", "#", out)
    return out

In [22]:
def reduce_errors_more(error_string: str) -> str:
    """ Reduces the error strings to consolidate number of groups even more
    :param error_string: string representing error
    :return: string representing error but even shorter
    """
    errs = error_string.split(";")
    new_errs = [re.sub("\:[ ]?[^\s]+[\s]?", "", err) for err in errs]
    return ";".join(new_errs)

In [23]:
df["error_string_reduce"] = df["error_string"].apply(reduce_errors)

In [24]:
df["error_string_reduce"] = df["error_string_reduce"].replace("", "Success")

In [25]:
df["error_string_reduce"].value_counts()

error_string_reduce
Success                                     2209136
error returned from variation normalizer       1060
unrecognized variation record                   431
Name: count, dtype: int64

There are Not Supported variants that have no error (marked as success inaccurately) because they were labeled "Not Supported" manually.

An error ("Not Supported") is entered manually for those variants so that they are not categorized as normalized

In [26]:
df.loc[
    (df["support_status"] == False) & (df["error_string_reduce"] == "Success"),
    "error_string_reduce",
] = "Not Supported"

The error strings had to be reduced further

In [27]:
df["error_string_reduce_2"] = df["error_string_reduce"].apply(reduce_errors_more)

In [28]:
df["error_string_reduce_2"].value_counts()

error_string_reduce_2
Success                                     2198936
Not Supported                                 10200
error returned from variation normalizer       1060
unrecognized variation record                   431
Name: count, dtype: int64

### <a id='toc1_3_1_'></a>[Set Normalize Status of Variant as T/F](#toc0_)

If an error is present, the variant was not normalized and therefore has a False Normalize Status

In [29]:
df["normalize_status"] = df["error_string_reduce_2"] == "Success"
df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
0,425693,NM_001204.6(BMPR2):c.77-?_247+?dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:425693,Text,clinvar:425693,...,,,,,,False,,Not Supported,Not Supported,False
1,90650,NM_000251.2(MSH2):c.1387-?_1510+?del,SimpleAllele,Deletion,Text,[id],No hgvs or location info,Text:clinvar:90650,Text,clinvar:90650,...,,,,,,False,,Not Supported,Not Supported,False
2,16098,"SLC2A2, 1-BP INS, 793C",SimpleAllele,Insertion,Text,[id],No hgvs or location info,Text:clinvar:16098,Text,clinvar:16098,...,,,,,,False,,Not Supported,Not Supported,False
3,14905,"HLA-DRB1, HLA-DRB1*1101",SimpleAllele,Variation,Text,[id],No hgvs or location info,Text:clinvar:14905,Text,clinvar:14905,...,,,,,,False,,Not Supported,Not Supported,False
4,1048409,NM_000512.5:c.567_1002dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:1048409,Text,clinvar:1048409,...,,,,,,False,,Not Supported,Not Supported,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2210622,1801441,NM_015189.3(EXOC6B):c.2197-66917_2197-66916ins...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210623,2202105,NM_153676.4(USH1C):c.496+14_496+15insGTACTCCAT...,SimpleAllele,Microsatellite,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210624,1972383,NM_003470.3(USP7):c.383+10_383+11insGTTTAAATGA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210625,1496502,NM_007255.3(B4GALT7):c.881_882insTGAGGTGGATTAA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True


#### <a id='toc1_3_1_1_'></a>[Summary Table](#toc0_)

In the table below, the cells show the number of variants with each expected behavior and how they actually ended up performing.

If a variant was in an "expected to pass" category and ends up as text, that is an instance of a normalizer failure on a supported variant

In [30]:
summary_df = df[["in.id", "support_status", "in.vrs_xform_plan.policy", "out.type"]].fillna(
    "NONE"
).groupby(["support_status", "in.vrs_xform_plan.policy", "out.type"]).count().unstack(
    level=2
).fillna(
    0
).astype(
    int
)["in.id"]

In [31]:
summary_df["VariantSum"] = summary_df.sum(axis = 1)

In [32]:
summary_df["NormalizedSum"] = summary_df[["Allele", "CopyNumberChange", "CopyNumberCount"]].sum(axis = 1)

In [33]:
summary_df["NormalizedPercent"] = (summary_df["NormalizedSum"] / summary_df["VariantSum"]).apply(lambda x : f"{round(x * 100, 2)}%")

In [34]:
summary_df = summary_df.drop(["VariantSum", "NormalizedSum"], axis=1)
summary_df

Unnamed: 0_level_0,out.type,Allele,CopyNumberChange,CopyNumberCount,NONE,Text,NormalizedPercent
support_status,in.vrs_xform_plan.policy,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
False,Genotype/Haplotype,0,0,0,0,1440,0.0%
False,Invalid/unsupported hgvs,0,0,0,19,1317,0.0%
False,Min/max copy count range not supported,0,0,0,0,14,0.0%
False,NCBI36 genomic only,0,0,0,0,4771,0.0%
False,No hgvs or location info,0,0,0,0,3089,0.0%
True,Absolute copy count,0,1,52440,819,3,98.46%
True,Canonical SPDI,2118669,0,0,0,0,100.0%
True,Copy number change (cn loss|del and cn gain|dup),0,26889,0,209,6,99.21%
True,Remaining valid hgvs alleles,927,0,0,13,1,98.51%


In [35]:
summary_df.to_csv(
    "clinvar_variation_analysis_output/variant_analysis_summary_df.csv"
)

## <a id='toc1_4_'></a>[Create subgroups based on Variant Status](#toc0_)

### <a id='toc1_4_1_'></a>[Supported and Normalized Variants](#toc0_)

In [36]:
supported_df = df.copy()

In [37]:
supported_df = supported_df.loc[
    (supported_df["support_status"] == True)
    & (supported_df["normalize_status"] == True)
]
supported_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
149,1676638,NM_000094.4(COL7A1):c.8729G>T (p.Gly2910Val),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
150,1676377,NM_012064.4(MIP):c.20C>T (p.Ala7Val),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
151,1676330,NM_054027.6(ANKH):c.259G>A (p.Val87Ile),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
152,1325429,NM_000138.5(FBN1):c.5855G>T (p.Gly1952Val),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
153,1676394,NM_000786.4(CYP51A1):c.1291C>T (p.Arg431Cys),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2210622,1801441,NM_015189.3(EXOC6B):c.2197-66917_2197-66916ins...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210623,2202105,NM_153676.4(USH1C):c.496+14_496+15insGTACTCCAT...,SimpleAllele,Microsatellite,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210624,1972383,NM_003470.3(USP7):c.383+10_383+11insGTTTAAATGA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210625,1496502,NM_007255.3(B4GALT7):c.881_882insTGAGGTGGATTAA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True


In [38]:
variation_type_count_supported_df = supported_df[["in.id", "in.variation_type"]].groupby("in.variation_type").count()

In [39]:
variation_type_count_supported_df.to_csv(
    "clinvar_variation_analysis_output/variation_type_count_supported_df.csv"
)

### <a id='toc1_4_2_'></a>[Supported and Not Normalized Variants](#toc0_)

In [40]:
supported_not_normalized_df = df.copy()

In [41]:
supported_not_normalized_df = supported_not_normalized_df.loc[
    (supported_not_normalized_df["support_status"] == True)
    & (supported_not_normalized_df["normalize_status"] == False)
]
supported_not_normalized_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
164,989220,NC_000015.9:g.(44884528_44881613)_(44877833_44...,SimpleAllele,Duplication,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),989220,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False
166,10342,NG_011403.2:g.(80027_96047)_(131648_164496)del,SimpleAllele,Deletion,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),10342,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False
210,602019,GRCh37/hg19 15q11.2(chr15:22750305-23140114)x3,SimpleAllele,copy number gain,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,602019,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False
277,1706497,GRCh37/hg19 16p11.2(chr16:29432212-30177807)x1,SimpleAllele,copy number loss,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,1706497,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False
308,1705934,GRCh37/hg19 Xp22.33(chrX:566009-1356042)x3,SimpleAllele,copy number gain,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,1705934,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2143307,830807,NC_000001.10:g.(?_145498103)_(145538307_?)dup,SimpleAllele,Duplication,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),830807,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False
2143312,625541,GRCh37/hg19 1q21.1(chr1:145395604-145704146),SimpleAllele,copy number loss,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),625541,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False
2143342,600207,GRCh37/hg19 1q32.1(chr1:206315933-206331193)x1,SimpleAllele,copy number loss,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,600207,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False
2143345,565216,GRCh37/hg19 1q32.1(chr1:206173911-206288157)x1,SimpleAllele,copy number loss,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,565216,,,...,,,,,,True,Error returned from variation normalizer,error returned from variation normalizer,error returned from variation normalizer,False


In [42]:
variation_type_count_supported_not_normalized_df = supported_not_normalized_df[["in.id", "in.variation_type"]].groupby("in.variation_type").count()
variation_type_count_supported_not_normalized_df

Unnamed: 0_level_0,in.id
in.variation_type,Unnamed: 1_level_1
Deletion,89
Duplication,72
Insertion,1
Variation,1
copy number gain,361
copy number loss,506
single nucleotide variant,11


In [43]:
variation_type_count_supported_not_normalized_df.to_csv(
    "clinvar_variation_analysis_output/variation_type_count_supported_not_normalized_df.csv"
)

### <a id='toc1_4_3_'></a>[Not Supported Variants](#toc0_)

In [44]:
not_supported_df = df.copy()

In [45]:
not_supported_df = not_supported_df.loc[
    (not_supported_df["support_status"] == False)
    & (not_supported_df["normalize_status"] == False)
]
not_supported_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
0,425693,NM_001204.6(BMPR2):c.77-?_247+?dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:425693,Text,clinvar:425693,...,,,,,,False,,Not Supported,Not Supported,False
1,90650,NM_000251.2(MSH2):c.1387-?_1510+?del,SimpleAllele,Deletion,Text,[id],No hgvs or location info,Text:clinvar:90650,Text,clinvar:90650,...,,,,,,False,,Not Supported,Not Supported,False
2,16098,"SLC2A2, 1-BP INS, 793C",SimpleAllele,Insertion,Text,[id],No hgvs or location info,Text:clinvar:16098,Text,clinvar:16098,...,,,,,,False,,Not Supported,Not Supported,False
3,14905,"HLA-DRB1, HLA-DRB1*1101",SimpleAllele,Variation,Text,[id],No hgvs or location info,Text:clinvar:14905,Text,clinvar:14905,...,,,,,,False,,Not Supported,Not Supported,False
4,1048409,NM_000512.5:c.567_1002dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:1048409,Text,clinvar:1048409,...,,,,,,False,,Not Supported,Not Supported,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2199349,1418992,NM_002076.4(GNS):c.841_842insTTTTTTTTTTTTTTTTT...,SimpleAllele,Insertion,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1418992,Text,clinvar:1418992,...,,,,,,False,,Not Supported,Not Supported,False
2199350,2134754,NM_152564.5(VPS13B):c.5627_5628insTTTTTTTTTTTT...,SimpleAllele,Insertion,Text,[id],Invalid/unsupported hgvs,Text:clinvar:2134754,Text,clinvar:2134754,...,,,,,,False,,Not Supported,Not Supported,False
2210617,1513408,NM_024928.5(STN1):c.340_352AAG[2]CTACAAGGCCGGG...,SimpleAllele,Insertion,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1513408,Text,clinvar:1513408,...,,,,,,False,,Not Supported,Not Supported,False
2210618,1464440,NM_001267550.2(TTN):c.29512_29513insGGCCGGGCGC...,SimpleAllele,Insertion,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1464440,Text,clinvar:1464440,...,,,,,,False,,Not Supported,Not Supported,False


In [46]:
variation_type_count_not_supported_df = not_supported_df[["in.id", "in.variation_type"]].groupby("in.variation_type").count()
variation_type_count_not_supported_df

Unnamed: 0_level_0,in.id
in.variation_type,Unnamed: 1_level_1
Complex,77
CompoundHeterozygote,249
Deletion,1720
Diplotype,596
Distinct chromosomes,1
Duplication,431
Haplotype,565
"Haplotype, single variant",21
Indel,178
Insertion,688


In [47]:
variation_type_count_not_supported_df.to_csv(
    "clinvar_variation_analysis_output/variation_type_count_not_supported_df.csv"
)

Sanity check: making sure there are no supported variants that have been marked as normalized

In [48]:
not_supported_but_normalized_df = df.copy()

In [49]:
not_supported_but_normalized_df = not_supported_but_normalized_df.loc[
    (not_supported_but_normalized_df["support_status"] == False)
    & (not_supported_but_normalized_df["normalize_status"] == True)
]
not_supported_but_normalized_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status


## <a id='toc1_5_'></a>[Counting variants from each group](#toc0_)

In [50]:
num_supported = len(supported_df)
num_supported_not_normalized = len(supported_not_normalized_df)
num_not_supported_but_normalized = len(not_supported_but_normalized_df)
num_not_supported = len(not_supported_df)

In [51]:
summary_df2 = pd.DataFrame({"Supported":[num_supported, num_supported_not_normalized],
                "Not Supported":[num_not_supported_but_normalized, num_not_supported]})

In [52]:
summary_df2.index = ['Normalized', 'Not Normalized']
summary_df2

Unnamed: 0,Supported,Not Supported
Normalized,2198936,0
Not Normalized,1041,10650


## <a id='toc1_6_'></a>[Counting variant types for each group](#toc0_)

In [53]:
variation_type_count_summary_df = pd.merge(pd.merge(variation_type_count_supported_df,variation_type_count_supported_not_normalized_df, on='in.variation_type', how = "left"), variation_type_count_not_supported_df, on='in.variation_type', how = "right")
variation_type_count_summary_df = variation_type_count_summary_df.replace(np.nan,'',regex=True)

In [54]:
variation_type_count_summary_df = variation_type_count_summary_df.rename(columns={"in.id_x": "supported", "in.id_y": "supported_not_normalized", "in.id": "not_supported"})

In [55]:
variation_type_count_summary_df.to_csv(
    "clinvar_variation_analysis_output/variation_type_count_summary_df.csv"
)
variation_type_count_summary_df

Unnamed: 0_level_0,supported,supported_not_normalized,not_supported
in.variation_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Complex,,,77
CompoundHeterozygote,,,249
Deletion,109127.0,89.0,1720
Diplotype,,,596
Distinct chromosomes,,,1
Duplication,51305.0,72.0,431
Haplotype,,,565
"Haplotype, single variant",,,21
Indel,11326.0,,178
Insertion,9056.0,1.0,688
