# <a id='toc1_'></a>[ClinVar Analysis](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [ClinVar Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Import variant file from downloads- change to personal username](#toc1_1_2_)    
  - [Add Supported Status of Variant based on in.vrs_xform_plan.policy](#toc1_2_)    
  - [Add Normalization Status of Variant based on out.errors](#toc1_3_)    
    - [Set Normalize Status of Variant as T/F](#toc1_3_1_)    
      - [Summary Table](#toc1_3_1_1_)    
  - [Create subgroups based on Variant Status](#toc1_4_)    
    - [Supported and Normalized Variants](#toc1_4_1_)    
    - [Supported and Not Normalized Variants](#toc1_4_2_)    
    - [Not Supported Variants](#toc1_4_3_)    
  - [Counting variants from each group](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [1]:
import ndjson
import pandas as pd
import numpy as np
import re
from dotenv import load_dotenv

### <a id='toc1_1_2_'></a>[Import variant file from downloads- change to personal username](#toc0_)

In [2]:
file = open("/Users/rsaxs014/Downloads/output-variation_identity-nolimits.ndjson")
records = ndjson.load(file)

df0 = pd.json_normalize(records)

In [3]:
df = df0.copy()

## <a id='toc1_2_'></a>[Add Supported Status of Variant based on in.vrs_xform_plan.policy](#toc0_)

Checking for blanks

In [4]:
df["in.vrs_xform_plan.policy"] = df["in.vrs_xform_plan.policy"].fillna("None")

In [5]:
df["in.vrs_xform_plan.policy"].value_counts()

in.vrs_xform_plan.policy
Canonical SPDI                                      2118669
Absolute copy count                                   53263
Copy number change (cn loss|del and cn gain|dup)      27104
NCBI36 genomic only                                    4771
No hgvs or location info                               3089
Genotype/Haplotype                                     1440
Invalid/unsupported hgvs                               1336
Remaining valid hgvs alleles                            941
Min/max copy count range not supported                   14
Name: count, dtype: int64

In [6]:
df["support_status"] = df["in.vrs_xform_plan.policy"].copy()

df.loc[df["support_status"] == "Canonical SPDI", "support_status"] = True
df.loc[df["support_status"] == "Absolute copy count", "support_status"] = True
df.loc[df["support_status"] == "Copy number change (cn loss|del and cn gain|dup)",
    "support_status"] = True
df.loc[df["support_status"] == "NCBI36 genomic only", "support_status"] = False
df.loc[df["support_status"] == "No hgvs or location info", "support_status"] = False
df.loc[df["support_status"] == "Genotype/Haplotype", "support_status"] = False
df.loc[df["support_status"] == "Invalid/unsupported hgvs", "support_status"] = False
df.loc[df["support_status"] == "Remaining valid hgvs alleles", "support_status"] = True
df.loc[df["support_status"] == "Min/max copy count range not supported", 
    "support_status"] = False

In [7]:
df["support_status"].value_counts()

support_status
True     2199977
False      10650
Name: count, dtype: int64

## <a id='toc1_3_'></a>[Add Normalization Status of Variant based on out.errors](#toc0_)

The errors are stored as a list of values, some of which are strings and other of which are dictionaries (determined by whether error was handled at the level of Variation Normalizer or after the normalizer)

The "get_errors" function extracts the text error responses for better readability and ease string processing

In [8]:
def get_errors(errors):
    errors_out = []
    for e in errors:
        if type(e) == str:
            errors_out.append(e)
        elif type(e) == dict:
            for k, v in e.items():
                if k not in [
                    "msg",
                    "response-errors",]:
                ## only get these keys from normalizer response
                    continue
                if type(v) == str:
                    errors_out.append(v)
                elif type(v) == list:
                    errors_out.append(";".join(v))
    return ";".join(errors_out)

In [9]:
df["error_string"] = df["out.errors"].fillna("").apply(get_errors)

This is the number of unique error strings

There are many different strings because many of the errors contain specific genomic coordinates, which are unlikely to occur more than once

In [10]:
df["error_string"].nunique()

505

To get the core error message, the numeric values are replaced with "#"

In [11]:
def reduce_errors(error_string):
    out = error_string.lower()
    out = re.sub("\d+", "#", out)
    return out

In [12]:
def reduce_errors_more(error_string):
    errs = error_string.split(";")
    new_errs = [re.sub("\:[ ]?[^\s]+[\s]?", "", err) for err in errs]
    return ";".join(new_errs)

In [13]:
df["error_string_reduce"] = df["error_string"].apply(reduce_errors)

There are Not Supported variants that have no error because they were labeled "Not Supported" manually.

An error ("Not Supported") is entered manually for those variants so that they are not categorized as normalized

In [14]:
df.loc[
    (df["support_status"] == False) & (df["error_string_reduce"] == ""),
    "error_string_reduce",
] = "Not Supported"

The error strings had to be reduced further

In [15]:
df["error_string_reduce_2"] = df["error_string_reduce"].apply(reduce_errors_more)

In [16]:
df["error_string_reduce_2"].value_counts()

error_string_reduce_2
not status #                                                                                              2118672
                                                                                                            80814
Not Supported                                                                                               10200
unrecognized variation record                                                                                 431
error returned from variation normalizer;unable to find classification for                                    403
error returned from variation normalizer;unable to tokenize;unable to find classification for                  51
error returned from variation normalizer;unable to find classification for;unable to tokenize                  44
error returned from variation normalizer;unable to find valid result for classification                         8
error returned from variation normalizer;nc_#.#is not a supported 

### <a id='toc1_3_1_'></a>[Set Normalize Status of Variant as T/F](#toc0_)

If an error is present, the variant was not normalized and therefore has a False Normalize Status

In [17]:
df["normalize_status"] = df["error_string_reduce"] != ""
df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.xrefs,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,...,out.canonical_spdi,in.max_copies,in.min_copies,out.seq.inner_start,out.seq.inner_stop,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
0,16098,"SLC2A2, 1-BP INS, 793C",SimpleAllele,Insertion,OMIM:138160.0009,Text,[id],No hgvs or location info,Text:clinvar:16098,Text,...,,,,,,False,,Not Supported,Not Supported,True
1,425693,NM_001204.6(BMPR2):c.77-?_247+?dup,SimpleAllele,Duplication,,Text,[id],No hgvs or location info,Text:clinvar:425693,Text,...,,,,,,False,,Not Supported,Not Supported,True
2,2446408,"CDHR1, 783G-A (rs147346345)",SimpleAllele,single nucleotide variant,OMIM:609502.0005,Text,[id],No hgvs or location info,Text:clinvar:2446408,Text,...,,,,,,False,,Not Supported,Not Supported,True
3,90650,NM_000251.2(MSH2):c.1387-?_1510+?del,SimpleAllele,Deletion,,Text,[id],No hgvs or location info,Text:clinvar:90650,Text,...,,,,,,False,,Not Supported,Not Supported,True
4,14905,"HLA-DRB1, HLA-DRB1*1101",SimpleAllele,Variation,OMIM:142857.0001,Text,[id],No hgvs or location info,Text:clinvar:14905,Text,...,,,,,,False,,Not Supported,Not Supported,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2210622,1801441,NM_015189.3(EXOC6B):c.2197-66917_2197-66916ins...,SimpleAllele,Insertion,,Allele,[canonical_spdi],Canonical SPDI,1801441,,...,NC_000002.12:72251103::CTGATTGATATTTAATAATGTAA...,,,,,True,Not status 200,not status #,not status #,True
2210623,1464440,NM_001267550.2(TTN):c.29512_29513insGGCCGGGCGC...,SimpleAllele,Insertion,,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1464440,Text,...,,,,,,False,,Not Supported,Not Supported,True
2210624,1513408,NM_024928.5(STN1):c.340_352AAG[2]CTACAAGGCCGGG...,SimpleAllele,Insertion,,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1513408,Text,...,,,,,,False,,Not Supported,Not Supported,True
2210625,1496502,NM_007255.3(B4GALT7):c.881_882insTGAGGTGGATTAA...,SimpleAllele,Insertion,,Allele,[canonical_spdi],Canonical SPDI,1496502,,...,NC_000005.10:177609590:GT:GTTGAGGTGGATTAAACCAA...,,,,,True,Not status 200,not status #,not status #,True


#### <a id='toc1_3_1_1_'></a>[Summary Table](#toc0_)

In the table below, the cells show the number of variants with each expected behavior and how they actually ended up performing.

If a variant was in an "expected to pass" category and ends up as text, that is an instance of a normalizer failure on a supported variant

In [18]:
df[["in.id", "support_status", "in.vrs_xform_plan.policy", "out.type"]].fillna(
    "NONE"
).groupby(["support_status", "in.vrs_xform_plan.policy", "out.type"]).count().unstack(
    level=2
).fillna(
    0
).astype(
    int
)

Unnamed: 0_level_0,Unnamed: 1_level_0,in.id,in.id,in.id,in.id,in.id
Unnamed: 0_level_1,out.type,Allele,CopyNumberChange,CopyNumberCount,NONE,Text
support_status,in.vrs_xform_plan.policy,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
False,Genotype/Haplotype,0,0,0,0,1440
False,Invalid/unsupported hgvs,0,0,0,19,1317
False,Min/max copy count range not supported,0,0,0,0,14
False,NCBI36 genomic only,0,0,0,0,4771
False,No hgvs or location info,0,0,0,0,3089
True,Absolute copy count,0,1,52880,379,3
True,Canonical SPDI,0,0,0,2118669,0
True,Copy number change (cn loss|del and cn gain|dup),0,26988,0,110,6
True,Remaining valid hgvs alleles,935,0,0,5,1


## <a id='toc1_4_'></a>[Create subgroups based on Variant Status](#toc0_)

### <a id='toc1_4_1_'></a>[Supported and Normalized Variants](#toc0_)

In [19]:
supported_df = df.copy()

In [20]:
supported_df = supported_df.loc[
    (supported_df["support_status"] == True)
    & (supported_df["normalize_status"] == True)
]
supported_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.xrefs,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,...,out.canonical_spdi,in.max_copies,in.min_copies,out.seq.inner_start,out.seq.inner_stop,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
170,10342,NG_011403.2:g.(80027_96047)_(131648_164496)del,SimpleAllele,Deletion,OMIM:300841.0259,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),10342,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,True
177,989220,NC_000015.9:g.(44884528_44881613)_(44877833_44...,SimpleAllele,Duplication,,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),989220,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,True
283,1706497,GRCh37/hg19 16p11.2(chr16:29432212-30177807)x1,SimpleAllele,copy number loss,,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,1706497,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,True
310,1705934,GRCh37/hg19 Xp22.33(chrX:566009-1356042)x3,SimpleAllele,copy number gain,,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,1705934,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,True
336,1711099,GRCh37/hg19 10q26.13-26.3(chr10:127198625-1354...,SimpleAllele,copy number gain,,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,1711099,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2210618,1972383,NM_003470.3(USP7):c.383+10_383+11insGTTTAAATGA...,SimpleAllele,Insertion,,Allele,[canonical_spdi],Canonical SPDI,1972383,,...,NC_000016.10:8923204:AC:ACAACGATGTGGGGGTTTGTAG...,,,,,True,Not status 200,not status #,not status #,True
2210619,1181160,NM_030962.4(SBF2):c.55+97_55+98insCGGGCGTCGGGGC,SimpleAllele,Microsatellite,,Allele,[canonical_spdi],Canonical SPDI,1181160,,...,NC_000011.10:10293917:CCCCGACGCCCG:CCCCGACGCCC...,,,,,True,Not status 200,not status #,not status #,True
2210620,455365,NM_000249.4(MLH1):c.1039-8_1039-7insTTTTTTTTTT...,SimpleAllele,Insertion,"ClinGen:CA658655821,dbSNP:535965616",Allele,[canonical_spdi],Canonical SPDI,455365,,...,NC_000003.12:37025629::TTTTTTTTTTTTTTTTTTA,,,,,True,Not status 200,not status #,not status #,True
2210622,1801441,NM_015189.3(EXOC6B):c.2197-66917_2197-66916ins...,SimpleAllele,Insertion,,Allele,[canonical_spdi],Canonical SPDI,1801441,,...,NC_000002.12:72251103::CTGATTGATATTTAATAATGTAA...,,,,,True,Not status 200,not status #,not status #,True


### <a id='toc1_4_2_'></a>[Supported and Not Normalized Variants](#toc0_)

In [21]:
supported_not_normalized_df = df.copy()

In [22]:
supported_not_normalized_df = supported_not_normalized_df.loc[
    (supported_not_normalized_df["support_status"] == True)
    & (supported_not_normalized_df["normalize_status"] == False)
]
supported_not_normalized_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.xrefs,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,...,out.canonical_spdi,in.max_copies,in.min_copies,out.seq.inner_start,out.seq.inner_stop,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
149,1676330,NM_054027.6(ANKH):c.259G>A (p.Val87Ile),SimpleAllele,single nucleotide variant,,Allele,[hgvs],Remaining valid hgvs alleles,ga4gh:VA.T3hLVZajyx7AGQKV7la2RgE9CO0Wiv2y,Allele,...,,,,,,True,,,,False
150,1676394,NM_000786.4(CYP51A1):c.1291C>T (p.Arg431Cys),SimpleAllele,single nucleotide variant,,Allele,[hgvs],Remaining valid hgvs alleles,ga4gh:VA.fQvrcvvCsScmvBvq4d6nxm09Q7avAmbX,Allele,...,,,,,,True,,,,False
151,1676371,NM_000088.4(COL1A1):c.1037C>T (p.Pro346Leu),SimpleAllele,single nucleotide variant,,Allele,[hgvs],Remaining valid hgvs alleles,ga4gh:VA.2d9Gsax-wmwO0zsQ-ZIhCExbACqjVxLO,Allele,...,,,,,,True,,,,False
152,1676476,NM_012309.5(SHANK2):c.460C>T (p.Gln154Ter),SimpleAllele,single nucleotide variant,,Allele,[hgvs],Remaining valid hgvs alleles,ga4gh:VA.dZAOK2Sy6aJlXan36nZYUjEJpk5WX8pF,Allele,...,,,,,,True,,,,False
153,1676377,NM_012064.4(MIP):c.20C>T (p.Ala7Val),SimpleAllele,single nucleotide variant,,Allele,[hgvs],Remaining valid hgvs alleles,ga4gh:VA.-FEJu4eN_baUv2BWb4BxuoSyW4r8FBub,Allele,...,,,,,,True,,,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2199359,1430299,NM_001012759.3(CTU2):c.737+9_737+10insTGAGAGCC...,SimpleAllele,Insertion,,Allele,[hgvs],Remaining valid hgvs alleles,ga4gh:VA.dm7V9TRmkiQSkvMWrtt75FRyyBZBNApj,Allele,...,,,,,,True,,,,False
2210490,993055,NC_000013.10:g.32889619_32890666dup,SimpleAllele,Duplication,,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),ga4gh:CX.eR37N88iZGBhrAbApaJ8svu_dimnjyWD,CopyNumberChange,...,,,,,,True,,,,False
2210491,1755741,NM_000059.4(BRCA2):c.6842-587_7007+2347dup,SimpleAllele,Duplication,,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),ga4gh:CX.wAuyGaZw-vC4Xx4hFVf6574KhJ9w4nnv,CopyNumberChange,...,,,,,,True,,,,False
2210553,1508452,NM_001903.5(CTNNA1):c.2709_*44dup (p.Met903_Te...,SimpleAllele,Duplication,,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),ga4gh:CX.JmTrwmY_R9nyZ6Z-nGYfwk02VTmd8OpW,CopyNumberChange,...,,,,,,True,,,,False


### <a id='toc1_4_3_'></a>[Not Supported Variants](#toc0_)

In [23]:
not_supported_df = df.copy()

In [24]:
not_supported_df = not_supported_df.loc[
    (not_supported_df["support_status"] == False)
    & (not_supported_df["normalize_status"] == False)
]
not_supported_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.xrefs,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,...,out.canonical_spdi,in.max_copies,in.min_copies,out.seq.inner_start,out.seq.inner_stop,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status


Sanity check: making sure there are no supported variants that have been marked as normalized

In [25]:
not_supported_but_normalized_df = df.copy()

In [26]:
not_supported_but_normalized_df = not_supported_but_normalized_df.loc[
    (not_supported_but_normalized_df["support_status"] == False)
    & (not_supported_but_normalized_df["normalize_status"] == True)
]
not_supported_but_normalized_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.xrefs,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,...,out.canonical_spdi,in.max_copies,in.min_copies,out.seq.inner_start,out.seq.inner_stop,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
0,16098,"SLC2A2, 1-BP INS, 793C",SimpleAllele,Insertion,OMIM:138160.0009,Text,[id],No hgvs or location info,Text:clinvar:16098,Text,...,,,,,,False,,Not Supported,Not Supported,True
1,425693,NM_001204.6(BMPR2):c.77-?_247+?dup,SimpleAllele,Duplication,,Text,[id],No hgvs or location info,Text:clinvar:425693,Text,...,,,,,,False,,Not Supported,Not Supported,True
2,2446408,"CDHR1, 783G-A (rs147346345)",SimpleAllele,single nucleotide variant,OMIM:609502.0005,Text,[id],No hgvs or location info,Text:clinvar:2446408,Text,...,,,,,,False,,Not Supported,Not Supported,True
3,90650,NM_000251.2(MSH2):c.1387-?_1510+?del,SimpleAllele,Deletion,,Text,[id],No hgvs or location info,Text:clinvar:90650,Text,...,,,,,,False,,Not Supported,Not Supported,True
4,14905,"HLA-DRB1, HLA-DRB1*1101",SimpleAllele,Variation,OMIM:142857.0001,Text,[id],No hgvs or location info,Text:clinvar:14905,Text,...,,,,,,False,,Not Supported,Not Supported,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2199348,1418992,NM_002076.4(GNS):c.841_842insTTTTTTTTTTTTTTTTT...,SimpleAllele,Insertion,,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1418992,Text,...,,,,,,False,,Not Supported,Not Supported,True
2199349,2134754,NM_152564.5(VPS13B):c.5627_5628insTTTTTTTTTTTT...,SimpleAllele,Insertion,,Text,[id],Invalid/unsupported hgvs,Text:clinvar:2134754,Text,...,,,,,,False,,Not Supported,Not Supported,True
2210621,1453116,NM_000051.4(ATM):c.3376_3377insGGCCGGGCGCGGTGG...,SimpleAllele,Insertion,,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1453116,Text,...,,,,,,False,,Not Supported,Not Supported,True
2210623,1464440,NM_001267550.2(TTN):c.29512_29513insGGCCGGGCGC...,SimpleAllele,Insertion,,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1464440,Text,...,,,,,,False,,Not Supported,Not Supported,True


## <a id='toc1_5_'></a>[Counting variants from each group](#toc0_)

In [27]:
num_supported = len(supported_df)
num_supported_not_normalized = len(supported_not_normalized_df)
num_not_supported_but_normalized = len(not_supported_but_normalized_df)
num_not_supported = len(not_supported_df)

In [28]:
print(num_supported)
print(num_supported_not_normalized)
print(num_not_supported_but_normalized)
print(num_not_supported)

2119163
80814
10650
0
