# <a id='toc1_'></a>[ClinVar Variant Analysis](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [ClinVar Variant Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Create output directory](#toc1_1_2_)    
    - [Import variant information file](#toc1_1_3_)    
  - [Add Supported Status of Variant based on in.vrs_xform_plan.policy](#toc1_2_)    
  - [Add Normalization Status of Variant based on out.errors](#toc1_3_)    
    - [Set Normalize Status of Variant as T/F](#toc1_3_1_)    
      - [Summary Table](#toc1_3_1_1_)    
  - [Create subgroups based on Variant Status](#toc1_4_)    
    - [Supported and Normalized Variants](#toc1_4_1_)    
    - [Supported and Not Normalized Variants](#toc1_4_2_)    
    - [Not Supported Variants](#toc1_4_3_)    
  - [Counting variants from each group](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [1]:
import ndjson
import pandas as pd
import numpy as np
import re
from dotenv import load_dotenv
from pathlib import Path
from boto3.exceptions import ResourceLoadException
from botocore.config import Config
import boto3
import gzip

### <a id='toc1_1_2_'></a>[Create output directory](#toc0_)

In [2]:
path = Path("clinvar_variation_analysis_output")
path.mkdir(exist_ok = True)

### <a id='toc1_1_3_'></a>[Import variant information file](#toc0_)

In [3]:
#to refresh SSO session, run aws sso login

# s3 = boto3.client('s3')

# with open('../clinvar/output-variation_identity-vrs-1.3.ndjson.gz', 'wb') as data:
#     s3.download_fileobj('nch-igm-wagner-lab-public', 'variation-normalizer-manuscript/output-variation_identity-vrs-1.3.ndjson.gz', data)

In [4]:
with gzip.open('output-variation_identity-vrs-1.3.ndjson.gz', 'rb') as f:
    file_content = ndjson.load(f)


In [5]:
df0 = pd.json_normalize(file_content)

In [6]:
df = df0.copy()

## <a id='toc1_2_'></a>[Add Supported Status of Variant based on in.vrs_xform_plan.policy](#toc0_)

Checking for blanks

In [7]:
df["in.vrs_xform_plan.policy"] = df["in.vrs_xform_plan.policy"].fillna("None")

In [8]:
df["in.vrs_xform_plan.policy"].value_counts()

in.vrs_xform_plan.policy
Canonical SPDI                                      2118669
Absolute copy count                                   53263
Copy number change (cn loss|del and cn gain|dup)      27104
NCBI36 genomic only                                    4771
No hgvs or location info                               3089
Genotype/Haplotype                                     1440
Invalid/unsupported hgvs                               1336
Remaining valid hgvs alleles                            941
Min/max copy count range not supported                   14
Name: count, dtype: int64

In [9]:
df["support_status"] = df["in.vrs_xform_plan.policy"].copy()

df.loc[df["support_status"] == "Canonical SPDI", "support_status"] = True
df.loc[df["support_status"] == "Absolute copy count", "support_status"] = True
df.loc[df["support_status"] == "Copy number change (cn loss|del and cn gain|dup)",
    "support_status"] = True
df.loc[df["support_status"] == "NCBI36 genomic only", "support_status"] = False
df.loc[df["support_status"] == "No hgvs or location info", "support_status"] = False
df.loc[df["support_status"] == "Genotype/Haplotype", "support_status"] = False
df.loc[df["support_status"] == "Invalid/unsupported hgvs", "support_status"] = False
df.loc[df["support_status"] == "Remaining valid hgvs alleles", "support_status"] = True
df.loc[df["support_status"] == "Min/max copy count range not supported", 
    "support_status"] = False

In [10]:
df["support_status"].value_counts()

support_status
True     2199977
False      10650
Name: count, dtype: int64

## <a id='toc1_3_'></a>[Add Normalization Status of Variant based on out.errors](#toc0_)

The errors are stored as a list of values, some of which are strings and other of which are dictionaries (determined by whether error was handled at the level of Variation Normalizer or after the normalizer)

The "get_errors" function extracts the text error responses for better readability and ease string processing

In [11]:
def get_errors(errors):
    errors_out = []
    for e in errors:
        if type(e) == str:
            errors_out.append(e)
        elif type(e) == dict:
            for k, v in e.items():
                if k not in [
                    "msg",
                    "response-errors",]:
                ## only get these keys from normalizer response
                    continue
                if type(v) == str:
                    errors_out.append(v)
                elif type(v) == list:
                    errors_out.append(";".join(v))
    return ";".join(errors_out)

In [12]:
df["error_string"] = df["out.errors"].fillna("").apply(get_errors)

This is the number of unique error strings

There are many different strings because many of the errors contain specific genomic coordinates, which are unlikely to occur more than once

In [13]:
df["error_string"].nunique()

532

To get the core error message, the numeric values are replaced with "#"

In [14]:
def reduce_errors(error_string):
    out = error_string.lower()
    out = re.sub("\d+", "#", out)
    return out

In [15]:
def reduce_errors_more(error_string):
    errs = error_string.split(";")
    new_errs = [re.sub("\:[ ]?[^\s]+[\s]?", "", err) for err in errs]
    return ";".join(new_errs)

In [16]:
df["error_string_reduce"] = df["error_string"].apply(reduce_errors)

In [17]:
df["error_string_reduce"] = df["error_string_reduce"].replace("", "Success")

In [18]:
df["error_string_reduce"].value_counts()

error_string_reduce
Success                                                                                                                                                  2209136
error returned from variation normalizer;unable to find a grch# accession for: nc_#.#                                                                        539
unrecognized variation record                                                                                                                                431
error returned from variation normalizer;unable to find classification for: nc_#.#:g.(#_?)_(?_#)del                                                          233
error returned from variation normalizer;unable to find classification for: nc_#.#:g.(#_?)_(?_#)dup                                                          151
error returned from variation normalizer;unable to tokenize: cm#.#:g.#_#dup;unable to find classification for: cm#.#:g.#_#dup                                 23
error returned

There are Not Supported variants that have no error (marked as success inaccurately) because they were labeled "Not Supported" manually.

An error ("Not Supported") is entered manually for those variants so that they are not categorized as normalized

In [19]:
df.loc[
    (df["support_status"] == False) & (df["error_string_reduce"] == "Success"),
    "error_string_reduce",
] = "Not Supported"

The error strings had to be reduced further

In [20]:
df["error_string_reduce_2"] = df["error_string_reduce"].apply(reduce_errors_more)

In [21]:
df["error_string_reduce_2"].value_counts()

error_string_reduce_2
Success                                                                                                   2198936
Not Supported                                                                                               10200
error returned from variation normalizer;unable to find a grch# accession for                                 539
unrecognized variation record                                                                                 431
error returned from variation normalizer;unable to find classification for                                    403
error returned from variation normalizer;unable to tokenize;unable to find classification for                  49
error returned from variation normalizer;unable to find classification for;unable to tokenize                  46
error returned from variation normalizer;unable to find valid result for classification                         8
error returned from variation normalizer;unable to translate varia

### <a id='toc1_3_1_'></a>[Set Normalize Status of Variant as T/F](#toc0_)

If an error is present, the variant was not normalized and therefore has a False Normalize Status

In [22]:
df["normalize_status"] = df["error_string_reduce_2"] == "Success"
df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
0,425693,NM_001204.6(BMPR2):c.77-?_247+?dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:425693,Text,clinvar:425693,...,,,,,,False,,Not Supported,Not Supported,False
1,90650,NM_000251.2(MSH2):c.1387-?_1510+?del,SimpleAllele,Deletion,Text,[id],No hgvs or location info,Text:clinvar:90650,Text,clinvar:90650,...,,,,,,False,,Not Supported,Not Supported,False
2,16098,"SLC2A2, 1-BP INS, 793C",SimpleAllele,Insertion,Text,[id],No hgvs or location info,Text:clinvar:16098,Text,clinvar:16098,...,,,,,,False,,Not Supported,Not Supported,False
3,14905,"HLA-DRB1, HLA-DRB1*1101",SimpleAllele,Variation,Text,[id],No hgvs or location info,Text:clinvar:14905,Text,clinvar:14905,...,,,,,,False,,Not Supported,Not Supported,False
4,1048409,NM_000512.5:c.567_1002dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:1048409,Text,clinvar:1048409,...,,,,,,False,,Not Supported,Not Supported,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2210622,1801441,NM_015189.3(EXOC6B):c.2197-66917_2197-66916ins...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210623,2202105,NM_153676.4(USH1C):c.496+14_496+15insGTACTCCAT...,SimpleAllele,Microsatellite,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210624,1972383,NM_003470.3(USP7):c.383+10_383+11insGTTTAAATGA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210625,1496502,NM_007255.3(B4GALT7):c.881_882insTGAGGTGGATTAA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True


#### <a id='toc1_3_1_1_'></a>[Summary Table](#toc0_)

In the table below, the cells show the number of variants with each expected behavior and how they actually ended up performing.

If a variant was in an "expected to pass" category and ends up as text, that is an instance of a normalizer failure on a supported variant

In [23]:
summary_df = df[["in.id", "support_status", "in.vrs_xform_plan.policy", "out.type"]].fillna(
    "NONE"
).groupby(["support_status", "in.vrs_xform_plan.policy", "out.type"]).count().unstack(
    level=2
).fillna(
    0
).astype(
    int
)["in.id"]

In [24]:
summary_df["VariantSum"] = summary_df.sum(axis = 1)

In [25]:
summary_df["NormalizedSum"] = summary_df[["Allele", "CopyNumberChange", "CopyNumberCount"]].sum(axis = 1)

In [26]:
summary_df["NormalizedPercent"] = (summary_df["NormalizedSum"] / summary_df["VariantSum"]).apply(lambda x : f"{round(x * 100, 2)}%")

In [27]:
summary_df = summary_df.drop(["VariantSum", "NormalizedSum"], axis=1)
summary_df

Unnamed: 0_level_0,out.type,Allele,CopyNumberChange,CopyNumberCount,NONE,Text,NormalizedPercent
support_status,in.vrs_xform_plan.policy,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
False,Genotype/Haplotype,0,0,0,0,1440,0.0%
False,Invalid/unsupported hgvs,0,0,0,19,1317,0.0%
False,Min/max copy count range not supported,0,0,0,0,14,0.0%
False,NCBI36 genomic only,0,0,0,0,4771,0.0%
False,No hgvs or location info,0,0,0,0,3089,0.0%
True,Absolute copy count,0,1,52440,819,3,98.46%
True,Canonical SPDI,2118669,0,0,0,0,100.0%
True,Copy number change (cn loss|del and cn gain|dup),0,26889,0,209,6,99.21%
True,Remaining valid hgvs alleles,927,0,0,13,1,98.51%


In [28]:
summary_df.to_csv(
    "clinvar_variation_analysis_output/variant_analysis_summary_df.csv"
)

## <a id='toc1_4_'></a>[Create subgroups based on Variant Status](#toc0_)

### <a id='toc1_4_1_'></a>[Supported and Normalized Variants](#toc0_)

In [29]:
supported_df = df.copy()

In [30]:
supported_df = supported_df.loc[
    (supported_df["support_status"] == True)
    & (supported_df["normalize_status"] == True)
]
supported_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
149,1676638,NM_000094.4(COL7A1):c.8729G>T (p.Gly2910Val),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
150,1676377,NM_012064.4(MIP):c.20C>T (p.Ala7Val),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
151,1676330,NM_054027.6(ANKH):c.259G>A (p.Val87Ile),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
152,1325429,NM_000138.5(FBN1):c.5855G>T (p.Gly1952Val),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
153,1676394,NM_000786.4(CYP51A1):c.1291C>T (p.Arg431Cys),SimpleAllele,single nucleotide variant,Allele,[hgvs],Remaining valid hgvs alleles,,Allele,,...,,,,,,True,,Success,Success,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2210622,1801441,NM_015189.3(EXOC6B):c.2197-66917_2197-66916ins...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210623,2202105,NM_153676.4(USH1C):c.496+14_496+15insGTACTCCAT...,SimpleAllele,Microsatellite,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210624,1972383,NM_003470.3(USP7):c.383+10_383+11insGTTTAAATGA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True
2210625,1496502,NM_007255.3(B4GALT7):c.881_882insTGAGGTGGATTAA...,SimpleAllele,Insertion,Allele,[canonical_spdi],Canonical SPDI,,Allele,,...,,,,,,True,,Success,Success,True


In [39]:
variation_type_count_supported_df = supported_df.value_counts(["in.variation_type", "in.vrs_xform_plan.policy"]).reset_index()
variation_type_count_supported_df

Unnamed: 0,in.variation_type,in.vrs_xform_plan.policy,count
0,single nucleotide variant,Canonical SPDI,1934692
1,Deletion,Canonical SPDI,92875
2,Duplication,Canonical SPDI,43185
3,copy number loss,Absolute copy count,26857
4,Microsatellite,Canonical SPDI,26678
5,copy number gain,Absolute copy count,25583
6,Deletion,Copy number change (cn loss|del and cn gain|dup),16249
7,Indel,Canonical SPDI,10986
8,Insertion,Canonical SPDI,8865
9,Duplication,Copy number change (cn loss|del and cn gain|dup),8119


In [38]:
variation_type_count_supported_df.to_csv(
    "clinvar_variation_analysis_output/variation_type_count_supported_df.csv"
)

### <a id='toc1_4_2_'></a>[Supported and Not Normalized Variants](#toc0_)

In [33]:
supported_not_normalized_df = df.copy()

In [34]:
supported_not_normalized_df = supported_not_normalized_df.loc[
    (supported_not_normalized_df["support_status"] == True)
    & (supported_not_normalized_df["normalize_status"] == False)
]
supported_not_normalized_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
164,989220,NC_000015.9:g.(44884528_44881613)_(44877833_44...,SimpleAllele,Duplication,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),989220,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False
166,10342,NG_011403.2:g.(80027_96047)_(131648_164496)del,SimpleAllele,Deletion,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),10342,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False
210,602019,GRCh37/hg19 15q11.2(chr15:22750305-23140114)x3,SimpleAllele,copy number gain,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,602019,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False
277,1706497,GRCh37/hg19 16p11.2(chr16:29432212-30177807)x1,SimpleAllele,copy number loss,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,1706497,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False
308,1705934,GRCh37/hg19 Xp22.33(chrX:566009-1356042)x3,SimpleAllele,copy number gain,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,1705934,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2143307,830807,NC_000001.10:g.(?_145498103)_(145538307_?)dup,SimpleAllele,Duplication,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),830807,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False
2143312,625541,GRCh37/hg19 1q21.1(chr1:145395604-145704146),SimpleAllele,copy number loss,CopyNumberChange,[hgvs],Copy number change (cn loss|del and cn gain|dup),625541,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False
2143342,600207,GRCh37/hg19 1q32.1(chr1:206315933-206331193)x1,SimpleAllele,copy number loss,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,600207,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False
2143345,565216,GRCh37/hg19 1q32.1(chr1:206173911-206288157)x1,SimpleAllele,copy number loss,CopyNumberCount,"[hgvs, absolute_copies]",Absolute copy count,565216,,,...,,,,,,True,Error returned from variation normalizer;Unabl...,error returned from variation normalizer;unabl...,error returned from variation normalizer;unabl...,False


In [42]:
variation_type_count_supported_not_normalized_df = supported_not_normalized_df.value_counts(["in.variation_type", "in.vrs_xform_plan.policy"]).reset_index()
variation_type_count_supported_not_normalized_df

Unnamed: 0,in.variation_type,in.vrs_xform_plan.policy,count
0,copy number loss,Absolute copy count,475
1,copy number gain,Absolute copy count,344
2,Deletion,Copy number change (cn loss|del and cn gain|dup),89
3,Duplication,Copy number change (cn loss|del and cn gain|dup),72
4,copy number loss,Copy number change (cn loss|del and cn gain|dup),31
5,copy number gain,Copy number change (cn loss|del and cn gain|dup),17
6,single nucleotide variant,Remaining valid hgvs alleles,11
7,Insertion,Remaining valid hgvs alleles,1
8,Variation,Remaining valid hgvs alleles,1


In [43]:
variation_type_count_supported_not_normalized_df.to_csv(
    "clinvar_variation_analysis_output/variation_type_count_supported_not_normalized_df.csv"
)

### <a id='toc1_4_3_'></a>[Not Supported Variants](#toc0_)

In [46]:
not_supported_df = df.copy()

In [47]:
not_supported_df = not_supported_df.loc[
    (not_supported_df["support_status"] == False)
    & (not_supported_df["normalize_status"] == False)
]
not_supported_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status
0,425693,NM_001204.6(BMPR2):c.77-?_247+?dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:425693,Text,clinvar:425693,...,,,,,,False,,Not Supported,Not Supported,False
1,90650,NM_000251.2(MSH2):c.1387-?_1510+?del,SimpleAllele,Deletion,Text,[id],No hgvs or location info,Text:clinvar:90650,Text,clinvar:90650,...,,,,,,False,,Not Supported,Not Supported,False
2,16098,"SLC2A2, 1-BP INS, 793C",SimpleAllele,Insertion,Text,[id],No hgvs or location info,Text:clinvar:16098,Text,clinvar:16098,...,,,,,,False,,Not Supported,Not Supported,False
3,14905,"HLA-DRB1, HLA-DRB1*1101",SimpleAllele,Variation,Text,[id],No hgvs or location info,Text:clinvar:14905,Text,clinvar:14905,...,,,,,,False,,Not Supported,Not Supported,False
4,1048409,NM_000512.5:c.567_1002dup,SimpleAllele,Duplication,Text,[id],No hgvs or location info,Text:clinvar:1048409,Text,clinvar:1048409,...,,,,,,False,,Not Supported,Not Supported,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2199349,1418992,NM_002076.4(GNS):c.841_842insTTTTTTTTTTTTTTTTT...,SimpleAllele,Insertion,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1418992,Text,clinvar:1418992,...,,,,,,False,,Not Supported,Not Supported,False
2199350,2134754,NM_152564.5(VPS13B):c.5627_5628insTTTTTTTTTTTT...,SimpleAllele,Insertion,Text,[id],Invalid/unsupported hgvs,Text:clinvar:2134754,Text,clinvar:2134754,...,,,,,,False,,Not Supported,Not Supported,False
2210617,1513408,NM_024928.5(STN1):c.340_352AAG[2]CTACAAGGCCGGG...,SimpleAllele,Insertion,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1513408,Text,clinvar:1513408,...,,,,,,False,,Not Supported,Not Supported,False
2210618,1464440,NM_001267550.2(TTN):c.29512_29513insGGCCGGGCGC...,SimpleAllele,Insertion,Text,[id],Invalid/unsupported hgvs,Text:clinvar:1464440,Text,clinvar:1464440,...,,,,,,False,,Not Supported,Not Supported,False


In [49]:
variation_type_count_not_supported_df = not_supported_df.value_counts(["in.variation_type", "in.vrs_xform_plan.policy"]).reset_index()
variation_type_count_not_supported_df

Unnamed: 0,in.variation_type,in.vrs_xform_plan.policy,count
0,copy number gain,NCBI36 genomic only,2897
1,copy number loss,NCBI36 genomic only,1749
2,Deletion,No hgvs or location info,1236
3,Diplotype,Genotype/Haplotype,596
4,Haplotype,Genotype/Haplotype,565
5,Microsatellite,Invalid/unsupported hgvs,415
6,Deletion,Invalid/unsupported hgvs,413
7,Insertion,No hgvs or location info,382
8,Insertion,Invalid/unsupported hgvs,306
9,single nucleotide variant,No hgvs or location info,287


In [50]:
variation_type_count_not_supported_df.to_csv(
    "clinvar_variation_analysis_output/variation_type_count_not_supported_df.csv"
)

Sanity check: making sure there are no supported variants that have been marked as normalized

In [51]:
not_supported_but_normalized_df = df.copy()

In [52]:
not_supported_but_normalized_df = not_supported_but_normalized_df.loc[
    (not_supported_but_normalized_df["support_status"] == False)
    & (not_supported_but_normalized_df["normalize_status"] == True)
]
not_supported_but_normalized_df

Unnamed: 0,in.id,in.name,in.subclass_type,in.variation_type,in.vrs_xform_plan.type,in.vrs_xform_plan.inputs,in.vrs_xform_plan.policy,out.id,out.type,out.definition,...,out.seq.ref_allele_vcf,out.seq.position_vcf,out.seq.alt_allele_vcf,in.max_copies,in.min_copies,support_status,error_string,error_string_reduce,error_string_reduce_2,normalize_status


## <a id='toc1_5_'></a>[Counting variants from each group](#toc0_)

In [None]:
num_supported = len(supported_df)
num_supported_not_normalized = len(supported_not_normalized_df)
num_not_supported_but_normalized = len(not_supported_but_normalized_df)
num_not_supported = len(not_supported_df)

In [None]:
summary_df2 = pd.DataFrame({"Supported":[num_supported, num_supported_not_normalized],
                "Not Supported":[num_not_supported_but_normalized, num_not_supported]})

In [None]:
summary_df2.index = ['Normalized', 'Not Normalized']
summary_df2

## Counting variant types for each group

In [None]:
variation_type_count_summary_df = pd.merge(pd.merge(variation_type_count_supported_df,variation_type_count_supported_not_normalized_df, on='in.variation_type', how = "left"), variation_type_count_not_supported_df, on='in.variation_type', how = "right")
variation_type_count_summary_df = variation_type_count_summary_df.replace(np.nan,'',regex=True)

In [None]:
variation_type_count_summary_df = variation_type_count_summary_df.rename(columns={"in.id_x": "supported", "in.id_y": "supported_not_normalized", "in.id": "not_supported"})

In [None]:
variation_type_count_summary_df.to_csv(
    "clinvar_variation_analysis_output/variation_type_count_summary_df.csv"
)