# <a id='toc1_'></a>[ClinVar Variant Analysis](#toc0_)

The clinvar_variation_analysis notebook contains an analysis on ClinVar variant data

**Table of contents**<a id='toc0_'></a>    
- [ClinVar Variant Analysis](#toc1_)    
  - [Initialize](#toc1_1_)    
    - [Import necessary libraries](#toc1_1_1_)    
    - [Create output directory](#toc1_1_2_)    
    - [Import variant information file](#toc1_1_3_)    
  - [Add Supported Status of Variant based on in.vrs_xform_plan.policy](#toc1_2_)    
  - [Add Normalization Status of Variant based on out.errors](#toc1_3_)    
    - [Set Normalize Status of Variant as T/F](#toc1_3_1_)    
      - [Summary Table](#toc1_3_1_1_)    
  - [Create subgroups based on Variant Status](#toc1_4_)    
    - [Supported and Normalized Variants](#toc1_4_1_)    
    - [Supported and Not Normalized Variants](#toc1_4_2_)    
    - [Not Supported Variants](#toc1_4_3_)    
  - [Counting variants from each group](#toc1_5_)    
  - [Counting variant types for each group](#toc1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Initialize](#toc0_)

### <a id='toc1_1_1_'></a>[Import necessary libraries](#toc0_)

In [None]:
import ndjson
import pandas as pd
import numpy as np
from pathlib import Path
import boto3
import gzip

### <a id='toc1_1_2_'></a>[Create output directory](#toc0_)

In [None]:
path = Path("variation_analysis_output")
path.mkdir(exist_ok=True)

### <a id='toc1_1_3_'></a>[Import variant information file](#toc0_)

In [None]:
# Comment out this cell if the file already exists in the variation_analysis_output folder
s3 = boto3.client("s3")

with open("../clinvar/output-variation_identity-vrs-1.3.ndjson.gz", "wb") as data:
    s3.download_fileobj(
        "nch-igm-wagner-lab-public",
        "variation-normalizer-manuscript/output-variation_identity-vrs-1.3.ndjson.gz",
        data,
    )

In [None]:
with gzip.open("output-variation_identity-vrs-1.3.ndjson.gz", "rb") as f: file_content = ndjson.load(f)

In [None]:
df = pd.json_normalize(file_content)

## <a id='toc1_2_'></a>[Add Supported Status of Variant based on in.vrs_xform_plan.policy](#toc0_)

Checking for blanks

In [None]:
df["in.vrs_xform_plan.policy"] = df["in.vrs_xform_plan.policy"].fillna("None")

In [None]:
df["in.vrs_xform_plan.policy"].value_counts()

In [None]:
df["support_status"] = df["in.vrs_xform_plan.policy"].copy()

df.loc[df["support_status"] == "Canonical SPDI", "support_status"] = True
df.loc[df["support_status"] == "Absolute copy count", "support_status"] = True
df.loc[
    df["support_status"] == "Copy number change (cn loss|del and cn gain|dup)",
    "support_status",
] = True
df.loc[df["support_status"] == "NCBI36 genomic only", "support_status"] = False
df.loc[df["support_status"] == "No hgvs or location info", "support_status"] = False
df.loc[df["support_status"] == "Genotype/Haplotype", "support_status"] = False
df.loc[df["support_status"] == "Invalid/unsupported hgvs", "support_status"] = False
df.loc[df["support_status"] == "Remaining valid hgvs alleles", "support_status"] = True
df.loc[
    df["support_status"] == "Min/max copy count range not supported", "support_status"
] = False

In [None]:
df["support_status"].value_counts()

## <a id='toc1_3_'></a>[Add Normalization Status of Variant based on out.errors](#toc0_)

The errors are stored as a list of values, some of which are strings and other of which are dictionaries (determined by whether error was handled at the level of Variation Normalizer or after the normalizer)

The "get_errors" function extracts the text error responses for better readability and ease string processing

In [None]:
def get_errors(errors: list) -> str:
    """Takes the values for the errors and represents them as a string
    :param errors: list of errors
    :return: string representing error
    """
    errors_out = []
    for e in errors:
        if isinstance(e, str):
            errors_out.append(e)
        elif isinstance(e, dict):
            for k, v in e.items():
                if k not in ["msg", "response-errors"]:
                    ## only get these keys from normalizer response
                    continue
                if isinstance(v, str):
                    errors_out.append(v)
                elif isinstance(e, list):
                    errors_out.append(";".join(v))
    return ";".join(errors_out)

In [None]:
df["error_string"] = df["out.errors"].fillna("").apply(get_errors)

In [None]:
df["error_string"] = df["error_string"].replace("", "Success")

In [None]:
df["error_string"].value_counts()

There are Not Supported variants that have no error (marked as success inaccurately) because they were labeled "Not Supported" manually.

An error ("Not Supported") is entered manually for those variants so that they are not categorized as normalized

In [None]:
df.loc[
    (~df["support_status"]) & (df["error_string"] == "Success"),
    "error_string",
] = "Not Supported"

### <a id='toc1_3_1_'></a>[Set Normalize Status of Variant as T/F](#toc0_)

If an error is present, the variant was not normalized and therefore has a False Normalize Status

In [None]:
df["normalize_status"] = df["error_string"] == "Success"
df

#### <a id='toc1_3_1_1_'></a>[Summary Table](#toc0_)

In the table below, the cells show the number of variants with each expected behavior and how they actually ended up performing.

If a variant was in an "expected to pass" category and ends up as text, that is an instance of a normalizer failure on a supported variant

In [None]:
summary_df = (
    df[["in.id", "support_status", "in.vrs_xform_plan.policy", "out.type"]]
    .fillna("NONE")
    .groupby(["support_status", "in.vrs_xform_plan.policy", "out.type"])
    .count()
    .unstack(level=2)
    .fillna(0)
    .astype(int)["in.id"]
)

In [None]:
summary_df["VariantSum"] = summary_df.sum(axis=1)

In [None]:
summary_df["NormalizedSum"] = summary_df[
    ["Allele", "CopyNumberChange", "CopyNumberCount"]
].sum(axis=1)

In [None]:
summary_df["NormalizedPercent"] = (
    summary_df["NormalizedSum"] / summary_df["VariantSum"]
).apply(lambda x: f"{round(x * 100, 2)}%")

In [None]:
summary_df = summary_df.drop(["VariantSum", "NormalizedSum"], axis=1)
summary_df

In [None]:
summary_df.to_csv("variation_analysis_output/variant_analysis_summary_df.csv")

## <a id='toc1_4_'></a>[Create subgroups based on Variant Status](#toc0_)

### <a id='toc1_4_1_'></a>[Supported and Normalized Variants](#toc0_)

In [None]:
supported_df = df.copy()

In [None]:
supported_df = supported_df.loc[
    (supported_df["support_status"] & supported_df["normalize_status"])
]
supported_df

In [None]:
variation_type_count_supported_df = (
    supported_df[["in.id", "in.variation_type"]].groupby("in.variation_type").count()
)

In [None]:
variation_type_count_supported_df.to_csv(
    "variation_analysis_output/variation_type_count_supported_df.csv"
)

### <a id='toc1_4_2_'></a>[Supported and Not Normalized Variants](#toc0_)

In [None]:
supported_not_normalized_df = df.copy()

In [None]:
supported_not_normalized_df = supported_not_normalized_df.loc[
    (
        supported_not_normalized_df["support_status"]
        & ~supported_not_normalized_df["normalize_status"]
    )
]
supported_not_normalized_df

In [None]:
variation_type_count_supported_not_normalized_df = (
    supported_not_normalized_df[["in.id", "in.variation_type"]]
    .groupby("in.variation_type")
    .count()
)
variation_type_count_supported_not_normalized_df

In [None]:
variation_type_count_supported_not_normalized_df.to_csv(
    "variation_analysis_output/variation_type_count_supported_not_normalized_df.csv"
)

### <a id='toc1_4_3_'></a>[Not Supported Variants](#toc0_)

In [None]:
not_supported_df = df.copy()

In [None]:
not_supported_df = not_supported_df.loc[
    ~not_supported_df["support_status"] & ~not_supported_df["normalize_status"]
]
not_supported_df

In [None]:
variation_type_count_not_supported_df = (
    not_supported_df[["in.id", "in.variation_type"]]
    .groupby("in.variation_type")
    .count()
)
variation_type_count_not_supported_df

In [None]:
variation_type_count_not_supported_df.to_csv(
    "variation_analysis_output/variation_type_count_not_supported_df.csv"
)

Sanity check: making sure there are no supported variants that have been marked as normalized

In [None]:
not_supported_but_normalized_df = df.copy()

In [None]:
not_supported_but_normalized_df = not_supported_but_normalized_df.loc[
    (
        ~not_supported_but_normalized_df["support_status"]
        & not_supported_but_normalized_df["normalize_status"]
    )
]
not_supported_but_normalized_df

## <a id='toc1_5_'></a>[Counting variants from each group](#toc0_)

In [None]:
num_supported = len(supported_df)
num_supported_not_normalized = len(supported_not_normalized_df)
num_not_supported_but_normalized = len(not_supported_but_normalized_df)
num_not_supported = len(not_supported_df)

In [None]:
summary_df2 = pd.DataFrame(
    {
        "Supported": [num_supported, num_supported_not_normalized],
        "Not Supported": [num_not_supported_but_normalized, num_not_supported],
    }
)

In [None]:
summary_df2.index = ["Normalized", "Not Normalized"]
summary_df2

## <a id='toc1_6_'></a>[Counting variant types for each group](#toc0_)

In [None]:
variation_type_count_summary_df = pd.merge(
    pd.merge(
        variation_type_count_supported_df,
        variation_type_count_supported_not_normalized_df,
        on="in.variation_type",
        how="left",
    ),
    variation_type_count_not_supported_df,
    on="in.variation_type",
    how="right",
)
variation_type_count_summary_df = variation_type_count_summary_df.replace(
    np.nan, 0, regex=True
)

In [None]:
variation_type_count_summary_df = variation_type_count_summary_df.rename(
    columns={
        "in.id_x": "supported",
        "in.id_y": "supported_not_normalized",
        "in.id": "not_supported",
    }
)

In [None]:
variation_type_count_summary_df.to_csv(
    "variation_analysis_output/variation_type_count_summary_df.csv"
)
variation_type_count_summary_df