# Preparation of variant consequence data

Genetic variants can be classified by the consequences that they have on the function of a gene. 

The consequences of a genetic variant can be broadly classified into several categories:

* Synonymous: Some genetic variants do not alter the function of the gene or protein, resulting in no observable impact on an individual's health or traits.
* Missense: This type of variant leads to the replacement of one amino acid with another in the protein sequence, potentially affecting its structure or function.
* Nonsense (stop-gained): These variants result in a premature stop codon in the protein-coding sequence, leading to the production of a truncated and often non-functional protein.
* Frameshift: These variants insert or delete genetic material, causing a shift in the reading frame of the gene and disrupting the correct synthesis of the protein.
* Splice Site: Variants at the boundary regions of introns and exons can affect the process of gene splicing, leading to abnormal protein production.
* Upstream: An upstream gene variant refers to a genetic change or alteration that occurs in the DNA sequence located before (or "upstream" of) a particular gene. Upstream variants can potentially affect the regulation or expression of the gene by influencing how the gene is transcribed or controlled.
* Downstream: A downstream gene variant occurs in the DNA sequence located after (or "downstream" of) a specific gene. Downstream variants might impact processes related to the gene's transcript processing, translation, or overall function.
* Deletion/Duplication: These variants involve the loss or duplication of a segment of DNA, potentially affecting gene function or dosage.
* Start lost: A start-lost variant occurs in the DNA sequence encoding the start sequence of a gene. It affects the initiation of protein synthesis during gene expression.
* 3-prime/5-prime UTR: These variants are located before (3-prime) and after (5-prime) gene coding regions and may impact various gene regulatory functions.
* Intronic: Intronic variants are located in regions that do not directly encode genes. These variants may impact the splicing process of genes.

Consequence data for variants identified in-house in African population groups from [GnomAD 1000 Genomes and HGDP datasets](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/) was retrieved from the [Ensembl Variant Effect Predictor]() API. The retrieved data was stored in `Data/Raw/SUB/{gene_name}_VEP.csv` and `Data/Raw/SUPER/{gene_name}_VEP.csv`, where `{gene_name}` refers to the name of a specific gene. 

To prepare the data for further analysis, the following steps were performed: 

1. The consequence data for all genes was merged into a single dataset
2. Only the relevant features, such as variant ID, reference and alternate allele information, variant position information, and variant consequences, were selected. These features were renamed if necessary.
3. An additional feature was added to provide a broad classification of variant consequences. 

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import Utils.constants as constants
import Utils.functions as functions

Import variant consequence information

In [2]:
gene_vep_data = pd.DataFrame()

# Import variant consequence information for each gene of interest

genes = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "locations.csv",
    )
).location_name

for gene in genes:
    gene_vep_path = os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "SUPER",
        "{}_VEP.csv".format(gene),
    )
    consequence_df = pd.DataFrame()
    if os.path.exists(gene_vep_path):
        consequence_df = pd.read_csv(gene_vep_path, sep=",")
        consequence_df["GENE"] = gene

    # Append the variant consequence information for each gene to a single dataframe
    
    gene_vep_data = pd.concat([gene_vep_data, consequence_df])

gene_vep_data.head(5)

Unnamed: 0.1,Unnamed: 0,ID,POS,REF,ALT,Co-Located Variant,Transcript ID,Transcript Strand,Existing Variation,Start Coordinates,...,Biotype,CADD_PHRED,input,SIFT_score,SIFT_pred,Polyphen_score,Polyphen_pred,CONDEL,CONDEL_pred,GENE
0,0,chr13:110148882C-CT,110148882,C,CT,False,NM_001845.6,-1.0,-,110148882,...,protein_coding,,13:110148882-110148883:-1/CT,,,,,,,COL4A1
1,1,rs552586867,110148891,C,G,True,NM_001845.6,-1.0,rs552586867,110148891,...,protein_coding,0.446,13:110148891-110148891:1/G,,,,,,,COL4A1
2,2,rs59409892,110148917,C,G,True,NM_001845.6,-1.0,rs59409892,110148917,...,protein_coding,3.938,13:110148917-110148917:1/G,,,,,,,COL4A1
3,3,rs535182970,110148920,G,C,True,NM_001845.6,-1.0,rs535182970,110148920,...,protein_coding,6.825,13:110148920-110148920:1/C,,,,,,,COL4A1
4,4,rs56406633,110148959,A,G,True,NM_001845.6,-1.0,rs56406633,110148959,...,protein_coding,14.95,13:110148959-110148959:1/G,,,,,,,COL4A1


## Feature selection

Select the variant ID, position, reference allele, alternate allele, gene and consequence features for further analysis. Rename features if necessary.

In [3]:
gene_vep_data = gene_vep_data.copy()[["ID", "POS", "REF", "ALT", "Consequence", "GENE"]].rename(columns={"Consequence": "CONSEQUENCE", "ID":"VAR_NAME"})
gene_vep_data.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,CONSEQUENCE,GENE
0,chr13:110148882C-CT,110148882,C,CT,downstream_gene_variant,COL4A1
1,rs552586867,110148891,C,G,downstream_gene_variant,COL4A1
2,rs59409892,110148917,C,G,downstream_gene_variant,COL4A1
3,rs535182970,110148920,G,C,downstream_gene_variant,COL4A1
4,rs56406633,110148959,A,G,downstream_gene_variant,COL4A1


## Add a feature

Variant consequences can be quite diverse and specific, describing various effects on gene function or protein structure. Some of the consequences can often be grouped or classified into broader categories for easier interpretation and analysis. An additional feature was added to provide a broad classification of variant consequences.

In [4]:
gene_vep_data["CONSEQUENCE_CLASSIFICATION"] = gene_vep_data["CONSEQUENCE"].map(
    constants.VARIANT_CLASSIFICATION
)

gene_vep_data.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,CONSEQUENCE,GENE,CONSEQUENCE_CLASSIFICATION
0,chr13:110148882C-CT,110148882,C,CT,downstream_gene_variant,COL4A1,upstream/downstream
1,rs552586867,110148891,C,G,downstream_gene_variant,COL4A1,upstream/downstream
2,rs59409892,110148917,C,G,downstream_gene_variant,COL4A1,upstream/downstream
3,rs535182970,110148920,G,C,downstream_gene_variant,COL4A1,upstream/downstream
4,rs56406633,110148959,A,G,downstream_gene_variant,COL4A1,upstream/downstream


## Save consequence data to a CSV file

In [5]:
gene_vep_data.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_consequences.csv",
    ),
    index=False,
)