# Preparation of variant consequence data

Genetic variants can be classified by the consequences that they have on the function of a gene. 

The consequences of a genetic variant can be broadly classified into several categories:

* Synonymous: A genetic variant that does not result in a protein amino acid change. 
* Non-synonymous: A genetic variant that changes an amino acid in a protein. 
* Upstream: An upstream gene variant refers to a genetic change or alteration that occurs in the DNA sequence located before (or "upstream" of) a particular gene. Upstream variants can potentially affect the regulation or expression of the gene by influencing how the gene is transcribed or controlled.
* Downstream: A downstream gene variant occurs in the DNA sequence located after (or "downstream" of) a specific gene. Downstream variants might impact processes related to the gene's transcript processing, translation, or overall function.
* Intronic: Intronic variants are located in regions that do not directly encode genes. These variants may impact the splicing process of genes.
* Regulatory: A genetic variant located in an intronic region that interferes with gene regulatory elements.
* Splice site: A genetic variant within a site where genetic splicing takes place. 
* 3-prime/5-prime UTR: These variants are located before (3-prime) and after (5-prime) gene coding regions and may impact various gene regulatory functions.

Consequence data for variants identified in-house in African population groups was retrieved from the [Ensembl Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html) using [CADD v1.6](https://cadd.gs.washington.edu/score). The retrieved data was stored in `Data/VEP/GRCh38-v1.6_{gene_name}.tsv` where `{gene_name}` refers to the name of a specific gene. The retrieved data was prepared for analysis by following the steps outlined in this notebook.

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import Utils.constants as constants
import Utils.functions as functions

Ignore pandas warnings

In [2]:
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

## Data loading

Load variant consequence information from the `Data/VEP/GRCh38-v1.6_{gene_name}.tsv` files and load into a single DataFrame.

In [3]:
# Initialize an empty DataFrame to store gene consequence data
gene_consequence_data = pd.DataFrame()

# Read gene names from locations.csv file
genes = pd.read_csv(
    os.path.join(PROJECT_ROOT, "Metadata", "locations.csv")
).location_name

# Iterate over each gene to load corresponding consequence data
for gene in genes:
    # Construct the path to the gene's consequence data file
    gene_consequence_path = os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "VEP",
        "GRCh38-v1.6_{}.tsv".format(gene),
    )

    # Initialize an empty DataFrame for the gene's consequence data
    consequence_df = pd.DataFrame()

    # Check if the consequence data file exists
    if os.path.exists(gene_consequence_path):
        # Read the consequence data into a DataFrame, skipping the header row
        consequence_df = pd.read_csv(gene_consequence_path, sep="\t", skiprows=[0])
        
        # Add a column specifying the gene name for each row of data
        consequence_df["GENE"] = gene

    # Concatenate the current gene's consequence data to the main DataFrame
    gene_consequence_data = pd.concat([gene_consequence_data, consequence_df])

# Display the first two rows of the aggregated gene consequence data
gene_consequence_data.head(2)

Unnamed: 0,#Chrom,Pos,Ref,Alt,Type,Length,AnnoType,Consequence,ConsScore,ConsDetail,...,Rare10000bp,Sngl10000bp,EnsembleRegulatoryFeature,dbscSNV-ada_score,dbscSNV-rf_score,RemapOverlapTF,RemapOverlapCL,RawScore,PHRED,GENE
0,13,110148882,C,CT,INS,1,Intergenic,DOWNSTREAM,1,downstream,...,91,1311,,,,,,-0.437825,0.16,COL4A1
1,13,110148891,C,G,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,91,1314,,,,,,-0.227221,0.446,COL4A1


## Feature selection

Select the chromosome, position, reference allele, alternate allele, gene and consequence features for further analysis. Rename features if necessary.

In [4]:
# Select features and rename feature names
gene_consequence_data_filtered = gene_consequence_data[
    ["#Chrom", "Pos", "Ref", "Alt", "Type", "Consequence", "ConsDetail", "GeneName"]
].rename(
    columns={
        "#Chrom": "CHROM",
        "Pos": "POS",
        "Ref": "REF",
        "Alt": "ALT",
        "Type": "TYPE",
        "Consequence": "CONSEQUENCE_CLASSIFICATION",
        "ConsDetail": "CONSEQUENCE",
        "GeneName": "GENE",
    }
)

# Convert values in CONSEQUENCE_CLASSIFICATION column to lowercase
gene_consequence_data_filtered["CONSEQUENCE_CLASSIFICATION"] = gene_consequence_data_filtered[
    "CONSEQUENCE_CLASSIFICATION"
].apply(str.lower)
gene_consequence_data_filtered.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,TYPE,CONSEQUENCE_CLASSIFICATION,CONSEQUENCE,GENE
0,13,110148882,C,CT,INS,downstream,downstream,COL4A1
1,13,110148891,C,G,SNV,downstream,downstream,COL4A1
2,13,110148917,C,G,SNV,downstream,downstream,COL4A1
3,13,110148920,G,C,SNV,downstream,downstream,COL4A1
4,13,110148959,A,G,SNV,downstream,downstream,COL4A1


Add a column with a unique ID for each variant.

In [5]:
# Create a unique ID using the variant position, ref and alt allele information
gene_consequence_data_filtered["ID"] = (
    gene_consequence_data_filtered[["POS", "REF", "ALT"]].astype("str").agg("_".join, axis=1)
)

gene_consequence_data_filtered.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,TYPE,CONSEQUENCE_CLASSIFICATION,CONSEQUENCE,GENE,ID
0,13,110148882,C,CT,INS,downstream,downstream,COL4A1,110148882_C_CT
1,13,110148891,C,G,SNV,downstream,downstream,COL4A1,110148891_C_G
2,13,110148917,C,G,SNV,downstream,downstream,COL4A1,110148917_C_G
3,13,110148920,G,C,SNV,downstream,downstream,COL4A1,110148920_G_C
4,13,110148959,A,G,SNV,downstream,downstream,COL4A1,110148959_A_G


## Data filtering

Remove variants that are not associated with a list of specified genes.

In [6]:
gene_consequence_data_filtered = gene_consequence_data_filtered.copy()[
    gene_consequence_data_filtered["GENE"].isin(genes)
]

## Display and save the prepared data

In [7]:
gene_consequence_data_filtered.head(2)

Unnamed: 0,CHROM,POS,REF,ALT,TYPE,CONSEQUENCE_CLASSIFICATION,CONSEQUENCE,GENE,ID
0,13,110148882,C,CT,INS,downstream,downstream,COL4A1,110148882_C_CT
1,13,110148891,C,G,SNV,downstream,downstream,COL4A1,110148891_C_G


The columns in the DataFrame above house the following information:

* `CHROM`: Chromosome number where the variant is located.
* `POS`: Genomic position of the variant.
* `REF`: Reference allele (original allele).
* `ALT`: Alternate allele (mutated allele).
* `TYPE`: Type of genetic variant (e.g., SNP, insertion, deletion).
* `CONSEQUENCE_CLASSIFICATION`: Classification of the variant's impact on the gene (e.g., downstream, upstream).
* `CONSEQUENCE`: Specific consequence of the variant (e.g., missense, nonsense, synonymous).
* `GENE`: Name of the gene affected by the variant.

In [8]:
gene_consequence_data_filtered.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_consequences.csv",
    ),
    index=False,
)