# Preparation of variant consequences

Genetic variants can be classified by the consequences that they have on the function of a gene. 

The consequences of a genetic variant can be broadly classified into several categories:

* Synonymous: A genetic variant that does not result in a protein amino acid change. 
* Non-synonymous: A genetic variant that changes an amino acid in a protein. 
* Upstream: An upstream gene variant refers to a genetic change or alteration that occurs in the DNA sequence located before (or "upstream" of) a particular gene. Upstream variants can potentially affect the regulation or expression of the gene by influencing how the gene is transcribed or controlled.
* Downstream: A downstream gene variant occurs in the DNA sequence located after (or "downstream" of) a specific gene. Downstream variants might impact processes related to the gene's transcript processing, translation, or overall function.
* Intronic: Intronic variants are located in regions that do not directly encode genes. These variants may impact the splicing process of genes.
* Regulatory: A genetic variant located in an intronic region that interferes with gene regulatory elements.
* Splice site: A genetic variant within a site where genetic splicing takes place. 
* 3-prime/5-prime UTR: These variants are located before (3-prime) and after (5-prime) gene coding regions and may impact various gene regulatory functions.

Consequence data for variants identified in-house in African population groups was retrieved from the [Ensembl Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html) using [CADD v1.6](https://cadd.gs.washington.edu/score). The retrieved data was stored in `Data/VEP/GRCh38-v1.6_{gene_name}.tsv, where `{gene_name}` refers to the name of a specific gene. 

To prepare the data for further analysis, the following steps were performed: 

1. The consequence data for all genes was merged into a single dataset
2. Only the relevant features, such as chromosome, reference and alternate allele information, variant position information, and variant consequences, were selected. These features were renamed if necessary.

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import Utils.constants as constants
import Utils.functions as functions

Import variant consequence information

In [2]:
gene_consequence_data = pd.DataFrame()

genes = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "locations.csv",
    )
).location_name

for gene in genes:
    gene_consequence_path = os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "VEP",
        "GRCh38-v1.6_{}.tsv".format(gene),
    )

    consequence_df = pd.DataFrame()
    if os.path.exists(gene_consequence_path):
        consequence_df = pd.read_csv(gene_consequence_path, sep="\t", skiprows=[0])
        consequence_df["GENE"] = gene

    gene_consequence_data = pd.concat([gene_consequence_data, consequence_df])

gene_consequence_data.head(5)

  consequence_df = pd.read_csv(gene_consequence_path, sep="\t", skiprows=[0])
  consequence_df = pd.read_csv(gene_consequence_path, sep="\t", skiprows=[0])


Unnamed: 0,#Chrom,Pos,Ref,Alt,Type,Length,AnnoType,Consequence,ConsScore,ConsDetail,...,Rare10000bp,Sngl10000bp,EnsembleRegulatoryFeature,dbscSNV-ada_score,dbscSNV-rf_score,RemapOverlapTF,RemapOverlapCL,RawScore,PHRED,GENE
0,13,110148882,C,CT,INS,1,Intergenic,DOWNSTREAM,1,downstream,...,91,1311,,,,,,-0.437825,0.16,COL4A1
1,13,110148891,C,G,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,91,1314,,,,,,-0.227221,0.446,COL4A1
2,13,110148917,C,G,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,91,1312,,,,,,0.269936,3.938,COL4A1
3,13,110148920,G,C,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,91,1312,,,,,,0.530972,6.825,COL4A1
4,13,110148959,A,G,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,92,1315,,,,,,1.380228,14.95,COL4A1


## Feature selection

Select the chromosome, position, reference allele, alternate allele, gene and consequence features for further analysis. Rename features if necessary.

In [3]:
gene_consequence_data_filtered = gene_consequence_data.copy()[
    ["#Chrom", "Pos", "Ref", "Alt", "Type", "Consequence", "ConsDetail", "GeneName"]
].rename(
    columns={
        "#Chrom": "CHROM",
        "Pos": "POS",
        "Ref": "REF",
        "Alt": "ALT",
        "Type": "TYPE",
        "Consequence": "CONSEQUENCE_CLASSIFICATION",
        "ConsDetail": "CONSEQUENCE",
        "GeneName": "GENE",
    }
)
gene_consequence_data_filtered["CONSEQUENCE_CLASSIFICATION"] = gene_consequence_data_filtered[
    "CONSEQUENCE_CLASSIFICATION"
].apply(str.lower)
gene_consequence_data_filtered.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,TYPE,CONSEQUENCE_CLASSIFICATION,CONSEQUENCE,GENE
0,13,110148882,C,CT,INS,downstream,downstream,COL4A1
1,13,110148891,C,G,SNV,downstream,downstream,COL4A1
2,13,110148917,C,G,SNV,downstream,downstream,COL4A1
3,13,110148920,G,C,SNV,downstream,downstream,COL4A1
4,13,110148959,A,G,SNV,downstream,downstream,COL4A1


## Assign a unique ID to each variant

Add a column with a unique ID for each variant.

In [4]:
gene_consequence_data_filtered["ID"] = (
    gene_consequence_data_filtered[["POS", "ALT", "REF"]].astype("str").agg("_".join, axis=1)
)

gene_consequence_data_filtered.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,TYPE,CONSEQUENCE_CLASSIFICATION,CONSEQUENCE,GENE,ID
0,13,110148882,C,CT,INS,downstream,downstream,COL4A1,110148882_CT_C
1,13,110148891,C,G,SNV,downstream,downstream,COL4A1,110148891_G_C
2,13,110148917,C,G,SNV,downstream,downstream,COL4A1,110148917_G_C
3,13,110148920,G,C,SNV,downstream,downstream,COL4A1,110148920_C_G
4,13,110148959,A,G,SNV,downstream,downstream,COL4A1,110148959_G_A


## Variant filtering

Remove variants that are not associated with the specified genes

In [5]:
gene_consequence_data_filtered = gene_consequence_data_filtered.copy()[
    gene_consequence_data_filtered["GENE"].isin(genes)
]

## Save consequence data to a CSV file

In [6]:
gene_consequence_data_filtered.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_consequences.csv",
    ),
    index=False,
)