# Preparation of variant effect score and prediction data

Variant effect scores are numerical values or metrics used to quantitatively assess the potential functional impact of genetic variants on genes or proteins. These scores are based on computational predictions which estimate how likely a genetic variant is to have a deleterious (pathogenic) effect.

Genetic variant effect score and prediction data were retrieved using [CADD v1.6](https://cadd.gs.washington.edu/score) for the variants identified in-house in African populations. The retrieved data was stored in `Data/VEP/GRCh38-v1.6_{gene_name}.tsv`, where `{gene_name}` refers to the name of a specific gene. 

The data was prepared for analysis by following the steps outlined in this notebook.

## Imports

Notebook setup

In [7]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import Utils.constants as constants
import Utils.functions as functions
import numpy as np

Suppress pandas warnings

In [8]:
import warnings
warnings.filterwarnings("ignore")

## Data loading

Load the data from the `Data/VEP/GRCh38-v1.6_{gene_name}.tsv` files into a single DataFrame.

In [9]:
# Initialize an empty DataFrame to store VEP data for all genes
gene_vep_data = pd.DataFrame()

# Read the list of gene names from the gene_locations.csv file
genes = pd.read_csv(
    os.path.join(PROJECT_ROOT, "Metadata", "gene_locations.csv")
).location_name

# Iterate over each gene to load its corresponding VEP data
for gene in genes:
    # Construct the file path to the gene's VEP data file
    gene_vep_path = os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "VEP",
        "GRCh38-v1.6_{}.tsv".format(gene),
    )

    # Initialize an empty DataFrame for the gene's VEP data
    consequence_df = pd.DataFrame()
    
    # Check if the VEP data file exists for the current gene
    if os.path.exists(gene_vep_path):
        # Read the VEP data into a DataFrame, skipping the header row
        consequence_df = pd.read_csv(gene_vep_path, sep="\t", skiprows=[0])
        
        # Add a column specifying the gene name for each row of data
        consequence_df["GENE"] = gene
    
    # Concatenate the current gene's VEP data to the main DataFrame
    gene_vep_data = pd.concat([gene_vep_data, consequence_df])

# Display the first five rows of the aggregated gene VEP data
gene_vep_data.head(2)

Unnamed: 0,#Chrom,Pos,Ref,Alt,Type,Length,AnnoType,Consequence,ConsScore,ConsDetail,...,Rare10000bp,Sngl10000bp,EnsembleRegulatoryFeature,dbscSNV-ada_score,dbscSNV-rf_score,RemapOverlapTF,RemapOverlapCL,RawScore,PHRED,GENE
0,13,110148882,C,CT,INS,1,Intergenic,DOWNSTREAM,1,downstream,...,91,1311,,,,,,-0.437825,0.16,COL4A1
1,13,110148891,C,G,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,91,1314,,,,,,-0.227221,0.446,COL4A1


## Feature selection

Select variant ID, position, reference and alternate allele, and consequence score and prediction data from various algorithms including FATHMM (FAT), CADD, PredictSNP (PSNP), DANN, FunSeq2 (FUN), and GWAVA. Rename columns as necessary.

In [10]:
# Filter and rename selected columns in the gene VEP data
gene_vep_data_filtered = gene_vep_data.copy()[
    [
        "#Chrom",
        "Pos",
        "Ref",
        "Alt",
        "GeneName",
        "PolyPhenCat",
        "PolyPhenVal",
        "SIFTcat",
        "SIFTval",
        "RawScore",
        "PHRED",
    ]
].rename(
    columns={
        "#Chrom": "CHROM",
        "Pos": "POS",
        "Ref": "REF",
        "Alt": "ALT",
        "GeneName": "GENE",
        "PolyPhenCat": "POLYPHEN_PRED",
        "PolyPhenVal": "POLYPHEN_SCORE",
        "SIFTcat": "SIFT_PRED",
        "SIFTval": "SIFT_SCORE",
        "RawScore": "CADD_RAW_SCORE",
        "PHRED": "CADD_PHRED_SCORE",
    }
)

# Display the first two rows of the filtered data
gene_vep_data_filtered.head(2)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446


## Assign a unique ID to each variant

Add a column with a unique ID for each variant.

In [11]:
# Create a new 'ID' column by concatenating 'POS', 'REF', and 'ALT' columns with underscores
gene_vep_data_filtered["ID"] = (
    gene_vep_data_filtered[["POS", "REF", "ALT"]].astype("str").agg("_".join, axis=1)
)

## Data filtering

Remove variants that are not associated with a specified list of genes.

In [12]:
gene_vep_data_filtered = gene_vep_data_filtered.copy()[
    gene_vep_data_filtered["GENE"].isin(genes)
]

## Display and save the prepared data

In [13]:
gene_vep_data_filtered.head(2)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE,ID
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16,110148882_C_CT
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446,110148891_C_G


The information stored in each of the columns above is as follows:

* `CHROM`: Chromosome number where the variant is located.
* `POS`: Genomic position of the variant.
* `REF`: Reference allele (original allele).
* `ALT`: Alternate allele (mutated allele).
* `GENE`: Name of the gene affected by the variant.
* `POLYPHEN_PRED`: PolyPhen prediction category (e.g., benign, possibly damaging).
* `POLYPHEN_SCORE`: PolyPhen score indicating the probability of impact.
* `SIFT_PRED`: SIFT prediction category (e.g., tolerated, deleterious).
* `SIFT_SCORE`: SIFT score indicating the impact on protein function.
* `CADD_RAW_SCORE`: Raw CADD score assessing the variant's potential impact.
* `CADD_PHRED_SCORE`: CADD PHRED score providing a normalized rank of the variant's impact.

In [14]:
gene_vep_data_filtered.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_effects.csv",
    ),
    index=False,
)