# Preparation of variant effect score data

Variant effect scores are numerical values or metrics used to quantitatively assess the potential functional impact of genetic variants on genes or proteins. These scores are based on computational predictions which estimate how likely a genetic variant is to have a deleterious (pathogenic) effect.

Genetic variant effect score and prediction data were retrieved using [CADD v1.6](https://cadd.gs.washington.edu/score) for the variants identified in-house in African populations. The retrieved data was stored in `Data/VEP/GRCh38-v1.6_{gene_name}.tsv, where `{gene_name}` refers to the name of a specific gene. 

The data was prepared for analysis by: 
Selecting features of interest, such as chromosome number, position, reference and alternate allele, and variant effect scores and predictions

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import Utils.constants as constants
import Utils.functions as functions
import numpy as np

Import variant effect data

In [2]:
gene_vep_data = pd.DataFrame()

genes = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "locations.csv",
    )
).location_name

for gene in genes:
    gene_vep_path = os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "VEP",
        "GRCh38-v1.6_{}.tsv".format(gene),
    )

    consequence_df = pd.DataFrame()
    if os.path.exists(gene_vep_path):
        consequence_df = pd.read_csv(gene_vep_path, sep="\t", skiprows=[0])
        consequence_df["GENE"] = gene
    
    gene_vep_data = pd.concat([gene_vep_data, consequence_df])

gene_vep_data.head(5)

  consequence_df = pd.read_csv(gene_vep_path, sep="\t", skiprows=[0])
  consequence_df = pd.read_csv(gene_vep_path, sep="\t", skiprows=[0])


Unnamed: 0,#Chrom,Pos,Ref,Alt,Type,Length,AnnoType,Consequence,ConsScore,ConsDetail,...,Rare10000bp,Sngl10000bp,EnsembleRegulatoryFeature,dbscSNV-ada_score,dbscSNV-rf_score,RemapOverlapTF,RemapOverlapCL,RawScore,PHRED,GENE
0,13,110148882,C,CT,INS,1,Intergenic,DOWNSTREAM,1,downstream,...,91,1311,,,,,,-0.437825,0.16,COL4A1
1,13,110148891,C,G,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,91,1314,,,,,,-0.227221,0.446,COL4A1
2,13,110148917,C,G,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,91,1312,,,,,,0.269936,3.938,COL4A1
3,13,110148920,G,C,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,91,1312,,,,,,0.530972,6.825,COL4A1
4,13,110148959,A,G,SNV,0,Intergenic,DOWNSTREAM,1,downstream,...,92,1315,,,,,,1.380228,14.95,COL4A1


## Select features of interest
Select variant ID, position, reference and alternate allele, and consequence score and prediction data from various algorithms including FATHMM (FAT), CADD, PredictSNP (PSNP), DANN, FunSeq2 (FUN), and GWAVA. 

In [3]:
gene_vep_data_filtered = gene_vep_data.copy()[
    [
        "#Chrom",
        "Pos",
        "Ref",
        "Alt",
        "GeneName",
        "PolyPhenCat",
        "PolyPhenVal",
        "SIFTcat",
        "SIFTval",
        "RawScore",
        "PHRED",
    ]
].rename(
    columns={
        "#Chrom": "CHROM",
        "Pos": "POS",
        "Ref": "REF",
        "Alt": "ALT",
        "GeneName": "GENE",     
        "PolyPhenCat":"POLYPHEN_PRED",
        "PolyPhenVal":"POLYPHEN_SCORE",
        "SIFTcat":"SIFT_PRED",
        "SIFTval":"SIFT_SCORE",
        "RawScore":"CADD_RAW_SCORE",
        "PHRED":"CADD_PHRED_SCORE",          
    }
)

gene_vep_data_filtered.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446
2,13,110148917,C,G,COL4A1,,,,,0.269936,3.938
3,13,110148920,G,C,COL4A1,,,,,0.530972,6.825
4,13,110148959,A,G,COL4A1,,,,,1.380228,14.95


## Assign a unique ID to each variant

Add a column with a unique ID for each variant.

In [4]:
gene_vep_data_filtered["ID"] = (
    gene_vep_data_filtered[["POS", "ALT", "REF"]].astype("str").agg("_".join, axis=1)
)

gene_vep_data_filtered.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE,ID
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16,110148882_CT_C
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446,110148891_G_C
2,13,110148917,C,G,COL4A1,,,,,0.269936,3.938,110148917_G_C
3,13,110148920,G,C,COL4A1,,,,,0.530972,6.825,110148920_C_G
4,13,110148959,A,G,COL4A1,,,,,1.380228,14.95,110148959_G_A


## Variant filtering

Remove variants that are not associated with the specified genes

In [5]:
gene_vep_data_filtered = gene_vep_data_filtered.copy()[
    gene_vep_data_filtered["GENE"].isin(genes)
]

gene_vep_data_filtered.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE,ID
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16,110148882_CT_C
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446,110148891_G_C
2,13,110148917,C,G,COL4A1,,,,,0.269936,3.938,110148917_G_C
3,13,110148920,G,C,COL4A1,,,,,0.530972,6.825,110148920_C_G
4,13,110148959,A,G,COL4A1,,,,,1.380228,14.95,110148959_G_A


## Save variant effect data to a CSV file

In [6]:
gene_vep_data_filtered.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_effects.csv",
    ),
    index=False,
)