# Preparation of variant phenotype data

Genetic variant phenotype association data was retrieved using [the Functional Annotation of Variants - Online Resource v2.0](https://favor.genohub.org/batch-annotation) for the variants identified in-house in African populations. The retrieved data was stored in `Data/PHENO/GRCh38-variant_phenotypes.tsv. 

The data was prepared for analysis by: 
Selecting features of interest

## Imports

Notebook setup

In [43]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import Utils.constants as constants
import Utils.functions as functions
import numpy as np

Import variant phenotype data

In [44]:
variant_pheno = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "PHENO",
        "GRCh38-Favorv2.0_ALLGENES.csv",
    )
)

variant_pheno.head(5)

Unnamed: 0,VariantVcf,Chromosome,Position,Rsid,GenecodeComprehensiveCategory,GenecodeComprehensiveInfo,UcscInfo,Clnsig,Clnsigincl,Clndn,Clndnincl,Clnrevstat,Clndisdb,Clndisdbincl
0,13-110148917-C-G,13,110148917,rs59409892,downstream,COL4A1,"ENST00000375820.9,ENST00000649720.1,ENST000006...",,,,,,,
1,13-110148891-C-G,13,110148891,rs552586867,downstream,COL4A1,"ENST00000375820.9,ENST00000649720.1,ENST000006...",,,,,,,
2,13-110149494-C-T,13,110149494,rs552877576,UTR3,"COL4A1(ENST00000375820.10:c.*869G>A,ENST000006...",ENST00000649720.1,,,,,,,
3,13-110149715-AAT-A,13,110149715,rs886049952,UTR3,"COL4A1(ENST00000375820.10:c.*647_*646delAT,ENS...",ENST00000649720.1,,,,,,,
4,13-110151168-C-T,13,110151168,rs557686466,intronic,COL4A1,ENST00000649720.1,,,,,,,


## Select features of interest
Select variant nomenclature, and associated clinical disease name. 

In [45]:
variant_pheno_filtered = variant_pheno.copy()[['VariantVcf','Rsid','Clndn']]
variant_pheno_filtered.head(5)

Unnamed: 0,VariantVcf,Rsid,Clndn
0,13-110148917-C-G,rs59409892,
1,13-110148891-C-G,rs552586867,
2,13-110149494-C-T,rs552877576,
3,13-110149715-AAT-A,rs886049952,
4,13-110151168-C-T,rs557686466,


## Extract variant reference and alternate allele information
Extract variant chromosome, genomic position, reference and alternate allele information from VariantVCF column. 

In [46]:
variant_pheno_filtered[['CHROM','POS','REF','ALT']] = variant_pheno_filtered["VariantVcf"].str.split("-", expand=True)
variant_pheno_filtered.head(5)

Unnamed: 0,VariantVcf,Rsid,Clndn,CHROM,POS,REF,ALT
0,13-110148917-C-G,rs59409892,,13,110148917,C,G
1,13-110148891-C-G,rs552586867,,13,110148891,C,G
2,13-110149494-C-T,rs552877576,,13,110149494,C,T
3,13-110149715-AAT-A,rs886049952,,13,110149715,AAT,A
4,13-110151168-C-T,rs557686466,,13,110151168,C,T


## Assign a unique ID to each variant

Add a column with a unique ID for each variant. 

In [47]:
variant_pheno_filtered["ID"] = (
    variant_pheno_filtered[["POS", "REF", "ALT"]].astype("str").agg("_".join, axis=1)
)

variant_pheno_filtered.head(5)

Unnamed: 0,VariantVcf,Rsid,Clndn,CHROM,POS,REF,ALT,ID
0,13-110148917-C-G,rs59409892,,13,110148917,C,G,110148917_C_G
1,13-110148891-C-G,rs552586867,,13,110148891,C,G,110148891_C_G
2,13-110149494-C-T,rs552877576,,13,110149494,C,T,110149494_C_T
3,13-110149715-AAT-A,rs886049952,,13,110149715,AAT,A,110149715_AAT_A
4,13-110151168-C-T,rs557686466,,13,110151168,C,T,110151168_C_T


## Save variant phenotype data to a CSV file

In [49]:
variant_pheno_filtered.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_disease_phenotypes.csv",
    ),
    index=False,
)