# Preparation of variant phenotype data

Genetic variant phenotype association data was retrieved using [the Functional Annotation of Variants - Online Resource v2.0](https://favor.genohub.org/batch-annotation) for the variants identified in-house in African populations. The retrieved data was stored in the `Data/PHENO/GRCh38-variant_phenotypes.tsv` file. 

The data was prepared for analysis by following the steps outlined in this notebook.

## Imports

Notebook setup

In [3]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd

## Data loading

Load the variant phenotype data from the `Data/PHENO/GRCh38-variant_phenotypes.tsv` into a DataFrame.

In [4]:
variant_pheno = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "PHENO",
        "GRCh38-Favorv2.0_ALLGENES.csv",
    )
)

variant_pheno.head(2)

Unnamed: 0,VariantVcf,Chromosome,Position,Rsid,GenecodeComprehensiveCategory,GenecodeComprehensiveInfo,UcscInfo,Clnsig,Clnsigincl,Clndn,Clndnincl,Clnrevstat,Clndisdb,Clndisdbincl
0,13-110148917-C-G,13,110148917,rs59409892,downstream,COL4A1,"ENST00000375820.9,ENST00000649720.1,ENST000006...",,,,,,,
1,13-110148891-C-G,13,110148891,rs552586867,downstream,COL4A1,"ENST00000375820.9,ENST00000649720.1,ENST000006...",,,,,,,


## Feature selection

Select variant nomenclature, and associated clinical disease name features for further analysis.

In [5]:
variant_pheno_filtered = variant_pheno.copy()[['VariantVcf','Rsid','Clndn']]

Extract variant chromosome, genomic position, reference and alternate allele information from the `VariantVcf` column. 

In [6]:
variant_pheno_filtered[['CHROM','POS','REF','ALT']] = variant_pheno_filtered["VariantVcf"].str.split("-", expand=True)

Add a column with a unique ID for each variant. 

In [7]:
variant_pheno_filtered["ID"] = (
    variant_pheno_filtered[["POS", "REF", "ALT"]].astype("str").agg("_".join, axis=1)
)

## View and save prepared data

In [8]:
variant_pheno_filtered.head(2)

Unnamed: 0,VariantVcf,Rsid,Clndn,CHROM,POS,REF,ALT,ID
0,13-110148917-C-G,rs59409892,,13,110148917,C,G,110148917_C_G
1,13-110148891-C-G,rs552586867,,13,110148891,C,G,110148891_C_G


The prepared dataframe houses the following information in each column:

* `VariantVcf`: Variant call format identifier, representing the variant in the VCF file.
* `Rsid`: Reference SNP ID from the dbSNP database.
* `Clndn`: Clinical significance or disease name associated with the variant.
* `CHROM`: Chromosome number where the variant is located.
* `POS`: Position of the variant on the chromosome.
* `REF`: Reference allele (original allele).
* `ALT`: Alternate allele (mutated allele).
* `ID`: Unique identifier for the variant, often a combination of position, reference allele, and alternate allele.

In [9]:
variant_pheno_filtered.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_disease_phenotypes.csv",
    ),
    index=False,
)