# Retrieval of variant phenotype data

Genetic variant phenotype association data was retrieved using [the Functional Annotation of Variants - Online Resource v2.0 (FAVOR)](https://favor.genohub.org/batch-annotation) for the variants identified in-house in African populations. To retrieve the data via the FAVOR database user interface, the variant input data needed to be formatted in a specific way. This notebook details the preparation of the input data. Since the phenotype data was retrieved using a user interface, the retrieval process itself is not documented here.

## Imports

In [1]:
import os
import sys
import pandas as pd

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import Utils.constants as constants
import Utils.functions as functions

## Data Loading

To compile a list of variants analysed in African populations for which we want to obtain phenotype data, we first need to load the in-house African variant data.

In [2]:
ih_afr = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    )
)

ih_afr.head(5)

Unnamed: 0,ID,VAR_NAME,VARIANT_TYPE,POS,REF,ALT,GENE,SUB_POP,REG,IH_REF_CTS,IH_ALT_CTS,IH_TOTAL_CTS,IH_AF
0,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Bantu Kenya,EA,20,0,20,0.0
1,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Yoruba,WA,276,0,276,0.0
2,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,San,SA,12,0,12,0.0
3,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Mende,WA,166,0,166,0.0
4,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Mbuti Pygmy,CA,24,0,24,0.0


Each of the variants listed in the DataFrame above resides in a particular gene. Load information on the genomic coordinates of each gene.

In [3]:
gene_coords = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "gene_locations.csv",
    )
)
gene_coords = gene_coords.rename(columns={"location_name":"GENE"})
gene_coords.head(5)

Unnamed: 0,GENE,chromosome,start,stop,strand
0,COL4A1,13,110148863,110307257,-1
1,NOS3,7,150990917,151014688,1
2,IL6,7,22725784,22732102,1
3,IL1B,2,112829651,112836943,-1
4,AGT,1,230690676,230745683,-1


## Compile data into required input format

Select aggregated variant data on Recent Africans

In [4]:
ih_afr = ih_afr[ih_afr.REG == "Recent African"]

Remove variants that have an alternate allele count of 0 in Recent Africans. These variants were not found in Recent Africans.

In [5]:
ih_afr_filtered = ih_afr[ih_afr.IH_ALT_CTS != 0]

ih_afr_filtered.head(5)

Unnamed: 0,ID,VAR_NAME,VARIANT_TYPE,POS,REF,ALT,GENE,SUB_POP,REG,IH_REF_CTS,IH_ALT_CTS,IH_TOTAL_CTS,IH_AF
12,110148891_C_G,rs552586867,SNP,110148891,C,G,COL4A1,,Recent African,1219,1,1220,0.00082
28,110148917_C_G,rs59409892,SNP,110148917,C,G,COL4A1,,Recent African,1101,119,1220,0.097541
144,110149176_T_A,rs546124548,SNP,110149176,T,A,COL4A1,,Recent African,1219,1,1220,0.00082
252,110149349_G_A,rs139916479,SNP,110149349,G,A,COL4A1,,Recent African,1215,5,1220,0.004098
288,110149494_C_T,rs552877576,SNP,110149494,C,T,COL4A1,,Recent African,1219,1,1220,0.00082


Join the chromosome information from the `gene_coords` DataFrame with the variant information in the `ih_afr_filtered` DataFrame.

In [6]:
ih_afr_filtered = pd.merge(ih_afr_filtered, gene_coords[["GENE","chromosome"]], on="GENE").rename(columns={"chromosome":"CHROM"})

ih_afr_filtered.head(5)

Unnamed: 0,ID,VAR_NAME,VARIANT_TYPE,POS,REF,ALT,GENE,SUB_POP,REG,IH_REF_CTS,IH_ALT_CTS,IH_TOTAL_CTS,IH_AF,CHROM
0,110148891_C_G,rs552586867,SNP,110148891,C,G,COL4A1,,Recent African,1219,1,1220,0.00082,13
1,110148917_C_G,rs59409892,SNP,110148917,C,G,COL4A1,,Recent African,1101,119,1220,0.097541,13
2,110149176_T_A,rs546124548,SNP,110149176,T,A,COL4A1,,Recent African,1219,1,1220,0.00082,13
3,110149349_G_A,rs139916479,SNP,110149349,G,A,COL4A1,,Recent African,1215,5,1220,0.004098,13
4,110149494_C_T,rs552877576,SNP,110149494,C,T,COL4A1,,Recent African,1219,1,1220,0.00082,13


Select variant chromosome, position, reference allele and alternate allele information. 

In [7]:
selected_variant_info = ih_afr_filtered[["CHROM", "POS", "REF", "ALT"]]
selected_variant_info.head(5)

Unnamed: 0,CHROM,POS,REF,ALT
0,13,110148891,C,G
1,13,110148917,C,G
2,13,110149176,T,A
3,13,110149349,G,A
4,13,110149494,C,T


Format variant chromosome, position, reference allele and alternate allele information correctly to allow retrieval of variant annotation info from https://favor.genohub.org/batch-annotation.

In [8]:
formatted_variant_info = pd.DataFrame(data=selected_variant_info["CHROM"].astype(str) + "-" + selected_variant_info["POS"].astype(str) + "-" + selected_variant_info["REF"] + "-" + selected_variant_info["ALT"])

## Save formatted data to txt file

In [9]:
formatted_variant_info.to_csv('Variant_phenotype_formatted_variants.txt', header=False, index=False)