# Variant disease phenotypes

This analysis aimed to answer the following research question: **How many of the variants identified in African populations have known disease associations, and what are these associations?**

To achieve this, the following steps were performed:

1. Genetic Variant Phenotype Data Retrieval and Preparation: Known disease phenotypes for the variants were retrieved from [Favor v2.0](https://favor.genohub.org/). The retrieved data underwent processing and preparation following guidelines outlined in the [Notebooks\Data_preparation\6-Variant_phenotype_associations.ipynb](https://github.com/MeganHolborn/Genetic_data_analysis/blob/main/Notebooks/Data_preparation/6-Variant_phenotype_associations.ipynb) Jupyter notebook. The processed data can be found [here](https://github.com/MeganHolborn/Genetic_data_analysis/blob/main/Data/Processed/Variant_disease_phenotypes.csv).
2. Analysis and Visualisation:
    * To be completed...

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import numpy as np
import pandas as pd
import seaborn as sns
import upsetplot
from matplotlib import pyplot as plt
import Utils.constants as constants
import Utils.functions as functions

Import variant phenotype data

In [3]:
phenotype_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_disease_phenotypes.csv",
    )
)

phenotype_data.head(5)

Unnamed: 0,VariantVcf,Rsid,Clndn,CHROM,POS,REF,ALT,ID
0,13-110148917-C-G,rs59409892,,13,110148917,C,G,110148917_C_G
1,13-110148891-C-G,rs552586867,,13,110148891,C,G,110148891_C_G
2,13-110149494-C-T,rs552877576,,13,110149494,C,T,110149494_C_T
3,13-110149715-AAT-A,rs886049952,,13,110149715,AAT,A,110149715_AAT_A
4,13-110151168-C-T,rs557686466,,13,110151168,C,T,110151168_C_T


Import genetic variant count data for African populations

In [6]:
ih_afr = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    )
)

ih_afr.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE
0,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu Kenya,0,20,20,EA,0.0,INDEL
1,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Yoruba,0,276,276,WA,0.0,INDEL
2,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,San,0,12,12,SA,0.0,INDEL
3,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mende,0,166,166,WA,0.0,INDEL
4,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mbuti Pygmy,0,24,24,CA,0.0,INDEL


Import variant effect data

In [4]:
vep_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_effects.csv",
    )
)

vep_data.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE,ID
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16,110148882_C_CT
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446,110148891_C_G
2,13,110148917,C,G,COL4A1,,,,,0.269936,3.938,110148917_C_G
3,13,110148920,G,C,COL4A1,,,,,0.530972,6.825,110148920_G_C
4,13,110148959,A,G,COL4A1,,,,,1.380228,14.95,110148959_A_G


Import variant consequence data

In [5]:
consequence_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_consequences.csv",
    )
)

consequence_data.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,TYPE,CONSEQUENCE_CLASSIFICATION,CONSEQUENCE,GENE,ID
0,13,110148882,C,CT,INS,downstream,downstream,COL4A1,110148882_C_CT
1,13,110148891,C,G,SNV,downstream,downstream,COL4A1,110148891_C_G
2,13,110148917,C,G,SNV,downstream,downstream,COL4A1,110148917_C_G
3,13,110148920,G,C,SNV,downstream,downstream,COL4A1,110148920_G_C
4,13,110148959,A,G,SNV,downstream,downstream,COL4A1,110148959_A_G


## Analysis and Visualisation

### Data selection

Select effect data on rare variants within African subpopulation (ethnolinguistic) groups for analysis. 

In [8]:
# Select aggregated variant count and frequency data for Recent Africans. Remove variants with an alternate allele count of 0. These variants are not present in Recent Africans.

ih_afr_subpops = ih_afr[(ih_afr["REG"] == "Recent African") & (ih_afr["IH_ALT_CTS"] > 0)]

# Add in effect data for rare variants that are in the Recent African populations
ih_afr_subpops_phenotype_data = (
    ih_afr_subpops.merge(
        phenotype_data,
        how="left",
        left_on=["REF", "ALT", "POS"],
        right_on=["REF", "ALT", "POS"],
    )
    .drop(columns="ID_y")
    .rename(columns={"ID_x": "ID"})
)

ih_afr_subpops_phenotype_data.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
0,110148891_C_G,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110148891-C-G,rs552586867,,13.0
1,110148917_C_G,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP,13-110148917-C-G,rs59409892,,13.0
2,110149176_T_A,rs546124548,110149176,T,A,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149176-T-A,rs546124548,,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0
4,110149494_C_T,rs552877576,110149494,C,T,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149494-C-T,rs552877576,,13.0


How many variants have known disease phenotypes?

What disease phenotypes are present?

Do the variants with known disease phenotypes have predicted deleteriousness (CADD>=20)?