# Variant disease phenotypes

This analysis aimed to answer the following research question: **How many of the variants identified in African populations have known disease associations, and what are these associations?**

To achieve this, the following steps were performed:

1. Genetic Variant Phenotype Data Retrieval and Preparation: Known disease phenotypes for the variants were retrieved from [Favor v2.0](https://favor.genohub.org/). The retrieved data underwent processing and preparation following guidelines outlined in the [Notebooks\Data_preparation\6-Variant_phenotype_associations.ipynb](https://github.com/MeganHolborn/Genetic_data_analysis/blob/main/Notebooks/Data_preparation/6-Variant_phenotype_associations.ipynb) Jupyter notebook. The processed data can be found [here](https://github.com/MeganHolborn/Genetic_data_analysis/blob/main/Data/Processed/Variant_disease_phenotypes.csv).
2. Analysis and Visualisation:
    * To be completed...

## Imports

Notebook setup

In [142]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import numpy as np
import pandas as pd
import seaborn as sns
import upsetplot
from matplotlib import pyplot as plt
import Utils.constants as constants
import Utils.functions as functions

Import variant phenotype data

In [143]:
phenotype_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_disease_phenotypes.csv",
    )
)

phenotype_data.head(5)

Unnamed: 0,VariantVcf,Rsid,Clndn,CHROM,POS,REF,ALT,ID
0,13-110148917-C-G,rs59409892,,13,110148917,C,G,110148917_C_G
1,13-110148891-C-G,rs552586867,,13,110148891,C,G,110148891_C_G
2,13-110149494-C-T,rs552877576,,13,110149494,C,T,110149494_C_T
3,13-110149715-AAT-A,rs886049952,,13,110149715,AAT,A,110149715_AAT_A
4,13-110151168-C-T,rs557686466,,13,110151168,C,T,110151168_C_T


Import genetic variant count data for African populations

In [144]:
ih_afr = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    )
)

ih_afr.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE
0,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu Kenya,0,20,20,EA,0.0,INDEL
1,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Yoruba,0,276,276,WA,0.0,INDEL
2,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,San,0,12,12,SA,0.0,INDEL
3,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mende,0,166,166,WA,0.0,INDEL
4,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mbuti Pygmy,0,24,24,CA,0.0,INDEL


Import variant effect data

In [145]:
vep_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_effects.csv",
    )
)

vep_data.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE,ID
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16,110148882_C_CT
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446,110148891_C_G
2,13,110148917,C,G,COL4A1,,,,,0.269936,3.938,110148917_C_G
3,13,110148920,G,C,COL4A1,,,,,0.530972,6.825,110148920_G_C
4,13,110148959,A,G,COL4A1,,,,,1.380228,14.95,110148959_A_G


## Analysis and Visualisation

### Data selection

Select effect data on rare variants within African subpopulation (ethnolinguistic) groups for analysis. 

In [146]:
# Select aggregated variant count and frequency data for Recent Africans. Remove variants with an alternate allele count of 0. These variants are not present in Recent Africans.

ih_afr_subpops = ih_afr[(ih_afr["REG"] == "Recent African") & (ih_afr["IH_ALT_CTS"] > 0)]

# Add in effect data for rare variants that are in the Recent African populations
ih_afr_subpops_phenotype_data = (
    ih_afr_subpops.merge(
        phenotype_data,
        how="left",
        left_on=["REF", "ALT", "POS"],
        right_on=["REF", "ALT", "POS"],
    )
    .drop(columns="ID_y")
    .rename(columns={"ID_x": "ID"})
)

ih_afr_subpops_phenotype_data.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
0,110148891_C_G,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110148891-C-G,rs552586867,,13.0
1,110148917_C_G,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP,13-110148917-C-G,rs59409892,,13.0
2,110149176_T_A,rs546124548,110149176,T,A,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149176-T-A,rs546124548,,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0
4,110149494_C_T,rs552877576,110149494,C,T,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149494-C-T,rs552877576,,13.0


How many variants have known disease phenotypes?

In [147]:
# Split the phenotypes listed in each row of the Clndn column into a list

ih_afr_subpops_phenotype_data['Clndn'] = ih_afr_subpops_phenotype_data['Clndn'].str.split('|')
ih_afr_subpops_phenotype_data.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
0,110148891_C_G,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110148891-C-G,rs552586867,,13.0
1,110148917_C_G,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP,13-110148917-C-G,rs59409892,,13.0
2,110149176_T_A,rs546124548,110149176,T,A,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149176-T-A,rs546124548,,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,[Brain_small_vessel_disease_1_with_or_without_...,13.0
4,110149494_C_T,rs552877576,110149494,C,T,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149494-C-T,rs552877576,,13.0


In [148]:
# Convert the phenotypes list into multiple rows

ih_afr_subpops_phenotype_explode = ih_afr_subpops_phenotype_data.explode('Clndn')
ih_afr_subpops_phenotype_explode.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
0,110148891_C_G,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110148891-C-G,rs552586867,,13.0
1,110148917_C_G,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP,13-110148917-C-G,rs59409892,,13.0
2,110149176_T_A,rs546124548,110149176,T,A,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149176-T-A,rs546124548,,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0


In [149]:
# Remove rows with missing data for the Clndn column

variants_with_known_phenotypes = ih_afr_subpops_phenotype_explode[~((ih_afr_subpops_phenotype_explode['Clndn'].isna()) | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_provided') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_specified') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_specified|not_provided') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'none_provided'))]
variants_with_known_phenotypes.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_with_hemorrhage,13.0
7,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,Recent African,0.000825,INDEL,13-110149715-A-AAT,,Porencephalic_cyst,13.0
7,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,Recent African,0.000825,INDEL,13-110149715-A-AAT,,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0


In [150]:
# Count the number of variants with disease phenotype descriptions

variants_with_known_phenotypes['ID'].nunique()

133

What disease phenotypes are present for each gene

In [151]:
variants_with_known_phenotypes_grouped_by_gene = variants_with_known_phenotypes[['GENE','Clndn','ID']].groupby(['GENE','Clndn']).count().reset_index().sort_values(by=['GENE','ID'], ascending=True)
variants_with_known_phenotypes_grouped_by_gene.rename(columns={'ID':'Count'}, inplace=True)

variants_with_known_phenotypes_grouped_by_gene

Unnamed: 0,GENE,Clndn,Count
0,AGT,"Hypertension,_essential,_susceptibility_to",1
1,AGT,"Preeclampsia,_susceptibility_to",1
3,AGT,Susceptibility_to_progression_to_renal_failure...,1
2,AGT,Renal_dysplasia,22
4,AP4B1,History_of_neurodevelopmental_disorder,7
5,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",10
9,COL4A1,Porencephalic_cyst,17
8,COL4A1,Brain_small_vessel_disease_with_hemorrhage,26
7,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,38
6,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",40


How many of the variants with known disease phenotypes have predicted deleteriousness (CADD>=20)?

In [152]:
# Join the phenotype and cadd data

variants_phenotypes_and_cadd_phred = pd.merge(variants_with_known_phenotypes, vep_data, how='left', on=['ID','CHROM','POS','REF','ALT','GENE'])
variants_phenotypes_and_cadd_phred.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,...,VariantVcf,Rsid,Clndn,CHROM,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE
0,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,...,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0,,,,,0.257986,3.798
1,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,...,13-110149349-G-A,rs139916479,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0,,,,,0.257986,3.798
2,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,...,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_with_hemorrhage,13.0,,,,,0.257986,3.798
3,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,...,13-110149715-A-AAT,,Porencephalic_cyst,13.0,,,,,1.172573,13.42
4,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,...,13-110149715-A-AAT,,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0,,,,,1.172573,13.42


In [171]:
# Filter for variants with cadd >=10

deleterious_variants_with_known_phenotype = variants_phenotypes_and_cadd_phred[variants_phenotypes_and_cadd_phred.CADD_PHRED_SCORE >= 10]
deleterious_variants_with_known_phenotype.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,...,VariantVcf,Rsid,Clndn,CHROM,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE
78,110205399_G_A,rs34843786,110205399,G,A,COL4A1,,17,1220,1203,...,13-110205399-G-A,rs34843786,Brain_small_vessel_disease_with_hemorrhage,13.0,benign,0.054,tolerated,0.17,2.258115,21.3
124,113896329_A_G,rs1217401,113896329,A,G,AP4B1,,973,1220,247,...,1-113896329-A-G,rs1217401,"Spastic_paraplegia_47,_autosomal_recessive",1.0,benign,0.111,tolerated,0.07,2.355377,21.9
125,113896329_A_G,rs1217401,113896329,A,G,AP4B1,,973,1220,247,...,1-113896329-A-G,rs1217401,History_of_neurodevelopmental_disorder,1.0,benign,0.111,tolerated,0.07,2.355377,21.9
126,113897925_C_T,rs145803736,113897925,C,T,AP4B1,,1,1220,1219,...,1-113897925-C-T,rs145803736,History_of_neurodevelopmental_disorder,1.0,probably_damaging,0.962,tolerated,0.08,3.822708,25.9
127,113898727_T_C,rs145182838,113898727,T,C,AP4B1,,16,1220,1204,...,1-113898727-T-C,rs145182838,"Spastic_paraplegia_47,_autosomal_recessive",1.0,benign,0.38,tolerated,0.27,2.169579,20.6


In [172]:
# Count the number of variants with known disease phenotypes and cadd >= 10

deleterious_variants_with_known_phenotype.ID.nunique()

26

In [173]:
# Percentage of variants with known disease phenotypes and cadd >= 10 of the variants with known disease phenotypes

(deleterious_variants_with_known_phenotype.ID.nunique()/variants_with_known_phenotypes['ID'].nunique())*100

19.548872180451127

What phenotypes are represented in the variants with cadd scores >= 10?

In [185]:
deleterious_variants_with_known_phenotype_grouped_by_gene = deleterious_variants_with_known_phenotype[['GENE','Clndn','ID']].groupby(['GENE','Clndn']).count().reset_index().sort_values(by=['GENE','ID'], ascending=True)
deleterious_variants_with_known_phenotype_grouped_by_gene.rename(columns={'ID':'Count'}, inplace=True)

deleterious_variants_with_known_phenotype_grouped_by_gene

Unnamed: 0,GENE,Clndn,Count
0,AGT,Renal_dysplasia,4
1,AP4B1,History_of_neurodevelopmental_disorder,5
2,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",5
3,COL4A1,Brain_small_vessel_disease_with_hemorrhage,1
4,IL10,Inflammatory_bowel_disease,2
5,IL6,"Crohn_disease-associated_growth_failure,_susce...",1
6,IL6,"Diabetes_mellitus,_type_1,_susceptibility_to",1
7,IL6,"Diabetes_mellitus_type_2,_susceptibility_to",1
8,IL6,Intracranial_hemorrhage_in_brain_cerebrovascul...,1
9,IL6,Kaposi_sarcoma,1


What are the known phenotypes of the variants with cadd phred scores >= 10 and common frequencies (>0.1)?

In [184]:
# Display common variants, and their associated clinical disease phenotype

common_deleterious_variants_with_known_phenotype = deleterious_variants_with_known_phenotype[deleterious_variants_with_known_phenotype.IH_AF >= 0.1][['GENE','Clndn','ID','VAR_NAME','IH_AF','CADD_PHRED_SCORE']].sort_values(by=['GENE','Clndn','ID'], ascending=True)
common_deleterious_variants_with_known_phenotype

Unnamed: 0,GENE,Clndn,ID,VAR_NAME,IH_AF,CADD_PHRED_SCORE
125,AP4B1,History_of_neurodevelopmental_disorder,113896329_A_G,rs1217401,0.797541,21.9
124,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",113896329_A_G,rs1217401,0.797541,21.9
237,IL6,"Crohn_disease-associated_growth_failure,_susce...",22727026_C_G,rs1800795,0.99918,18.77
233,IL6,"Diabetes_mellitus,_type_1,_susceptibility_to",22727026_C_G,rs1800795,0.99918,18.77
232,IL6,"Diabetes_mellitus_type_2,_susceptibility_to",22727026_C_G,rs1800795,0.99918,18.77
234,IL6,Intracranial_hemorrhage_in_brain_cerebrovascul...,22727026_C_G,rs1800795,0.99918,18.77
236,IL6,Kaposi_sarcoma,22727026_C_G,rs1800795,0.99918,18.77
235,IL6,"Rheumatoid_arthritis,_systemic_juvenile,_susce...",22727026_C_G,rs1800795,0.99918,18.77
169,MTHFR,Gastrointestinal_stromal_tumor,11794419_T_G,rs1801131,0.145902,20.9
173,MTHFR,Homocystinuria_due_to_MTHFR_deficiency,11794419_T_G,rs1801131,0.145902,20.9


In [181]:
# Count the number of variants with known phenotypes, CADD >=10 and that are common

common_deleterious_variants_with_known_phenotype.ID.nunique()

3

What are the known phenotypes of the variants with cadd phred scores >= 10 and uncommon frequencies (<0.1)?

In [182]:
# Display common variants, and their associated clinical disease phenotype

uncommon_deleterious_variants_with_known_phenotype = deleterious_variants_with_known_phenotype[deleterious_variants_with_known_phenotype.IH_AF < 0.1][['GENE','Clndn','ID','VAR_NAME','IH_AF','CADD_PHRED_SCORE']].sort_values(by=['GENE','Clndn','ID'], ascending=True)
uncommon_deleterious_variants_with_known_phenotype

Unnamed: 0,GENE,Clndn,ID,VAR_NAME,IH_AF
245,AGT,Renal_dysplasia,230703157_G_A,rs143479528,0.004098
246,AGT,Renal_dysplasia,230703274_G_A,rs61751077,0.005738
251,AGT,Renal_dysplasia,230706171_C_T,rs139685563,0.001639
257,AGT,Renal_dysplasia,230710231_G_A,rs4762,0.04918
126,AP4B1,History_of_neurodevelopmental_disorder,113897925_C_T,rs145803736,0.00082
129,AP4B1,History_of_neurodevelopmental_disorder,113900051_A_T,rs149335605,0.00082
131,AP4B1,History_of_neurodevelopmental_disorder,113900120_C_A,rs111785152,0.008197
139,AP4B1,History_of_neurodevelopmental_disorder,113902736_T_C,rs34249695,0.02459
127,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",113898727_T_C,rs145182838,0.013115
128,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",113900051_A_T,rs149335605,0.00082


In [183]:
# Count the number of variants with known phenotypes, CADD >=10 and that are uncommon

uncommon_deleterious_variants_with_known_phenotype.ID.nunique()

23