# Variant disease phenotypes

This analysis aimed to answer the following research question: **How many of the variants identified in African populations have known disease associations, and what are these associations?**

To achieve this, the following steps were performed:

1. Genetic Variant Phenotype Data Retrieval and Preparation: Known disease phenotypes for the variants were retrieved from [Favor v2.0](https://favor.genohub.org/). The retrieved data underwent processing and preparation following guidelines outlined in the [Notebooks\Data_preparation\6-Variant_phenotype_associations.ipynb](https://github.com/MeganHolborn/Genetic_data_analysis/blob/main/Notebooks/Data_preparation/6-Variant_phenotype_associations.ipynb) Jupyter notebook. The processed data can be found [here](https://github.com/MeganHolborn/Genetic_data_analysis/blob/main/Data/Processed/Variant_disease_phenotypes.csv).
2. Analysis and Visualisation:
    * To be completed...

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import numpy as np
import pandas as pd
import seaborn as sns
import upsetplot
from matplotlib import pyplot as plt
import Utils.constants as constants
import Utils.functions as functions

Import variant phenotype data

In [2]:
phenotype_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_disease_phenotypes.csv",
    )
)

phenotype_data.head(5)

Unnamed: 0,VariantVcf,Rsid,Clndn,CHROM,POS,REF,ALT,ID
0,13-110148917-C-G,rs59409892,,13,110148917,C,G,110148917_C_G
1,13-110148891-C-G,rs552586867,,13,110148891,C,G,110148891_C_G
2,13-110149494-C-T,rs552877576,,13,110149494,C,T,110149494_C_T
3,13-110149715-AAT-A,rs886049952,,13,110149715,AAT,A,110149715_AAT_A
4,13-110151168-C-T,rs557686466,,13,110151168,C,T,110151168_C_T


Import genetic variant count data for African populations

In [3]:
ih_afr = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    )
)

ih_afr.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE
0,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu Kenya,0,20,20,EA,0.0,INDEL
1,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Yoruba,0,276,276,WA,0.0,INDEL
2,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,San,0,12,12,SA,0.0,INDEL
3,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mende,0,166,166,WA,0.0,INDEL
4,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mbuti Pygmy,0,24,24,CA,0.0,INDEL


Import genetic variant count data for global populations

In [4]:
alfa_global = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "ALFA_allele_counts_b.csv",
    )
)

alfa_global.head(5)

Unnamed: 0,variant_id,reference_allele,alternate_allele,ALT_CT_ALFA_East Asian,ALT_CT_ALFA_European,ALT_CT_ALFA_Latin American 1,ALT_CT_ALFA_Latin American 2,ALT_CT_ALFA_South Asian,REF_CT_ALFA_East Asian,REF_CT_ALFA_European,REF_CT_ALFA_Latin American 1,REF_CT_ALFA_Latin American 2,REF_CT_ALFA_South Asian
0,rs1000343,C,T,0.0,49.0,5.0,10.0,0.0,490.0,109377.0,673.0,2200.0,184.0
1,rs1000989,T,C,55.0,21489.0,123.0,1330.0,1685.0,109.0,37269.0,273.0,2052.0,3283.0
2,rs1000990,T,C,32.0,5355.0,40.0,261.0,36.0,54.0,8931.0,106.0,349.0,62.0
3,rs1005573,C,T,35.0,10693.0,209.0,1810.0,79.0,69.0,4955.0,87.0,956.0,31.0
4,rs1007311,A,G,56.0,9154.0,61.0,154.0,43.0,56.0,11242.0,85.0,456.0,55.0


Import variant effect data

In [5]:
vep_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_effects.csv",
    )
)

vep_data.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE,ID
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16,110148882_C_CT
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446,110148891_C_G
2,13,110148917,C,G,COL4A1,,,,,0.269936,3.938,110148917_C_G
3,13,110148920,G,C,COL4A1,,,,,0.530972,6.825,110148920_G_C
4,13,110148959,A,G,COL4A1,,,,,1.380228,14.95,110148959_A_G


## Analysis and Visualisation

### Data selection

Select effect data on rare variants within African subpopulation (ethnolinguistic) groups for analysis. 

In [6]:
# Select aggregated variant count and frequency data for Recent Africans. Remove variants with an alternate allele count of 0. These variants are not present in Recent Africans.

ih_afr_subpops = ih_afr[(ih_afr["REG"] == "Recent African") & (ih_afr["IH_ALT_CTS"] > 0)]

# Add in effect data for rare variants that are in the Recent African populations
ih_afr_subpops_phenotype_data = (
    ih_afr_subpops.merge(
        phenotype_data,
        how="left",
        left_on=["REF", "ALT", "POS"],
        right_on=["REF", "ALT", "POS"],
    )
    .drop(columns="ID_y")
    .rename(columns={"ID_x": "ID"})
)

ih_afr_subpops_phenotype_data.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
0,110148891_C_G,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110148891-C-G,rs552586867,,13.0
1,110148917_C_G,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP,13-110148917-C-G,rs59409892,,13.0
2,110149176_T_A,rs546124548,110149176,T,A,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149176-T-A,rs546124548,,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0
4,110149494_C_T,rs552877576,110149494,C,T,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149494-C-T,rs552877576,,13.0


### How many variants have previously been associated with HIE and are present in Africans?

Create a list of variants (with rsids) associated with HIE in the genes of interest. This information was retrieved from: https://doi.org/10.1016/j.ygeno.2022.110508.

In [7]:
hie_variant_rsids = [
    "rs2067853",
    "rs1217401",
    "rs2043211",
    "rs1001179",
    "rs1800896",
    "rs1071676",
    "rs1143623",
    "rs16944",
    "rs1800795",
    "rs1801133",
    "rs1808593",
    "rs2070744",
    "rs6517135",
    "rs1799964"
]

How many HIE-associated variants with rsids are there?

In [8]:
len(hie_variant_rsids)

14

Which of these variants are present in Africans?

In [9]:
variants_with_hie_assoc = ih_afr_subpops_phenotype_data[ih_afr_subpops_phenotype_data.VAR_NAME.isin(hie_variant_rsids)]
variants_with_hie_assoc.VAR_NAME.unique()

array(['rs1071676', 'rs1217401', 'rs1801133', 'rs2070744', 'rs1808593',
       'rs1800896', 'rs1800795', 'rs2067853', 'rs2043211'], dtype=object)

How many of the variants are present in Africans?

In [10]:
variants_with_hie_assoc.VAR_NAME.nunique()

9

### How many variants have a known ClinVar disease phenotype?

In [11]:
# Split the phenotypes listed in each row of the Clndn column into a list

ih_afr_subpops_phenotype_data['Clndn'] = ih_afr_subpops_phenotype_data['Clndn'].str.split('|')

In [12]:
# Convert the phenotypes list into multiple rows

ih_afr_subpops_phenotype_explode = ih_afr_subpops_phenotype_data.explode('Clndn')

In [13]:
# Remove rows with missing data for the Clndn column

variants_with_ClinVar_phenotypes = ih_afr_subpops_phenotype_explode[~((ih_afr_subpops_phenotype_explode['Clndn'].isna()) | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_provided') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_specified') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_specified|not_provided') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'none_provided'))]

In [14]:
# Count the number of variants with disease phenotype descriptions

variants_with_ClinVar_phenotypes['ID'].nunique()

133

In [15]:
variants_with_ClinVar_phenotypes[variants_with_ClinVar_phenotypes.VAR_NAME.isin(hie_variant_rsids)].ID.nunique()

5

In [16]:
variants_with_known_phenotypes_grouped_by_gene = variants_with_ClinVar_phenotypes[['GENE','Clndn','ID']].groupby(['GENE','Clndn']).count().reset_index().sort_values(by=['GENE','ID'], ascending=True)
variants_with_known_phenotypes_grouped_by_gene.rename(columns={'ID':'Count'}, inplace=True)

variants_with_known_phenotypes_grouped_by_gene

Unnamed: 0,GENE,Clndn,Count
0,AGT,"Hypertension,_essential,_susceptibility_to",1
1,AGT,"Preeclampsia,_susceptibility_to",1
3,AGT,Susceptibility_to_progression_to_renal_failure...,1
2,AGT,Renal_dysplasia,22
4,AP4B1,History_of_neurodevelopmental_disorder,7
5,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",10
9,COL4A1,Porencephalic_cyst,17
8,COL4A1,Brain_small_vessel_disease_with_hemorrhage,26
7,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,38
6,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",40


### Combine information on variants associated with HIE and variants associated with diseases in ClinVar databases

In [17]:
# Concatenate information on variants with known ClinVar phenotypes and HIE associations. Drop duplicate information.

variants_with_known_phenotypes = pd.concat([variants_with_ClinVar_phenotypes, variants_with_hie_assoc]).drop_duplicates()

variants_with_known_phenotypes.VAR_NAME.nunique()

137

### How many of the variants with known disease phenotypes (ClinVar and HIE studies) have predicted deleteriousness (CADD>=10)?

In [18]:
# Join the phenotype and cadd data

variants_phenotypes_and_cadd_phred = pd.merge(variants_with_known_phenotypes, vep_data, how='left', on=['ID','CHROM','POS','REF','ALT','GENE'])

In [19]:
# Filter for variants with cadd >=10

deleterious_variants_with_known_phenotype = variants_phenotypes_and_cadd_phred[variants_phenotypes_and_cadd_phred.CADD_PHRED_SCORE >= 10]

In [20]:
# Count the number of variants with known disease phenotypes and cadd >= 10

deleterious_variants_with_known_phenotype.ID.nunique()

41

In [21]:
# Percentage of variants with known disease phenotypes and cadd >= 10 of the variants with known disease phenotypes

(deleterious_variants_with_known_phenotype.ID.nunique()/variants_with_known_phenotypes['ID'].nunique())*100

29.927007299270077

What phenotypes are represented in the variants with cadd scores >= 10?

In [22]:
deleterious_variants_with_known_phenotype_grouped_by_gene = deleterious_variants_with_known_phenotype[['GENE','Clndn','ID']].groupby(['GENE','Clndn']).count().reset_index().sort_values(by=['GENE','ID'], ascending=True)
deleterious_variants_with_known_phenotype_grouped_by_gene.rename(columns={'ID':'Count'}, inplace=True)

deleterious_variants_with_known_phenotype_grouped_by_gene

Unnamed: 0,GENE,Clndn,Count
0,AGT,Renal_dysplasia,5
3,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive|His...",1
1,AP4B1,History_of_neurodevelopmental_disorder,5
2,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",7
6,COL4A1,Brain_small_vessel_disease_with_hemorrhage,3
5,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,5
7,COL4A1,Porencephalic_cyst,5
4,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",6
8,IL10,Inflammatory_bowel_disease,5
9,IL1B,Gastric_cancer_susceptibility_after_h._pylori_...,1


In [23]:
deleterious_variants_with_known_phenotype[deleterious_variants_with_known_phenotype.VAR_NAME.isin(hie_variant_rsids)].ID.nunique()

3

In [42]:
deleterious_variants_with_known_phenotype[deleterious_variants_with_known_phenotype.VAR_NAME.isin(hie_variant_rsids)].VAR_NAME.unique()

array(['rs1217401', 'rs1801133', 'rs1800795'], dtype=object)

### What are the known phenotypes of the variants with cadd phred scores >= 10 and common frequencies (>0.1)?

In [25]:
# Display common variants, and their associated clinical disease phenotype

common_deleterious_variants_with_known_phenotype = deleterious_variants_with_known_phenotype[deleterious_variants_with_known_phenotype.IH_AF>=0.1][['GENE','Clndn','ID','VAR_NAME','REF','ALT','IH_AF','CADD_PHRED_SCORE']].sort_values(by=['GENE','Clndn','ID'], ascending=True)
common_deleterious_variants_with_known_phenotype

Unnamed: 0,GENE,Clndn,ID,VAR_NAME,REF,ALT,IH_AF,CADD_PHRED_SCORE
125,AP4B1,History_of_neurodevelopmental_disorder,113896329_A_G,rs1217401,A,G,0.797541,21.9
124,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",113896329_A_G,rs1217401,A,G,0.797541,21.9
264,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive|His...",113896329_A_G,rs1217401,A,G,0.797541,21.9
8,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",110149776_G_T,rs13260,G,T,0.266393,11.67
83,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",110205548_A_G,rs677877,A,G,0.38806,10.31
116,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",110307009_C_G,rs9515185,C,G,0.216172,13.89
120,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",110307117_C_A,rs113651836,C,A,0.107084,14.63
7,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,110149776_G_T,rs13260,G,T,0.266393,11.67
82,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,110205548_A_G,rs677877,A,G,0.38806,10.31
115,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,110307009_C_G,rs9515185,C,G,0.216172,13.89


In [26]:
# Count the number of variants with known phenotypes, CADD >=10 and that are common

common_deleterious_variants_with_known_phenotype.ID.nunique()

10

### How do the frequencies of the common variants with known phenotypes and CADD Phred scores in Africans >=10 compare to that of other global populations?

In [27]:
# Add ALFA global allele count data to African data in common_deleterious_variants_with_known_phenotype dataframe

common_deleterious_variants_with_known_phenotype_incl_alfa = pd.merge(common_deleterious_variants_with_known_phenotype.drop_duplicates(subset=['VAR_NAME','REF','ALT']).drop(columns=['Clndn','ID','CADD_PHRED_SCORE']), alfa_global, how='left', left_on=['VAR_NAME','REF','ALT'], right_on=['variant_id','reference_allele','alternate_allele'])
common_deleterious_variants_with_known_phenotype_incl_alfa = common_deleterious_variants_with_known_phenotype_incl_alfa.drop(columns=['variant_id','reference_allele','alternate_allele'])

In [28]:
# Calculate the alternate allele frequencies for the global populations

common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_East Asian'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_East Asian']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_East Asian']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_East Asian'])
common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_South Asian'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_South Asian']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_South Asian']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_South Asian'])
common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_European'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_European']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_European']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_European'])
common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_Latin American 1'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 1']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 1']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_Latin American 1'])
common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_Latin American 2'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 2']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 2']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_Latin American 2'])

common_deleterious_variants_with_known_phenotype_incl_alfa.drop(columns=['ALT_CT_ALFA_East Asian','ALT_CT_ALFA_South Asian','ALT_CT_ALFA_European','ALT_CT_ALFA_Latin American 1','ALT_CT_ALFA_Latin American 2','REF_CT_ALFA_East Asian','REF_CT_ALFA_South Asian','REF_CT_ALFA_European','REF_CT_ALFA_Latin American 1','REF_CT_ALFA_Latin American 2'], inplace=True)
common_deleterious_variants_with_known_phenotype_incl_alfa

Unnamed: 0,GENE,VAR_NAME,REF,ALT,IH_AF,ALFA_AF_East Asian,ALFA_AF_South Asian,ALFA_AF_European,ALFA_AF_Latin American 1,ALFA_AF_Latin American 2
0,AP4B1,rs1217401,A,G,0.797541,0.057899,0.327415,0.321924,0.419217,0.192529
1,COL4A1,rs13260,G,T,0.266393,0.004016,0.039773,0.091209,0.130312,0.226508
2,COL4A1,rs677877,A,G,0.38806,0.758065,0.66,0.641808,0.613095,0.550746
3,COL4A1,rs9515185,C,G,0.216172,0.488372,0.489796,0.406679,0.493151,0.372131
4,COL4A1,rs113651836,C,A,0.107084,0.026316,0.0,0.197838,0.0,0.0
5,IL10,rs1518111,T,C,0.555738,0.295049,0.603571,0.784843,0.734914,0.661386
6,IL1B,rs1143627,G,A,0.351639,0.5057,0.415344,0.664999,0.585017,0.445177
7,IL6,rs1800795,C,G,0.99918,1.0,0.857143,0.559989,0.828767,0.819672
8,MTHFR,rs1801131,T,G,0.145902,0.214497,0.410439,0.314106,0.219092,0.192032
9,NOS3,rs1799983,T,G,0.935246,0.857143,0.333333,0.679167,0.763889,0.24


### What are the frequencies of the common variants with known phenotypes and CADD Phred scores >=10 in each African ethnolinguistic population group?

In [29]:
# Get a list of variants to get frequencies for.

common_deleterious_variants_with_known_phenotype_list = list(common_deleterious_variants_with_known_phenotype.ID.unique())

In [30]:
# Extract ethnolinguistic population frequencies for each of these variants. 

ih_afr_subpops_common_deleterious_variants = ih_afr[(ih_afr.ID.isin(common_deleterious_variants_with_known_phenotype_list)) & ~ (ih_afr.REG == 'Recent African')]

In [31]:
# Filter out irrelevant data. Pivot data.

ih_afr_subpops_common_deleterious_variants_filtered = ih_afr_subpops_common_deleterious_variants.drop(columns=['POS','REF','ALT','IH_ALT_CTS','IH_TOTAL_CTS','IH_REF_CTS','VARIANT_TYPE','REG'])
ih_afr_subpops_common_deleterious_variants_filtered = ih_afr_subpops_common_deleterious_variants_filtered.pivot(index=['ID','VAR_NAME','GENE'], columns=['SUB_POP'], values=['IH_AF']).reset_index()

In [32]:
# Add overall African frequencies

ih_afr_subpops_common_deleterious_variants_filtered = pd.merge(left=common_deleterious_variants_with_known_phenotype_incl_alfa[['VAR_NAME','IH_AF']], right=ih_afr_subpops_common_deleterious_variants_filtered, how='left', on=['VAR_NAME'])
ih_afr_subpops_common_deleterious_variants_filtered

  ih_afr_subpops_common_deleterious_variants_filtered = pd.merge(left=common_deleterious_variants_with_known_phenotype_incl_alfa[['VAR_NAME','IH_AF']], right=ih_afr_subpops_common_deleterious_variants_filtered, how='left', on=['VAR_NAME'])
  ih_afr_subpops_common_deleterious_variants_filtered = pd.merge(left=common_deleterious_variants_with_known_phenotype_incl_alfa[['VAR_NAME','IH_AF']], right=ih_afr_subpops_common_deleterious_variants_filtered, how='left', on=['VAR_NAME'])


Unnamed: 0,VAR_NAME,IH_AF,"(ID, )","(GENE, )","(IH_AF, Bantu Kenya)","(IH_AF, Bantu South Africa)","(IH_AF, Biaka Pygmy)","(IH_AF, Esan)","(IH_AF, Luhya)","(IH_AF, Mandenka)","(IH_AF, Mandinka)","(IH_AF, Mbuti Pygmy)","(IH_AF, Mende)","(IH_AF, San)","(IH_AF, Yoruba)"
0,rs1217401,0.797541,113896329_A_G,AP4B1,0.75,0.8125,0.568182,0.815534,0.788043,0.95,0.831897,0.291667,0.777108,0.666667,0.84058
1,rs13260,0.266393,110149776_G_T,COL4A1,0.4,0.3125,0.454545,0.271845,0.320652,0.15,0.206897,0.166667,0.26506,0.0,0.271739
2,rs677877,0.38806,110205548_A_G,COL4A1,0.55,0.285714,0.318182,0.38835,0.384615,0.55,0.448276,0.041667,0.353659,0.166667,0.380597
3,rs9515185,0.216172,110307009_C_G,COL4A1,0.15,0.125,0.204545,0.169903,0.157609,0.236842,0.327586,0.458333,0.26506,0.2,0.154412
4,rs113651836,0.107084,110307117_C_A,COL4A1,0.05,0.0,0.090909,0.169903,0.097826,0.131579,0.081897,0.041667,0.066265,0.0,0.131387
5,rs1518111,0.555738,206771300_T_C,IL10,0.7,0.6875,0.681818,0.533981,0.608696,0.45,0.534483,0.75,0.512048,0.916667,0.525362
6,rs1143627,0.351639,112836810_G_A,IL1B,0.3,0.1875,0.340909,0.359223,0.277174,0.55,0.426724,0.333333,0.319277,0.166667,0.347826
7,rs1800795,0.99918,22727026_C_G,IL6,1.0,1.0,1.0,1.0,1.0,1.0,0.99569,1.0,1.0,1.0,1.0
8,rs1801131,0.145902,11794419_T_G,MTHFR,0.1,0.25,0.386364,0.135922,0.184783,0.025,0.12069,0.166667,0.138554,0.333333,0.119565
9,rs1799983,0.935246,150999023_T_G,NOS3,0.9,1.0,0.931818,0.898058,0.961957,0.95,0.922414,1.0,0.951807,0.583333,0.952899


### What are the known phenotypes of the variants with cadd phred scores >= 10 and rare frequencies (>=0.01)?

In [33]:
# Display rare variants, and their associated clinical disease phenotype

rare_deleterious_variants_with_known_phenotype = deleterious_variants_with_known_phenotype[deleterious_variants_with_known_phenotype.IH_AF<=0.01][['GENE','Clndn','ID','VAR_NAME','REF','ALT','IH_AF','CADD_PHRED_SCORE']].sort_values(by=['GENE','Clndn','ID'], ascending=True)
rare_deleterious_variants_with_known_phenotype

Unnamed: 0,GENE,Clndn,ID,VAR_NAME,REF,ALT,IH_AF,CADD_PHRED_SCORE
245,AGT,Renal_dysplasia,230703157_G_A,rs143479528,G,A,0.004098,24.5
246,AGT,Renal_dysplasia,230703274_G_A,rs61751077,G,A,0.005738,23.5
247,AGT,Renal_dysplasia,230704308_A_G,rs61731499,A,G,0.006557,11.45
251,AGT,Renal_dysplasia,230706171_C_T,rs139685563,C,T,0.001639,22.9
126,AP4B1,History_of_neurodevelopmental_disorder,113897925_C_T,rs145803736,C,T,0.00082,25.9
129,AP4B1,History_of_neurodevelopmental_disorder,113900051_A_T,rs149335605,A,T,0.00082,22.0
131,AP4B1,History_of_neurodevelopmental_disorder,113900120_C_A,rs111785152,C,A,0.008197,20.7
123,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",113895442_A_C,rs148748734,A,C,0.001639,12.93
128,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",113900051_A_T,rs149335605,A,T,0.00082,22.0
130,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",113900120_C_A,rs111785152,C,A,0.008197,20.7


In [34]:
# Count the number of variants with known phenotypes, CADD >=10 and that are common

rare_deleterious_variants_with_known_phenotype.ID.nunique()

23

### How do the frequencies of the rare variants with known phenotypes and CADD Phred scores in Africans >=10 compare to that of other global populations?

In [35]:
# Add ALFA global allele count data to African data in rare_deleterious_variants_with_known_phenotype dataframe

rare_deleterious_variants_with_known_phenotype_incl_alfa = pd.merge(rare_deleterious_variants_with_known_phenotype.drop_duplicates(subset=['VAR_NAME','REF','ALT']).drop(columns=['Clndn','ID','CADD_PHRED_SCORE']), alfa_global, how='left', left_on=['VAR_NAME','REF','ALT'], right_on=['variant_id','reference_allele','alternate_allele'])
rare_deleterious_variants_with_known_phenotype_incl_alfa = rare_deleterious_variants_with_known_phenotype_incl_alfa.drop(columns=['variant_id','reference_allele','alternate_allele'])

In [36]:
# Calculate the alternate allele frequencies for the global populations

rare_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_East Asian'] = rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_East Asian']/(rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_East Asian']+rare_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_East Asian'])
rare_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_South Asian'] = rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_South Asian']/(rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_South Asian']+rare_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_South Asian'])
rare_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_European'] = rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_European']/(rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_European']+rare_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_European'])
rare_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_Latin American 1'] = rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 1']/(rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 1']+rare_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_Latin American 1'])
rare_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_Latin American 2'] = rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 2']/(rare_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 2']+rare_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_Latin American 2'])

rare_deleterious_variants_with_known_phenotype_incl_alfa.drop(columns=['ALT_CT_ALFA_East Asian','ALT_CT_ALFA_South Asian','ALT_CT_ALFA_European','ALT_CT_ALFA_Latin American 1','ALT_CT_ALFA_Latin American 2','REF_CT_ALFA_East Asian','REF_CT_ALFA_South Asian','REF_CT_ALFA_European','REF_CT_ALFA_Latin American 1','REF_CT_ALFA_Latin American 2'], inplace=True)
rare_deleterious_variants_with_known_phenotype_incl_alfa

Unnamed: 0,GENE,VAR_NAME,REF,ALT,IH_AF,ALFA_AF_East Asian,ALFA_AF_South Asian,ALFA_AF_European,ALFA_AF_Latin American 1,ALFA_AF_Latin American 2
0,AGT,rs143479528,G,A,0.004098,0.0,0.0,6e-06,0.0,0.0
1,AGT,rs61751077,G,A,0.005738,0.0,0.0,1.3e-05,0.0,0.001053
2,AGT,rs61731499,A,G,0.006557,0.0,0.0,2.6e-05,0.004926,0.0
3,AGT,rs139685563,C,T,0.001639,0.0,0.0,0.000509,0.001256,0.0
4,AP4B1,rs145803736,C,T,0.00082,0.0,0.0,3.3e-05,0.0,0.0
5,AP4B1,rs149335605,A,T,0.00082,0.0,0.0,0.0,0.0,0.0
6,AP4B1,rs111785152,C,A,0.008197,0.0,0.0,1e-05,0.003953,0.0
7,AP4B1,rs148748734,A,C,0.001639,0.0,0.0,0.0,0.0,0.0
8,AP4B1,rs143769705,T,C,0.00082,0.0,0.0,0.0,0.0,0.0
9,COL4A1,chr13:110149715A-AAT,A,AAT,0.000825,,,,,


### What are the frequencies of the rare variants with known phenotypes and CADD Phred scores >=10 in each African ethnolinguistic population group?

In [37]:
# Get a list of variants to get frequencies for.

rare_deleterious_variants_with_known_phenotype_list = list(rare_deleterious_variants_with_known_phenotype.ID.unique())

In [38]:
# Extract ethnolinguistic population frequencies for each of these variants. 

ih_afr_subpops_rare_deleterious_variants = ih_afr[(ih_afr.ID.isin(rare_deleterious_variants_with_known_phenotype_list)) & ~ (ih_afr.REG == 'Recent African')]

In [39]:
# Filter out irrelevant data. Pivot data.

ih_afr_subpops_rare_deleterious_variants_filtered = ih_afr_subpops_rare_deleterious_variants.drop(columns=['POS','REF','ALT','IH_ALT_CTS','IH_TOTAL_CTS','IH_REF_CTS','VARIANT_TYPE','REG'])
ih_afr_subpops_rare_deleterious_variants_filtered = ih_afr_subpops_rare_deleterious_variants_filtered.pivot(index=['ID','VAR_NAME','GENE'], columns=['SUB_POP'], values=['IH_AF']).reset_index()

In [40]:
# Add overall African frequencies

ih_afr_subpops_rare_deleterious_variants_filtered = pd.merge(left=rare_deleterious_variants_with_known_phenotype_incl_alfa[['VAR_NAME','IH_AF']], right=ih_afr_subpops_rare_deleterious_variants_filtered, how='left', on=['VAR_NAME'])
ih_afr_subpops_rare_deleterious_variants_filtered

  ih_afr_subpops_rare_deleterious_variants_filtered = pd.merge(left=rare_deleterious_variants_with_known_phenotype_incl_alfa[['VAR_NAME','IH_AF']], right=ih_afr_subpops_rare_deleterious_variants_filtered, how='left', on=['VAR_NAME'])
  ih_afr_subpops_rare_deleterious_variants_filtered = pd.merge(left=rare_deleterious_variants_with_known_phenotype_incl_alfa[['VAR_NAME','IH_AF']], right=ih_afr_subpops_rare_deleterious_variants_filtered, how='left', on=['VAR_NAME'])


Unnamed: 0,VAR_NAME,IH_AF,"(ID, )","(GENE, )","(IH_AF, Bantu Kenya)","(IH_AF, Bantu South Africa)","(IH_AF, Biaka Pygmy)","(IH_AF, Esan)","(IH_AF, Luhya)","(IH_AF, Mandenka)","(IH_AF, Mandinka)","(IH_AF, Mbuti Pygmy)","(IH_AF, Mende)","(IH_AF, San)","(IH_AF, Yoruba)"
0,rs143479528,0.004098,230703157_G_A,AGT,0.0,0.0,0.0,0.014563,0.0,0.0,0.0,0.0,0.012048,0.0,0.0
1,rs61751077,0.005738,230703274_G_A,AGT,0.0,0.0,0.0,0.0,0.0,0.0,0.008621,0.0,0.024096,0.0,0.003623
2,rs61731499,0.006557,230704308_A_G,AGT,0.05,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.083333,0.007246
3,rs139685563,0.001639,230706171_C_T,AGT,0.0,0.0,0.0,0.0,0.01087,0.0,0.0,0.0,0.0,0.0,0.0
4,rs145803736,0.00082,113897925_C_T,AP4B1,0.0,0.0,0.0,0.0,0.005435,0.0,0.0,0.0,0.0,0.0,0.0
5,rs149335605,0.00082,113900051_A_T,AP4B1,0.0,0.0,0.0,0.0,0.005435,0.0,0.0,0.0,0.0,0.0,0.0
6,rs111785152,0.008197,113900120_C_A,AP4B1,0.0,0.0,0.0,0.0,0.005435,0.0,0.025862,0.0,0.018072,0.0,0.0
7,rs148748734,0.001639,113895442_A_C,AP4B1,0.0,0.0,0.0,0.004854,0.0,0.0,0.0,0.0,0.0,0.0,0.003623
8,rs143769705,0.00082,113901851_T_C,AP4B1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0
9,chr13:110149715A-AAT,0.000825,110149715_A_AAT,COL4A1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003623
