# Variant disease phenotypes

This analysis aimed to answer the following research question: **How many of the variants identified in African populations have known disease associations, and what are these associations?**

To achieve this, the following steps were performed:

1. Genetic Variant Phenotype Data Retrieval and Preparation: Known disease phenotypes for the variants were retrieved from [Favor v2.0](https://favor.genohub.org/). The retrieved data underwent processing and preparation following guidelines outlined in the [Notebooks\Data_preparation\6-Variant_phenotype_associations.ipynb](https://github.com/MeganHolborn/Genetic_data_analysis/blob/main/Notebooks/Data_preparation/6-Variant_phenotype_associations.ipynb) Jupyter notebook. The processed data can be found [here](https://github.com/MeganHolborn/Genetic_data_analysis/blob/main/Data/Processed/Variant_disease_phenotypes.csv).
2. Analysis and Visualisation:
    * To be completed...

## Imports

Notebook setup

In [124]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import numpy as np
import pandas as pd
import seaborn as sns
import upsetplot
from matplotlib import pyplot as plt
import Utils.constants as constants
import Utils.functions as functions

Import variant phenotype data

In [125]:
phenotype_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_disease_phenotypes.csv",
    )
)

phenotype_data.head(5)

Unnamed: 0,VariantVcf,Rsid,Clndn,CHROM,POS,REF,ALT,ID
0,13-110148917-C-G,rs59409892,,13,110148917,C,G,110148917_C_G
1,13-110148891-C-G,rs552586867,,13,110148891,C,G,110148891_C_G
2,13-110149494-C-T,rs552877576,,13,110149494,C,T,110149494_C_T
3,13-110149715-AAT-A,rs886049952,,13,110149715,AAT,A,110149715_AAT_A
4,13-110151168-C-T,rs557686466,,13,110151168,C,T,110151168_C_T


Import genetic variant count data for African populations

In [126]:
ih_afr = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    )
)

ih_afr.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE
0,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu Kenya,0,20,20,EA,0.0,INDEL
1,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Yoruba,0,276,276,WA,0.0,INDEL
2,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,San,0,12,12,SA,0.0,INDEL
3,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mende,0,166,166,WA,0.0,INDEL
4,110148882_C_CT,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mbuti Pygmy,0,24,24,CA,0.0,INDEL


Import genetic variant count data for global populations

In [127]:
alfa_global = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "ALFA_allele_counts_b.csv",
    )
)

alfa_global.head(5)

Unnamed: 0,variant_id,reference_allele,alternate_allele,ALT_CT_ALFA_East Asian,ALT_CT_ALFA_European,ALT_CT_ALFA_Latin American 1,ALT_CT_ALFA_Latin American 2,ALT_CT_ALFA_South Asian,REF_CT_ALFA_East Asian,REF_CT_ALFA_European,REF_CT_ALFA_Latin American 1,REF_CT_ALFA_Latin American 2,REF_CT_ALFA_South Asian
0,rs1000343,C,T,0.0,49.0,5.0,10.0,0.0,490.0,109377.0,673.0,2200.0,184.0
1,rs1000989,T,C,55.0,21489.0,123.0,1330.0,1685.0,109.0,37269.0,273.0,2052.0,3283.0
2,rs1000990,T,C,32.0,5355.0,40.0,261.0,36.0,54.0,8931.0,106.0,349.0,62.0
3,rs1005573,C,T,35.0,10693.0,209.0,1810.0,79.0,69.0,4955.0,87.0,956.0,31.0
4,rs1007311,A,G,56.0,9154.0,61.0,154.0,43.0,56.0,11242.0,85.0,456.0,55.0


Import variant effect data

In [128]:
vep_data = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Variant_effects.csv",
    )
)

vep_data.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,GENE,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE,ID
0,13,110148882,C,CT,COL4A1,,,,,-0.437825,0.16,110148882_C_CT
1,13,110148891,C,G,COL4A1,,,,,-0.227221,0.446,110148891_C_G
2,13,110148917,C,G,COL4A1,,,,,0.269936,3.938,110148917_C_G
3,13,110148920,G,C,COL4A1,,,,,0.530972,6.825,110148920_G_C
4,13,110148959,A,G,COL4A1,,,,,1.380228,14.95,110148959_A_G


## Analysis and Visualisation

### Data selection

Select effect data on rare variants within African subpopulation (ethnolinguistic) groups for analysis. 

In [129]:
# Select aggregated variant count and frequency data for Recent Africans. Remove variants with an alternate allele count of 0. These variants are not present in Recent Africans.

ih_afr_subpops = ih_afr[(ih_afr["REG"] == "Recent African") & (ih_afr["IH_ALT_CTS"] > 0)]

# Add in effect data for rare variants that are in the Recent African populations
ih_afr_subpops_phenotype_data = (
    ih_afr_subpops.merge(
        phenotype_data,
        how="left",
        left_on=["REF", "ALT", "POS"],
        right_on=["REF", "ALT", "POS"],
    )
    .drop(columns="ID_y")
    .rename(columns={"ID_x": "ID"})
)

ih_afr_subpops_phenotype_data.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
0,110148891_C_G,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110148891-C-G,rs552586867,,13.0
1,110148917_C_G,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP,13-110148917-C-G,rs59409892,,13.0
2,110149176_T_A,rs546124548,110149176,T,A,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149176-T-A,rs546124548,,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0
4,110149494_C_T,rs552877576,110149494,C,T,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149494-C-T,rs552877576,,13.0


How many variants have known disease phenotypes?

In [130]:
# Split the phenotypes listed in each row of the Clndn column into a list

ih_afr_subpops_phenotype_data['Clndn'] = ih_afr_subpops_phenotype_data['Clndn'].str.split('|')
ih_afr_subpops_phenotype_data.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
0,110148891_C_G,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110148891-C-G,rs552586867,,13.0
1,110148917_C_G,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP,13-110148917-C-G,rs59409892,,13.0
2,110149176_T_A,rs546124548,110149176,T,A,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149176-T-A,rs546124548,,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,[Brain_small_vessel_disease_1_with_or_without_...,13.0
4,110149494_C_T,rs552877576,110149494,C,T,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149494-C-T,rs552877576,,13.0


In [131]:
# Convert the phenotypes list into multiple rows

ih_afr_subpops_phenotype_explode = ih_afr_subpops_phenotype_data.explode('Clndn')
ih_afr_subpops_phenotype_explode.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
0,110148891_C_G,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110148891-C-G,rs552586867,,13.0
1,110148917_C_G,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP,13-110148917-C-G,rs59409892,,13.0
2,110149176_T_A,rs546124548,110149176,T,A,COL4A1,,1,1220,1219,Recent African,0.00082,SNP,13-110149176-T-A,rs546124548,,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0


In [132]:
# Remove rows with missing data for the Clndn column

variants_with_known_phenotypes = ih_afr_subpops_phenotype_explode[~((ih_afr_subpops_phenotype_explode['Clndn'].isna()) | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_provided') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_specified') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'not_specified|not_provided') | (ih_afr_subpops_phenotype_explode['Clndn'] == 'none_provided'))]
variants_with_known_phenotypes.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE,VariantVcf,Rsid,Clndn,CHROM
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0
3,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,Recent African,0.004098,SNP,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_with_hemorrhage,13.0
7,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,Recent African,0.000825,INDEL,13-110149715-A-AAT,,Porencephalic_cyst,13.0
7,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,Recent African,0.000825,INDEL,13-110149715-A-AAT,,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0


In [133]:
# Count the number of variants with disease phenotype descriptions

variants_with_known_phenotypes['ID'].nunique()

133

What disease phenotypes are present for each gene

In [134]:
variants_with_known_phenotypes_grouped_by_gene = variants_with_known_phenotypes[['GENE','Clndn','ID']].groupby(['GENE','Clndn']).count().reset_index().sort_values(by=['GENE','ID'], ascending=True)
variants_with_known_phenotypes_grouped_by_gene.rename(columns={'ID':'Count'}, inplace=True)

variants_with_known_phenotypes_grouped_by_gene

Unnamed: 0,GENE,Clndn,Count
0,AGT,"Hypertension,_essential,_susceptibility_to",1
1,AGT,"Preeclampsia,_susceptibility_to",1
3,AGT,Susceptibility_to_progression_to_renal_failure...,1
2,AGT,Renal_dysplasia,22
4,AP4B1,History_of_neurodevelopmental_disorder,7
5,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",10
9,COL4A1,Porencephalic_cyst,17
8,COL4A1,Brain_small_vessel_disease_with_hemorrhage,26
7,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,38
6,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",40


How many of the variants with known disease phenotypes have predicted deleteriousness (CADD>=20)?

In [135]:
# Join the phenotype and cadd data

variants_phenotypes_and_cadd_phred = pd.merge(variants_with_known_phenotypes, vep_data, how='left', on=['ID','CHROM','POS','REF','ALT','GENE'])
variants_phenotypes_and_cadd_phred.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,...,VariantVcf,Rsid,Clndn,CHROM,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE
0,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,...,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_1_with_or_without_o...,13.0,,,,,0.257986,3.798
1,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,...,13-110149349-G-A,rs139916479,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0,,,,,0.257986,3.798
2,110149349_G_A,rs139916479,110149349,G,A,COL4A1,,5,1220,1215,...,13-110149349-G-A,rs139916479,Brain_small_vessel_disease_with_hemorrhage,13.0,,,,,0.257986,3.798
3,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,...,13-110149715-A-AAT,,Porencephalic_cyst,13.0,,,,,1.172573,13.42
4,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,...,13-110149715-A-AAT,,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0,,,,,1.172573,13.42


In [136]:
# Filter for variants with cadd >=10

deleterious_variants_with_known_phenotype = variants_phenotypes_and_cadd_phred[variants_phenotypes_and_cadd_phred.CADD_PHRED_SCORE >= 10]
deleterious_variants_with_known_phenotype.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,...,VariantVcf,Rsid,Clndn,CHROM,POLYPHEN_PRED,POLYPHEN_SCORE,SIFT_PRED,SIFT_SCORE,CADD_RAW_SCORE,CADD_PHRED_SCORE
3,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,...,13-110149715-A-AAT,,Porencephalic_cyst,13.0,,,,,1.172573,13.42
4,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,...,13-110149715-A-AAT,,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",13.0,,,,,1.172573,13.42
5,110149715_A_AAT,chr13:110149715A-AAT,110149715,A,AAT,COL4A1,,1,1212,1211,...,13-110149715-A-AAT,,Brain_small_vessel_disease_with_hemorrhage,13.0,,,,,1.172573,13.42
6,110149776_G_T,rs13260,110149776,G,T,COL4A1,,325,1220,895,...,13-110149776-G-T,rs13260,Porencephalic_cyst,13.0,,,,,1.00375,11.67
7,110149776_G_T,rs13260,110149776,G,T,COL4A1,,325,1220,895,...,13-110149776-G-T,rs13260,Brain_small_vessel_disease_1_with_or_without_o...,13.0,,,,,1.00375,11.67


In [137]:
# Count the number of variants with known disease phenotypes and cadd >= 10

deleterious_variants_with_known_phenotype.ID.nunique()

41

In [138]:
# Percentage of variants with known disease phenotypes and cadd >= 10 of the variants with known disease phenotypes

(deleterious_variants_with_known_phenotype.ID.nunique()/variants_with_known_phenotypes['ID'].nunique())*100

30.82706766917293

What phenotypes are represented in the variants with cadd scores >= 10?

In [139]:
deleterious_variants_with_known_phenotype_grouped_by_gene = deleterious_variants_with_known_phenotype[['GENE','Clndn','ID']].groupby(['GENE','Clndn']).count().reset_index().sort_values(by=['GENE','ID'], ascending=True)
deleterious_variants_with_known_phenotype_grouped_by_gene.rename(columns={'ID':'Count'}, inplace=True)

deleterious_variants_with_known_phenotype_grouped_by_gene

Unnamed: 0,GENE,Clndn,Count
0,AGT,Renal_dysplasia,5
1,AP4B1,History_of_neurodevelopmental_disorder,5
2,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",7
5,COL4A1,Brain_small_vessel_disease_with_hemorrhage,3
4,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,5
6,COL4A1,Porencephalic_cyst,5
3,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",6
7,IL10,Inflammatory_bowel_disease,5
8,IL1B,Gastric_cancer_susceptibility_after_h._pylori_...,1
9,IL6,"Crohn_disease-associated_growth_failure,_susce...",1


What are the known phenotypes of the variants with cadd phred scores >= 10 and common frequencies (>0.1)?

In [140]:
# Display common variants, and their associated clinical disease phenotype

common_deleterious_variants_with_known_phenotype = deleterious_variants_with_known_phenotype[deleterious_variants_with_known_phenotype.IH_AF >= 0.1][['GENE','Clndn','ID','VAR_NAME','REF','ALT','IH_AF','CADD_PHRED_SCORE']].sort_values(by=['GENE','Clndn','ID'], ascending=True)
common_deleterious_variants_with_known_phenotype

Unnamed: 0,GENE,Clndn,ID,VAR_NAME,REF,ALT,IH_AF,CADD_PHRED_SCORE
125,AP4B1,History_of_neurodevelopmental_disorder,113896329_A_G,rs1217401,A,G,0.797541,21.9
124,AP4B1,"Spastic_paraplegia_47,_autosomal_recessive",113896329_A_G,rs1217401,A,G,0.797541,21.9
8,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",110149776_G_T,rs13260,G,T,0.266393,11.67
83,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",110205548_A_G,rs677877,A,G,0.38806,10.31
116,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",110307009_C_G,rs9515185,C,G,0.216172,13.89
120,COL4A1,"Angiopathy,_hereditary,_with_nephropathy,_aneu...",110307117_C_A,rs113651836,C,A,0.107084,14.63
7,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,110149776_G_T,rs13260,G,T,0.266393,11.67
82,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,110205548_A_G,rs677877,A,G,0.38806,10.31
115,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,110307009_C_G,rs9515185,C,G,0.216172,13.89
119,COL4A1,Brain_small_vessel_disease_1_with_or_without_o...,110307117_C_A,rs113651836,C,A,0.107084,14.63


In [141]:
# Count the number of variants with known phenotypes, CADD >=10 and that are common

common_deleterious_variants_with_known_phenotype.ID.nunique()

10

How do the frequencies of the common variants with known phenotypes and CADD Phred scores in Africans >=10 compare to that of other global populations?

In [142]:
# Add ALFA global allele count data to African data in common_deleterious_variants_with_known_phenotype dataframe

common_deleterious_variants_with_known_phenotype_incl_alfa = pd.merge(common_deleterious_variants_with_known_phenotype.drop_duplicates(subset=['VAR_NAME','REF','ALT']).drop(columns=['Clndn','ID','CADD_PHRED_SCORE']), alfa_global, how='left', left_on=['VAR_NAME','REF','ALT'], right_on=['variant_id','reference_allele','alternate_allele'])
common_deleterious_variants_with_known_phenotype_incl_alfa = common_deleterious_variants_with_known_phenotype_incl_alfa.drop(columns=['variant_id','reference_allele','alternate_allele'])
common_deleterious_variants_with_known_phenotype_incl_alfa

Unnamed: 0,GENE,VAR_NAME,REF,ALT,IH_AF,ALT_CT_ALFA_East Asian,ALT_CT_ALFA_European,ALT_CT_ALFA_Latin American 1,ALT_CT_ALFA_Latin American 2,ALT_CT_ALFA_South Asian,REF_CT_ALFA_East Asian,REF_CT_ALFA_European,REF_CT_ALFA_Latin American 1,REF_CT_ALFA_Latin American 2,REF_CT_ALFA_South Asian
0,AP4B1,rs1217401,A,G,0.797541,291.0,108308.0,685.0,1809.0,1715.0,4735.0,228132.0,949.0,7587.0,3523.0
1,COL4A1,rs13260,G,T,0.266393,2.0,11567.0,92.0,1374.0,7.0,496.0,115251.0,614.0,4692.0,169.0
2,COL4A1,rs677877,A,G,0.38806,94.0,20620.0,103.0,369.0,66.0,30.0,11508.0,65.0,301.0,34.0
3,COL4A1,rs9515185,C,G,0.216172,42.0,5809.0,72.0,227.0,48.0,44.0,8475.0,74.0,383.0,50.0
4,COL4A1,rs113651836,C,A,0.107084,1.0,2105.0,0.0,0.0,0.0,37.0,8535.0,104.0,232.0,42.0
5,IL10,rs1518111,T,C,0.555738,1323.0,121558.0,341.0,668.0,169.0,3161.0,33324.0,123.0,342.0,111.0
6,IL1B,rs1143627,G,A,0.351639,2484.0,166327.0,695.0,3914.0,157.0,2428.0,83789.0,493.0,4878.0,221.0
7,IL6,rs1800795,C,G,0.99918,86.0,8000.0,121.0,500.0,84.0,0.0,6286.0,25.0,110.0,14.0
8,MTHFR,rs1801131,T,G,0.145902,145.0,69682.0,280.0,964.0,2076.0,531.0,152160.0,998.0,4056.0,2982.0
9,NOS3,rs1799983,T,G,0.935246,24.0,27547.0,275.0,12.0,2.0,4.0,13013.0,85.0,38.0,4.0


In [143]:
# Calculate the alternate allele frequencies for the global populations

common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_East Asian'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_East Asian']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_East Asian']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_East Asian'])
common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_South Asian'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_South Asian']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_South Asian']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_South Asian'])
common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_European'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_European']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_European']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_European'])
common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_Latin American 1'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 1']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 1']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_Latin American 1'])
common_deleterious_variants_with_known_phenotype_incl_alfa['ALFA_AF_Latin American 2'] = common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 2']/(common_deleterious_variants_with_known_phenotype_incl_alfa['ALT_CT_ALFA_Latin American 2']+common_deleterious_variants_with_known_phenotype_incl_alfa['REF_CT_ALFA_Latin American 2'])

common_deleterious_variants_with_known_phenotype_incl_alfa.drop(columns=['ALT_CT_ALFA_East Asian','ALT_CT_ALFA_South Asian','ALT_CT_ALFA_European','ALT_CT_ALFA_Latin American 1','ALT_CT_ALFA_Latin American 2','REF_CT_ALFA_East Asian','REF_CT_ALFA_South Asian','REF_CT_ALFA_European','REF_CT_ALFA_Latin American 1','REF_CT_ALFA_Latin American 2'], inplace=True)
common_deleterious_variants_with_known_phenotype_incl_alfa

Unnamed: 0,GENE,VAR_NAME,REF,ALT,IH_AF,ALFA_AF_East Asian,ALFA_AF_South Asian,ALFA_AF_European,ALFA_AF_Latin American 1,ALFA_AF_Latin American 2
0,AP4B1,rs1217401,A,G,0.797541,0.057899,0.327415,0.321924,0.419217,0.192529
1,COL4A1,rs13260,G,T,0.266393,0.004016,0.039773,0.091209,0.130312,0.226508
2,COL4A1,rs677877,A,G,0.38806,0.758065,0.66,0.641808,0.613095,0.550746
3,COL4A1,rs9515185,C,G,0.216172,0.488372,0.489796,0.406679,0.493151,0.372131
4,COL4A1,rs113651836,C,A,0.107084,0.026316,0.0,0.197838,0.0,0.0
5,IL10,rs1518111,T,C,0.555738,0.295049,0.603571,0.784843,0.734914,0.661386
6,IL1B,rs1143627,G,A,0.351639,0.5057,0.415344,0.664999,0.585017,0.445177
7,IL6,rs1800795,C,G,0.99918,1.0,0.857143,0.559989,0.828767,0.819672
8,MTHFR,rs1801131,T,G,0.145902,0.214497,0.410439,0.314106,0.219092,0.192032
9,NOS3,rs1799983,T,G,0.935246,0.857143,0.333333,0.679167,0.763889,0.24
