## Map PGS ID to general disease category
This script loads metadata from PGS catalog and from EFO database, and maps PGS to more general disease categories (e.g., melanoma -> cancer, obesity -> metabolic disorder).

The first step after loading the data is to create a dictionary structure (key -> value) mapping PGS as keys and the most general EFO category names as values.

After that, the script loads your typical PGSwas or PRSwas results (p-values or betas), parses the PGS ids from the PRS_name column, and creates a new column in your data called "Trait_category" with the more general disease category names.

In [1]:
import pandas as pd

In [2]:
#these are mapping tables that we will need to map the genetic risk ids (PGS) into the trait categories or experimental factor ontology (EFO) category.
pgs_metadata_df = pd.read_csv("/blue/raquel.dias/raquel.dias/UKB/PGS_catalog/pgs_all_metadata_scores.csv")
trait_category_df = pd.read_csv("/blue/raquel.dias/raquel.dias/UKB/PGS_catalog/pgs_traits_data.csv")

#clean additional text following the EFO ids
trait_category_df['Trait Identifier (ontology ID)'] = trait_category_df['Trait Identifier (ontology ID)'].str.replace(r':.*', '', regex=True)

#cleaning traits that match to multiple categories, keeping first category only
trait_category_df['Trait Category'] = trait_category_df['Trait Category'].str.replace(r"(\w)([A-Z]).*", r"\1", regex=True)

#clean PGS ids that match to multiple EFO ids, keeping first EFO id only
pgs_metadata_df['Mapped Trait(s) (EFO ID)'] = pgs_metadata_df['Mapped Trait(s) (EFO ID)'].str.replace(r'\|.*', '', regex=True)

In [3]:
pgs_metadata_df

Unnamed: 0,Polygenic Score (PGS) ID,PGS Name,Reported Trait,Mapped Trait(s) (EFO label),Mapped Trait(s) (EFO ID),PGS Development Method,PGS Development Details/Relevant Parameters,Original Genome Build,Number of Variants,Number of Interaction Terms,...,PGS Publication (PGP) ID,Publication (PMID),Publication (doi),Score and results match the original publication,Ancestry Distribution (%) - Source of Variant Associations (GWAS),Ancestry Distribution (%) - Score Development/Training,Ancestry Distribution (%) - PGS Evaluation,FTP link,Release Date,License/Terms of Use
0,PGS000001,PRS77_BC,Breast Cancer,breast carcinoma,EFO_0000305,SNPs passing genome-wide significance,P<5x10-8,NR,77,0,...,PGP000001,25855707.0,10.1093/jnci/djv036,True,European:100,,European:100,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2019-10-14,PGS obtained from the Catalog should be cited ...
1,PGS000002,PRS77_ERpos,ER-positive Breast Cancer,estrogen-receptor positive breast cancer,EFO_1000649,SNPs passing genome-wide significance,P<5x10-8,NR,77,0,...,PGP000001,25855707.0,10.1093/jnci/djv036,True,European:100,,European:100,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2019-10-14,PGS obtained from the Catalog should be cited ...
2,PGS000003,PRS77_ERneg,ER-negative Breast Cancer,estrogen-receptor negative breast cancer,EFO_1000650,SNPs passing genome-wide significance,P<5x10-8,NR,77,0,...,PGP000001,25855707.0,10.1093/jnci/djv036,True,European:100,,European:100,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2019-10-14,PGS obtained from the Catalog should be cited ...
3,PGS000004,PRS313_BC,Breast Cancer,breast carcinoma,EFO_0000305,Hard-Thresholding Stepwise Forward Regression,p < 10^-5,GRCh37,313,0,...,PGP000002,30554720.0,10.1016/j.ajhg.2018.11.002,True,European:100,European:100,European:76.1|Multi-ancestry (including Europe...,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2019-10-14,PGS obtained from the Catalog should be cited ...
4,PGS000005,PRS313_ERpos,ER-positive Breast Cancer,estrogen-receptor positive breast cancer,EFO_1000649,Hard-Thresholding Stepwise Forward Regression,p < 10^-5,GRCh37,313,0,...,PGP000002,30554720.0,10.1016/j.ajhg.2018.11.002,True,European:100,European:100,European:90|Not Reported:5|Multi-ancestry (exc...,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2019-10-14,PGS obtained from the Catalog should be cited ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3802,PGS003967,SBP_EUR,Systolic blood pressure,systolic blood pressure,EFO_0006335,"PRSice2, LDPred2-Auto",R2 = 0.1,NR,1108368,0,...,PGP000510,37268629.0,10.1038/s41467-023-38990-9,True,European:81|East Asian:11.3|African:5.2|Hispan...,European:49.5|African:23.6|Hispanic or Latin A...,,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2023-10-17,PGS obtained from the Catalog should be cited ...
3803,PGS003968,SBP_weightedPRSsum,Systolic blood pressure,systolic blood pressure,EFO_0006335,"PRSice2, LDPred2-Auto, PRS-CSx","PRSsum = UKBB+ICBP (PRS1), MVP, (PRS2), BBJ (P...",NR,1267240,0,...,PGP000510,37268629.0,10.1038/s41467-023-38990-9,True,European:79|East Asian:11|African:7.6|Hispanic...,European:62.6|African:16.8|Hispanic or Latin A...,Multi-ancestry (including European):100,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2023-10-17,PGS obtained from the Catalog should be cited ...
3804,PGS003969,PRS39_HF,Heart failure,heart failure,EFO_0003144,Genome-wide significant SNPs,,NR,39,0,...,PGP000511,37429843.0,10.1038/s41467-023-39253-3,True,European:100,,European:100,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2023-10-17,PGS obtained from the Catalog should be cited ...
3805,PGS003970,HARE_META_SBP_PRS_MALE,Systolic Blood Pressure,systolic blood pressure,EFO_0006335,PRS-CS,Hapmap only variants with EUR as LD reference,GRCh38,1115602,0,...,PGP000512,37807951.0,10.1161/circgen.123.004259,True,European:94.1|Multi-ancestry (excluding Europe...,Multi-ancestry (including European):100,Multi-ancestry (including European):100,https://ftp.ebi.ac.uk/pub/databases/spot/pgs/s...,2023-10-17,PGS obtained from the Catalog should be cited ...


In [4]:
trait_category_df

Unnamed: 0,Trait (ontology term label),Trait Identifier (ontology ID),Trait Category,Number of Related PGS
0,Abdominal Aortic Aneurysm,EFO_0004214,Cardiovascular disease,3
1,Abdominal pain,HP_0002027,Other trait,3
2,Abnormal circulating lipid concentration,HP_0003119,Other trait,2
3,Abnormal EKG,HP_0003115,Other trait,1
4,ACPA-negative rheumatoid arthritis,EFO_0009460,Immune system disorder,1
...,...,...,...,...
613,wellbeing measurement,EFO_0007869,Other measurement,4
614,white matter hyperintensity measurement,EFO_0005665,Other measurement,1
615,white matter volume measurement,EFO_0008320,Other measurement,4
616,whole body water mass,EFO_0009805,Other measurement,5


In [5]:
PGS_EFO = dict(zip(pgs_metadata_df['Polygenic Score (PGS) ID'], pgs_metadata_df['Mapped Trait(s) (EFO ID)']))
EFO_CAT = dict(zip(trait_category_df['Trait Identifier (ontology ID)'], trait_category_df['Trait Category']))

In [6]:
#PGS_CAT translates all PGS_ids into disease categories
PGS_CAT = dict()
for pgs, efo in PGS_EFO.items():
    PGS_CAT[pgs] = EFO_CAT[efo]

In [7]:
#PGS_CAT

In [11]:
#replace this by your pvalue table
pvalue_table = pd.read_csv("results/hochberg_qvalues_21068_166900.csv")
pvalue_table

Unnamed: 0,PRS_name,intercept,PRS_q.value,X22009.0.1,X22009.0.2,X22009.0.3,X22009.0.4,X22009.0.5,X31.0.0,X21001.0.0,...,X54_11012,X54_11013,X54_11014,X54_11016,X54_11017,X54_11018,X54_11020,X54_11021,X54_11022,X54_11023
0,AD_sumstats_ad_190513(29|Tlab_NonUKB),1.0,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,Aragam_KG_PRS_241_Coronary_artery_disease_beta...,1.0,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,Aragam_KG_PRS_241_Coronary_artery_disease_join...,,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,
3,cad_190822_weights(168|Tlab_NonUKB),,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,
4,CAD_GRS_300_JACC_2019_proxy(300|Tlab_NonUKB),,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,Tlab_TC_GLGC_PRS_74_TC_Nat_Genet_2013(73|Tlab_...,,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,
2940,Tlab_TG_GLGC_PRS_40_TG_Nat_Genet_2013(40|Tlab_...,1.0,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2941,Vanhoye_X_PRS_165_LDL_Translational_Resarch_20...,1.0,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2942,Vujkovic_PRS_558_T2D_Nat_Genet(550|Tlab_NonUKB),,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,


In [12]:
#convert each PGS id into their respective category names
category_names = []
for PRS_name in pvalue_table['PRS_name']:
    PGS_id = PRS_name.split('_')[0]
    if PGS_id in PGS_CAT.keys():
        category_names.append(PGS_CAT[PGS_id])
    elif PGS_id in ['cad', 'CAD', 'Aragam', 'Tlab', 'Nielsen', 'Vanhoye']:
        category_names.append('Cardiovascular disease')
    elif PGS_id in ['Vujkovic', 'Mahajan']:
        category_names.append('Digestive system disorder')
    elif PGS_id in ['AD', 'Lam']:
        category_names.append('Neurological disorder')
    else:
        category_names.append('Other measurement')

#add the category names to a column named "Trait_category"
pvalue_table['Trait_category'] = category_names

In [13]:
pvalue_table

Unnamed: 0,PRS_name,intercept,PRS_q.value,X22009.0.1,X22009.0.2,X22009.0.3,X22009.0.4,X22009.0.5,X31.0.0,X21001.0.0,...,X54_11013,X54_11014,X54_11016,X54_11017,X54_11018,X54_11020,X54_11021,X54_11022,X54_11023,Trait_category
0,AD_sumstats_ad_190513(29|Tlab_NonUKB),1.0,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Neurological disorder
1,Aragam_KG_PRS_241_Coronary_artery_disease_beta...,1.0,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Cardiovascular disease
2,Aragam_KG_PRS_241_Coronary_artery_disease_join...,,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,Cardiovascular disease
3,cad_190822_weights(168|Tlab_NonUKB),,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,Cardiovascular disease
4,CAD_GRS_300_JACC_2019_proxy(300|Tlab_NonUKB),,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,Cardiovascular disease
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,Tlab_TC_GLGC_PRS_74_TC_Nat_Genet_2013(73|Tlab_...,,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,Cardiovascular disease
2940,Tlab_TG_GLGC_PRS_40_TG_Nat_Genet_2013(40|Tlab_...,1.0,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Cardiovascular disease
2941,Vanhoye_X_PRS_165_LDL_Translational_Resarch_20...,1.0,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Cardiovascular disease
2942,Vujkovic_PRS_558_T2D_Nat_Genet(550|Tlab_NonUKB),,0.999954,0.996844,0.112562,0.171108,0.983634,0.376903,0.01566,6.774339e-11,...,,,,,,,,,,Digestive system disorder


In [14]:
#save the pvalue table with the new column added
#change the file path and name below according to your own pvalue result file
pvalue_table.to_csv("results/hochberg_qvalues_21068_166900_trait_cat.csv", index=False)