Pipeline to calculates PRS based on a list of sumstats. Weights are calculated with PRScs: https://github.com/getian107/PRScs
The sumstats used for R10 are the following:
filename | pheno | publication |
---|---|---|
AD_sumstats_Jansenetal.txt.gz | Alzheimers_Jansen_2019 | https://www.nature.com/articles/s41588-018-0311-9 |
adhd_eur_jun2017.gz | ADHD | https://www.biorxiv.org/content/early/2017/06/03/145581 |
alsMetaSummaryStats_march21st2018.tab.gz | ALS_Nicolas_2018 | https://www.cell.com/neuron/references/S0896-6273(18)30148-X |
Bipolar_vs_control_PGC_Cell_2018_formatted.txt.gz | bipolar_disorder | PGC;Cell;2018;https://doi.org/10.1016/j.cell.2018.05.046 |
cad.add.160614.website.txt.gz | coronary_artery_disease | https://www.nature.com/articles/ng.3396 |
ckqny.scz2snpres.gz | Schizophrenia_PGC_2014 | https://www.nature.com/articles/nature13595 |
daner_PGC_SCZ52_0513a.resultfiles_PGC_SCZ52_0513.sh2_nofin.gz | Schizophrenia_PGC_2014_noFinns | PGC_schzizophrenia_GWAS_Ripke_et_al._excluding_Finnish |
dpw_excludingFinnishStudies_4finngen.txt.gz | alcohol_consumption | https://www.nature.com/articles/s41398-019-0676-2 |
EAGLE_AD_GWAS_results_2015.txt.gz | atopic_dermatitis | https://www.nature.com/articles/ng.3424 |
Educational_Attainment_excl23andme_Lee_2018_NatGen_formatted.txt.gz | educational_attainment | Lee;Nat_Gen;2018;https://doi.org/10.1038/s41588-018-0147-3 |
EUR.CD.gwas_info03_filtered.assoc.gz | Crohns_disease | https://www.nature.com/articles/ng.3359 |
EUR.IBD.gwas_info03_filtered.assoc.gz | Inflammatory_bowel_disease | https://www.nature.com/articles/ng.3359 |
EUR.UC.gwas_info03_filtered.assoc.gz | Ulcerative_colitis | https://www.nature.com/articles/ng.3359 |
focal_epilepsy_METAL_4finngen.txt.gz | Focal_epilepsy | https://www.nature.com/articles/s41467-018-07524-z |
gabriel_asthma_meta-analysis_36studies_format_repository_NEJM.txt.gz | Asthma_Moffatt_2010 | https://www.nejm.org/doi/full/10.1056/nejmoa0906312 |
generalised_epilepsy_METAL_4finngen.txt.gz | Generalized_epilepsy | https://www.nature.com/articles/s41467-018-07524-z |
GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz | Height_giant_2014 | https://www.nature.com/articles/ng.3097 |
GWAS_CP_all.txt.gz | Cognitive_performance | https://www.nature.com/articles/s41588-018-0147-3 |
Hb_gwas_summary_fromNealeLab.tsv.gz | Haemoglobin_concentration | UKBB |
HbA1c_METAL_European.txt.gz | HbA1c | http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002383 |
HDL_Meta_ENGAGE_1000G_non_FINRISK_QCd.txt.gz | HDL-C | https://www.nature.com/articles/ng.3300 |
IGAP_stage_1.txt.gz | Alzheimers_Lambert_2013 | https://www.nature.com/articles/ng.2802 |
iPSYCH-PGC_ASD_Nov2017.gz | ASD | https://www.biorxiv.org/content/early/2017/11/27/224774 |
kunkle_etal_stage1.txt.gz | Alzheimers_Kunkle_2019 | https://www.niagads.org/datasets/ng00075 |
LDL_Meta_ENGAGE_1000G_non_FINRISK_QCd.txt.gz | LDL-C | https://www.nature.com/articles/ng.3300 |
Mahajan.NatGenet2018b.T2D.European.txt.gz | T2D_Mahajan_2018 | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6287706/ |
MDD2018_ex23andMe.19fields.gz | Major_depressive_disorder | https://www.nature.com/articles/s41588-018-0090-3 |
meta_v3_onco_euro_overall_ChrAll_1_release.txt.gz | prostate_cancer | https://www.nature.com/articles/s41588-018-0142-8 |
METAANALYSIS_DIAGRAM_SE1.txt.gz | T2D_Scott_2017 | http://diabetes.diabetesjournals.org/content/66/11/2888 |
metastroke.all.chr.bp.gz | Ischaemic_stroke | https://www.thelancet.com/journals/laneur/article/PIIS1474-4422(12)70234-X/abstract |
MTAG_CP.to10K.txt.gz | Cognitive_Performance | https://www.nature.com/articles/s41588-018-0147-3 |
MTAG_EA.to10K.txt.gz | Educational_attainment | https://www.nature.com/articles/s41588-018-0147-3 |
pgc.bip.full.2012-04.txt.gz | Bipolar_disorder_PGC_2011 | https://www.nature.com/articles/ng.943 |
pgc.ed.freeze1.summarystatistics.July2017.txt.gz | Anorexia | Missing |
RA_GWASmeta_European_v2.txt.gz | rheumatoid_arthritis | https://www.nature.com/articles/nature12873 |
SavageJansen_2018_intelligence_metaanalysis_formatted.txt.gz | intelligence | Savage;Nat_Gen-2018;https://doi.org/10.1038/s41588-018-0152-6 |
Saxena_fullUKBB_Longsleep_summary_stats_formatted.txt.gz | habitual_sleep_duration | Dashti;Natcomm-2019;https://doi.org/10.1038/s41467-019-08917-4 |
Shrine_30804560_FEV1_meta-analysis.txt.gz | FEV1 | https://www.nature.com/articles/s41588-018-0321-7 |
Shrine_30804560_FEV1_to_FVC_RATIO_meta-analysis.txt.gz | FEV1/FVC | https://www.nature.com/articles/s41588-018-0321-7 |
Shrine_30804560_FVC_meta-analysis.txt.gz | FVC | https://www.nature.com/articles/s41588-018-0321-7 |
Shrine_30804560_PEF_meta-analysis.txt.gz | Peak_expiratory_flow | https://www.nature.com/articles/s41588-018-0321-7 |
SNP_gwas_mc_merge_nogc.tbl.uniq.gz | BMI | https://www.nature.com/articles/nature14177 |
SORTED_PTSD_EA9_ALL_study_specific_PCs1.txt.gz | Posttraumatic_stress_disorder | https://www.nature.com/articles/mp201777 |
sumstats_neuroticism_ctg_format_formatted.txt.gz | neuroticism | Nagel;NatGen2018-https://doi.org/10.1038/s41588-018-0151-7 |
TC_Meta_ENGAGE_1000G_non_FINRISK_QCd.txt.gz | Total_cholesterol | https://www.nature.com/articles/ng.3300 |
TG_Meta_ENGAGE_1000G_non_FINRISK_QCd.txt.gz | Triglycerides | https://www.nature.com/articles/ng.3300 |
UKBB_gwas_Neale_Chronotype_formatted.txt.gz | chronotype_neale | UKBB |
UKBB_gwas_Neale_pulserate_formatted.txt.gz | heart_rate_neale | UKBB |
UKBB_gwas_Neale_SleepDuration_formatted.txt.gz | sleepduration_neale | UKBB |
UKBB_gwas_Neale_SleeplessnessInsomnia_formatted.txt.gz | insomnia_neale | UKBB |
UKBB_HYPO.txt.gz | hypothyroidism | UKBB |
UKB-ICBPmeta750k_DBPsummaryResults.txt.gz | diastolic_blood_pressure | UKBB_ICBP |
UKB-ICBPmeta750k_PPsummaryResults.edited.txt.gz | pulse_pressure | UKBB_ICBP |
UKB-ICBPmeta750k_SBPsummaryResults.txt.gz | systolic_blood_pressure | UKBB_ICBP |
GSCAN_AgeofInitiation.txt.gz | smoking_age_of_initiation | https://www.nature.com/articles/s41588-018-0307-5 |
GSCAN_CigarettesPerDay.txt.gz | smoking_cigarettes_per_day | https://www.nature.com/articles/s41588-018-0307-5 |
GSCAN_DrinksPerWeek.txt.gz | drinks_per_week | https://www.nature.com/articles/s41588-018-0307-5 |
GSCAN_SmokingInitiation.txt.gz | smoking_initiation | https://www.nature.com/articles/s41588-018-0307-5 |
PGC_UKB_depression_genome-wide.txt.gz | depression | https://www.nature.com/articles/s41588-018-0090-3 |
PGC_UKB_23andMe_depression_10000.txt.gz | depression_23andme | https://www.nature.com/articles/s41588-018-0090-3 |
PGC3_SCZ_wave3_public_without_frequencies.v1.tsv.gz | schizophrenia_pgc_2022 | https://www.nature.com/articles/s41586-022-04434-5 |
PGC3_SCZ_wave3_public_without_frequencies.clumped.v1.tsv.gz | schizophrenia_pgc_2022_clumped | https://www.nature.com/articles/s41586-022-04434-5 |
daner_PGC_BIP32b_mds7a_0416a.gz | bipolar_disorder | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956732/ |
CIMBA_BRCA1_BCAC_TN_meta_summary_level_statistics.txt.gz | breast_cancer | https://www.nature.com/articles/s41588-020-0609-2 |
AF_HRC_GWAS_ALLv11.txt.gz | atrial_fibrillation | https://www.nature.com/articles/s41588-018-0133-9 |
lifegen_phase2_bothpl_alldr_2017_09_18.tsv.gz | lifespan | https://elifesciences.org/articles/39856 |
RISK_GWAS_MA_UKB+23andMe+replication.txt.gz | risk_tolerance | https://www.nature.com/articles/s41588-018-0309-3 |
chronic_pain-bgen.stats.gz | multisite_chronic_pain | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6592570/ |
Saxena_fullUKBB_Insomnia_summary_stats.txt.gz | insomnia | https://www.nature.com/articles/s41588-019-0361-7 |
continuous-LDLC-both_sexes-medadj_irnt.tsv.gz | ldl_adj_irnt | https://docs.google.com/spreadsheets/d/1AeeADtT0U1AukliiNyiVzVRdLYPkTbruQSk38DeutU8/edit#gid=511623409 |
biomarkers-30750-both_sexes-irnt.tsv.gz | hba1c_inrt | https://docs.google.com/spreadsheets/d/1AeeADtT0U1AukliiNyiVzVRdLYPkTbruQSk38DeutU8/edit#gid=511623409 |
biomarkers-30870-both_sexes-irnt.tsv.gz | triglycerides_irnt | https://docs.google.com/spreadsheets/d/1AeeADtT0U1AukliiNyiVzVRdLYPkTbruQSk38DeutU8/edit#gid=511623409 |
MEGASTROKE.1.AS.TRANS.out.gz | any_stroke | https://www.nature.com/articles/s41588-018-0058-3 |
UKBB.asthma-2.assoc.gz | asthma_han_2020 | https://www.nature.com/articles/s41467-020-15649-3 |
ICP_FG_DC_EG_DBDS_202012.gz | intrahepatic_cholestatis_of_pregnancy | FG |
Meta-analysis_Wood_et_al+UKBiobank_2018_height.txt.gz | height_giant_2014_ukb | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6488973/ |
Meta-analysis_Locke_et_al+UKBiobank_2018_bmi.txt.gz | height_giant_2014_ukb | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6488973/ |
GSEM.GWAS.EXTERNALIZING.SHARE.v20191014.txt.gz | externalizing | https://www.biorxiv.org/content/10.1101/2020.10.16.342501v1 |
MTAG_glaucoma_four_traits_summary_statistics.txt.gz | glaucoma_Craig_2020 | https://www.nature.com/articles/s41588-019-0556-y |
phecode-696.4-both_sexes.tsv.gz | Psoriasis | UKBB |
phecode-250.1-both_sexes.tsv.gz | T1D | UKBB |
BCX2_EOS_Trans_GWAMA.out.gz | EOS_Trans | https://pubmed.ncbi.nlm.nih.gov/32888493/ |
biomarkers-30600-both_sexes-irnt.tsv.gz | Albumin_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30610-both_sexes-irnt.tsv.gz | Alkaline_phosphatase_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30620-both_sexes-irnt.tsv.gz | Alanine_aminotransferase_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30630-both_sexes-irnt.tsv.gz | Apolipoprotein_A_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30640-both_sexes-irnt.tsv.gz | Apolipoprotein_B_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30650-both_sexes-irnt.tsv.gz | Aspartate_aminotransferase_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30660-both_sexes-irnt.tsv.gz | Direct_bilirubin_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30670-both_sexes-irnt.tsv.gz | Urea_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30680-both_sexes-irnt.tsv.gz | Calcium_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30690-both_sexes-irnt.tsv.gz | Cholesterol_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30700-both_sexes-irnt.tsv.gz | Creatinine_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30710-both_sexes-irnt.tsv.gz | C-reactive_protein_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30720-both_sexes-irnt.tsv.gz | Cystatin_C_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30730-both_sexes-irnt.tsv.gz | Gamma_glutamyltransferase_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30740-both_sexes-irnt.tsv.gz | Glucose_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30760-both_sexes-irnt.tsv.gz | HDL_cholesterol_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30770-both_sexes-irnt.tsv.gz | IGF-1_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30780-both_sexes-irnt.tsv.gz | LDL_direct_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30790-both_sexes-irnt.tsv.gz | Lipoprotein_A_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30800-both_sexes-irnt.tsv.gz | Oestradiol_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30810-both_sexes-irnt.tsv.gz | Phosphate_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30820-both_sexes-irnt.tsv.gz | Rheumatoid_factor_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30830-both_sexes-irnt.tsv.gz | SHBG_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30840-both_sexes-irnt.tsv.gz | Total_bilirubin_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30850-both_sexes-irnt.tsv.gz | Testosterone_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30860-both_sexes-irnt.tsv.gz | Total_protein_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30880-both_sexes-irnt.tsv.gz | Urate_panukbb | https://pan.ukbb.broadinstitute.org/ |
biomarkers-30890-both_sexes-irnt.tsv.gz | Vitamin_D_panukbb | https://pan.ukbb.broadinstitute.org/ |
ukb.allasthma.upload.final.assoc_v2.txt.gz | asthma_Zhu_2019 | https://erj.ersjournals.com/content/54/6/1901507-0.long |
michailidoubreastcancerall.txt.gz | Breast_cancer | https://www.nature.com/articles/nature24284 |
20171017_MW_eGFR_overall_EA_nstud42.dbgap_v2.txt.gz | estimated_glomerular_filtration_rate | https://www.nature.com/articles/s41588-019-0407-x |
CRC_LP1_VQ_CFR1_CFR2_COIN_FIN_ONCO_SCOT_AUSTRIA_SP1_SOCCS3_CROATIA_BIOB_LBC_DACHS_meta_uk10k_1kG_info8_allchr.txt.gz | Colorectal_cancer | https://www.nature.com/articles/s41467-019-09775-w |
urate_chr1_22_LQ_IQ06_mac10_all_741_rsid_v2.txt.gz | Urate | https://www.nature.com/articles/s41588-019-0504-x |
Tachmazidou_30664745_HIPOA.txt.gz | hip_osteoarthritis | https://www.nature.com/articles/s41588-018-0327-1 |
Tachmazidou_30664745_KNEEOA.txt.gz | knee_osteoarthritis | https://www.nature.com/articles/s41588-018-0327-1 |
BMD_v3_SumStats.txt.gz | Bone_mineral_density | https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0200785 |
anno.CLEANED.MVP.EUR.VTE.results.EQC.tbl.gz | Venous_thromboembolism | https://www.nature.com/articles/s41588-019-0519-3 |
GCST90011770_buildGRCh37.tsv.gz | Glaucoma_Gharahkhani_2021 | https://www.nature.com/articles/s41467-020-20851-4 |
Hysi_et_al_Refractive_error_NatGenet_2020_beta_se.txt.gz | Refractive_error_and_myopia | https://www.nature.com/articles/s41588-020-0599-0 |
ALS_sumstats_EUR_only.txt.gz | ALS_Rheenen_2021 | https://www.nature.com/articles/s41588-021-00973-1 |
discovery_metav3.0.meta.gz | Multiple_Sclerosis | https://pubmed.ncbi.nlm.nih.gov/31604244/ |
HERMES_Jan2019_HeartFailure_summary_data.txt.gz | heart_failure | https://cvd.hugeamp.org/dinspector.html?dataset=GWAS_HERMES_eu |
CKD_overall_ALL_JW_20180223_nstud30.dbgap.txt.gz | chronic_kidney_disease | https://ckdgen.imbi.uni-freiburg.de/ |
netherlands_total.fastGWA.gz | netherlands_total_cost | https://pubmed.ncbi.nlm.nih.gov/34213412/ |
netherlands_gp.fastGWA.gz | netherlands_gp_cost | https://pubmed.ncbi.nlm.nih.gov/34213412/ |
netherlands_hospital.fastGWA.gz | netherlands_hospital_cost | https://pubmed.ncbi.nlm.nih.gov/34213412/ |
netherlands_pharmacy.fastGWA.gz | netherlands_pharmacy_cost | https://pubmed.ncbi.nlm.nih.gov/34213412/ |
GCST90027158_buildGRCh38.tsv.gz | alzheimer | https://www.nature.com/articles/s41588-022-01024-z,WARNING:FINNGEN_SAMPLES_INCLUDED |
This step generates a mapping to/from rsid/chrompos based on data available at ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.
rsid_map.py
produces:
- finngen.rsid.map.tsv (rsid--> chrompos)
rs10 7_92754574
rs1000000 12_126406434
rs1000000219 13_95689463
- finngen.variants.tsv (chrompos --> ref/alt)
10_100000235 C T
10_100000979 T C
10_100001839 CAA C
The first file is used throught the computation to go to/from rsid notation. The second is used to filter out variants that do not have the right alleles.
Also, if a rsid list is provided (e.g. hm3 rsids), it returns the subset of variants in the original bim file that match to those rsids:
- hm3.snplist
chr10_100000235_C_T
chr10_100002628_A_C
chr10_100004827_A_C
PRScs automatically does some allele matching:
- the reference genome (1kg) only contains non ambigous variants :
cat snpinfo_1kg_hm3 | sed -E 1d | cut -f 4,5 | awk '{print $1$2}' | sort | uniq
AC
AG
CA
CT
GA
GT
TC
TG
Also, in the parsing phase it checks for the ref/alt order and fixes the beta accordingly.
vld_snp = set(zip(vld_dict['SNP'], vld_dict['A1'], vld_dict['A2']))
ref_snp = set(zip(ref_dict['SNP'], ref_dict['A1'], ref_dict['A2'])) | set(zip(ref_dict['SNP'], ref_dict['A2'], ref_dict['A1'])) | \
set(zip(ref_dict['SNP'], [mapping[aa] for aa in ref_dict['A1']], [mapping[aa] for aa in ref_dict['A2']])) | \
set(zip(ref_dict['SNP'], [mapping[aa] for aa in ref_dict['A2']], [mapping[aa] for aa in ref_dict['A1']]))
sst_snp = set(zip(sst_dict['SNP'], sst_dict['A1'], sst_dict['A2'])) | set(zip(sst_dict['SNP'], sst_dict['A2'], sst_dict['A1'])) | \
set(zip(sst_dict['SNP'], [mapping[aa] for aa in sst_dict['A1'] if aa in ATGC], [mapping[aa] for aa in sst_dict['A2'] if aa in ATGC])) | \
set(zip(sst_dict['SNP'], [mapping[aa] for aa in sst_dict['A2'] if aa in ATGC], [mapping[aa] for aa in sst_dict['A1'] if aa in ATGC]))
comm_snp = ref_snp & vld_snp & sst_snp
with open(sst_file) as ff:
if (snp, a1, a2) in comm_snp:
...
beta_std = sp.sign(beta)*abs(norm.ppf(p/2.0))/n_sqrt
elif (snp, a2, a1) in comm_snp:
beta_std = -1*sp.sign(beta)*abs(norm.ppf(p/2.0))/n_sqrt
...
The final weights are printed based on the a1/a2 order of the reference panel (i.e. the EUR 1kg panel in this case).
PRScs does check for strand flip.
Our solution is therefore the following.
We build a rsid to chrom pos mapping from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz
. This allows to move back and forth from/to rsid/chrompos notations and therefore to merge summary stats with different formats.
Then:
- in the munging phase we split the summary stats entries based on whether variants are identified by rsid or by some of chrom/pos notation
- the rsid file is filtered for rsids present in finngen. chrom and pos information are updated to finngen data
- the chrompos file is updated to have chrom and pos added (if provided, else extracted from variant id) and then lifted to build 38
- the two files are then merged to a FINNGEN chrom_pos_ref_alt notation, making sure that the variant exists in finngen data (checking for strand flip as well).
This produces a file with chrom and pos based on Finngen, but with A1/A2/OR/P based on the original data:
CHR SNP A1 A2 BP OR P
19 chr19_260912 G A 260912 0.9957872809136792 0.050031874722
19 chr19_261033 A G 261033 0.9957998422696626 0.0507241244322
19 chr19_266034 C T 266034 1.0053796666398893 0.150796052266
19 chr19_267039 C T 267039 0.9961140904000996 0.0691110781996
19 chr19_276245 T C 276245 0.995420944482037 0.0366293445837
19 chr19_277776 A G 277776 0.9964618752820534 0.119939685495
19 chr19_280299 C T 280299 0.9964396546231319 0.120269108388
19 chr19_281360 T C 281360 0.9968755417956986 0.169347605744
19 chr19_288246 C T 288246 0.9991125513478095 0.722369442562
19 chr19_288374 C T 288374 1.0021901113140002 0.299346915861
Weights are calculated using PRScs. In order to run PRScs we then convert the file to rsids:
SNP A1 A2 OR P
rs8100066 G A 0.9957872809136792 0.050031874722
rs8105536 A G 0.9957998422696626 0.0507241244322
rs2312724 C T 1.0053796666398893 0.150796052266
rs1020382 C T 0.9961140904000996 0.0691110781996
rs12459906 T C 0.995420944482037 0.0366293445837
rs11084928 A G 0.9964618752820534 0.119939685495
rs11878315 C T 0.9964396546231319 0.120269108388
rs7815 T C 0.9968755417956986 0.169347605744
rs10409452 C T 0.9991125513478095 0.722369442562
rs12981067 C T 1.0021901113140002 0.299346915861
This guarantees that the beta is still correct,since it's based on the original summary stats. However, this way, we can "recycle" the munged data also for other reference panels, if needed in the future.
Then PRScs is run and weights are calculated only for the subset of variants shared across reference panel, summary stats and validation bim file (finngen).
19 rs8100066 260912 G A -2.438700e-05
19 rs8105536 261033 A G -1.608225e-05
19 rs2312724 266034 C T 2.586432e-04
19 rs1020382 267039 C T 8.532887e-06
19 rs12459906 276245 T C -3.629306e-05
19 rs11084928 277776 A G -3.484712e-05
19 rs11878315 280299 C T -6.442467e-06
19 rs7815 281360 T C -1.508200e-05
19 rs10409452 288246 C T 2.060791e-05
19 rs12981067 288374 C T 4.191780e-05
The weight file is converted to chrom_pos again through the finngen rsid/chrom_pos mapping,using the a1/a2 from the weights. However, now there is a double mismatch that needs to be fixed:
- the weights were calculated based on the a1/a2 order of the reference data set
- the output positions are based on the reference data set.
19 chr19_1208073_C_T 1208072 C T 7.354914e-06
19 chr19_1218220_T_C 1218219 T C 5.217805e-06
19 chr19_1220005_G_A 1220004 G A 9.294323e-05
19 chr19_1221162_T_C 1221161 T C 1.765576e-05
19 chr19_1226005_A_C 1226004 A C 8.979659e-05
19 chr19_1232559_C_T 1232558 C T 4.271571e-05
19 chr19_1238900_C_T 1238899 C T 2.828170e-05
In order to fix this, we replicate each entry, considering all possible permutations of the ref_alt in the variant id. This guarantees that at least one permutation is the matching Finngen variant. Also, the position is updated to the one in the id.
19 chr19_1208073_C_T 1208073 C T 7.354914e-06
19 chr19_1208073_T_C 1208073 C T 7.354914e-06
19 chr19_1208073_G_A 1208073 C T 7.354914e-06
19 chr19_1208073_A_G 1208073 C T 7.354914e-06
19 chr19_1218220_T_C 1218220 T C 5.217805e-06
19 chr19_1218220_C_T 1218220 T C 5.217805e-06
19 chr19_1218220_A_G 1218220 T C 5.217805e-06
19 chr19_1218220_G_A 1218220 T C 5.217805e-06
19 chr19_1220005_G_A 1220005 G A 9.294323e-05
19 chr19_1220005_A_G 1220005 G A 9.294323e-05
19 chr19_1220005_C_T 1220005 G A 9.294323e-05
19 chr19_1220005_T_C 1220005 G A 9.294323e-05
19 chr19_1221162_T_C 1221162 T C 1.765576e-05
19 chr19_1221162_C_T 1221162 T C 1.765576e-05
Now we have all elements in place:
- variants are identified with a Finngen ID
- the position is updated to finngen data
- the effect allele is still the original one
- weights are calcualted accordingly based on the effect allele
Finally scores are calculate with plink2 --sscore
which will only compute if the variant ids match, but still computing the score for the correct allele.
We've added in the wdl section the wdl finngen_weights.wdl
that allows to calculate weights based on the FG sumstats. In order to do so we've build a custom LD panel based on the Finnish panel used for imputation.
In the scripts folder link there are the scripts used to generate it.
Here's a breakdown of how it works in each step and how to edit the wdl for one's needs
"finngen_weights.test": False,
"finngen_weights.run_scores": False,
"finngen_weights.pheno_list": "gs://path/to/list/of/phenos.txt",
"finngen_weights.plink_root": "gs://finngen-production-library-red/finngen_R8/genotype_plink_1.0/data/finngen_R8_hm3",
"finngen_weights.prefix": "finngen_R?",
Test
mode cuts the input sumstats to only 10k variants and performs the weights calculation in test mode (very few iterations). The output will be useless, but it will run in a very short time (mins vs hours).
run_scores
determines whether scores are calculated or not.
pheno_list
is the path to the list of phenos to analyze (see munging step).
plink_root
is the base of the .bim file used as validation by cs-prs. If scores are passed, it also needs to have .bed and .fam files in the same directory
prefix'
is the prefix prepended to the output of all files
The task munge_sumstats
munges the input data to a CHROM_POS_REF_ALT format needed with FG.
Relevant inputs:
"finngen_weights.munge_sumstats.file_root": "gs://finngen-production-library-green/finngen_R5/finngen_R5_analysis_data/summary_stats/release/finngen_R5_PHENO.gz",
"finngen_weights.munge_sumstats.columns":"#chrom,pos,ref,alt,beta,pval",
The first is the path to the input sumsats. The string PHENO
is replaced with the phenotypes defined in the global parameter described before.
The columns are the core of the munging step. These column configurations need to be in the order that PRCSs expects them i.e chrom pos ref alt beta pvalue.
"finngen_weights.weights.ref_list": "gs://finngen-production-library-green/prs/ref_list_fin.txt",
"finngen_weights.weights.rsid_map": "gs://finngen-production-library-green/finngen_R8/finngen_R8_analysis_data/variant_mapping/finngen_R8_hm3.rsid.map.tsv",
ref_list
is the list of files containing the custom LD panel built with thes scripts described in the introduction.
rsid_map
is a filed structures as follows:
rs11596870 10_100000235
rs11190363 10_100002628
rs7902856 10_100004827
rs7078766 10_100005136
[...]
That is used to output weights in rsid format.
This task outputs three files:
- weights in CHR_POS_REF_ALT format
- weights in rsid format
- the log output of CS-PRS for checking all the relevant metadata
If interested in calculating scores, the only edit one can make is to increase/decrease the number of cpus in the .json
. The .bed file is determined by the plink_root
parameter passed in the global parameters.