Skip to content

FINNGEN/CS-PRS-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS-PRS-pipeline

Pipeline to calculates PRS based on a list of sumstats. Weights are calculated with PRScs: https://github.com/getian107/PRScs

The sumstats used for R10 are the following:

filename pheno publication
AD_sumstats_Jansenetal.txt.gz Alzheimers_Jansen_2019 https://www.nature.com/articles/s41588-018-0311-9
adhd_eur_jun2017.gz ADHD https://www.biorxiv.org/content/early/2017/06/03/145581
alsMetaSummaryStats_march21st2018.tab.gz ALS_Nicolas_2018 https://www.cell.com/neuron/references/S0896-6273(18)30148-X
Bipolar_vs_control_PGC_Cell_2018_formatted.txt.gz bipolar_disorder PGC;Cell;2018;https://doi.org/10.1016/j.cell.2018.05.046
cad.add.160614.website.txt.gz coronary_artery_disease https://www.nature.com/articles/ng.3396
ckqny.scz2snpres.gz Schizophrenia_PGC_2014 https://www.nature.com/articles/nature13595
daner_PGC_SCZ52_0513a.resultfiles_PGC_SCZ52_0513.sh2_nofin.gz Schizophrenia_PGC_2014_noFinns PGC_schzizophrenia_GWAS_Ripke_et_al._excluding_Finnish
dpw_excludingFinnishStudies_4finngen.txt.gz alcohol_consumption https://www.nature.com/articles/s41398-019-0676-2
EAGLE_AD_GWAS_results_2015.txt.gz atopic_dermatitis https://www.nature.com/articles/ng.3424
Educational_Attainment_excl23andme_Lee_2018_NatGen_formatted.txt.gz educational_attainment Lee;Nat_Gen;2018;https://doi.org/10.1038/s41588-018-0147-3
EUR.CD.gwas_info03_filtered.assoc.gz Crohns_disease https://www.nature.com/articles/ng.3359
EUR.IBD.gwas_info03_filtered.assoc.gz Inflammatory_bowel_disease https://www.nature.com/articles/ng.3359
EUR.UC.gwas_info03_filtered.assoc.gz Ulcerative_colitis https://www.nature.com/articles/ng.3359
focal_epilepsy_METAL_4finngen.txt.gz Focal_epilepsy https://www.nature.com/articles/s41467-018-07524-z
gabriel_asthma_meta-analysis_36studies_format_repository_NEJM.txt.gz Asthma_Moffatt_2010 https://www.nejm.org/doi/full/10.1056/nejmoa0906312
generalised_epilepsy_METAL_4finngen.txt.gz Generalized_epilepsy https://www.nature.com/articles/s41467-018-07524-z
GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz Height_giant_2014 https://www.nature.com/articles/ng.3097
GWAS_CP_all.txt.gz Cognitive_performance https://www.nature.com/articles/s41588-018-0147-3
Hb_gwas_summary_fromNealeLab.tsv.gz Haemoglobin_concentration UKBB
HbA1c_METAL_European.txt.gz HbA1c http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002383
HDL_Meta_ENGAGE_1000G_non_FINRISK_QCd.txt.gz HDL-C https://www.nature.com/articles/ng.3300
IGAP_stage_1.txt.gz Alzheimers_Lambert_2013 https://www.nature.com/articles/ng.2802
iPSYCH-PGC_ASD_Nov2017.gz ASD https://www.biorxiv.org/content/early/2017/11/27/224774
kunkle_etal_stage1.txt.gz Alzheimers_Kunkle_2019 https://www.niagads.org/datasets/ng00075
LDL_Meta_ENGAGE_1000G_non_FINRISK_QCd.txt.gz LDL-C https://www.nature.com/articles/ng.3300
Mahajan.NatGenet2018b.T2D.European.txt.gz T2D_Mahajan_2018 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6287706/
MDD2018_ex23andMe.19fields.gz Major_depressive_disorder https://www.nature.com/articles/s41588-018-0090-3
meta_v3_onco_euro_overall_ChrAll_1_release.txt.gz prostate_cancer https://www.nature.com/articles/s41588-018-0142-8
METAANALYSIS_DIAGRAM_SE1.txt.gz T2D_Scott_2017 http://diabetes.diabetesjournals.org/content/66/11/2888
metastroke.all.chr.bp.gz Ischaemic_stroke https://www.thelancet.com/journals/laneur/article/PIIS1474-4422(12)70234-X/abstract
MTAG_CP.to10K.txt.gz Cognitive_Performance https://www.nature.com/articles/s41588-018-0147-3
MTAG_EA.to10K.txt.gz Educational_attainment https://www.nature.com/articles/s41588-018-0147-3
pgc.bip.full.2012-04.txt.gz Bipolar_disorder_PGC_2011 https://www.nature.com/articles/ng.943
pgc.ed.freeze1.summarystatistics.July2017.txt.gz Anorexia Missing
RA_GWASmeta_European_v2.txt.gz rheumatoid_arthritis https://www.nature.com/articles/nature12873
SavageJansen_2018_intelligence_metaanalysis_formatted.txt.gz intelligence Savage;Nat_Gen-2018;https://doi.org/10.1038/s41588-018-0152-6
Saxena_fullUKBB_Longsleep_summary_stats_formatted.txt.gz habitual_sleep_duration Dashti;Natcomm-2019;https://doi.org/10.1038/s41467-019-08917-4
Shrine_30804560_FEV1_meta-analysis.txt.gz FEV1 https://www.nature.com/articles/s41588-018-0321-7
Shrine_30804560_FEV1_to_FVC_RATIO_meta-analysis.txt.gz FEV1/FVC https://www.nature.com/articles/s41588-018-0321-7
Shrine_30804560_FVC_meta-analysis.txt.gz FVC https://www.nature.com/articles/s41588-018-0321-7
Shrine_30804560_PEF_meta-analysis.txt.gz Peak_expiratory_flow https://www.nature.com/articles/s41588-018-0321-7
SNP_gwas_mc_merge_nogc.tbl.uniq.gz BMI https://www.nature.com/articles/nature14177
SORTED_PTSD_EA9_ALL_study_specific_PCs1.txt.gz Posttraumatic_stress_disorder https://www.nature.com/articles/mp201777
sumstats_neuroticism_ctg_format_formatted.txt.gz neuroticism Nagel;NatGen2018-https://doi.org/10.1038/s41588-018-0151-7
TC_Meta_ENGAGE_1000G_non_FINRISK_QCd.txt.gz Total_cholesterol https://www.nature.com/articles/ng.3300
TG_Meta_ENGAGE_1000G_non_FINRISK_QCd.txt.gz Triglycerides https://www.nature.com/articles/ng.3300
UKBB_gwas_Neale_Chronotype_formatted.txt.gz chronotype_neale UKBB
UKBB_gwas_Neale_pulserate_formatted.txt.gz heart_rate_neale UKBB
UKBB_gwas_Neale_SleepDuration_formatted.txt.gz sleepduration_neale UKBB
UKBB_gwas_Neale_SleeplessnessInsomnia_formatted.txt.gz insomnia_neale UKBB
UKBB_HYPO.txt.gz hypothyroidism UKBB
UKB-ICBPmeta750k_DBPsummaryResults.txt.gz diastolic_blood_pressure UKBB_ICBP
UKB-ICBPmeta750k_PPsummaryResults.edited.txt.gz pulse_pressure UKBB_ICBP
UKB-ICBPmeta750k_SBPsummaryResults.txt.gz systolic_blood_pressure UKBB_ICBP
GSCAN_AgeofInitiation.txt.gz smoking_age_of_initiation https://www.nature.com/articles/s41588-018-0307-5
GSCAN_CigarettesPerDay.txt.gz smoking_cigarettes_per_day https://www.nature.com/articles/s41588-018-0307-5
GSCAN_DrinksPerWeek.txt.gz drinks_per_week https://www.nature.com/articles/s41588-018-0307-5
GSCAN_SmokingInitiation.txt.gz smoking_initiation https://www.nature.com/articles/s41588-018-0307-5
PGC_UKB_depression_genome-wide.txt.gz depression https://www.nature.com/articles/s41588-018-0090-3
PGC_UKB_23andMe_depression_10000.txt.gz depression_23andme https://www.nature.com/articles/s41588-018-0090-3
PGC3_SCZ_wave3_public_without_frequencies.v1.tsv.gz schizophrenia_pgc_2022 https://www.nature.com/articles/s41586-022-04434-5
PGC3_SCZ_wave3_public_without_frequencies.clumped.v1.tsv.gz schizophrenia_pgc_2022_clumped https://www.nature.com/articles/s41586-022-04434-5
daner_PGC_BIP32b_mds7a_0416a.gz bipolar_disorder https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6956732/
CIMBA_BRCA1_BCAC_TN_meta_summary_level_statistics.txt.gz breast_cancer https://www.nature.com/articles/s41588-020-0609-2
AF_HRC_GWAS_ALLv11.txt.gz atrial_fibrillation https://www.nature.com/articles/s41588-018-0133-9
lifegen_phase2_bothpl_alldr_2017_09_18.tsv.gz lifespan https://elifesciences.org/articles/39856
RISK_GWAS_MA_UKB+23andMe+replication.txt.gz risk_tolerance https://www.nature.com/articles/s41588-018-0309-3
chronic_pain-bgen.stats.gz multisite_chronic_pain https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6592570/
Saxena_fullUKBB_Insomnia_summary_stats.txt.gz insomnia https://www.nature.com/articles/s41588-019-0361-7
continuous-LDLC-both_sexes-medadj_irnt.tsv.gz ldl_adj_irnt https://docs.google.com/spreadsheets/d/1AeeADtT0U1AukliiNyiVzVRdLYPkTbruQSk38DeutU8/edit#gid=511623409
biomarkers-30750-both_sexes-irnt.tsv.gz hba1c_inrt https://docs.google.com/spreadsheets/d/1AeeADtT0U1AukliiNyiVzVRdLYPkTbruQSk38DeutU8/edit#gid=511623409
biomarkers-30870-both_sexes-irnt.tsv.gz triglycerides_irnt https://docs.google.com/spreadsheets/d/1AeeADtT0U1AukliiNyiVzVRdLYPkTbruQSk38DeutU8/edit#gid=511623409
MEGASTROKE.1.AS.TRANS.out.gz any_stroke https://www.nature.com/articles/s41588-018-0058-3
UKBB.asthma-2.assoc.gz asthma_han_2020 https://www.nature.com/articles/s41467-020-15649-3
ICP_FG_DC_EG_DBDS_202012.gz intrahepatic_cholestatis_of_pregnancy FG
Meta-analysis_Wood_et_al+UKBiobank_2018_height.txt.gz height_giant_2014_ukb https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6488973/
Meta-analysis_Locke_et_al+UKBiobank_2018_bmi.txt.gz height_giant_2014_ukb https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6488973/
GSEM.GWAS.EXTERNALIZING.SHARE.v20191014.txt.gz externalizing https://www.biorxiv.org/content/10.1101/2020.10.16.342501v1
MTAG_glaucoma_four_traits_summary_statistics.txt.gz glaucoma_Craig_2020 https://www.nature.com/articles/s41588-019-0556-y
phecode-696.4-both_sexes.tsv.gz Psoriasis UKBB
phecode-250.1-both_sexes.tsv.gz T1D UKBB
BCX2_EOS_Trans_GWAMA.out.gz EOS_Trans https://pubmed.ncbi.nlm.nih.gov/32888493/
biomarkers-30600-both_sexes-irnt.tsv.gz Albumin_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30610-both_sexes-irnt.tsv.gz Alkaline_phosphatase_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30620-both_sexes-irnt.tsv.gz Alanine_aminotransferase_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30630-both_sexes-irnt.tsv.gz Apolipoprotein_A_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30640-both_sexes-irnt.tsv.gz Apolipoprotein_B_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30650-both_sexes-irnt.tsv.gz Aspartate_aminotransferase_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30660-both_sexes-irnt.tsv.gz Direct_bilirubin_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30670-both_sexes-irnt.tsv.gz Urea_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30680-both_sexes-irnt.tsv.gz Calcium_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30690-both_sexes-irnt.tsv.gz Cholesterol_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30700-both_sexes-irnt.tsv.gz Creatinine_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30710-both_sexes-irnt.tsv.gz C-reactive_protein_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30720-both_sexes-irnt.tsv.gz Cystatin_C_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30730-both_sexes-irnt.tsv.gz Gamma_glutamyltransferase_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30740-both_sexes-irnt.tsv.gz Glucose_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30760-both_sexes-irnt.tsv.gz HDL_cholesterol_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30770-both_sexes-irnt.tsv.gz IGF-1_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30780-both_sexes-irnt.tsv.gz LDL_direct_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30790-both_sexes-irnt.tsv.gz Lipoprotein_A_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30800-both_sexes-irnt.tsv.gz Oestradiol_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30810-both_sexes-irnt.tsv.gz Phosphate_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30820-both_sexes-irnt.tsv.gz Rheumatoid_factor_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30830-both_sexes-irnt.tsv.gz SHBG_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30840-both_sexes-irnt.tsv.gz Total_bilirubin_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30850-both_sexes-irnt.tsv.gz Testosterone_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30860-both_sexes-irnt.tsv.gz Total_protein_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30880-both_sexes-irnt.tsv.gz Urate_panukbb https://pan.ukbb.broadinstitute.org/
biomarkers-30890-both_sexes-irnt.tsv.gz Vitamin_D_panukbb https://pan.ukbb.broadinstitute.org/
ukb.allasthma.upload.final.assoc_v2.txt.gz asthma_Zhu_2019 https://erj.ersjournals.com/content/54/6/1901507-0.long
michailidoubreastcancerall.txt.gz Breast_cancer https://www.nature.com/articles/nature24284
20171017_MW_eGFR_overall_EA_nstud42.dbgap_v2.txt.gz estimated_glomerular_filtration_rate https://www.nature.com/articles/s41588-019-0407-x
CRC_LP1_VQ_CFR1_CFR2_COIN_FIN_ONCO_SCOT_AUSTRIA_SP1_SOCCS3_CROATIA_BIOB_LBC_DACHS_meta_uk10k_1kG_info8_allchr.txt.gz Colorectal_cancer https://www.nature.com/articles/s41467-019-09775-w
urate_chr1_22_LQ_IQ06_mac10_all_741_rsid_v2.txt.gz Urate https://www.nature.com/articles/s41588-019-0504-x
Tachmazidou_30664745_HIPOA.txt.gz hip_osteoarthritis https://www.nature.com/articles/s41588-018-0327-1
Tachmazidou_30664745_KNEEOA.txt.gz knee_osteoarthritis https://www.nature.com/articles/s41588-018-0327-1
BMD_v3_SumStats.txt.gz Bone_mineral_density https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0200785
anno.CLEANED.MVP.EUR.VTE.results.EQC.tbl.gz Venous_thromboembolism https://www.nature.com/articles/s41588-019-0519-3
GCST90011770_buildGRCh37.tsv.gz Glaucoma_Gharahkhani_2021 https://www.nature.com/articles/s41467-020-20851-4
Hysi_et_al_Refractive_error_NatGenet_2020_beta_se.txt.gz Refractive_error_and_myopia https://www.nature.com/articles/s41588-020-0599-0
ALS_sumstats_EUR_only.txt.gz ALS_Rheenen_2021 https://www.nature.com/articles/s41588-021-00973-1
discovery_metav3.0.meta.gz Multiple_Sclerosis https://pubmed.ncbi.nlm.nih.gov/31604244/
HERMES_Jan2019_HeartFailure_summary_data.txt.gz heart_failure https://cvd.hugeamp.org/dinspector.html?dataset=GWAS_HERMES_eu
CKD_overall_ALL_JW_20180223_nstud30.dbgap.txt.gz chronic_kidney_disease https://ckdgen.imbi.uni-freiburg.de/
netherlands_total.fastGWA.gz netherlands_total_cost https://pubmed.ncbi.nlm.nih.gov/34213412/
netherlands_gp.fastGWA.gz netherlands_gp_cost https://pubmed.ncbi.nlm.nih.gov/34213412/
netherlands_hospital.fastGWA.gz netherlands_hospital_cost https://pubmed.ncbi.nlm.nih.gov/34213412/
netherlands_pharmacy.fastGWA.gz netherlands_pharmacy_cost https://pubmed.ncbi.nlm.nih.gov/34213412/
GCST90027158_buildGRCh38.tsv.gz alzheimer https://www.nature.com/articles/s41588-022-01024-z,WARNING:FINNGEN_SAMPLES_INCLUDED

Rsid map

This step generates a mapping to/from rsid/chrompos based on data available at ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.

rsid_map.py produces:

  • finngen.rsid.map.tsv (rsid--> chrompos)
rs10	7_92754574
rs1000000	12_126406434
rs1000000219	13_95689463
  • finngen.variants.tsv (chrompos --> ref/alt)
10_100000235	C	T
10_100000979	T	C
10_100001839	CAA	C

The first file is used throught the computation to go to/from rsid notation. The second is used to filter out variants that do not have the right alleles.

Also, if a rsid list is provided (e.g. hm3 rsids), it returns the subset of variants in the original bim file that match to those rsids:

  • hm3.snplist
chr10_100000235_C_T
chr10_100002628_A_C
chr10_100004827_A_C

Munging

PRScs automatically does some allele matching:

  • the reference genome (1kg) only contains non ambigous variants :
cat snpinfo_1kg_hm3 | sed -E 1d |  cut -f 4,5 | awk '{print $1$2}' | sort | uniq
AC
AG
CA
CT
GA
GT
TC
TG

Also, in the parsing phase it checks for the ref/alt order and fixes the beta accordingly.

vld_snp = set(zip(vld_dict['SNP'], vld_dict['A1'], vld_dict['A2']))
ref_snp = set(zip(ref_dict['SNP'], ref_dict['A1'], ref_dict['A2'])) | set(zip(ref_dict['SNP'], ref_dict['A2'], ref_dict['A1'])) | \
              set(zip(ref_dict['SNP'], [mapping[aa] for aa in ref_dict['A1']], [mapping[aa] for aa in ref_dict['A2']])) | \
              set(zip(ref_dict['SNP'], [mapping[aa] for aa in ref_dict['A2']], [mapping[aa] for aa in ref_dict['A1']]))
sst_snp = set(zip(sst_dict['SNP'], sst_dict['A1'], sst_dict['A2'])) | set(zip(sst_dict['SNP'], sst_dict['A2'], sst_dict['A1'])) | \
              set(zip(sst_dict['SNP'], [mapping[aa] for aa in sst_dict['A1'] if aa in ATGC], [mapping[aa] for aa in sst_dict['A2'] if aa in ATGC])) | \
              set(zip(sst_dict['SNP'], [mapping[aa] for aa in sst_dict['A2'] if aa in ATGC], [mapping[aa] for aa in sst_dict['A1'] if aa in ATGC]))

comm_snp = ref_snp & vld_snp & sst_snp
with open(sst_file) as ff:
     if (snp, a1, a2) in comm_snp:
     	...   
	beta_std = sp.sign(beta)*abs(norm.ppf(p/2.0))/n_sqrt
     elif (snp, a2, a1) in comm_snp:
     	beta_std = -1*sp.sign(beta)*abs(norm.ppf(p/2.0))/n_sqrt
	 ...

The final weights are printed based on the a1/a2 order of the reference panel (i.e. the EUR 1kg panel in this case).

PRScs does check for strand flip.

Our solution is therefore the following.

We build a rsid to chrom pos mapping from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz . This allows to move back and forth from/to rsid/chrompos notations and therefore to merge summary stats with different formats.

Then:

  • in the munging phase we split the summary stats entries based on whether variants are identified by rsid or by some of chrom/pos notation
  • the rsid file is filtered for rsids present in finngen. chrom and pos information are updated to finngen data
  • the chrompos file is updated to have chrom and pos added (if provided, else extracted from variant id) and then lifted to build 38
  • the two files are then merged to a FINNGEN chrom_pos_ref_alt notation, making sure that the variant exists in finngen data (checking for strand flip as well).

This produces a file with chrom and pos based on Finngen, but with A1/A2/OR/P based on the original data:

CHR     SNP     A1      A2      BP      OR      P
19	chr19_260912	G	A	260912	0.9957872809136792	0.050031874722
19	chr19_261033	A	G	261033	0.9957998422696626	0.0507241244322
19	chr19_266034	C	T	266034	1.0053796666398893	0.150796052266
19	chr19_267039	C	T	267039	0.9961140904000996	0.0691110781996
19	chr19_276245	T	C	276245	0.995420944482037	0.0366293445837
19	chr19_277776	A	G	277776	0.9964618752820534	0.119939685495
19	chr19_280299	C	T	280299	0.9964396546231319	0.120269108388
19	chr19_281360	T	C	281360	0.9968755417956986	0.169347605744
19	chr19_288246	C	T	288246	0.9991125513478095	0.722369442562
19	chr19_288374	C	T	288374	1.0021901113140002	0.299346915861

Weights

Weights are calculated using PRScs. In order to run PRScs we then convert the file to rsids:

SNP	A1	A2	OR	P
rs8100066	G	A	0.9957872809136792	0.050031874722
rs8105536	A	G	0.9957998422696626	0.0507241244322
rs2312724	C	T	1.0053796666398893	0.150796052266
rs1020382	C	T	0.9961140904000996	0.0691110781996
rs12459906	T	C	0.995420944482037	0.0366293445837
rs11084928	A	G	0.9964618752820534	0.119939685495
rs11878315	C	T	0.9964396546231319	0.120269108388
rs7815	T	C	0.9968755417956986	0.169347605744
rs10409452	C	T	0.9991125513478095	0.722369442562
rs12981067	C	T	1.0021901113140002	0.299346915861

This guarantees that the beta is still correct,since it's based on the original summary stats. However, this way, we can "recycle" the munged data also for other reference panels, if needed in the future.

Then PRScs is run and weights are calculated only for the subset of variants shared across reference panel, summary stats and validation bim file (finngen).

19	rs8100066	260912	G	A	-2.438700e-05
19	rs8105536	261033	A	G	-1.608225e-05
19	rs2312724	266034	C	T	2.586432e-04
19	rs1020382	267039	C	T	8.532887e-06
19	rs12459906	276245	T	C	-3.629306e-05
19	rs11084928	277776	A	G	-3.484712e-05
19	rs11878315	280299	C	T	-6.442467e-06
19	rs7815	281360	T	C	-1.508200e-05
19	rs10409452	288246	C	T	2.060791e-05
19	rs12981067	288374	C	T	4.191780e-05

The weight file is converted to chrom_pos again through the finngen rsid/chrom_pos mapping,using the a1/a2 from the weights. However, now there is a double mismatch that needs to be fixed:

  1. the weights were calculated based on the a1/a2 order of the reference data set
  2. the output positions are based on the reference data set.
19      chr19_1208073_C_T       1208072 C       T       7.354914e-06
19      chr19_1218220_T_C       1218219 T       C       5.217805e-06
19      chr19_1220005_G_A       1220004 G       A       9.294323e-05
19      chr19_1221162_T_C       1221161 T       C       1.765576e-05
19      chr19_1226005_A_C       1226004 A       C       8.979659e-05
19      chr19_1232559_C_T       1232558 C       T       4.271571e-05
19      chr19_1238900_C_T       1238899 C       T       2.828170e-05

In order to fix this, we replicate each entry, considering all possible permutations of the ref_alt in the variant id. This guarantees that at least one permutation is the matching Finngen variant. Also, the position is updated to the one in the id.

19	chr19_1208073_C_T	1208073	C	T	7.354914e-06
19	chr19_1208073_T_C	1208073	C	T	7.354914e-06
19	chr19_1208073_G_A	1208073	C	T	7.354914e-06
19	chr19_1208073_A_G	1208073	C	T	7.354914e-06
19	chr19_1218220_T_C	1218220	T	C	5.217805e-06
19	chr19_1218220_C_T	1218220	T	C	5.217805e-06
19	chr19_1218220_A_G	1218220	T	C	5.217805e-06
19	chr19_1218220_G_A	1218220	T	C	5.217805e-06
19	chr19_1220005_G_A	1220005	G	A	9.294323e-05
19	chr19_1220005_A_G	1220005	G	A	9.294323e-05
19	chr19_1220005_C_T	1220005	G	A	9.294323e-05
19	chr19_1220005_T_C	1220005	G	A	9.294323e-05
19	chr19_1221162_T_C	1221162	T	C	1.765576e-05
19	chr19_1221162_C_T	1221162	T	C	1.765576e-05

Now we have all elements in place:

  • variants are identified with a Finngen ID
  • the position is updated to finngen data
  • the effect allele is still the original one
  • weights are calcualted accordingly based on the effect allele

Scores

Finally scores are calculate with plink2 --sscore which will only compute if the variant ids match, but still computing the score for the correct allele.

FINNGEN WEIGHTS

We've added in the wdl section the wdl finngen_weights.wdl that allows to calculate weights based on the FG sumstats. In order to do so we've build a custom LD panel based on the Finnish panel used for imputation.

In the scripts folder link there are the scripts used to generate it.

Here's a breakdown of how it works in each step and how to edit the wdl for one's needs

Global parameters

"finngen_weights.test": False,
"finngen_weights.run_scores": False,
"finngen_weights.pheno_list": "gs://path/to/list/of/phenos.txt",
"finngen_weights.plink_root": "gs://finngen-production-library-red/finngen_R8/genotype_plink_1.0/data/finngen_R8_hm3",
"finngen_weights.prefix": "finngen_R?",

Test mode cuts the input sumstats to only 10k variants and performs the weights calculation in test mode (very few iterations). The output will be useless, but it will run in a very short time (mins vs hours).

run_scores determines whether scores are calculated or not.

pheno_list is the path to the list of phenos to analyze (see munging step).

plink_root is the base of the .bim file used as validation by cs-prs. If scores are passed, it also needs to have .bed and .fam files in the same directory

prefix'is the prefix prepended to the output of all files

Munging

The task munge_sumstats munges the input data to a CHROM_POS_REF_ALT format needed with FG.

Relevant inputs:

"finngen_weights.munge_sumstats.file_root": "gs://finngen-production-library-green/finngen_R5/finngen_R5_analysis_data/summary_stats/release/finngen_R5_PHENO.gz",
"finngen_weights.munge_sumstats.columns":"#chrom,pos,ref,alt,beta,pval",

The first is the path to the input sumsats. The string PHENO is replaced with the phenotypes defined in the global parameter described before.

The columns are the core of the munging step. These column configurations need to be in the order that PRCSs expects them i.e chrom pos ref alt beta pvalue.

Weights

"finngen_weights.weights.ref_list": "gs://finngen-production-library-green/prs/ref_list_fin.txt",
"finngen_weights.weights.rsid_map": "gs://finngen-production-library-green/finngen_R8/finngen_R8_analysis_data/variant_mapping/finngen_R8_hm3.rsid.map.tsv",

ref_list is the list of files containing the custom LD panel built with thes scripts described in the introduction.

rsid_map is a filed structures as follows:

rs11596870 10_100000235
rs11190363 10_100002628
rs7902856 10_100004827
rs7078766 10_100005136
[...]

That is used to output weights in rsid format.

This task outputs three files:

  • weights in CHR_POS_REF_ALT format
  • weights in rsid format
  • the log output of CS-PRS for checking all the relevant metadata

Scores

If interested in calculating scores, the only edit one can make is to increase/decrease the number of cpus in the .json. The .bed file is determined by the plink_root parameter passed in the global parameters.