## Generate polygenic risk score

This notebook is used to generate a polygenic risk score in UKB training set (approx 380k european subjects). This is then separately validated in 50k unrelated UKB european subjects. This polygenic risk score is constructed using LDPred2 following the tutorial at https://privefl.github.io/bigsnpr/articles/LDpred2.html.

This script is implemented in R.

### Prerequisites

In [1]:
library(bigsnpr)
library(data.table)

options(bigstatsr.check.parallel.blas = FALSE)
options(default.nproc.blas = NULL)

Loading required package: bigstatsr



### Import summary statistics

Latent trait GWAS (BOLT-LMM, then GenomicSEM) on UKB training sets.

In [7]:
gwas_cystatin = read.table('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/EUR/summ_SEM_cystatin_vaf_effectflip_innersnp.tsv', sep = '\t', header = TRUE)

#Set column labels as per LDPred2 instructions
names(gwas_cystatin) = c('chr','pos','SNPID','a1','a0','MAF','beta','beta_se','P','N','ALT_FREQ', 'alleles','locus')

### Match SNPs to LD reference

LD reference downloaded per LDPred2 tutorial

In [8]:
map_ldref <- readRDS("~/PGS/ukb_ld/map.rds")
sumstats <- gwas_cystatin
sumstats$n_eff <- sumstats$N

info_snp <- snp_match(sumstats, map_ldref, match.min.prop=0)
(info_snp <- tidyr::drop_na(tibble::as_tibble(info_snp)))

df_beta <- info_snp

1,031,527 variants to be matched.

0 ambiguous SNPs have been removed.

1,031,527 variants have been matched; 0 were flipped and 0 were reversed.



chr,pos,a0,a1,SNPID,MAF,beta,beta_se,P,N,⋯,locus,n_eff,_NUM_ID_.ss,rsid,af_UKBB,ld,pos_hg17,pos_hg18,pos_hg38,_NUM_ID_
<int>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,⋯,<chr>,<int>,<int>,<chr>,<dbl>,<dbl>,<int>,<int>,<int>,<int>
1,779322,A,G,rs4040617,0.128190,0.01227300,0.0071802,8.7406e-02,86462,⋯,1:779322,86462,1,rs4040617,0.12795871,3.680956,819185,769185,843942,5
1,888659,T,C,rs3748597,0.053717,0.01288600,0.0106390,2.2580e-01,86462,⋯,1:888659,86462,2,rs3748597,0.94655001,5.480429,928802,878522,953279,21
1,916834,G,A,rs6694632,0.410400,0.00757740,0.0049094,1.2272e-01,86462,⋯,1:916834,86462,3,rs6694632,0.58994510,7.776957,956901,906697,981454,25
1,918384,G,T,rs13303118,0.415150,0.00678510,0.0048959,1.6578e-01,86462,⋯,1:918384,86462,4,rs13303118,0.58530752,7.799277,958451,908247,983004,26
1,918573,A,G,rs2341354,0.410170,0.00717160,0.0049075,1.4392e-01,86462,⋯,1:918573,86462,5,rs2341354,0.59015511,7.765813,958640,908436,983193,27
1,932457,G,A,rs1891910,0.229370,0.02319600,0.0057922,6.2098e-05,86462,⋯,1:932457,86462,6,rs1891910,0.23015347,5.990979,972524,922320,997077,31
1,944564,T,C,rs3128117,0.397690,0.00707180,0.0049334,1.5173e-01,86462,⋯,1:944564,86462,7,rs3128117,0.39805554,7.269782,984631,934427,1009184,34
1,947034,G,A,rs2465126,0.036583,-0.00447500,0.0127440,7.2549e-01,86462,⋯,1:947034,86462,8,rs2465126,0.96394223,7.690968,987101,936897,1011654,35
1,950243,A,C,rs1891906,0.393090,0.00817660,0.0049444,9.8185e-02,86462,⋯,1:950243,86462,9,rs1891906,0.39360436,7.257954,990310,940106,1014863,38
1,959842,C,T,rs2710888,0.366680,0.00603520,0.0050135,2.2867e-01,86462,⋯,1:959842,86462,10,rs2710888,0.36671423,5.977691,999909,949705,1024462,40


### Generate LD matrices

In [10]:
tmp <- tempfile(tmpdir = "tmp-data")

for (chr in 1:22) {

  cat(chr, ".. ", sep = "")

  ## indices in 'df_beta'
  ind.chr <- which(df_beta$chr == chr)
  ## indices in 'map_ldref'
  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]
  ## indices in 'corr_chr'
  ind.chr3 <- match(ind.chr2, which(map_ldref$chr == chr))

  corr_chr <- readRDS(paste0("~/PGS/ukb_ld/LD_chr", chr, ".rds"))[ind.chr3, ind.chr3]

  if (chr == 1) {
    corr <- as_SFBM(corr_chr, tmp)
  } else {
    corr$add_columns(corr_chr, nrow(corr))
  }
}

saveRDS(corr, file = "corr_hm3_cystatin_prs.rds")
saveRDS(df_beta, file = "df_beta_hm3_cystatin_prs.rds")

1.. 

Creating directory "tmp-data" which didn't exist..



2.. 3.. 4.. 5.. 6.. 7.. 8.. 9.. 10.. 11.. 12.. 13.. 14.. 15.. 16.. 17.. 18.. 19.. 20.. 21.. 22.. 

### Run LDPred2-AUTO

Difficult to perform LDSC using a latent trait so we use a placeholder heritability estimate of 0.4, this appeared to perform well for our use case.

In [None]:
map_ldref <- readRDS("ukb_ld/map.rds")

N_CORES = 32
corr = readRDS(file = "corr_hm3_cystatin_prs_exome.rds")
df_beta = readRDS(file = "df_beta_hm3_cystatin_prs_exome.rds")

h2_est <- 0.4

# LDpred2-auto
multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
                               vec_p_init = seq_log(1e-4, 0.9, N_CORES),
                               ncores = N_CORES)
beta_auto <- sapply(multi_auto, function(auto) auto$beta_est)
                    
saveRDS(multi_auto, file = "multi_auto_hm3_cystatin_prs.rds")
saveRDS(beta_auto, file = "beta_auto_hm3_cystatin_prs.rds")

### Visualize chain

Authors recommend visually inspected the trace to confirm that the model has converged, in our case it clearly does.

In [None]:
multi_auto = readRDS(file = "multi_auto_hm3_cystatin_prs.rds")
beta_auto = readRDS(file = "beta_auto_hm3_cystatin_prs.rds")
df_beta = readRDS(file = "df_beta_hm3_cystatin_prs.rds")

library(ggplot2)
auto <- multi_auto[[1]]
plot_grid(
  qplot(y = auto$path_p_est) + 
    theme_bigstatsr() + 
    geom_hline(yintercept = auto$p_est, col = "blue") +
    scale_y_log10() +
    labs(y = "p"),
  qplot(y = auto$path_h2_est) + 
    theme_bigstatsr() + 
    geom_hline(yintercept = auto$h2_est, col = "blue") +
    labs(y = "h2"),
  ncol = 1, align = "hv"
)

### Generate score

We create a variant column as an alternative to match by dbSNP ID, this is helpful for the exome sequencing cohort as saves us having to annotate each SNP with dbSNP ID.

In [None]:
info_snp = df_beta

info_snp$beta_auto <- rowMeans(beta_auto)
info_snp$variant <- paste(info_snp$chr, info_snp$pos, sep=':')
save = info_snp[,c(5, 4, 21, 22)]
write.table(save, file='UKB380_PGS_LDPRED2.tsv', quote=FALSE, sep='\t', row.names = FALSE)