## Generate PGS SNP list

LDPred2 recommends approximately 1 million HapMap3 SNPs. However, in each study we looked at, around 1% of these SNPs are missing. For maximum portability of this PGS, here we intersect the available SNPs in three QC'd cohorts (UK Biobank, TCGA, GTEX) with the HapMap3 variants to generate an integrated SNP list.

## Load Hail

In [1]:
import seaborn as sns
import hail as hl
import os
from gnomad.utils.liftover import *
from gnomad.utils.annotations import *
from gnomad.sample_qc.pipeline import *
from gnomad.sample_qc.ancestry import *

tmp = "/mnt/grid/janowitz/home/skleeman/tmp2"
os.environ["SPARK_LOCAL_DIRS"]=tmp

os.environ["PYSPARK_SUBMIT_ARGS"] ="--driver-memory 200g --executor-memory 2g pyspark-shell"

hl.init(default_reference='GRCh38', master='local[16]',min_block_size=128, local_tmpdir=tmp, tmp_dir=tmp)

Running on Apache Spark version 2.4.5
SparkUI available at http://bamgpu04.cm.cluster:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.63-0bc3808faa6d
LOGGING: writing to /mnt/grid/janowitz/home/skleeman/cystatinc/prs/hail-20210327-2019-0.2.63-0bc3808faa6d.log


## Process data

### Load each dataset

Liftover GTEX and TCGA-imputed to GRCh37.

In [None]:
#UKB summary statistics (INFO > 0.8, MAF > 0.01)
ukb = hl.import_table('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/EUR/summ_SEM_cystatin_vaf_effectflip.tsv', impute=True)
ukb.write('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/snplist/ukb.ht')

#TCGA imputed data (RSQ > 0.6, MAF > 0.001)

tcga = hl.read_matrix_table('/mnt/grid/janowitz/rdata_norepl/tcga_germline/hail/tcga_imputed_info0.6.mt')
tcga = hl.variant_qc(tcga) #Default Hail variant QC pipeline
tcga = tcga.filter_rows(tcga.variant_qc.AF[1] > 0.001) #MAF > 0.1%
tcga_variants = tcga.rows()
tcga_variants = default_lift_data(tcga_variants)
tcga_variants.write('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/snplist/tcga.ht')

#GTEX WGS data (GTEX QC pipeline, V8, MAF > 0.001)

gtex = hl.read_matrix_table('/mnt/grid/janowitz/rdata_norepl/gtex/hail/gtex_raw.mt')
gtex = hl.variant_qc(gtex) #Default Hail variant QC pipeline
gtex = gtex.filter_rows(gtex.variant_qc.AF[1] > 0.001) #MAF > 0.1%
gtex_variants = gtex.rows()
gtex_variants = default_lift_data(gtex_variants)
gtex_variants.write('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/snplist/gtex.ht')

### Filter to HapMap3 SNPs

In [4]:
hm3 = hl.import_table('/mnt/grid/janowitz/home/skleeman/cystatinc/prs/UKB380_PGS_LDPRED2.tsv', impute=True)
hm3 = hm3.key_by('SNPID')

ukb = hl.read_table('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/snplist/ukb.ht')
ukb = ukb.filter(hl.is_defined(hm3[ukb.SNP]))
ukb = ukb.annotate(alleles = [ukb.A2, ukb.A1],
                  locus =hl.locus(hl.str(ukb.CHR), ukb.BP,reference_genome='GRCh37'))
ukb = ukb.key_by('locus','alleles')
ukb.count()

2021-03-27 20:32:57 Hail: INFO: Reading table to impute column types
2021-03-27 20:32:59 Hail: INFO: Finished type imputation
  Loading field 'SNPID' as type str (imputed)
  Loading field 'a1' as type str (imputed)
  Loading field 'beta_auto' as type float64 (imputed)
  Loading field 'variant' as type str (imputed)
2021-03-27 20:33:05 Hail: INFO: Coerced sorted dataset
2021-03-27 20:33:06 Hail: INFO: Ordering unsorted dataset with network shuffle


1043492

### Filter to TCGA SNPs

In [6]:
tcga_variant = hl.read_table('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/snplist/tcga.ht')
ukb = ukb.semi_join(tcga_variant)

ukb.count()

2021-03-27 20:36:25 Hail: INFO: Coerced sorted dataset
2021-03-27 20:36:26 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-03-27 20:36:36 Hail: INFO: Ordering unsorted dataset with network shuffle


1037629

### Filter to GTEX SNPs

In [8]:
gtex_variant = hl.read_table('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/snplist/gtex.ht')
ukb = ukb.semi_join(gtex_variant)

ukb.count()

2021-03-27 20:43:23 Hail: INFO: Coerced sorted dataset
2021-03-27 20:43:25 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-03-27 20:43:36 Hail: INFO: Ordering unsorted dataset with network shuffle


1031527

### Save inner SNPs

In [9]:
ukb.export('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/EUR/summ_SEM_cystatin_vaf_effectflip_innersnp.tsv')

variants = ukb.select(rsid=ukb.SNP)
print(variants.count())

variants.write('/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/snplist/innersnps.ht',overwrite=True)

2021-03-27 20:46:28 Hail: INFO: Coerced sorted dataset
2021-03-27 20:46:30 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-03-27 20:46:36 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-03-27 20:47:20 Hail: INFO: merging 6 files totalling 107.2M...
2021-03-27 20:47:21 Hail: INFO: while writing:
    /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/EUR/summ_SEM_cystatin_vaf_effectflip_innersnp.tsv
  merge time: 1.165s
2021-03-27 20:47:28 Hail: INFO: Coerced sorted dataset
2021-03-27 20:47:30 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-03-27 20:47:40 Hail: INFO: Ordering unsorted dataset with network shuffle


1031527


2021-03-27 20:48:31 Hail: INFO: Coerced sorted dataset
2021-03-27 20:48:33 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-03-27 20:48:41 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-03-27 20:49:25 Hail: INFO: wrote table with 1031527 rows in 6 partitions to /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/snplist/innersnps.ht
    Total size: 20.11 MiB
    * Rows: 20.11 MiB
    * Globals: 11.00 B
    * Smallest partition: 143242 rows (2.80 MiB)
    * Largest partition:  208544 rows (4.04 MiB)
