## Variant reference

This notebook summarizes the generation of the variant reference file for GenomicSEM.

### Prequisites

In [None]:
import seaborn as sns
import hail as hl
import os
from gnomad.utils.liftover import *
from gnomad.utils.annotations import *
from gnomad.sample_qc.pipeline import *
from gnomad.sample_qc.ancestry import *

tmp = "/mnt/grid/janowitz/home/skleeman/tmp2"
os.environ["SPARK_LOCAL_DIRS"]=tmp

os.environ["PYSPARK_SUBMIT_ARGS"] ="--driver-memory 200g --executor-memory 2g pyspark-shell"

hl.init(default_reference='GRCh38', master='local[16]',min_block_size=128, local_tmpdir=tmp, tmp_dir=tmp)

### Generate reference

Use BOLT-LMM output from GWAS in order to extract allele frequency for effect allele. As GWAS is performed separately in each super-population this gives accuracy allele frequencies according to the super-populations we defined across all imputed SNPs with INFO > 0.8. Set A1 to ALT and A2 to REF (using GRCH37 reference) then calculate variant/ALT allele frequency. Annotated with MAF where MAF = VAF where VAF < 0.5 and MAF = 1 - VAF where VAF > 0.5. We create two references - one for MAF >0.5% and one for MAF >1%, we use the MAF >0.5% for GenomicSEM analysis.

In [None]:
populations=["AFR","CSA","EAS", "EUR"]
#populations=["EUR"]

rg = hl.get_reference('GRCh37')
rg.add_sequence('/mnt/grid/janowitz/home/references/liftover/human_g1k_v37.fasta.gz') 

for pop in populations:
    folder = '/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/' + pop +'/'
    
    summary1 = hl.import_table(folder + 'cystatin_bolt_stats_bgen.gz', impute=True, force=True)

    summary1 = summary1.annotate(locus=hl.locus(hl.str(summary1.CHR), summary1.BP,reference_genome='GRCh37'))
    summary1 = summary1.key_by(summary1.locus)
    
    summary1 = summary1.select(CHR = summary1.CHR, BP = summary1.BP, SNP = summary1.SNP, A1 = summary1.ALLELE1, A2 = summary1.ALLELE0, VAF=summary1.A1FREQ)
    
    summary1 = summary1.annotate(
        MAF = hl.if_else(summary1.VAF <= 0.5, summary1.VAF, 1 - summary1.VAF),
        REF = summary1.locus.sequence_context())

    summary1 = summary1.annotate(
        ALT = (hl.case()
            .when(summary1.A1 == summary1.REF, summary1.A2)
            .when(summary1.A2 == summary1.REF, summary1.A1)
            .or_missing()),
        ALT_FREQ = (hl.case()
            .when(summary1.A1 == summary1.REF, 1-summary1.VAF)
            .when(summary1.A2 == summary1.REF, summary1.VAF)
            .or_missing()),
    )
    
    summary1 = summary1.filter(hl.is_snp(summary1.REF, summary1.ALT))
    summary1 = summary1.key_by(summary1.SNP)
    summary1 = summary1.select(CHR = summary1.CHR, BP = summary1.BP, MAF=summary1.MAF, A1 = summary1.ALT, A2 = summary1.REF, ALT_FREQ = summary1.ALT_FREQ)
    
    summary1 = summary1.filter(summary1.MAF > 0.005)
    print(summary1.count())
    summary1.export('/mnt/grid/janowitz/home/references/maf/panukb_snps0.005_'+ pop +'.tsv')
    
    summary1 = summary1.filter(summary1.MAF > 0.01)
    print(summary1.count())
    summary1.export('/mnt/grid/janowitz/home/references/maf/panukb_snps0.01_'+ pop +'.tsv')

### Remove duplicate SNPs

Advised by authors of GenomicSEM that multi-allelic/duplicate SNPs (dbSNP ID found twice in variant reference file) need to be removed - email from Michael Nivard: "I quickly discussed this, and  we see some issues, for one its hard to determine what the variance of a multi-alleic variant is. This in turn will cause issues down stream when defining the variance of a SNP in the covariance matrix."

In [None]:
%%bash

cd /mnt/grid/janowitz/home/references/maf/

awk 'NR == 1 ; NR > 1 {a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' panukb_snps0.005_AFR.tsv > panukb_snps0.005_AFR_dedup.tsv
awk 'NR == 1 ; NR > 1 {a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' panukb_snps0.005_CSA.tsv > panukb_snps0.005_CSA_dedup.tsv
awk 'NR == 1 ; NR > 1 {a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' panukb_snps0.005_EAS.tsv > panukb_snps0.005_EAS_dedup.tsv
awk 'NR == 1 ; NR > 1 {a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' panukb_snps0.005_EUR.tsv > panukb_snps0.005_EUR_dedup.tsv

awk 'NR == 1 ; NR > 1 {a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' panukb_snps0.01_AFR.tsv > panukb_snps0.01_AFR_dedup.tsv
awk 'NR == 1 ; NR > 1 {a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' panukb_snps0.01_CSA.tsv > panukb_snps0.01_CSA_dedup.tsv
awk 'NR == 1 ; NR > 1 {a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' panukb_snps0.01_EAS.tsv > panukb_snps0.01_EAS_dedup.tsv
awk 'NR == 1 ; NR > 1 {a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' panukb_snps0.01_EUR.tsv > panukb_snps0.01_EUR_dedup.tsv