# Post-GWAS analyses

# Aim

The aim of this notebook is to be able to carry out post-GWAS analyses such as SNP annotation to genes, expression pathway and other.

Here [spnGeneSets v1.12](https://www.umc.edu/SoPH/Departments-and-Faculty/Data-Science/Research/Services/Software.html) and annovar are used, however FUMA is a web-based resource that is also useful for this purpose 


## To run this notebook

## 1. bim_from_plink
```
sos run ~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb bim_from_plink \
    --cwd ~/output\
    --bimfiles ~/ukb23155_c{1..22}_b0_v1.bim \
    --bim_name ~/ukb23155_chr1_chr22.bim \
    --container_annovar ~/annovar.sif
```

## 2. bim_from_bgen

```
sos run ~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb bim_from_bgen \
    --cwd ~/output\
    --genoFile ~/ukb_imp_chr{1..22}_v3.bgen \
    --bim_name "imputed_variants" \
    --container_lmm ~/lmm.sif
```

## 3. annovar

rsid should be True or False depending on the SNP field of the data (does it have rsid or not?)

```
sos run ~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb annovar \
    --cwd ~/output\
    --p_filter 0.05 \
    --rsid True \ 
    --sumstatsFile *.snp_stats.gz \
    --bim_name ~/imputed_variants.merged.bim \
    --humandb /gpfs/ysm/datasets/db/annovar/humandb \
    --xref_path /gpfs/gibbs/pi/dewan/data/UKBiobank \
    --build 'hg19' \
    --container_lmm ~/annovar.sif
```


In [1]:
sos run snptogene.ipynb -h

  msg['msg_id'] = self._parent_header['header']['msg_id']


usage: sos run snptogene.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  bim_from_plink
  bim_from_bgen
  annovar
  snp_to_gene

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --annovar-module '\nmodule load ANNOVAR/2020Jun08-foss-2018b-Perl-5.28.0\necho "Module annovar loaded"\n{cmd}\n'
                        Load annovar module from cluster
  --container-annovar 'gaow/gatk4-annovar'
                        Software container option
  --container-lmm 'statisticalgenetics/lmm:2.3'
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 15h
         

In [1]:
[global]
# the output directory for generated files
parameter: cwd = path
# Load annovar module from cluster
parameter: annovar_module = '''
module load ANNOVAR/2020Jun08-foss-2018b-Perl-5.28.0
echo "Module annovar loaded"
{cmd}
'''
# Software container option
parameter: container_annovar = 'gaow/gatk4-annovar'
parameter: container_lmm = 'statisticalgenetics/lmm:2.3'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "15h"
# Memory expected
parameter: mem = "30G"
# Number of threads
parameter: numThreads = 10

## Step to merge *.bim files from plink formatted data (e.g exome data in the UKBB, genotype array data)

In [None]:
# Merge all the *.bim files into a single file. Needs to be run once per type of data (e.g. genotype, exome)
[bim_from_plink]
# Path to the *.bim files to merge
parameter: bimfiles= paths
# Specify path of the merged bim file
parameter: bim_name = path
input: bimfiles 
output: bim_name
task: trunk_workers = 1, walltime = '10h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} >> ${_output}

## Step to create a list of variants from *.bgen files and a merged *.bim file to annotate (e.g imputed genotype data UKBB)

In [None]:
# Create a merged *.bim file from mfi files
[bim_from_bgen]
# Specify bgen files path
parameter: genoFile = paths
# Specify name of the merged bim file
parameter: bim_name = str
# The input here is the bgen file from which to extract the list of variants
input: genoFile, group_by=1
output: f'{cwd}/{_input:bn}.bim'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    # Create the bim for each mfi 
    # From the ubk_mfi file get the chr, alternate_id, pos, allele 1 (alternative and usually minor) and allele 2 (reference and usually major)
    # Add the 0 cM column 
    cat ${_input}| awk ' { gsub("_",":",$1); print substr($1,1,1), $1, $3, $5, $4 }' |awk 'BEGIN{FS=OFS=" "}{$2 = $2 OFS 0}1'> ${_output}

## Steps to annotate summary statistics files using annovar

In [None]:
# Get the list of significantly associated SNPs
[annovar_1]
# Column name for BP
parameter: bp = 'POS'
# Column name for p-value
parameter: pval = 'P'
# Column name for SNP
parameter: snp = 'SNP'
# Set p-value to filter for annotations
parameter: p_filter=5e-8
# If the data contains rsid instead of chr:pos:ref:alt
parameter: rsid = False
# Path sumstats file
parameter: sumstatsFile = path
input: sumstatsFile
output: f'{cwd}/{_input:bnn}.snp_annotate'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: container=container_lmm, expand='${ }', stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library('dplyr')
    # Import the sumstats file as dataframe
    data <- read.table(gzfile('${_input}'), sep='\t', header=T)
    # Filter SNPs with p-val <5e-08
    # Subset data to obtain only chr, pos, snp, beta,se and p for gene mapping
    # for the imputed data change the SNP field from rs to chr:pos:ref:alt 
    if (${"TRUE" if rsid else "FALSE"}){
      sig.p <- data %>%
         filter(P < ${p_filter}) %>%
         select(CHR, POS, REF, ALT, BETA, SE, P, SNP)
      #sig.p$SNP <- paste(sig.p$CHR, sig.p$POS, sig.p$REF, sig.p$ALT, sep=":")
      sig.p <- sig.p %>%
          select(SNP,BETA,SE,P)
    } else {
      sig.p <- data %>%
      filter(P < ${p_filter}) %>%
      select(SNP,BETA,SE,P)
    }  
    write.table(sig.p, '${_output}', sep = " ", quote=FALSE, row.names=FALSE, col.names=FALSE) 

In [None]:
# Get chr, start, end, ref_allele, alt_allele format
[annovar_2]
parameter: bim_name = path
output: f'{_input:n}.avinput'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' 
    # find the variants that are in the input file from the previous step (summary stats) in the bim file created in steps bim_from_plink or bim_from_bgen
    # be careful make sure the variants are in the same format as in the bim file (e.g chr1 or 1) inconsistent files cause wrong results
    awk -F" " 'NR==FNR{a[$1]=$1" "$2" "$3" "$4; next} ($2 in a){print $1,$2,$3,$4,$5,$6,a[$2]}' ${_input} ${bim_name} > ${_output:n}.tmp
    # create the annovar input file, the imputed data has only bi-allelic variants
    awk '{print $1, $4, $4 + (length($6) - 1), $6, $5, $7, $8, $9, $10}' ${_output:n}.tmp >  ${_output}
    # remove temporary files
    rm -f ${_output:n}.tmp 

In [None]:
# Annotate variants file using ANNOVAR
[annovar_3]
# humandb path for ANNOVAR
parameter: humandb = path
# Path to x-ref file
parameter: xref_path = path
# Human genome build hg19 or hg38
parameter: build = 'hg38'
# Annovar protocol
if build == 'hg19':
    protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements46way', 'gwasCatalog', 'gnomad211_exome', 'avsnp150', 'dbnsfp42a', 'dbscsnv11', 'gene4denovo201907']
    operation = ['g', 'g', 'g', 'g', 'r', 'r', 'f', 'f', 'f', 'f', 'f']
    arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '']
else:
    protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements30way', 'encRegTfbsClustered', 'gwasCatalog', 'gnomad30_genome', 'gnomad211_exome', 'gme', 'kaviar_20150923', 'avsnp150', 'dbnsfp41a', 'dbscsnv11', 'clinvar_20220320', 'gene4denovo201907']
    operation = ['g', 'g', 'g', 'gx', 'r', 'r', 'r', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f']
    arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '', '', '', '', '', '']
    
#add xreffile to option without -exonicsplicing
#mart_export_2019_LOFtools3.txt #xreffile latest option -> Phenotype description,HGNC symbol,MIM morbid description,CGD_CONDITION,CGD_inh,CGD_man,CGD_comm,LOF_tools
parameter: x_ref = path(f"{xref_path}/mart_export_2021_LOFtools.txt")
output: anno_file = f'{cwd}/{_input:bn}.{build}_multianno.csv'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}', template = '{cmd}' if executable('annotate_variation.pl').target_exists() else annovar_module
bash: container=container_annovar, volumes=[f'{humandb:a}:{humandb:a}', f'{x_ref:ad}:{x_ref:ad}'], expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'
    #do not add -intronhgvs as option -> writes cDNA variants as HGVS but creates issues (+2 splice site reported only)
    #-nastring . can only be . for VCF files
    #regsnpintron might cause shifted lines (be carefull using)
    table_annovar.pl \
        ${_input} \
        ${humandb} \
        -buildver ${build} \
        -out ${_output:nn}\
        -otherinfo\
        -remove \
        -polish \
        -nastring . \
        -protocol ${",".join(protocol)}\
        -operation ${",".join(operation)} \
        -arg ${",".join(arg)} \
        -csvout \
        -xreffile ${x_ref} 

In [None]:
# Re-format the annovar csv to have the BETA, SE and P in the front and with headers
[annovar_4]
input: named_output('anno_file')
output: f'{cwd}/{_input:bn}.formatted.csv'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
python: expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' 
    import pandas as pd
    import numpy as np
    import csv
    df = pd.read_csv('${_input}', header=0)
    df1 = df[["Otherinfo1"]]
    df1 = df1.astype(str)
    df2 = df1["Otherinfo1"].str.split(" ", n = 4, expand = True)
    df2.columns = ["alternate_id", "BETA", "SE", "P"]
    df = df2.join(df)
    df.to_csv('${_output}', index=False)

In [2]:
# Annotate snps to gene
[snp_to_gene]
# Column name for BP
parameter: bp = 'POS'
# Column name for p-value
parameter: pval = 'P'
# Column name for SNP
parameter: snp = 'SNP'
# Path sumstats file
parameter: sumstatsFile = path
# Genome assembly hg_37, hg_38
parameter: hg = int
input: sumstatsFile
output: f'{_input:nn}.gene_ann'      
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand='${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    library('snpGeneSets')
    library('dplyr')
    # Import the sumstats file as dataframe
    data <- read.table(gzfile('${_input}'), header=T)
    head(data)
    # Filter SNPs with p-val <5e-06
    # Subset data to obtain only chr, pos and snp for gene mapping
    sig.p <- data %>%
      filter(P < 5e-8) %>%
      mutate(chr = CHR,
             pos = ${bp},
             snp = as.character(${snp})) %>%
      select(chr, pos, snp)
    head(sig.p)
    # Get the annotation of SNPs with different genome assemblies
    snpMapAnn<- getSNPMap(sig.p$snp, GRCh=${hg})
    # Mapping SNPs to genes (define gene boundary ‘up’ for the upstream region and ‘down’ for the downstream region with default value of 2,000 bp for both)
    snpGeneMapAnn<- snp2Gene(snpMapAnn$rsid_map$snp)
    cat("The unique number of genes is",length(unique(snpGeneMapAnn$map$gene_id),"\n"))
    cat("The number of variants that could not be mapped to a gene is:",length(snpGeneMapAnn$other),"\n")
    #Get the gene-name and gene-id for the mapped variants
    gene_mapped <- getGeneMap(snpGeneMapAnn$map$gene_id)$gene_map
    # Merge the datasets
    snp_gene = merge(x = snpMapAnn37$rsid_map,y = snpGeneMapAnn$map[,c("snp", "gene_id")],by="snp", all.x=TRUE)
    snp_gene_2 = merge(x = snp_gene,y = gene_mapped[,c("gene_id", "gene_name")],by="gene_id", all.x=TRUE)
    names(snp_gene_2)[names(snp_gene_2) == 'snp'] <- 'SNP'
    snp_gene_3 = merge(x = snp_gene_2,y = data[,c("A1", "A2", "N", "AF1","P","BETA", "SE", "INFO","SNP")],by="SNP", all.x=TRUE)
    # Get the final table with ordered pval
    final_gene_set <- snp_gene_3 %>%
     select(chr, ${snp}, pos, A1, A2, N, AF1, BETA, SE, ${pval}, INFO, gene_id, gene_name) %>%
     arrange(P)
    names(final_gene_set)[names(final_gene_set) == 'chr'] <- 'CHR'
    names(final_gene_set)[names(final_gene_set) == 'pos'] <- 'POS'
    # Write results to a table
    write.table(final_gene_set, '${_output}', sep = "\t", quote=FALSE, row.names=FALSE)