# PTWAS Implementation in R


This module contains the software implementations to perform transcriptome-wide association analysis (TWAS). These methods are designed to perform rigorous causal inference connecting genes to complex traits. The statistical models and the key algorithms are described in the (manuscript](https://www.biorxiv.org/content/10.1101/808295v1).     
**PTWAS using the ouptut from DAPG and is written by C++. Here we are going to use susie objects instead and convert the codes into R.**

## Overview

The goal of this module is to perform PTWAS analysis from SuSiE objects, including:
1. Extract weights from eQTL susie objects. 
2. Conversion of GWAS sumstats to the format with z-scores. 
3. Run PTWAS with R codes.  


### Input
1. QTL susie table：
    - This table has two columns for `moleculart_trait_id` and `susie_file`: target gene and corresponding susie output rds respectively.
2. GWAS sumstats results (tsv format)    
3. LD refrence 

### Ouput
1. susie weight table
2. re-formatted GWAS sumstats results
3. PTWAS results

In [None]:
[global]
# Workdir
parameter: cwd = path("output")
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
parameter: job_size = 100
parameter: walltime = "1h"
parameter: mem = "16G"
parameter: numThreads = 1

## GWAS data_prep 



In [None]:
[gwas_ptwas_prep]
parameter: ptwas_weights=""
parameter: gwas_basepath=""
input: ptwas_weights
output: f"{cwd}/ptwas/{gwas_basepath.rstrip('/').split('/')[-1]}.gambitgwas.txt"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    library(tidyverse)
    target_variants <- read_delim(
        "${ptwas_weights}",
        delim = "\t",
        show_col_types = FALSE) %>%
        mutate(id = gsub(":", "_", variant)) %>%
        pull(id) %>%
        unique()

    gwas_basepath <- "${gwas_basepath}"
    gwas <- data.frame(matrix(ncol=12, nrow=0))

    for (gwasfile in list.files(gwas_basepath)) {
        gwaschr <- read_delim(
            paste0(gwas_basepath, gwasfile),
            delim = "\t",
            show_col_types = FALSE) %>%
            filter(variant %in% target_variants)
        gwas <- if (nrow(gwas) < 1) gwaschr else rbind(gwas, gwaschr)
    }

    #gwas$chromosome <- paste0('chr', gwas$chromosome)
    gwas$ZSCORE <- gwas$beta/gwas$se
    gwas$N <- gwas$n_cases + gwas$n_controls
    gwas$SNP_ID <- gwas$variant
    gambitgwas <- gwas %>% subset(select=c("chromosome", "position", "ref", "alt", "SNP_ID", "N", "ZSCORE"))
    colnames(gambitgwas) <- c("#CHR", "POS", "REF", "ALT", "SNP_ID", "N", "ZSCORE")

    write_delim(
        gambitgwas,
        "${_output}",
        delim = "\t",
        append = TRUE,
        col_names = TRUE,
        quote = "none")
  
  
  
    # Create Genotype SNPMatrix
    genotype_data <- BEDMatrix("${_input:n}")
    genotype_bim<-fread(gsub(".bed",".bim","${_input}"))
    plink_snps <- colnames(genotype_data)<-genotype_bim$V2
    # Get Variants Present in Ref Panel
    ld_variants <- generate_index(colnames(genotype_data))
    input_dataframe <- read.csv(paste0("${cwd:a}","/ptwas/cache/input_dataframe.txt"),sep ='\t',check.names = F)
    #results <- input_dataframe %>%
    #    filter(gene_id == grep("ENSG", unlist(strsplit("${_input}", split = '\\.')), value = TRUE)) %>%
    #    filter(uber_id %in% ld_variants) %>%
    #    mutate(weight = as.double(weight)) %>%
    #    group_by(gene_id, tissue) 

    target_gene_ids <- str_extract_all("${_input}", "ENSG[0-9]+")[[1]]
    
    results <- input_dataframe %>%
            filter(sapply(gene_id, function(x) any(str_detect(x, target_gene_ids)))) %>%
            filter(uber_id %in% ld_variants) %>%
            mutate(weight = as.double(weight)) %>%
            group_by(gene_id, tissue)
    # in sQTL, there could be > 1 splicing events target the same gene
    gene_ids <- unique(results$gene_id)

    results_final <- data.frame()
    for(gene_id in gene_ids){
        results_tmp <- results[results$gene_id==gene_id,]
        results_tmp$nsnps = length(results_tmp$variant)
        results_tmp$burden_pval = burden(results_tmp$variant, results_tmp$SNP_ID, results_tmp$uber_id, genotype_data, ld_variants, results_tmp$weight, results_tmp$Z)
        results_tmp$stat = burden(results_tmp$variant, results_tmp$SNP_ID, results_tmp$uber_id, genotype_data, ld_variants, results_tmp$weight, results_tmp$Z,TRUE)
        results_tmp$class = "sQTL"
        results_tmp <- results_tmp %>% ungroup()
        results_final <- rbind(results_final, results_tmp)
    }

      write.table(
        results_final %>%
            group_by(gene_id, tissue) %>%
            mutate(POS = paste0(min(POS), "-", max(POS))) %>%
            ungroup() %>%
            subset(
                select=c("#CHR", "POS", "gene_id", "class", "tissue", "nsnps", "stat", "burden_pval")) %>%
            dplyr::rename(GENE=gene_id, CLASS=class, SUBCLASS=tissue, NSNPS=nsnps, STAT=stat, PVAL=burden_pval) %>%
            distinct(GENE, SUBCLASS, .keep_all = TRUE),
        "${_output}",
        sep = "\t",
        append = FALSE,
        quote = FALSE)

**output_2**: re-formatted GWAS sumstats results



In [7]:
head output/ADGWAS_Bellenguez_2022.gambitgwas.txt


#CHR	POS	REF	ALT	SNP_ID	N	ZSCORE
1	24080045	G	A	chr1_24080045_G_A	487511	-2.264705882352941
1	24080157	G	A	chr1_24080157_G_A	487511	2.1585365853658534
1	24080563	C	T	chr1_24080563_C_T	487511	2.012048192771084
1	24080644	C	T	chr1_24080644_C_T	487511	2.1463414634146343
1	24080863	G	A	chr1_24080863_G_A	487511	2.207317073170732
1	24080867	G	A	chr1_24080867_G_A	487511	2.2560975609756095
1	24081747	A	G	chr1_24081747_A_G	487511	2.170731707317073
1	24081924	C	T	chr1_24081924_C_T	487511	2.170731707317073
1	24082275	T	C	chr1_24082275_T_C	487511	2.1585365853658534


## PTWAS scan
This portion contains code for running the PTWAS scan as implemented in GAMBIT. 



### Input

- eQTL Weights
    File susiet contains eQTL weights (formatting is up-for-debate). Maybe column 1 is SNP and column 2 is the weight.
- GWAS Z-Scores
    File that contains GWAS z-scores (or what makes up the z-scores). Column 1 is SNP, column 2 can be z-scores.
- LD reference
- region list


### Output

Same output as GAMBIT

In [None]:
[ptwas]
parameter: qtl_results_list = path
parameter: gwas_sumstats = path
parameter: ld_block_list = path
parameter: p_suggestive = 1E-5
input: [x for x in open(qtl_results).readlines()], group_by = 1, group_with = 'gwas_sumstats'
output: f'{cwd}/{_input:bnn}.ptwas.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    # 1. load QTL RDS file to figure out by variant ID the genomic region in question
    # 2. load GWAS z-score and the corresponding LD matrices -- potentially from multiple pre-computed LD block files -- put together the new LD matrix for the region
    # 3. use mungesumstats() to QC the LD matrix with summary stats especially allele flip; also look out for strand flip.
    # 4. use twas_z() to compute TWAS results from multipile weights
    # 5. use some p-value cutoff to loosely pick TWAS regions of interest then if the region passes the cutoff, save the QC-ed GWAS data in the format compatabile to https://github.com/cumc/pecotmr/blob/main/R/mr.R


In [8]:
head output/DLPFC.ptwas.output


gene_id	tissue	nsnps	burden_pval
ENSG00000000457	DLPFC	3023	0.37256142427110756
ENSG00000000971	DLPFC	7465	0.35697314943086955
ENSG00000001460	DLPFC	1999	7.006837334425248e-71
ENSG00000001461	DLPFC	1999	9.897049424129262e-51
ENSG00000002016	DLPFC	2710	7.302886702775239e-20
ENSG00000003056	DLPFC	6052	3.7933865595690386e-23
ENSG00000003249	DLPFC	2747	4.742781958897929e-131
ENSG00000002834	DLPFC	4430	1.1702859075378423e-74
ENSG00000003137	DLPFC	4655	1.8083016266006384e-55
