# APEX QTL association testing

This notebook implements a workflow for using [APEX](https://www.biorxiv.org/content/10.1101/2020.12.18.423490v1) to conduct analysis. APEX implements a linear mixed model for association testing. This is potentially useful for analysis with related individuals.

**In our pilot cis-eQTL analysis with around ~500 samples we found that APEX is less robust compared to tensorQTL for association scans and for linear regression using OLS there is no speed advantage over tensorQTL. Gene level p-value from APEX are on average smaller than that from tensorQTL, resulting in more genes discovered. However, comparison with univariate fine-mapping seem to suggest that these p-values are a bit inflated. We therefore recommend using tensorQTL for at least the analysis of samples without related individuals.**

## Other cautions

- `--low-mem` option does not work for APEX so we do not use it (https://github.com/corbinq/apex/issues/7).
- Also notice that the command options are different from those on the APEX website documentation. The commands on the documentation page does not work (last updated September 2021). The commands below were constructed and tested by our team based on our understanding of the program, without input from APEX authors.

## Input

- List of molecular phenotype files: a list of `bed.gz` files containing the table describing the molecular phenotype. It should have a companion index file in `tbi` format.
- List of genotypes in VCF format for each chromosome, those vcf are converted beforehand from plink trio in the data_processing sections
- List of GRM containing path to GRM matrices that generated by the [GRM module](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/genotype/GRM.html)
- Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.

## Output

For each chromosome, several of summary statistics files are generated, including both nomial test statistics for each test, as well as region (gene) level association evidence.

The columns of nominal association result are as follows:

- #chrom : Variant chromosome.
- pos : Variant chromosomal position (basepairs).
- ref : Variant reference allele (A, C, T, or G).
- alt : Variant alternate allele.
- gene : Molecular trait identifier (as specified in --bed {trait-file}).
- beta : OLS regression slope for variant on trait.
- se : Standard error of regression slope.
- pval : Single-variant association nominal p-value.
- variant_id: ID of the variant (rsid or chr:position:ref:alt)

The columns of region (gene) level association evidence are as follows:

- #chrom : Molecular trait chromosome.
- start : Molecular trait start position.
- end : Molecular trait end position.
- gene : Molecular trait identifier.
- gene_pval : Trait-level p-value calculated across all variants in the cis region using the Cauchy combination test, comparable to beta-approximated permutation p-values.
- n_samples : Number of samples included in analysis.
- n_covar : Number of covariates included in analysis, including intercept.
- resid_sd : Square root of regression mean squared error under the null model.
- n_cis_variants : Number of variants in the cis region (which were used to calculate gene_pval).

## Command interface

In [2]:
sos run APEX.ipynb -h

usage: sos run APEX.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  LMM_null
  APEX_cis
  cis
  APEX_trans
  trans

Global Workflow Options:
  --molecular-pheno-list VAL (as path, required)
                        Path to the input molecular phenotype file, per chrm, in
                        bed.gz format.
  --covariate VAL (as path, required)
                        Covariate file, in similar format as the molecular_pheno
  --genotype-file-list VAL (as path, required)
                        Genotype file in vcf format, per chrm
  --grm-list VAL (as path, required)
                        grm file in pairwise table format , per chrm
  --cwd . (as path)
                       

## Example



In [None]:
sos run APEX.ipynb APEX_cis   \
--genotype_file_list GRCh38_liftedover_sorted_all.add_chr.leftnorm.filtered.renamed.filtered.renamed.filtered.filtered.vcf_files_list.txt \
--molecular_pheno_list /mnt/mfs/statgen/snuc_pseudo_bulk/eight_tissue_analysis/data_preprocessing/ALL/phenotype_data/ALL.log2cpm.bed.processed_phenotype.per_chrom.recipe   \
--grm_list  data_preprocessing/genotype/grm/plink_files_list.loco_grm_list.txt  \
--covariate data_preprocessing/ALL/covariates/ALL.covariate.pca.BiCV.cov.gz     \
--cwd /mnt/mfs/statgen/snuc_pseudo_bulk/eight_tissue_analysis/Association_scan/ALL/APEX    \
--container /mnt/mfs/statgen/containers/apex.sif -J 50 -c csg.yml -q csg -n --no-LMM &

## Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [5]:
[global]
# Path to the input molecular phenotype file, per chrom, in bed.gz format.
parameter: phenotype_list = path
# Covariate file
parameter: covariate_file = path
# Genotype file in VCF format, per chrom
parameter: genotype_list = path
# GRM file in plain text format, per chrom, for leave one chrom out analysis
parameter: grm_list = path()
# Use LMM or not
parameter: LMM = False
# Whether or not to apply rank normalization
parameter: rankNormal = False
# Path to the work directory of the analysis.
parameter: cwd = path('.')
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
# Prefix for the analysis output
parameter: name = f"{phenotype_list:bn}_{covariate_file:bn}"
# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of bp
parameter: window = 1000000
# Number of threads
parameter: numThreads = 1
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '5h'
parameter: mem = '80G'

import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(phenotype_list,sep = "\t")
geno_chr_inv = pd.read_csv(genotype_list,sep = "\t")
input_inv = molecular_pheno_chr_inv.merge(geno_chr_inv, on = "#id")
if LMM:
    grm_chr_inv =  pd.read_csv(grm_list,sep = "\t")
    input_inv = input_inv.merge(grm_chr_inv,on = "#id")
input_inv = input_inv.values.tolist()

## LMM Regression  
This step are done to precompute and store:
1. LMM null models and trait residuals
2. spline terms for LMM genotypic variances to speed up downstream analysis

In [None]:
[LMM_null]
input: input_inv, group_by = 4
output: f'{cwd:a}/{_input[1]:bnn}.theta.gz'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    apex lmm $["--rankNormal" if rankNormal else ""] --vcf $[_input[2]] \
    --bed $[_input[1]] \
    --cov $[covariate] \
    --out $[_output[0]:nn] \
    --grm $[_input[3]] \
    --get-theta \
    --threads $[numThreads]

## QTL asscoiation testing 
This step generate the cis-QTL and trans-QTL summary statistics and for downstream analysis from summary statistics. The analysis is done per chromosome to reduce running time.

In [None]:
[cis_1]
if LMM:
    sos_run("LMM_null")
input: input_inv, group_by = 4
output:f'{cwd:a}/{path(_input[1]):bnn}{".LMM" if LMM else ".OLS"}.cis_long_table.txt.gz',
       f'{cwd:a}/{path(_input[1]):bnn}{".LMM" if LMM else ".OLS"}.cis_gene_table.txt.gz',
       f'{cwd:a}/{path(_input[1]):bnn}{".LMM" if LMM else ".OLS"}.cis_sumstats.txt.gz'
parameter: theta = path(f'{cwd:a}/{_input[1]:bnn}.theta.gz')
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    apex cis $["--rankNormal" if rankNormal else ""] --vcf $[_input[2]] \
    --bed $[_input[1]] \
    --cov $[covariate] \
    --out $[_output[0]:nnn] \
    --long $[f'--theta {theta} --grm {_input[3]}' if LMM else ""] \
    --window $[window] \
    --threads $[numThreads]

In [None]:
[trans_1]
if LMM:
    sos_run("LMM_null")
parameter: keep_gene = ""
input: input_inv, group_by = 4
output: f'{cwd:a}/{_input[1]):bnn}{".LMM" if LMM else ".OLS"}.trans_long_table.txt.gz',
        f'{cwd:a}/{_input[1]):bnn}{".LMM" if LMM else ".OLS"}.trans_gene_table.txt.gz'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    apex trans $["--rankNormal" if rankNormal else ""] --vcf $[_input[2]] \
    --bed $[_input_inv[1]] \
    --cov $[covariate] \
    --out $[_output[0]:nnn] \
    --gene "$[keep_gene]" \
    --long $[f'--theta {cwd:a}/{_input[1]:bnn}.theta.gz --grm {_input[3]}' if LMM else ""] \
    --threads $[numThreads]

In [None]:
[*_2]
output: f'{_input[0]:nn}.reformated.txt'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    data <- read_delim("$[_input[0]]",delim = "\t")
    data = data%>%mutate(variant_id = pmap_chr(list(a = `#chrom`,b = pos,c = ref, d = alt),function(a,b,c,d) paste(c(a,":",b,"_",c,"_",d),collapse = "")))%>%rename(chrom = `#chrom`)
    data %>%write_delim("$[_output]",delim = "\t")

In [None]:
[*_3]
input: group_by = "all"
output: f'{cwd}/APEX_QTL_recipe.tsv',f'{cwd:a}/APEX_column_info.txt'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    import pandas as pd 
    data_tempt = pd.DataFrame({
    "#id" : [int(x.split(".")[-6].replace("chr","")) for x in  [$[_input:r,]]],
    "sumstat_dir" : [$[_input:r,]],\
    "column_info" : $[_output[1]:r]
    })
    column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "CHR": "chrom",
          "POS": "pos",
          "A0": "ref",
          "A1": "alt",
          "SNP": "variant_id",
          "STAT": "beta",
          "SE": "se",
          "P": "pval",
          "GENE": "gene"}), columns = ["APEX"] )
    data_tempt.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )