# TensorQTL QTL association testing

This notebook implements a workflow for using [tensorQTL](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1836-7) to perform QTL association testing.

## Input

- List of molecular phenotype files: a list of `bed.gz` files containing the table for the molecular phenotype. It should have a companion index file in `tbi` format.
- List of genotypes in PLINK binary format (`bed`/`bim`/`fam`) for each chromosome, previously processed through our genotype QC pipelines.
- Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.
- Optionally, a list of traits (genes, regions of molecular features etc) to analyze.

## Output

For each chromosome, several of summary statistics files are generated, including both nominal test statistics for each test, as well as region (gene) level association evidence.

The columns of nominal association result are as follows:

- phenotype_id: Molecular trait identifier.(gene)
- variant_id: ID of the variant (rsid or chr:position:ref:alt)
- tss_distance: Distance of the SNP to the gene transcription start site (TSS)
- af: The allele frequency of this SNPs
- ma_samples: Number of samples carrying the minor allele
- ma_count: Total number of minor alleles across individuals
- pval: Nominal P-value from linear regression
- beta: Slope of the linear regression
- se: Standard error of beta
- chr : Variant chromosome.
- pos : Variant chromosomal position (basepairs).
- ref : Variant reference allele (A, C, T, or G).
- alt : Variant alternate allele.


The column specification of region (gene) level association evidence are as follows:

- phenotype_id - Molecular trait identifier. (gene)
- num_var - Total number of variants tested in cis
- beta_shape1 - First parameter value of the fitted beta distribution
- beta_shape2 - Second parameter value of the fitted beta distribution
- true_df - Effective degrees of freedom the beta distribution approximation
- pval_true_df - Empirical P-value for the beta distribution approximation
- variant_id - ID of the top variant (rsid or chr:position:ref:alt)
- tss_distance - Distance of the SNP to the gene transcription start site (TSS)
- ma_samples - Number of samples carrying the minor allele
- ma_count - Total number of minor alleles across individuals
- maf - Minor allele frequency in MiGA cohort
- ref_factor - Flag indicating if the alternative allele is the minor allele in the cohort (1 if AF <= 0.5, -1 if not)
- pval_nominal - Nominal P-value from linear regression
- slope - Slope of the linear regression
- slope_se - Standard error of the slope
- pval_perm - First permutation P-value directly obtained from the permutations with the direct method
- pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis

# Command interface 

In [1]:
sos run TensorQTL.ipynb -h

usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  cis
  trans

Global Workflow Options:
  --phenotype-list VAL (as path, required)
                        Path to the input molecular phenotype file, per chrom,
                        in bed.gz format.
  --covariate-file VAL (as path, required)
                        Covariate file
  --genotype-list VAL (as path, required)
                        Genotype file in PLINK binary format (bed/bam/fam)
                        format, per chrom
  --region-list . (as path)
                        An optional subset of regions of molecular features to
                        analyze
  --cwd . (as path)
            

# Example

In [None]:
sos run pipeline/TensorQTL.ipynb cis \
    --genotype-list plink_files_list.txt \
    --phenotype-list MWE.bed.recipe \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --container containers/TensorQTL.sif --MAC 5

## Global parameter settings

In [5]:
[global]
# Path to the input molecular phenotype file, per chrom, in bed.gz format.
parameter: phenotype_list = path
# Covariate file
parameter: covariate_file = path
# Genotype file in PLINK binary format (bed/bam/fam) format, per chrom
parameter: genotype_list = path
# An optional subset of regions of molecular features to analyze
parameter: region_list = path()
# Path to the work directory of the analysis.
parameter: cwd = path('.')
# Prefix for the analysis output
parameter: name = f"{phenotype_list:bn}_{covariate_file:bn}"
# Specify the number of jobs per run.
parameter: job_size = 2
# Container option for software to run the analysis: docker or singularity
parameter: container = ''

# Specify the cis window for the up and downstream radius to analyze around the region of interest, in units of bp
parameter: window = 1000000

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'

import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(phenotype_list,sep = "\t")
geno_chr_inv = pd.read_csv(genotype_list,sep = "\t")
input_inv = molecular_pheno_chr_inv.merge(geno_chr_inv, on = "#id")
input_inv = input_inv.values.tolist()
chr_inv = [x[0] for x in input_inv]
file_inv = [x[1:] for x in input_inv]

parameter: MAC = 0
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1 # Use the header of covariate file for it being the intersect of geno/pheno/cov.
parameter: maf_threshold = MAC/(2*N)

## cisQTL association testing

In [None]:
[cis_1]
input: file_inv, group_by = len(file_inv[0]), group_with = "chr_inv"
output: f'{cwd:a}/{name}.{_chr_inv}.cis_qtl_pairs.{_chr_inv}.parquet', # This design is necessary to match the pattern of map_norminal output
        f'{cwd:a}/{name}.{_chr_inv}.emprical.cis_sumstats.txt',
        long_table = f'{cwd:a}/{name}.{_chr_inv}.norminal.cis_long_table.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'

python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout' , container = container
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, cis, trans
    import os, time    
    ## Define paths
    plink_prefix_path = $[_input[1]:nr]
    expression_bed = $[_input[0]:r]
    covariates_file = "$[covariate_file]"

    ## Load Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        regions = pd.read_csv("$[region_list]","\t")["gene_ID"].to_list()
        phenotype_df = phenotype_df.query('gene_ID in regions')
        phenotype_pos_df = phenotype_pos_df.query('gene_ID in regions')

    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    pr = genotypeio.PlinkReader(plink_prefix_path)
    genotype_df = pr.load_genotypes()
    variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)]
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    if "chr" not in variant_df.chrom[0]:
        phenotype_pos_df.chr = [x.replace("chr","") for x in phenotype_pos_df.chr]
    ## cis-QTL mapping: nominal associations for all variant-phenotype pairs
    cis.map_nominal(genotype_df, variant_df,
                phenotype_df,
                phenotype_pos_df,
                "$[_output[0]:nnn]", covariates_df=covariates_df, window=$[window], maf_threshold = $[maf_threshold] )

    ## Load the parquet and save it as txt
    pairs_df = pd.read_parquet("$[_output[0]]")
    pairs_df.columns.values[0]  = "gene_ID"
    pairs_df.columns.values[6]  = "pval"
    pairs_df.columns.values[7]  = "beta"
    pairs_df.columns.values[8]  = "se"
    pairs_df = pairs_df.assign(
    alt = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[-1])).assign(
    ref = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[-2])).assign(
    pos = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[0].split(":")[1])).assign(
    chrom = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split(":")[0]))
    pairs_df.to_csv("$[_output[2]]", sep='\t',index = None)
    cis_df = cis.map_cis(genotype_df, variant_df, 
                     phenotype_df,
                     phenotype_pos_df,
                     covariates_df=covariates_df, seed=999, window=$[window], maf_threshold = $[maf_threshold] )
    cis_df.index.name = "gene_id"
    cis_df.to_csv("$[_output[1]]", sep='\t')

## TransQTL association testing

In [None]:
[trans_1]
input: file_inv,group_by = len(file_inv[0]), group_with = "chr_inv"
output: long_table = f'{cwd:a}/{path(_input[0]):bnnn}.long_table.trans_sumstats.txt'
parameter: batch_size = 10000
parameter: pval_threshold = 1e-5
task: trunk_workers = 1, trunk_size = 1, walltime = '12h',  mem = '40G', tags = f'{step_name}_{_output[0]:bn}'
python: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container =container 
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, cis, trans
    ## paths
    plink_prefix_path = $[path(_input[1]):nr]
    expression_bed = $[path(_input[0]):r]
    covariates_file = "$[covariate_file]"

    ## Loading Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)

    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list]","\t")
        keep_gene = region["gene_ID"].to_list()
        phenotype_df = phenotype_df.query('gene_ID  in keep_gene')
        phenotype_pos_df = phenotype_pos_df.query('gene_ID  in keep_gene')

    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    pr = genotypeio.PlinkReader(plink_prefix_path)
    genotype_df = pr.load_genotypes()
    variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    ## Trans analysis
    trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df, batch_size=$[batch_size],
                           return_sparse=True, pval_threshold=$[pval_threshold], maf_threshold=$[maf_threshold])
    ## Filter out cis signal
    trans_df = trans.filter_cis(trans_df, phenotype_pos_df.T.to_dict(), variant_df, window=$[window])
    ## Output
    trans_df.columns.values[1]  = "gene_ID"
    trans_df.columns.values[2]  = "pval"
    trans_df.columns.values[3]  = "beta"
    trans_df.columns.values[4]  = "se"
    trans_df = trans_df.assign(
    chrom = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split(":")[0])).assign(
    alt = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[2])).assign(
    ref = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[1])).assign(
    pos = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[0]))
    trans_df.to_csv("$[_output]", sep='\t',index = None)

In [None]:
[*_2]
input:  group_by = "all"
output: f'{cwd:a}/TensorQTL.{"trans" if len(_input["long_table"]) == len(_input) else "cis"}._recipe.tsv',
        f'{cwd:a}/TensorQTL.{"trans" if len(_input["long_table"]) == len(_input) else "cis"}._column_info.txt'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    import csv
    import pandas as pd 
    data_temp = pd.DataFrame({
    "#chr" : [int(x.split(".")[-4].replace("chr","")) for x in  [$[_input["long_table"]:r,]]],
    "sumstat_dir" : [$[_input["long_table"]:r,]],
    "column_info" : $[_output[1]:r]
    })
    if "cis" in data_temp.sumstat_dir[0]:
        column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "CHR": "chrom",
          "POS": "pos",
          "A0": "ref",
          "A1": "alt",
          "SNP": "variant_id",
          "STAT": "beta",
          "SE": "se",
          "P": "pval",
          "TSS_D": "tss_distance",
          "AF": "af",
          "MA_SAMPLES": "ma_samples",
          "MA_COUNT": "ma_count",
          "GENE": "gene_ID"}), columns = ["TensorQTL"] )

    else:
        column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "CHR": "chrom",
          "POS": "pos",
          "A0": "ref",
          "A1": "alt",
          "SNP": "variant_id",
          "STAT": "beta",
          "SE": "se",
          "P": "pval",
          "AF": "af",
          "GENE": "gene_ID"}), columns = ["TensorQTL"] )
    data_temp.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )

## Association results processing

In [None]:
#[cis_2]
input:  group_by = "all"
output: f'{cwd:a}/TensorQTL_recipe.tsv',f'{cwd:a}/TensorQTL_column_info.txt'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    import csv
    import pandas as pd 
    data_tempt = pd.DataFrame({
    "#chr" : [int(x.split(".")[-4].replace("chr","")) for x in  [$[_input["long_table"]:r,]]],
    "sumstat_dir" : [$[_input["long_table"]:r,]],
    "column_info" : $[_output[1]:r]
    })
    column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "CHR": "chrom",
          "POS": "pos",
          "A0": "ref",
          "A1": "alt",
          "SNP": "variant_id",
          "STAT": "beta",
          "SE": "se",
          "P": "pval",
          "TSS_D": "tss_distance",
          "AF": "af",
          "MA_SAMPLES": "ma_samples",
          "MA_COUNT": "ma_count",
          "GENE": "phenotype_id"}), columns = ["TensorQTL"] )
    data_tempt.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )

In [1]:
#[trans_2]
input: group_by = "all"
output: f'{cwd:a}/TensorQTL_recipe.tsv',f'{cwd:a}/TensorQTL_column_info.txt'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    import csv
    import pandas as pd 
    data_tempt = pd.DataFrame({
    "#chr" : [int(x.split(".")[-3].replace("chr","")) for x in  [$[_input:r,]]],
    "sumstat_dir" : [$[_input:r,]],
    "column_info" : $[_output[1]:r]
    })
    column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "CHR": "chrom",
          "POS": "pos",
          "A0": "ref",
          "A1": "alt",
          "SNP": "variant_id",
          "STAT": "beta",
          "SE": "se",
          "P": "pval",
          "AF": "af",
          "GENE": "gene_ID"}), columns = ["TensorQTL"] )
    data_tempt.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )