# TensorQTL QTL association testing

This pipeline conduct QTL association tests using tensorQTL package. 

## Input
- `--molecular-pheno`, The bed.gz file containing the table describing the molecular phenotype. It shall also have a tbi index accompaning it.
- `genotype_list` a list of whole genome plink file for each chromosome.
- `grm_list` is a file containing list of grm matrixs that generated by the GRM module of this pipeline.
- `covariate` is a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.

## Output

For each chromosome, a sets of summary statistics files , including both nomial test statistics for each test, as well as region (gene) level association evidence.

The column specification of nomial result are as followed:

- phenotype_id: Molecular trait identifier.(gene)
- variant_id: ID of the variant (rsid or chr:position:ref:alt)
- tss_distance: Distance of the SNP to the gene transcription start site (TSS)
- af: The allele frequency of this SNPs
- ma_samples: Number of samples carrying the minor allele
- ma_count: Total number of minor alleles across individuals
- pval: Nominal P-value from linear regression
- beta: Slope of the linear regression
- se: Standard error of beta
- chr : Variant chromosome.
- pos : Variant chromosomal position (basepairs).
- ref : Variant reference allele (A, C, T, or G).
- alt : Variant alternate allele.


The column specification of region (gene) level association evidence are as follows:

- phenotype_id - Molecular trait identifier. (gene)
- num_var - Total number of variants tested in cis
- beta_shape1 - First parameter value of the fitted beta distribution
- beta_shape2 - Second parameter value of the fitted beta distribution
- true_df - Effective degrees of freedom the beta distribution approximation
- pval_true_df - Empirical P-value for the beta distribution approximation
- variant_id - ID of the top variant (rsid or chr:position:ref:alt)
- tss_distance - Distance of the SNP to the gene transcription start site (TSS)
- ma_samples - Number of samples carrying the minor allele
- ma_count - Total number of minor alleles across individuals
- maf - Minor allele frequency in MiGA cohort
- ref_factor - Flag indicating if the alternative allele is the minor allele in the cohort (1 if AF <= 0.5, -1 if not)
- pval_nominal - Nominal P-value from linear regression
- slope - Slope of the linear regression
- slope_se - Standard error of the slope
- pval_perm - First permutation P-value directly obtained from the permutations with the direct method
- pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis

# Command interface 

In [1]:
sos run TensorQTL.ipynb -h

usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  TensorQTL_cis
  TensorQTL_trans

Global Workflow Options:
  --molecular-pheno-list VAL (as path, required)
                        Path to the input molecular phenotype file, per chrm, in
                        bed.gz format.
  --covariate VAL (as path, required)
                        Covariate file, in similar format as the molecular_pheno
  --genotype-file-list VAL (as path, required)
                        Genotype file in plink trio format, per chrm
  --region-list . (as path)
                        An optional subset of region list containing a column of
                        ENSG gene_id to lim

# Example

In [None]:
sos run pipeline/TensorQTL.ipynb TensorQTL_cis --genotype-list plink_files_list.txt \
--phenotype-list MWE.bed.recipe \
--covariate-file ALL.covariate.pca.BiCV.cov.gz \
--cwd ./ \
--container containers/apex.sif

## Global parameter settings

In [5]:
[global]
# Path to the input molecular phenotype file, per chrm, in bed.gz format.
parameter: phenotype_list = path
# covariate_file, in similar format as the molecular_pheno
parameter: covariate_file= path
# Genotype file in plink trio format, per chrm
parameter: genotype_list = path
# An optional subset of region list containing a column of ENSG gene_id to limit the analysis
parameter: region_list = path("./")
# Path to the work directory of the analysis.
parameter: cwd = path('./')
# Specify the number of jobs per run.
parameter: job_size = 2
# Container option for software to run the analysis: docker or singularity
parameter: container = ''

# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of bp
parameter: window = 1000000

import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(phenotype_list,sep = "\t")
geno_chr_inv = pd.read_csv(genotype_list,sep = "\t")
input_inv = molecular_pheno_chr_inv.merge(geno_chr_inv, on = "#id")
input_inv = input_inv.values.tolist()
chr_inv = [x[0] for x in input_inv]
file_inv = [x[1:] for x in input_inv ]

## cisQTL association testing

In [None]:
[TensorQTL_cis_1]
input: file_inv,group_by = len(file_inv[0]), group_with = "chr_inv"
output: f'{cwd:a}/{path(_input[0]):bnnn}.cis_qtl_pairs.{_chr_inv}.parquet', # This design is necessary to match the pattern of map_norminal output
        f'{cwd:a}/{path(_input[0]):bnnn}.emprical.cis_sumstats.txt',
        long_table = f'{cwd:a}/{path(_input[0]):bnnn}.norminal.cis_long_table.txt'
task: trunk_workers = 1, trunk_size = 1, walltime = '12h',  mem = '40G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', container = container,stdout = f'{_output[0]}.stdout'
    touch  $[_output[0]].time_stamp
python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout' , container = container
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, cis, trans
    ## Defineing parameter
    plink_prefix_path = $[path(_input[1]):nr]
    expression_bed = $[path(_input[0]):r]
    covariates_file = "$[covariate_file]"
    Prefix = "$[_output[0]:nnn]"
    ## Loading Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    ### Filter by the optional keep gene
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list]","\t")
        keep_gene = region["gene_ID"].to_list()
        phenotype_df = phenotype_df.query('gene_ID  in keep_gene')
        phenotype_pos_df = phenotype_pos_df.query('gene_ID  in keep_gene')

    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    pr = genotypeio.PlinkReader(plink_prefix_path)
    genotype_df = pr.load_genotypes()
    variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)]
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    if "chr" not in variant_df.chrom[0]:
        phenotype_pos_df.chr = [x.replace("chr","") for x in phenotype_pos_df.chr]
    ## cis-QTL mapping: nominal associations for all variant-phenotype pairs
    cis.map_nominal(genotype_df, variant_df,
                phenotype_df,
                phenotype_pos_df,
                Prefix, covariates_df=covariates_df, window=$[window] )

    ## Load the parquet and save it as txt
    pairs_df = pd.read_parquet("$[_output[0]]")
    pairs_df.columns.values[6]  = "pval"
    pairs_df.columns.values[7]  = "beta"
    pairs_df.columns.values[8]  = "se"
    pairs_df = pairs_df.assign(
    alt = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[-1])).assign(
    ref = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[-2])).assign(
    pos = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[0].split(":")[1])).assign(
    chrom = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split(":")[0]))
    pairs_df.to_csv("$[_output[2]]", sep='\t',index = None)
    cis_df = cis.map_cis(genotype_df, variant_df, 
                     phenotype_df,
                     phenotype_pos_df,
                     covariates_df=covariates_df, seed=999, window=$[window] )
    cis_df.index.name = "gene_id"
    cis_df.to_csv("$[_output[1]]", sep='\t')

## TransQTL association testing

In [None]:
[TensorQTL_trans_1]
input: file_inv,group_by = len(file_inv[0]), group_with = "chr_inv"
output: f'{cwd:a}/{path(_input[0]):bnnn}.trans_sumstats.txt'
parameter: batch_size = 10000
parameter: pval_threshold = 1e-5
parameter: maf_threshold = 0.05
task: trunk_workers = 1, trunk_size = 1, walltime = '12h',  mem = '40G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', container = container,stdout = f'{_output[0]}.stdout'
    touch  $[_output[0]].time_stamp
python: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container =container 
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, cis, trans
    ## Defineing parameter
    plink_prefix_path = $[path(_input[1]):nr]
    expression_bed = $[path(_input[0]):r]
    covariates_file = "$[covariate_file]"
    Prefix = "$[_output[0]:nnn]"
    ## Loading Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)


    ##### Filter by the optional keep gene
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list]","\t")
        keep_gene = region["gene_ID"].to_list()
        phenotype_df = phenotype_df.query('gene_ID  in keep_gene')
        phenotype_pos_df = phenotype_pos_df.query('gene_ID  in keep_gene')


    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    pr = genotypeio.PlinkReader(plink_prefix_path)
    genotype_df = pr.load_genotypes()
    variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    ## Trans analysis
    trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df, batch_size=$[batch_size],
                           return_sparse=True, pval_threshold=$[pval_threshold], maf_threshold=$[maf_threshold])
    ## Filter out cis signal
    trans_df = trans.filter_cis(trans_df, phenotype_pos_df.T.to_dict(), variant_df, window=$[window])
    ## Output
    trans_df.columns.values[1]  = "gene_ID"
    trans_df.columns.values[2]  = "pval"
    trans_df.columns.values[3]  = "beta"
    trans_df.columns.values[4]  = "se"
    trans_df = trans_df.assign(
    chrom = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split(":")[0])).assign(
    alt = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[2])).assign(
    ref = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[1])).assign(
    pos = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[0]))
    trans_df.to_csv("$[_output]", sep='\t')

## Association results processing

In [None]:
[TensorQTL_cis_2]
input:  group_by = "all"
output: f'{cwd:a}/TensorQTL_recipe.tsv',f'{cwd:a}/TensorQTL_column_info.txt'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    import csv
    import pandas as pd 
    data_tempt = pd.DataFrame({
    "#chr" : [int(x.split(".")[-4].replace("chr","")) for x in  [$[_input["long_table"]:r,]]],
    "sumstat_dir" : [$[_input["long_table"]:r,]],
    "column_info" : $[_output[1]:r]
    })
    column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "CHR": "chrom",
          "POS": "pos",
          "A0": "ref",
          "A1": "alt",
          "SNP": "variant_id",
          "STAT": "beta",
          "SE": "se",
          "P": "pval",
          "TSS_D": "tss_distance",
          "AF": "af",
          "MA_SAMPLES": "ma_samples",
          "MA_COUNT": "ma_count",
          "GENE": "phenotype_id"}), columns = ["TensorQTL"] )
    data_tempt.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )

In [1]:
[TensorQTL_trans_2]
input: group_by = "all"
output: f'{cwd:a}/TensorQTL_recipe.tsv',f'{cwd:a}/TensorQTL_column_info.txt'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    import csv
    import pandas as pd 
    data_tempt = pd.DataFrame({
    "#chr" : [int(x.split(".")[-3].replace("chr","")) for x in  [$[_input:r,]]],
    "sumstat_dir" : [$[_input:r,]],
    "column_info" : $[_output[1]:r]
    })
    column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "CHR": "chrom",
          "POS": "pos",
          "A0": "ref",
          "A1": "alt",
          "SNP": "variant_id",
          "STAT": "beta",
          "SE": "se",
          "P": "pval",
          "AF": "af",
          "GENE": "gene_ID"}), columns = ["TensorQTL"] )
    data_tempt.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )