# Advanced cis-QTL analysis with individual level data

## Description

This notebook performs various advanced statistical analysis on multiple xQTL in a given region. Current procedures implemented include:

1. Univariate analysis
    - SuSiE
    - Univeriate TWAS weights: LASSO, Elastic net, mr.mash and SuSiE (optional)
    - Cross validation of TWAS methods (optional but highly recommended if TWAS weights are computed)
2. Functional data (epigenomic xQTL) analysis
    - fSuSiE
3. Multivariate analysis
    - mvSuSiE
    - mr.mash

## Input

1. A list of regions to be analyzed (optional); the last column of this file should be region name.
2. Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK `bed` format. 
3. Vector of lists of phenotype files per region to be analyzed, in UCSC `bed.gz` with index in `bed.gz.tbi` formats.
4. Vector of covariate files corresponding to the lists above.
5. Customized cis windows file. If it is not provided, a fixed sized cis-window will be used.
6. Optionally a vector of names of the phenotypic conditions in the form of `cond1 cond2 cond3` separated with whitespace.

Input 2 and 3 should be outputs from `genotype_per_region` and `annotate_coord` modules in previous preprocessing steps. 4 should be output of `covariate_preprocessing` pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

### Example genotype data

```
#chr        path
chr21 /mnt/mfs/statgen/xqtl_workflow_testing/protocol_example.genotype.chr21.bed
chr22 /mnt/mfs/statgen/xqtl_workflow_testing/protocol_example.genotype.chr22.bed
```

Alternatively, simply use `protocol_example.genotype.chr21_22.bed` if all chromosomes are in the same file.

### Example phenotype list

```
#chr    start   end ID  path
chr12   752578  752579  ENSG00000060237  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   990508  990509  ENSG00000082805  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   2794969 2794970 ENSG00000004478  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   4649113 4649114 ENSG00000139180  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   6124769 6124770 ENSG00000110799  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   6534516 6534517 ENSG00000111640  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
```

### Example cis-window file

It should have strictly 4 columns, with the header a commented out line:

```
#chr    start    end    gene_id
chr10   0    6480000    ENSG00000008128
chr1    0    6480000    ENSG00000008130
chr1    0    6480000    ENSG00000067606
chr1    0    7101193    ENSG00000069424
chr1    0    7960000    ENSG00000069812
chr1    0    6480000    ENSG00000078369
chr1    0    6480000    ENSG00000078808
```

The key is that the 4th column ID should match with the 4th column ID in the phenotype list. Otherwise the cis-window to analyze will not be found.

### About indels

Option `--no-indel` will remove indel from analysis. FIXME: Gao need to provide more guidelines how to deal with indels in practice.

## Output

For each analysis region, the output is SuSiE model fitted and saved in RDS format.

## Minimal Working Example Steps

### i. SuSiE with TWAS weights

Timing [FIXME]

Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well).

Here using `--region-name` we focus the analysis on 3 genes. In practice if this parameter is dropped, the union of all regions in all phenotype region lists will be analyzed. It is possible for some of the regions there are no genotype data, in which case the pipeline will output RDS files with a warning message to indicate the lack of genotype data to analyze.

**Note:** Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk.

In [None]:
sos run pipeline/cis_workhorse.ipynb susie_twas  \
    --name protocol_example_protein  \
    --genoFile input/xqtl_association/protocol_example.genotype.chr21_22.bed   \
    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
                output/phenotype/protocol_example.protein.region_list.txt \
    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz  \
    --customized-cis-windows input/xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --region-name ENSG00000241973_P42356 ENSG00000160209_O00764 ENSG00000100412_Q99798 \
    --phenotype-names trait_A trait_B \
    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest

It is also possible to analyze a selected list of regions using option `--region-list`. The last column of this file will be used for the list to analyze. Here for example use the same list of regions as we used for customized cis-window:

In [None]:
sos run xqtl-pipeline/pipeline/cis_workhorse.ipynb susie_twas  \
    --name protocol_example_protein  \
    --genoFile xqtl_association/protocol_example.genotype.chr21_22.bed   \
    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
                output/phenotype/protocol_example.protein.region_list.txt \
    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz  \
    --customized-cis-windows xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --region-list xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --phenotype-names trait_A trait_B \
    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest

**Note:** When both `--region-name` and `--region-list` are used, the union of regions from these parameters will be analyzed. 

FIXME: We should probably just explain these parameters, will work better for conversion script


To perform fine-mapping only without TWAS weights,

```
sos run pipeline/cis_workhorse.ipynb susie_twas --no-twas-weights ... # rest of parameters the same. 
```

To perform fine-mapping and TWAS weights without cross validation,

```
sos run pipeline/cis_workhorse.ipynb susie_twas --twas-cv-folds 0 ... # rest of parameters the same. 
```

It is also possible to specify a subset of samples to analyze, using `--keep-samples` parameter. For example we create a file to keep the ID of 50 samples,

In [None]:
zcat output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz | head -1 | awk '{for (i=2; i<=51; i++) printf $i " "; print ""}'> output/keep_samples.txt

then use them in our analysis,

```
sos run xqtl-pipeline/pipeline/cis_workhorse.ipynb susie_twas --keep-samples output/keep_samples.txt ... # rest of parameters the same
```

### ii. fSuSiE

Timing [FIXME]

**Note:** Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk.

In [None]:
sos run pipeline/cis_workhorse.ipynb fsusie \
    --name protocol_example_methylation \
    --genoFile xqtl_association/protocol_example.genotype.chr21_22.plink_per_chrom.txt \
    --phenoFile output/phenotype_by_region/protocol_example.methylation.bed.phenotype_by_region_files.txt \
                output/phenotype_by_region/protocol_example.methylation.bed.phenotype_by_region_files.txt  \
    --covFile output/covariate/protocol_example.methylation.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.methylation.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest

### iii. mrmash

In [None]:
sos run ~/githubrepo/xqtl-pipeline/my_cis_workhorse.ipynb mrmash \
    --name multi_trait \
    --genoFile  output/protocol_example.genotype.chr21_22.bed \
    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized-cis-windows output/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --region-list output/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --canonical_mats T \
    --cwd output/ \
    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest

## Troubleshooting

| Step | Substep | Problem | Possible Reason | Solution |
|------|---------|---------|------------------|---------|
|  |  |  |  |  |




## Command interface

In [None]:
sos run cis_workhorse.ipynb -h

## Workflow implementation

In [1]:
[global]
parameter: cwd = path("output")
# A list of file paths for genotype data, or the genotype data itself. 
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths
# One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
parameter: phenoIDFile = paths()
# Covariate file path
parameter: covFile = paths
# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# Only focus on a subset of samples
parameter: keep_samples = path()
# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# When this is zero, we will rely on customized_cis_windows
parameter: window = 0
# It is required to input the name of the analysis
parameter: name = str
# save data object or not
parameter: save_data = False
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "1h"
# Memory expected
parameter: mem = "20G"
# Number of threads
parameter: numThreads = 1
# Name of phenotypes
parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
parameter: seed = 999

def group_by_region(lst, partition):
    # from itertools import accumulate
    # partition = [len(x) for x in partition]
    # Compute the cumulative sums once
    # cumsum_vector = list(accumulate(partition))
    # Use slicing based on the cumulative sums
    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
    return partition

import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))

In [1]:
[get_analysis_regions: shared = "regional_data"]
# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
# regional_data should be a dictionary like:
#{'data': [("genotype_1.bed", "phenotype_1.bed.gz", "covariate_1.gz"), ("genotype_2.bed", "phenotype_1.bed.gz", "phenotype_2.bed.gz", "covariate_1.gz", "covariate_2.gz") ... ],
# 'meta_info': [("chr12:752578-752579","chr12:752577-752580", "gene_1", "trait_1"), ("chr13:852580-852581","chr13:852579-852580", "gene_2", "trait_1", "trait_2") ... ]}
def adapt_file_paths_and_assign(pheno_df, pheno_path, cov_path, phenotype_name):
    pheno_df.iloc[:, 4] = adapt_file_path_all(pheno_df, pheno_df.columns[4], os.path.dirname(pheno_path))
    return pheno_df.assign( cov_path=str(cov_path), cond=phenotype_name)

def preload_id_map(id_map_files):
    id_maps = {}
    for id_map_file in id_map_files:
        if id_map_file is not None and os.path.isfile(id_map_file):
            df = pd.read_csv(id_map_file, sep='\s+', header=None, comment='#', names=['old_ID', 'new_ID'])
            id_maps[id_map_file] = df.set_index('old_ID')['new_ID'].to_dict()
    return id_maps

def process_pheno_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, preloaded_id_maps):
    '''
    Example output:
    #chr    start      end    ID  Original_ID   path     cov_path             cond
    0  chr12   752578   752579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B
    '''
    accumulated_pheno_df = pd.DataFrame()

    pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files

    for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
        if not os.path.isfile(cov_path):
            raise FileNotFoundError(f"No valid path found for file: {cov_path}")
        
        pheno_df = pd.read_csv(pheno_path, sep="\s+", header=0)
        
        # Skip filtering by region_ids if region_ids list is empty
        if region_ids:
            pheno_df = pheno_df[pheno_df['ID'].isin(region_ids)]
        
        if pheno_df.empty:
            continue
    
        # Apply new ID
        pheno_df['Original_ID'] = pheno_df['ID']
        if id_map_path in preloaded_id_maps:
            id_map = preloaded_id_maps[id_map_path]
            pheno_df['ID'] = pheno_df['ID'].map(id_map).fillna(pheno_df['ID'])
        # Adjust file paths and phenotype names
        pheno_df = adapt_file_paths_and_assign(pheno_df, pheno_path, cov_path, phenotype_name)
        # Take in new condition data
        accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)
    
    # aggregate cross-condition data-sets into a flattend matrix
    accumulated_pheno_df = accumulated_pheno_df.groupby(['#chr', 'start', 'end', 'ID', 'Original_ID'], as_index=False).agg({
                    'cond': ','.join,
                    'path': ','.join,
                    'cov_path': ','.join
                })
    # handle the case where multiple Origion ID correspond to the same ID (at gene or region level) we simply aggregate them
    accumulated_pheno_df = accumulated_pheno_df.groupby(accumulated_pheno_df.columns.difference(['Original_ID']).tolist(), as_index=False).agg({'Original_ID': ','.join})

    if accumulated_pheno_df['Original_ID'].duplicated(keep=False).any():
        duplicated_rows = accumulated_pheno_df[accumulated_pheno_df['Original_ID'].duplicated(keep=False)]
        error_message = "Original phenotypic ID should be unique, but duplicates are found that maps to more than one region of interest. " \
                        "Please check your phenotype data file and/or phenotype ID mapping files " \
                        "to ensure there are no duplicates in the original ID. " \
                        "Duplicated rows:\n{}".format(duplicated_rows)
        raise ValueError(error_message)
    
    return accumulated_pheno_df

# Load phenotype meta data
if len(phenoFile) != len(covFile):
    raise ValueError("Number of input phenotypes files must match that of covariates files")
if len(phenoFile) != len(phenotype_names):
    raise ValueError("Number of input phenotypes files must match the number of phenotype names")
if len(phenoIDFile) > 0 and len(phenoFile) != len(phenoIDFile):
    raise ValueError("Number of input phenotypes files must match the number of phenotype ID mapping files")

# Load genotype meta data
if f"{genoFile:x}" == ".bed":
    geno_meta_data = pd.DataFrame([("chr"+str(x), f"{genoFile:a}") for x in range(1,23)] + [("chrX", f"{genoFile:a}")], columns=['#chr', 'geno_path'])
else:
    geno_meta_data = pd.read_csv(f"{genoFile:a}", sep = "\s+", header=0)
    geno_meta_data.iloc[:, 1] = adapt_file_path_all(geno_meta_data, geno_meta_data.columns[1], f"{genoFile:a}")
    geno_meta_data.columns = ['#chr', 'geno_path']
    geno_meta_data['#chr'] = geno_meta_data['#chr'].apply(lambda x: str(x) if str(x).startswith('chr') else f'chr{x}')

# Checking the DataFrame
valid_chr_values = [f'chr{x}' for x in range(1, 23)] + ['chrX']
if not all(value in valid_chr_values for value in geno_meta_data['#chr']):
    raise ValueError("Invalid chromosome values found. Allowed values are chr1 to chr22 and chrX.")

region_ids = []
# If region_list is provided, read the file and extract IDs
if region_list.is_file():
    region_list_df = pd.read_csv(region_list, delim_whitespace=True, header=None, comment = "#")
    region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs

# If region_name is provided, include those IDs as well
# --region-name A B C will result in a list of ["A", "B", "C"] here
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))

preloaded_id_maps = preload_id_map(phenoIDFile)
meta_data = process_pheno_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, preloaded_id_maps)
meta_data = meta_data.merge(geno_meta_data, on='#chr', how='inner')

# Adjust cis-window
if os.path.isfile(customized_cis_windows):
    print(f"Loading customized cis-window data from {customized_cis_windows}")
    cis_list = pd.read_csv(customized_cis_windows, comment="#", header=None, names=["#chr","start","end","ID"], sep="\t")
    meta_data = pd.merge(meta_data, cis_list, on=['#chr', 'ID'], how='left', suffixes=('', '_cis')) 
    mismatches = meta_data[meta_data['start_cis'].isna()]
    if not mismatches.empty:
        print("First 5 mismatches:")
        print(mismatches[['ID']].head())
        raise ValueError(f"{len(mismatches)} regions to analyze cannot be found in ``{customized_cis_windows}``. Please check your ``{customized_cis_windows}`` database to make sure it contains all cis-window definitions. ")
else:
    if window <=0 :
        raise ValueError("Please either input valid path to cis-window file via ``--customized-cis-windows``, or set ``--window`` to a positive integer")
    meta_data['start_cis'] = meta_data['start'].apply(lambda x: max(x - window, 0))
    meta_data['end_cis'] = meta_data['end'] + window

# Example meta_data:
# #chr    start      end    start_cis       end_cis           ID  Original_ID   path     cov_path             cond             coordinate     geno_path
# 0  chr12   752578   752579  652578   852579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B    chr12:752578-752579  protocol_example.genotype.chr21_22.bed       
# Create the final dictionary
regional_data = {
    'data': [(row['geno_path'], *row['path'].split(','), *row['cov_path'].split(',')) for _, row in meta_data.iterrows()],
    'meta_info': [(f"{row['#chr']}:{row['start']}-{row['end']}", # this is the phenotype region
                   f"{row['#chr']}:{row['start_cis']}-{row['end_cis']}", # this is the cis-window region
                   row['ID'], row['Original_ID'], *row['cond'].split(',')) for _, row in meta_data.iterrows()]
}

## Univariate analysis: SuSiE and TWAS

In [1]:
[susie_twas_1]
# initial number of single effects for SuSiE
parameter: init_L = 8
# maximum number of single effects to use for SuSiE
parameter: max_L = 30
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1.0
# MAF cutoff
parameter: maf = 0.0025
# MAC cutoff, on top of MAF cutoff
parameter: mac = 5
# Remove indels if indel = False
parameter: indel = True
parameter: pip_cutoff = 0.025
# If this value is greater than 0, an initial single effect analysis will be performed 
# to determine if follow up analysis will be continued or to simply return NULL
parameter: skip_analysis_pip_cutoff = -1.0
parameter: coverage = [0.95, 0.7, 0.5]
# Perform Fine-mapping
parameter: fine_mapping = True
# Compute TWAS weights as well
parameter: twas_weights = True
# Perform K folds valiation CV for TWAS
# Set it to zero if this is to be skipped
parameter: twas_cv_folds = 5
parameter: twas_cv_threads = twas_cv_folds
# maximum number of variants to consider for CV
# We will randomly pick a subset of it for CV purpose
parameter: max_cv_variants = 5000
# Further limit CV to only using common variants
parameter: min_cv_maf = 0.05
parameter: ld_reference_meta_file = path()
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')

meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[2]}.univariate{"_susie" if fine_mapping else ""}{"_twas_weights" if twas_weights else ""}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    options(warn=1)
    library(pecotmr)
    # extract subset of samples
    keep_samples = NULL
    if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
      keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
      message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
    }
    # Load regional association data
    tryCatch({
    fdat = load_regional_univariate_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])}),
                                          region = "${_meta_info[0]}",
                                          cis_window = "${_meta_info[1]}",
                                          conditions = c(${",".join(['"%s"' % x for x in _meta_info[4:]])}),
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss},
                                          keep_indel = ${"TRUE" if indel else "FALSE"},
                                          keep_samples = keep_samples,
                                          extract_region_name = c(${",".join(['"%s"' % x for x in _meta_info[3].split(',')])}),
                                          phenotype_header = 4,
                                          region_name_col = 4,
                                          scale_residuals = FALSE)
    }, NoSNPsError = function(e) {
        message("Error: ", paste(e$message, "${_meta_info[2] + '@' + _meta_info[1]}"))
        #saveRDS(NULL, ${_output:ar})
        saveRDS(list(${_meta_info[2]} = e$message), ${_output:ar}, compress='xz')
        quit(save="no")
    })
  
    if (${"TRUE" if not (fine_mapping or twas_weights) else "FALSE"}) {
      # only export data
      saveRDS(list(${_meta_info[2]} = fdat), ${_output:ar}, compress='xz')
      quit(save="no")
    } else {
      if (${"TRUE" if save_data else "FALSE"}) {
          # save data object for debug purpose
          saveRDS(list(${_meta_info[2]} = fdat), "${_output:ann}.univariate.rds", compress='xz')
      }
    }
    # Univeriate analysis suite
    run_univariate_pipeline <- function(X, Y, X_scalar, Y_scalar, maf, dropped_samples, pip_cutoff_to_skip = ${skip_analysis_pip_cutoff}) {
      if (pip_cutoff_to_skip>0) {
          # return a NULL set if the top loci model does not show any potentially significant variants
          top_model_pip = susie(X,Y,L=1)$pip
          if (!any(top_model_pip>pip_cutoff_to_skip)) {
              return(list())
          }
      }
      st = proc.time()
      if (${"TRUE" if fine_mapping else "FALSE"}) {
          res = susie_wrapper(X, Y, init_L=${init_L}, max_L=${max_L}, refine=TRUE, coverage = ${coverage[0]})
          res = susie_post_processor(res, X, Y, X_scalar, Y_scalar, maf,
                                 secondary_coverage = c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
                                 other_quantities = list(dropped_samples = dropped_samples))
      }
      else {
          res = list()
      }
      if ( ${"TRUE" if twas_weights else "FALSE"} ) {
        twas_weights_output <- twas_weights_pipeline(X, Y, maf, susie_fit=res$susie_result_trimmed, 
                                     ld_reference_meta_file = ${('"%s"' % ld_reference_meta_file) if not ld_reference_meta_file.is_dir() else "NULL"},
                                     X_scalar = X_scalar, y_scalar = Y_scalar,
                                     cv_folds = ${twas_cv_folds}, coverage=${coverage[0]}, secondary_coverage=c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
                                     min_cv_maf=${min_cv_maf}, max_cv_variants=${max_cv_variants}, cv_seed=${seed}, cv_threads=${twas_cv_threads})
        # clean up the output database
        res = c(res, twas_weights_output)
        res$twas_weights = lapply(res$twas_weights, function(x) { rownames(x) <- NULL; return(x) })
      }
      res$total_time_elapsed = proc.time() - st
      if ("${_meta_info[2]}" != "${_meta_info[3]}") {
          region_name = c("${_meta_info[2]}", ${",".join(['"%s"' % x for x in _meta_info[3].split(',')])})
      } else {
          region_name = "${_meta_info[2]}"
      }
      res$region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name=region_name)
      return (res)
    }
  
    fitted = list()
    condition_names = vector()
    r = 1
    while (r<=length(fdat$residual_Y)) {
      dropped_samples = list(X=fdat$dropped_sample$dropped_samples_X[[r]], 
                             y=fdat$dropped_sample$dropped_samples_Y[[r]], 
                             covar=fdat$dropped_sample$dropped_samples_covar[[r]])
      if (ncol(fdat$residual_Y[[r]]) == 1) {
          condition_names = c(condition_names, names(fdat$residual_Y)[r])
      } else {
          new_names = colnames(fdat$residual_Y[[r]])
          if (is.null(new_names)) {
              # column names does not exist, create generic names instead
              new_names = 1:ncol(fdat$residual_Y[[r]])
          }
          new_names = paste(names(fdat$residual_Y)[r], new_names, sep="_") # DLPFC_iso1 DLPFC_iso2 
          condition_names = c(condition_names, new_names) # ACC DLPFC_iso1 DLPFC_iso2 
      }
      results <- lapply(1:ncol(fdat$residual_Y[[r]]), function(i) run_univariate_pipeline(fdat$residual_X[[r]], 
                                                                                    fdat$residual_Y[[r]][,i,drop=FALSE], 
                                                                                    fdat$residual_X_scalar[[r]], 
                                                                                    if (fdat$residual_Y_scalar[[r]] == 1) 1 else fdat$residual_Y_scalar[[r]][,i,drop=FALSE], 
                                                                                    fdat$maf[[r]], dropped_samples))
      fitted = c(fitted, results)
      # original data no longer relevant, set to NA to release memory
      fdat$residual_X[[r]] <- NA
      fdat$residual_Y[[r]] <- NA
      r = r + 1
    }
    names(fitted) <- condition_names
    saveRDS(list("${_meta_info[2]}" = fitted), ${_output:ar}, compress='xz')

## Multivariate analysis: mvSuSiE and mr.mash

### mvSuSiE

In [None]:
[mvsusie_1]
# Prior model file generated from mashr. 
# Default will be used if it does not exist.
parameter: mixture_prior = path()
parameter: max_L = 20
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 0.1
# MAF cutoff
parameter: maf = 0.0
# MAC cutoff, on top of MAF cutoff
# Here I set default to mac = 10 rather than using an MAF cutoff
# I don't set it to 5 because I'm not so sure of performance of SuSiE on somewhat infrequent variants
# MAC = 10 would not be too infrequenty for xQTL data where sample size is about ~1,000 at most (as of 2022)
parameter: mac = 10
# Remove indels if indel = False
parameter: indel = True
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')


meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.susie_fitted.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
   
    get_prior_indices <- function(Y, U) {
      # make sure the prior col/rows match the colnames of the Y matrix
      y_names = colnames(Y)
      u_names = colnames(U)
      if (is.null(y_names) || is.null(u_names)) {
          return(NULL)
      } else if (identical(y_names, u_names)) {
          return(NULL)
      } else {
          return(match(y_names, u_names))
      }
    }

    # Load regional association data
    fdat = load_regional_multivariate_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
                                          region = ${'"%s:%s-%s"' % (_meta_info[1], _meta_info[2], _meta_info[3])},
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss},
                                          keep_indel = ${"TRUE" if indel else "FALSE"})

    # univariate summary statistics
    non_missing = lapply(1:ncol(fdat$residual_Y), function(r)) which(!is.na(fdat$residual_Y[,r]))
    univariate_res = lapply(1:ncol(fdat$residual_Y), function(r) susieR:::univariate_regression(X[non_missing[[r]], ], fdat$residual_Y[non_missing[[r]], r]))
    sumstat = list(bhat=do.call(cbind, lapply(1:ncol(fdat$residual_Y), function(r) univariate_res[[r]]$betahat)),
                   sbhat=do.call(cbind, lapply(1:ncol(fdat$residual_Y), function(r) univariate_res[[r]]$sebetahat)))
  
    # Multivariate fine-mapping
    # FIXME: handle it when prior does not exist
    prior = readRDS(${mixture_prior:r})
    print(paste("Number of components in the mixture prior:", length(prior$U)))
    prior = mvsusieR::create_mash_prior(mixture_prior=list(weights=prior$w, matrices=prior$U), include_indices = get_prior_indices(fdat$residual_Y, prior$U[[1]]), max_mixture_len=-1)   
    resid_Y = compute_cov_flash(fdat$residual_Y)
    st = proc.time()
    fitted = mvsusieR::mvsusie(fdat$X, 
                               fdat$residual_Y, 
                               L=${max_L}, 
                               prior_variance=prior, 
                               residual_variance=resid_Y, 
                               precompute_covariances=F, 
                               compute_objective=T, 
                               estimate_residual_variance=F, 
                               estimate_prior_variance=T, 
                               estimate_prior_method='EM',
                               max_iter = 200, 
                               n_thread=${numThreads}, 
                               approximate=F)
    fitted$analysis_time = proc.time() - st
    fitted$cs_corr = susieR::get_cs_correlation(fitted, X=fdat$X)
    fitted$cs_snps = names(fitted$X_column_scale_factors[unlist(fitted$sets$cs)])
    fitted$variable_name = names(fitted$pip)
    fitted$analysis_script = load_script()
    fitted$dropped_samples = fdat$dropped_sample
    fitted$sample_names = colnames(fdat$residual_Y)
    fitted$residual_y = resid_Y
    saveRDS(fitted, ${_output:ar})

### mr.mash

In [None]:
[mrmash_1]
# Prior model file generated from mashr. 
# Default will be used if it does not exist.
parameter: mixture_prior = path()
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 0.1
# MAF cutoff
parameter: maf = 0.05
# MAC cutoff, on top of MAF cutoff
parameter: mac = 0
# Remove indels if indel = False
parameter: indel = True
# Path to prior grid data file: an RDS file with scaling factors
parameter: prior_grid = path('.')
# Path to prior weights data file: an RDS file with prior weights
parameter: prior_weights = path('.')
parameter: nthreads = 2
parameter: var_cutoff = 0.05
parameter: n_nonmiss_Y = 100
parameter: canonical_mats = False
parameter: standardize = False
parameter: update_w0 = True
parameter: w0_threshold = 0.0
parameter: update_V = True
parameter: update_V_method = "full"
parameter: B_init_method = "enet"
parameter: max_iter = 5000
parameter: tol = 1e-2
parameter: verbose = False
parameter: save_model = False
parameter: glmnet_pred = False
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')

meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.mrmash.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    library(pecotmr)
    # Load regional association data

    fdat = load_regional_multivariate_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
                                          region = "${_meta_info[0]}",
                                          conditions = c('${_meta_info[2]}'),
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          xvar_cutoff = ${var_cutoff},
                                          imiss_cutoff = ${imiss},
                                          cis_window = "${_meta_info[1]}",
                                          matrix_y_min_complete = ${n_nonmiss_Y},
                                          keep_indel = ${"TRUE" if indel else "FALSE"})

    if (file_test("-f", ${prior_grid:r})) {prior_grid <- readRDS(${prior_grid:r})} else {prior_grid <- NULL}
    if (file_test("-f", ${mixture_prior:r})) {mixture_prior <- readRDS(${mixture_prior:r})} else {mixture_prior <- NULL}

    res = mrmash_wrapper(fdat$X, fdat$residual_Y[[1]], # here, X and Y are both matrices. X does not have missing data, Y may have it. Y is residualized (covariates removed) X is not.
                            prior_grid = prior_grid,
                            prior_data_driven_matrices=mixture_prior, 
                            nthreads = ${nthreads},
                            prior_canonical_matrices = ${"T" if canonical_mats else "F"},
                            standardize = ${"T" if standardize else "F"},
                            update_w0 = ${"T" if update_w0 else "F"},
                            w0_threshold = ${w0_threshold},
                            update_V = ${"T" if update_V else "F"},
                            update_V_method = "${update_V_method}",
                            B_init_method = "${B_init_method}",
                            max_iter = ${max_iter},
                            tol = ${tol},
                            verbose = ${"T" if verbose else "F"}
                          )
  
    st = proc.time()
    res$analysis_time = proc.time() - st
    res$analysis_script = load_script()
    res$dropped_samples = fdat$dropped_sample
    res$sample_names = colnames(fdat$residual_Y)
    saveRDS(res, ${_output:ar}, compress='xz')

## Functional regression fSuSiE for epigenomic QTL fine-mapping

In [2]:
[fsusie_1]
# initial number of single effects for SuSiE
parameter: init_L = 8
# maximum number of single effects to use for SuSiE
parameter: max_L = 20
# remove a variant if it has more than imiss missing individual level data
# here we don't remove any because we have done QC before
parameter: imiss = 1.0
# MAF cutoff
parameter: maf = 0.005
# MAC cutoff, on top of MAF cutoff
parameter: mac = 5
# Remove indels if indel = False
parameter: indel = True
parameter: pip_cutoff = 0.025
parameter: coverage = [0.95, 0.7, 0.5]
# prior can be either of ["mixture_normal", "mixture_normal_per_scale"]
parameter: prior = "mixture_normal"
parameter: max_SNP_EM = 100
# Max scale is such that 2^max_scale being the number of phenotypes in the transformed space. Default to 2^10  = 1024. Don't change it unless you know what you are doing. Max_scale should be at least larger than 5.
parameter:  max_scale = 10
# Purity and coverage used to call cs
parameter:  min_purity = 0.5
# Epigenetics mark filter
parameter: epigenetics_mark_treshold = 16
# Run susie for top pc of the fsusie input
parameter: run_susie_top_pc = False
# Compute TWAS weights as well
parameter: twas_weights = True
# Perform K folds valiation CV for TWAS
# Set it to zero if this is to be skipped
parameter: twas_cv_folds = 5
parameter: twas_cv_threads = twas_cv_folds
# maximum number of variants to consider for CV
# We will randomly pick a subset of it for CV purpose
parameter: max_cv_variants = 5000
# Further limit CV to only using common variants
parameter: min_cv_maf = 0.05
parameter: ld_reference_meta_file = path()
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.fsusie_{prior}{"_weights_db" if twas_weights else ""}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    options(warn=1)
    # extract subset of samples
    keep_samples = NULL
    if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
      keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
      message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
    }

    # Load regional functional data
    library(pecotmr)
    tryCatch({
    fdat = load_regional_functional_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])}),
                                          region = "${_meta_info[0]}",
                                          cis_window = "${_meta_info[1]}",
                                          conditions = c(${",".join(['"%s"' % x for x in _meta_info[4:]])}),
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss},
                                          keep_indel = ${"TRUE" if indel else "FALSE"},
                                          keep_samples = keep_samples,
                                          tabix_header = TRUE,
                                          phenotype_header = 4,
                                          region_name_col = 4,
                                          scale_residuals = FALSE)
    }, NoSNPsError = function(e) {
        message("Error: ", paste(e$message, "${_meta_info[2] + '@' + _meta_info[1]}"))
        #saveRDS(NULL, ${_output:ar})
        saveRDS(list("${_meta_info[0]}" = e$message), ${_output:ar}, compress='xz')
        quit(save="no")
    })
    # Filter out list fdat that with less than a treshold of epigenomic marker.
    library(tidyverse)
    filter_fdat_except_specific_names <- function(fdat, n) {
        # Identify which elements in list1 meet the row count criteria
        indices_to_keep <- sapply(fdat$Y_coordinates, function(x) nrow(x) >= n)
        fdat_filtered <- map(fdat[!names(fdat) %in% c("dropped_sample", "X", "chrom")],~.x[indices_to_keep]) 
        return(c(fdat_filtered,fdat[names(fdat) %in% c("dropped_sample", "X", "chrom")]))
    }

    fdat = filter_fdat_except_specific_names(fdat, n = ${epigenetics_mark_treshold})
    # Check if Y_coordinates is empty after filtering
    if (length(fdat$Y_coordinates) == 0) {
        e_msg = paste0("None of the study have more than or equal to ",${epigenetics_mark_treshold}, " epigenetics marks, region skipped")
        message(e_msg)
        saveRDS(list("${_meta_info[0]}" = e_msg ),  ${_output:ar}, compress='xz')
        quit(save="no")
    }

    if (${"TRUE" if save_data else "FALSE"}) {
      # save data object for debug purpose
      saveRDS(list("${_meta_info[0]}" = fdat), "${_output:ann}.${epigenetics_mark_treshold}_marks.dataset.rds", compress='xz')
    }
  
    fitted = setNames(replicate(length(fdat$residual_Y), list(), simplify = FALSE), names(fdat$residual_Y))
    for (r in 1:length(fitted)) {
        st = proc.time()
        fitted[[r]] = list()
        message(paste("Dimension of Y matrix is ", nrow(fdat$residual_Y[[r]]), "rows by", ncol(fdat$residual_Y[[r]]), "columns."))
        
        # Get top PC data
        top_pc_data <- prcomp(fdat$residual_Y[[r]], center = TRUE, scale. = TRUE)$x[,1]
        
        # Run SuSiE on top PC
        if(${"TRUE" if (run_susie_top_pc or twas_weights) else "FALSE"}) {
            fitted[[r]]$susie_on_top_pc <- susie_wrapper(fdat$residual_X[[r]], top_pc_data, init_L=${init_L}, max_L=${max_L}, refine=TRUE, coverage = ${coverage[0]})
            fitted[[r]]$susie_on_top_pc <- susie_post_processor(fitted[[r]]$susie_on_top_pc, fdat$residual_X[[r]], top_pc_data, fdat$residual_X_scalar[[r]], 1, fdat$maf[[r]],
                                           secondary_coverage = c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
                                           other_quantities = list(dropped_samples = list(X=fdat$dropped_sample$dropped_samples_X[[r]], 
                                                                   y=fdat$dropped_sample$dropped_samples_Y[[r]], 
                                                                   covar=fdat$dropped_sample$dropped_samples_covar[[r]])))
        }
        
        # Run TWAS weights on top PC
        # Exactly the same codes copied from susie_twas
        if ( ${"TRUE" if twas_weights else "FALSE"} ) {
            twas_weights_output <- twas_weights_pipeline(fdat$residual_X[[r]], top_pc_data, fdat$maf[[r]], susie_fit=fitted[[r]]$susie_on_top_pc$susie_result_trimmed, 
                                     ld_reference_meta_file = ${('"%s"' % ld_reference_meta_file) if not ld_reference_meta_file.is_dir() else "NULL"},
                                     X_scalar = fdat$residual_X_scalar[[r]], y_scalar = fdat$residual_Y_scalar[[r]],
                                     cv_folds = ${twas_cv_folds}, coverage=${coverage[0]}, secondary_coverage=c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
                                     min_cv_maf=${min_cv_maf}, max_cv_variants=${max_cv_variants}, cv_seed=${seed}, cv_threads=${twas_cv_threads})
            # clean up the output database
            fitted[[r]] = c(fitted[[r]], twas_weights_output)
            fitted[[r]]$twas_weights = lapply(fitted[[r]]$twas_weights, function(x) { rownames(x) <- NULL; return(x) })
        }
          
        # Run fSuSiE -- this can take a while
        fitted[[r]]$fsusie_result <- fsusie_wrapper(X = fdat$residual_X[[r]],
                                      Y = fdat$residual_Y[[r]],
                                      pos=fdat$Y_coordinates[[r]]$start,
                                      L=${max_L},
                                      prior="${prior}",
                                      max_SNP_EM=${max_SNP_EM}, 
                                      max_scale = ${max_scale},
                                      min.purity = ${min_purity},
                                      cov_lev = ${coverage[0]})
        fitted[[r]]$fsusie_summary <- susie_post_processor(fitted[[r]]$fsusie_result, fdat$residual_X[[r]], top_pc_data, fdat$residual_X_scalar[[r]], 1, fdat$maf[[r]], 
                                                          secondary_coverage = c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
                                                          other_quantities = list(dropped_samples = list(X=fdat$dropped_sample$dropped_samples_X[[r]], y=fdat$dropped_sample$dropped_samples_Y[[r]], 
                                                                                  covar=fdat$dropped_sample$dropped_samples_covar[[r]])))
        fitted[[r]]$fsusie_summary$susie_result_trimmed = NULL
        fitted[[r]]$total_time_elapsed = proc.time() - st
        fitted[[r]]$region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name="${_meta_info[2]}")
        # original data no longer relevant, set to NA to release memory
        fdat$residual_X[[r]] <- NA
        fdat$residual_Y[[r]] <- NA
    }
    saveRDS(list("${_meta_info[0]}" = fitted), ${_output:ar}, compress='xz')

## Functional regression fSuSiE with other modality

In [None]:
[mvfsusie_1]
parameter: max_L = 30
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 0.1
# MAF cutoff
parameter: maf = 0.0
# MAC cutoff, on top of MAF cutoff
# Here I set default to mac = 10 rather than using an MAF cutoff
# I don't set it to 5 because I'm not so sure of performance of SuSiE on somewhat infrequent variants
# MAC = 10 would not be too infrequenty for xQTL data where sample size is about ~1,000 at most (as of 2022)
parameter: mac = 10
# prior can be either of ["mixture_normal", "mixture_normal_per_scale"]
parameter: prior  = "mixture_normal_per_scale"
parameter: max_SNP_EM = 1000

depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')


meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.mvfsusie_{prior}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    # Load regional association data
    fdat = load_regional_association_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
                                          region = ${'"%s:%s-%s"' % (_meta_info[1], _meta_info[2], _meta_info[3])},
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss})
    # Fine-mapping with mvfSuSiE
    library("mvf.susie.alpha")
    Y = map(fdat$residual_Y, ~left_join(fdat$X[,1]%>%as.data.frame%>%rownames_to_column("rowname"), .x%>%t%>%as.data.frame%>%rownames_to_column("rowname") , by = "rowname")%>%select(-2)%>%column_to_rownames("rowname")%>%as.matrix )
    fitted <- multfsusie(Y_f = list(Y[[1]],Y[[3]]), 
                         Y_u = Reduce(cbind, Y[[2]]),
                         pos = list(pos1 =fdat$phenotype_coordiates[[1]], pos2 = fdat$phenotype_coordiates[[3]]),
                         X=X,
                         L=${max_L},
                         data.format="list_df")
    saveRDS(fitted, ${_output:ar})