# GWAS integration: enrichment and colocalization

This workflow processes fine-mapping results for xQTL, generated by `susie_twas` in the `cis_analysis.ipynb` notebook for cis xQTL, and GWAS fine-mapping results produced by `susie_rss` in the `rss_analysis.ipynb` notebook. It is designed to perform enrichment and colocalization analysis, particularly when fine-mapping results originate from different regions in the case of cis-xQTL and GWAS. The pipeline is capable to integrate and analyze data across these distinct regions. Originally tailored for cis-xQTL and GWAS integration, this pipeline can be applied to other pairwise integrations. An example of such application is in trans analysis, where the fine-mapped regions might be identical between trans-xQTL and GWAS, representing a special case of this broader implementation.

## Input

Lists of SuSiE fine-mapping output objects, in RDS format, of `class(susie)` in R. 

- For GWAS the list is meta-data of format: `chr`, `start`, `end`, `study_id`, `file_path` where `file_path` is an RDS file.
- For xQTL the list is meta-data of format: `chr`, `start`, `end`, `region_id`, `condition_id`, `file_path` where `file_path` is an RDS file. `condition_id` should be optional -- if that is the case, all conditions inside of the xQTL dataset will be analyzed.

## Output

1. Enrichment analysis results --- this is a global enrichment estimate that combines all input data
2. Colocalization results for regions of interest

## Example
enrichment
```
sos run ~/codes/xqtl-pipeline/pipeline/SuSiE_enloc.ipynb xqtl_gwas_enrichment \
--gwas_finemapped_meta_data  gwas_meta.tsv \
--xqtl_meta_data  xqtl_meta.tsv \
--xqtl_finemapping_obj Mic susie_result_trimmed  \
--xqtl_varname_obj Mic variant_names
```

coloc
```
sos run ~/codes/xqtl-pipeline/pipeline/SuSiE_enloc.ipynb susie_coloc \
--gwas_finemapped_meta_data  gwas_meta.tsv \
--xqtl_meta_data  xqtl_meta.tsv \
--xqtl_finemapping_obj  susie_result_trimmed  \
--xqtl_varname_obj  variant_names \
--xqtl_region_obj  region_info \
--enrichment_data /mnt/vast/hpc/csg/rf2872/Work/pecotmr/encoloc_test/output/xqtl_meta.gwas_meta.enrichment.txt  \
--ld_meta_file_path /mnt/vast/hpc/csg/data_public/20240120_ADSP_LD_matrix/ld_meta_file.tsv
```

In [None]:

eg: `gwas_meta.tsv`


```
chrom    start    end    region_id    file_path
8        2000     6000   block1     /mnt/vast/hpc/homes/dmc2245/project/UKBB_GWAS_dev/code/python/output/SuSiE_RSS/study1.8_26225312-27515963.susie_rss.rds
8        3000     7000   block2     /mnt/vast/hpc/homes/dmc2245/project/UKBB_GWAS_dev/code/python/output/SuSiE_RSS/study1.8_25007602-26225312.susie_rss.rds
```


eg: `xqtl_meta.tsv`

```
chrom    start    end    region_id    condition    file_path
8        2000     6000   ENSG00000140090      cohor1:tissue1:eQTL     /mnt/vast/hpc/csg/rf2872/Work/test/susie_test/MWE_2024/Mic_example.ENSG00000092964.susie_weights_db.mod.rds
1        3000     7000   ENSG00000030582      cohor1:tissue1:eQTL      /mnt/vast/hpc/csg/rf2872/Work/Multivariate/susie_2024_new/rds_files/ROSMAP_eQTL.ENSG00000030582.susie_weights_db.rds, /mnt/vast/hpc/csg/rf2872/Work/Multivariate/susie_2024_new/rds_files/ROSMAP_sQTL.ENSG00000030582.susie_weights_db.rds
```

In [None]:
[global]
# Workdir
parameter: cwd = path("output")
# A list of file paths for fine-mapped GWAS results. 
parameter: gwas_finemapped_meta_data = path
# A list of file paths for fine-mapped xQTL results. 
parameter: xqtl_meta_data = path
# Optional: if a region list is provide the enrichment analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# It is required to input the name of the analysis
parameter: name = f"{xqtl_meta_data:bn}.{gwas_finemapped_meta_data:bn}"
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "5m"
# Memory expected: quite large for enrichment analysis but small for xQTL colocalization
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1

import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))

def group_by_region(lst, partition):
    # from itertools import accumulate
    # partition = [len(x) for x in partition]
    # Compute the cumulative sums once
    # cumsum_vector = list(accumulate(partition))
    # Use slicing based on the cumulative sums
    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
    return partition

In [None]:
[get_analysis_regions: shared = "regional_data"]
from collections import OrderedDict

def check_required_columns(df, required_columns):
    """Check if the required columns are present in the dataframe."""
    missing_columns = [col for col in required_columns if col not in list(df.columns)]
    if missing_columns:
        raise ValueError(f"Missing required columns: {', '.join(missing_columns)}")

def extract_regional_data(gwas_meta_data, xqtl_meta_data):
    """
    Extracts fine-mapped results data from GWAS and xQTL metadata files and additional GWAS data provided. 

    Args:
    - gwas_meta_data (str): File path to the GWAS metadata file.
    - xqtl_meta_data (str): File path to the xQTL weight metadata file.
    
    Returns:
    - Tuple of two dictionaries:
        - GWAS Dictionary: Nested dictionary with region IDs as keys
        - xQTL Dictionary: Nested dictionary with region IDs as keys.
    """
    required_gwas_columns = [ 'chrom', 'start', 'end', 'region_id','file_path']
    required_xqtl_columns = [ 'chrom', 'start', 'end', 'region_id','condition', 'file_path']

    # Process GWAS metadata
    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
    check_required_columns(gwas_df, required_gwas_columns)
    #gwas_df['file_path'] = adapt_file_path_all(gwas_df, 'file_path', gwas_meta_data)
    #gwas_df['region_id'] = gwas_df.apply(lambda row: f"{row['chrom']}:{row['start']}-{row['end']}", axis=1)

    gwas_dict = OrderedDict()
    for _, row in gwas_df.iterrows():
        file_paths = [fp.strip() for fp in row['file_path'].split(',')]
        gwas_dict[row['region_id']] = {"meta_info": [row['chrom'], row['start'], row['end'], row['region_id']],
                                       "files": file_paths}

    # Process xQTL metadata
    xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
    check_required_columns(xqtl_df, required_xqtl_columns)
    #xqtl_df['file_path'] = adapt_file_path_all(xqtl_df, 'file_path', xqtl_meta_data)

    xqtl_dict = OrderedDict()
    for _, row in xqtl_df.iterrows():
        file_paths = [fp.strip() for fp in row['file_path'].split(',')]
        xqtl_dict[row['region_id']] = {"meta_info": [row['chrom'], row['start'], row['end'], row['region_id'], row['condition']],
                                       "files": file_paths}
    return gwas_dict, xqtl_dict

gwas_dict, xqtl_dict = extract_regional_data(gwas_finemapped_meta_data, xqtl_meta_data)
regional_data = dict([("GWAS", gwas_dict), ("xQTL", xqtl_dict)])
print(regional_data)

In [None]:
[xqtl_gwas_enrichment]
depends: sos_variable("regional_data")
parameter: xqtl_finemapping_obj = []
parameter: xqtl_varname_obj = []
parameter: gwas_finemapping_obj = []
parameter: gwas_varname_obj = []
output: f'{cwd:a}/{name}.enrichment.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
  # RDS files for GWAS data
  gwas_finemapped_data = c(${paths([x["files"] for x in regional_data["GWAS"].values()]):r,})
  # RDS files for xQTL data
  xqtl_finemapped_data = c(${paths([x["files"] for x in regional_data["xQTL"].values()]):r,})
  result = pecotmr::xqtl_enrichment_wrapper(gwas_files = gwas_finemapped_data, xqtl_files = xqtl_finemapped_data, 
                                              xqtl_finemapping_obj =  c(${",".join(['"%s"' % x  for x in xqtl_finemapping_obj]) if len(xqtl_finemapping_obj) != 0 else "NULL"}), 
                                              xqtl_varname_obj =   c(${",".join(['"%s"' % x  for x in xqtl_varname_obj]) if len(xqtl_varname_obj) != 0 else "NULL"}), 
                                              gwas_finemapping_obj =  c(${",".join(['"%s"' % x for x in gwas_finemapping_obj]) if len(gwas_finemapping_obj) != 0 else "NULL"}), 
                                              gwas_varname_obj =  c(${",".join(['"%s"' % x for x in gwas_varname_obj]) if len(gwas_varname_obj) != 0 else "NULL"}))
  writeLines(paste(names(result), unlist(result), sep = ":"), ${_output:ar})

In [1]:
[susie_coloc]
depends: sos_variable("regional_data")
parameter: enrichment_data = path
parameter: xqtl_finemapping_obj = ""
parameter: xqtl_varname_obj = ""
parameter: gwas_finemapping_obj = ""
parameter: gwas_varname_obj = ""
parameter: xqtl_region_obj = ""
parameter: gwas_region_obj = ""
parameter: ld_meta_file_path=path()
meta_info = [x["meta_info"] for x in regional_data['xQTL'].values()]
xqtl_files = [x["files"] for x in regional_data['xQTL'].values()]
input: xqtl_files, group_by = lambda x: group_by_region(x, xqtl_files), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[3]}.coloc.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    library(tidyverse)
    library(pecotmr)
    pkgs <- list.files("/mnt/vast/hpc/homes/rf2872/codes/pecotmr/R", full.names = TRUE)
    for(i in pkgs){
        source(i)
    }
    concat_var <- function(var) {
      if (!is.null(var)) {
        return(c(con, var))
      } else {
        return(NULL)
      }
    }
    # RDS files for xQTL data
    xqtl_finemapped_datas = c(${paths([x for x in _input]):r,})
    chrom = ${_meta_info[0]}
    start = ${_meta_info[1]} 
    end = ${_meta_info[2]}
    region = "${_meta_info[3]}"
    xqtl_condition = "${_meta_info[4]}"
    gwas_regions = c(${', '.join([f'"{":".join(map(str, meta_info[:3]))}"' for meta_info in [info['meta_info'] for info in regional_data["GWAS"].values()]])})
    gwas_blocks = c(${', '.join([f'"{":".join(map(str, meta_info[3:4]))}"' for meta_info in [info['meta_info'] for info in regional_data["GWAS"].values()]])})
    gwas_paths =c(${', '.join([f'"{file}"' for info in regional_data["GWAS"].values() for file in info['files']])})
    
    # Step 1: find relevant GWAS regions that overlap with the xQTL region of interest
 
    gwas_regions <- gwas_regions %>% strsplit(.,",") %>% .[[1]]%>% unlist
    overlap_index <- NULL
    for (i in 1:length(gwas_regions)) {
        print(i)
      region <- gwas_regions[i]
      split_region <- unlist(strsplit(region, ":"))
      block_chrom <- as.numeric(split_region[1])
      block_start <- as.numeric(split_region[2])
      block_end <- as.numeric(split_region[3])
      if (chrom == block_chrom && (start <= block_end | end >= block_start)) {
        overlap_index <- c(overlap_index, i)
      }
    }

    if (!is.null(overlap_index)) {
        message("The region overlaps with ", c(gwas_blocks[overlap_index]))
        gwas_finemapped_data <- gwas_paths[overlap_index]

        # Step 2: load enrichment analysis results
        # coloc_priors = get_coloc_prior(${enrichment_data:r})
        # Function to extract the numeric value for a given parameter name
        get_coloc_prior <- function(param_name, lines) {
          line <- grep(paste0(param_name, ":"), lines, value = TRUE)
          numeric_part <- as.numeric(gsub(paste0(".*", param_name, ":"), "", line))
          return(numeric_part)
        }

        # Extract values for p1, p2, and p12
        p1 <- get_coloc_prior("p1", readLines(${enrichment_data:r}))
        p2 <- get_coloc_prior("p2", readLines(${enrichment_data:r}))
        p12 <- get_coloc_prior("p12", readLines(${enrichment_data:r}))

        message("Priors are P1:", p1, "; p2: ", p2, "; p12: ", p12)
        
       # Step 3: Apply colocalization analysis between each condition and GWAS

       coloc_res <- list()
       for(xqtl_finemapped_data in xqtl_finemapped_datas){
         cons <- readRDS(xqtl_finemapped_data)[[1]] %>% names 
  
         for( con in cons ){
           xqtl_finemapping_obj =  c(${f'"{xqtl_finemapping_obj}"' if xqtl_finemapping_obj else "NULL"}) %>% concat_var
           gwas_finemapping_obj =  c(${f'"{gwas_finemapping_obj}"' if gwas_finemapping_obj else "NULL"}) %>% concat_var
           xqtl_varname_obj =  c(${f'"{xqtl_varname_obj}"' if xqtl_varname_obj else "NULL"})  %>% concat_var
           gwas_varname_obj =  c(${f'"{gwas_varname_obj}"' if gwas_varname_obj else "NULL"})  %>% concat_var
           xqtl_region_obj =  c(${f'"{xqtl_region_obj}"' if xqtl_region_obj else "NULL"})  %>% concat_var
           gwas_region_obj =  c(${f'"{gwas_region_obj}"' if gwas_region_obj else "NULL"})   %>% concat_var
          
           coloc_res[[con]] <- coloc_wrapper(xqtl_file = xqtl_finemapped_data, gwas_files = gwas_finemapped_data, 
                               xqtl_finemapping_obj = xqtl_finemapping_obj, gwas_finemapping_obj = gwas_finemapping_obj,
                               xqtl_varname_obj = xqtl_varname_obj, gwas_varname_obj = gwas_varname_obj,
                               xqtl_region_obj = xqtl_region_obj, gwas_region_obj = gwas_region_obj,
                               p1 = p1, p2 = p2, p12 = p12)
          if (${"TRUE" if ld_meta_file_path.is_file() else "FALSE"}) {
          coloc_res[[con]] <- coloc_post_processor(coloc_res[[con]], LD_meta_file_path = ${ld_meta_file_path:r}, analysis_region= coloc_res[[con]]$analysis_region)
          }

        }
     
      }
    } else {
      print("No overlap found")
      coloc_res <-  "No overlap found"
    }
    saveRDS(coloc_res, ${_output:r})