From 50c2413d9a18b79a2dc153927065fbe579f40561 Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Wed, 16 Apr 2025 17:03:40 -0400 Subject: [PATCH 1/8] change readme --- README.md | 66 ++++++-------------------------------- index.md | 43 ------------------------- tmp_readme.md | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 99 insertions(+), 99 deletions(-) delete mode 100644 index.md create mode 100644 tmp_readme.md diff --git a/README.md b/README.md index 0206b49..a76123f 100644 --- a/README.md +++ b/README.md @@ -1,71 +1,35 @@ -# ColocBoost for multi-trait colocalization in molecular QTL and GWAS studies [![Codecov test coverage](https://codecov.io/gh/StatFunGen/colocboost/branch/main/graph/badge.svg)](https://codecov.io/gh/StatFunGen/colocboost?branch=main) [![CRAN Version](https://www.r-pkg.org/badges/version/colocboost)](https://cran.r-project.org/package=colocboost) -This R package implements ColocBoost --- motivated and designed for colocalization analysis ([first formulated here](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004383)) of multiple genetic association studies --- as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables. +![](man/figures/colocboost.png) -## Installation -### CRAN +This R package implements ColocBoost --- motivated and designed for colocalization analysis of multiple genetic association studies --- as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables. + +## Quick Start + +### CRAN Installation Install released versions from CRAN - pre-built packages are available on macOS and Windows ```r install.packages("colocboost") ``` -### GitHub +### GitHub Installation Install the development version from GitHub ```r devtools::install_github("StatFunGen/colocboost") ``` -### Conda -Install major releases using pre-built conda package with a conda-compatible package manager (recommended) - -Global pixi installation is the easiest way to use the conda package -```bash -pixi global install r-base # Install r-base as a global package if not already installed -pixi global install --environment r-base r-colocboost # Inject r-colocboost into r-base global environment -``` -The package can also be added to a local pixi environment -```bash -pixi workspace channel add dnachun # Add the dnachun channel to the workspace -pixi add r-colocboost # Add r-colocboost as a dependency to the environment -``` -Micromamba is recommended instead of conda or mamba for traditional conda environments -```bash -micromamba install -c dnachun r-colocboost -mamba install -c dnachun r-colocboost -conda install -c dnachun r-colocboost -``` -## Usage - -### Multi-trait Colocalization -```r -# Basic multi-trait analysis -result <- colocboost(X=list(X), Y=list(y1, y2, y3)) - -# Using summary statistics -result <- colocboost(sumstat=list(sumstat1, sumstat2), LD=LD_matrix) +For a detailed installation guidance, please refer to [https://statfungen.github.io/colocboost/articles/installation.html](https://statfungen.github.io/colocboost/articles/installation.html). -# View colocalization summary -summary <- get_cos_summary(result) -# Visualize results -colocboost_plot(result) -# Filter for stronger colocalization evidence -filtered <- get_strong_colocalization(result, cos_npc_cutoff = 0.5) -``` +## Tutorial Website -For more complex analyses involving multiple datasets mixing individual level and summary statistics data, we recommend using [this pipeline wrapper](https://github.com/StatFunGen/pecotmr/blob/main/R/colocboost_pipeline.R) from the `pecotmr` package. The `pecotmr` package can be installed either from source or from our conda package at https://anaconda.org/dnachun/r-pecotmr. +Learn how to perform colocalization analysis with step-by-step examples. For detailed tutorials and use cases, explore our FIXME (link). -### Single-trait Fine-mapping (FineBoost) - Special Case -Run FineBoost for single-trait fine-mapping (similar interface to SuSiE) -```r -result <- colocboost(X=X, Y=y) -``` ## Citation @@ -73,16 +37,6 @@ If you use ColocBoost in your research, please cite: Cao X, Sun H, Feng R, Mazumder R, Najar CFB, Li YI, de Jager PL, Bennett D, The Alzheimer's Disease Functional Genomics Consortium, Dey KK, Wang G. (2025+). Integrative multi-omics QTL colocalization maps regulatory architecture in aging human brain. bioRxiv. [https://doi.org/](https://doi.org/) -## Documentation - -For detailed documentation, use the R help system: - -```r -?colocboost -?colocboost_plot -?get_cos_summary -?get_strong_colocalization -``` ## License diff --git a/index.md b/index.md deleted file mode 100644 index a76123f..0000000 --- a/index.md +++ /dev/null @@ -1,43 +0,0 @@ -[![Codecov test coverage](https://codecov.io/gh/StatFunGen/colocboost/branch/main/graph/badge.svg)](https://codecov.io/gh/StatFunGen/colocboost?branch=main) -[![CRAN Version](https://www.r-pkg.org/badges/version/colocboost)](https://cran.r-project.org/package=colocboost) - -![](man/figures/colocboost.png) - - -This R package implements ColocBoost --- motivated and designed for colocalization analysis of multiple genetic association studies --- as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables. - -## Quick Start - -### CRAN Installation -Install released versions from CRAN - pre-built packages are available on macOS and Windows - -```r -install.packages("colocboost") -``` - -### GitHub Installation -Install the development version from GitHub - -```r -devtools::install_github("StatFunGen/colocboost") -``` - -For a detailed installation guidance, please refer to [https://statfungen.github.io/colocboost/articles/installation.html](https://statfungen.github.io/colocboost/articles/installation.html). - - - -## Tutorial Website - -Learn how to perform colocalization analysis with step-by-step examples. For detailed tutorials and use cases, explore our FIXME (link). - - -## Citation - -If you use ColocBoost in your research, please cite: - -Cao X, Sun H, Feng R, Mazumder R, Najar CFB, Li YI, de Jager PL, Bennett D, The Alzheimer's Disease Functional Genomics Consortium, Dey KK, Wang G. (2025+). Integrative multi-omics QTL colocalization maps regulatory architecture in aging human brain. bioRxiv. [https://doi.org/](https://doi.org/) - - -## License - -This package is released under the MIT License. diff --git a/tmp_readme.md b/tmp_readme.md new file mode 100644 index 0000000..0206b49 --- /dev/null +++ b/tmp_readme.md @@ -0,0 +1,89 @@ +# ColocBoost for multi-trait colocalization in molecular QTL and GWAS studies +[![Codecov test coverage](https://codecov.io/gh/StatFunGen/colocboost/branch/main/graph/badge.svg)](https://codecov.io/gh/StatFunGen/colocboost?branch=main) +[![CRAN Version](https://www.r-pkg.org/badges/version/colocboost)](https://cran.r-project.org/package=colocboost) + +This R package implements ColocBoost --- motivated and designed for colocalization analysis ([first formulated here](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004383)) of multiple genetic association studies --- as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables. + +## Installation + +### CRAN +Install released versions from CRAN - pre-built packages are available on macOS and Windows + +```r +install.packages("colocboost") +``` + +### GitHub +Install the development version from GitHub + +```r +devtools::install_github("StatFunGen/colocboost") +``` + +### Conda +Install major releases using pre-built conda package with a conda-compatible package manager (recommended) + +Global pixi installation is the easiest way to use the conda package +```bash +pixi global install r-base # Install r-base as a global package if not already installed +pixi global install --environment r-base r-colocboost # Inject r-colocboost into r-base global environment +``` +The package can also be added to a local pixi environment +```bash +pixi workspace channel add dnachun # Add the dnachun channel to the workspace +pixi add r-colocboost # Add r-colocboost as a dependency to the environment +``` +Micromamba is recommended instead of conda or mamba for traditional conda environments +```bash +micromamba install -c dnachun r-colocboost +mamba install -c dnachun r-colocboost +conda install -c dnachun r-colocboost +``` +## Usage + +### Multi-trait Colocalization +```r +# Basic multi-trait analysis +result <- colocboost(X=list(X), Y=list(y1, y2, y3)) + +# Using summary statistics +result <- colocboost(sumstat=list(sumstat1, sumstat2), LD=LD_matrix) + +# View colocalization summary +summary <- get_cos_summary(result) + +# Visualize results +colocboost_plot(result) + +# Filter for stronger colocalization evidence +filtered <- get_strong_colocalization(result, cos_npc_cutoff = 0.5) +``` + +For more complex analyses involving multiple datasets mixing individual level and summary statistics data, we recommend using [this pipeline wrapper](https://github.com/StatFunGen/pecotmr/blob/main/R/colocboost_pipeline.R) from the `pecotmr` package. The `pecotmr` package can be installed either from source or from our conda package at https://anaconda.org/dnachun/r-pecotmr. + +### Single-trait Fine-mapping (FineBoost) - Special Case +Run FineBoost for single-trait fine-mapping (similar interface to SuSiE) +```r +result <- colocboost(X=X, Y=y) +``` + +## Citation + +If you use ColocBoost in your research, please cite: + +Cao X, Sun H, Feng R, Mazumder R, Najar CFB, Li YI, de Jager PL, Bennett D, The Alzheimer's Disease Functional Genomics Consortium, Dey KK, Wang G. (2025+). Integrative multi-omics QTL colocalization maps regulatory architecture in aging human brain. bioRxiv. [https://doi.org/](https://doi.org/) + +## Documentation + +For detailed documentation, use the R help system: + +```r +?colocboost +?colocboost_plot +?get_cos_summary +?get_strong_colocalization +``` + +## License + +This package is released under the MIT License. From ad497931012add387c03331e2c040d857affccea Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Wed, 16 Apr 2025 17:07:18 -0400 Subject: [PATCH 2/8] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a76123f..b3a5a05 100644 --- a/README.md +++ b/README.md @@ -22,13 +22,13 @@ Install the development version from GitHub devtools::install_github("StatFunGen/colocboost") ``` -For a detailed installation guidance, please refer to [https://statfungen.github.io/colocboost/articles/installation.html](https://statfungen.github.io/colocboost/articles/installation.html). +For a detailed installation guidance, please refer to [Installation](https://statfungen.github.io/colocboost/articles/installation.html). ## Tutorial Website -Learn how to perform colocalization analysis with step-by-step examples. For detailed tutorials and use cases, explore our FIXME (link). +Learn how to perform colocalization analysis with step-by-step examples. For detailed tutorials and use cases in [Tutorials](https://statfungen.github.io/colocboost/articles/index.html). ## Citation From 4bd8b8fc975043a7ac492ca1850f3d92c397b6f3 Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Wed, 16 Apr 2025 19:33:02 -0400 Subject: [PATCH 3/8] improve sumstat flexibility allow one LD with superset variants and multiple sumstat --- R/colocboost_init.R | 217 ++++++++++++++------- vignettes/Input_Data_Format.Rmd | 2 +- vignettes/Summary_Level_Colocalization.Rmd | 137 +++++++++++++ 3 files changed, 281 insertions(+), 75 deletions(-) create mode 100644 vignettes/Summary_Level_Colocalization.Rmd diff --git a/R/colocboost_init.R b/R/colocboost_init.R index a7cd1f3..10c78b1 100644 --- a/R/colocboost_init.R +++ b/R/colocboost_init.R @@ -136,84 +136,16 @@ colocboost_init_data <- function(X, Y, dict_YX, if (!is.null(Z) & !is.null(LD)) { ####################### need to consider more ######################### # ------ only code up one sumstat - + variant_lists <- keep_variables[c(flag:length(keep_variables))] + sumstat_formated <- process_sumstat(Z, N_sumstat, Var_y, SeBhat, LD, + variant_lists, dict_sumstatLD, + keep_variable_names) for (i in 1:length(Z)) { - tmp <- list( - "XtX" = NULL, - "XtY" = NULL, - "YtY" = NULL, - "N" = N_sumstat[[i]], - "variable_miss" = NULL - ) - - flag1 <- dict_sumstatLD[i] + ind_idx - # check LD - if (dict_sumstatLD[i] == i) { - # - intersect variables - tmp_variables <- intersect(keep_variables[[flag1]], colnames(LD[[i]])) - pos.final <- sort(match(tmp_variables, keep_variable_names)) - final_variables <- keep_variable_names[pos.final] # keep the same rank with keep_variable_names - # - check missing variables - tmp$variable_miss <- setdiff(1:length(keep_variable_names), pos.final) - # - set 0 to LD - pos.LD <- match(final_variables, colnames(LD[[i]])) - LD_tmp <- LD[[i]][pos.LD, pos.LD] - } else { - variable_i <- keep_variables[[flag1]] - variable_LD <- keep_variables[[dict_sumstatLD[i] + ind_idx]] - if (length(which(is.na(match(variable_i, variable_LD)))) != 0) { - # - intersect variables - tmp_variables <- intersect(keep_variables[[flag]], colnames(LD[[dict_sumstatLD[i]]])) - pos.final <- sort(match(tmp_variables, keep_variable_names)) - final_variables <- keep_variable_names[pos.final] # keep the same rank with keep_variable_names - # - check missing variables - tmp$variable_miss <- setdiff(1:length(keep_variable_names), pos.final) - # - set 0 to LD - pos.LD <- match(final_variables, colnames(LD[[dict_sumstatLD[i]]])) - LD_tmp <- LD[[dict_sumstatLD[i]]][pos.LD, pos.LD] - # - need to change LD - dict_sumstatLD[i] <- i - } else { - tmp$variable_miss <- cb_data$data[[dict_sumstatLD[i] + n_ind]]$variable_miss - LD_tmp <- NULL - } - } - - # - set 0 to Z - Z_extend <- rep(0, length(keep_variable_names)) # N - pos.z <- match(final_variables, keep_variables[[flag1]]) # M <- M2 - Z_extend[pos.final] <- Z[[i]][pos.z] - - # - organize data - if (is.null(N_sumstat[[i]])) { - tmp$XtX <- LD_tmp - tmp$XtY <- Z_extend - tmp$YtY <- 1 - } else { - if (!is.null(SeBhat[[i]]) & !is.null(Var_y[[i]])) { - # var_y, shat (and bhat) are provided, so the effects are on the - # *original scale*. - adj <- 1 / (Z_extend^2 + N_sumstat[[i]] - 2) - if (!is.null(LD_tmp)) { - XtXdiag <- Var_y[[i]] * adj / (SeBhat[[i]]^2) - xtx <- t(LD_tmp * sqrt(XtXdiag)) * sqrt(XtXdiag) - tmp$XtX <- (xtx + t(xtx)) / 2 - } - tmp$YtY <- (N_sumstat[[i]] - 1) * Var_y[[i]] - tmp$XtY <- Z_extend * sqrt(adj) * Var_y[[i]] / SeBhat[[i]] - } else { - if (!is.null(LD_tmp)) { - tmp$XtX <- LD_tmp - } - tmp$YtY <- (N_sumstat[[i]] - 1) - tmp$XtY <- sqrt(N_sumstat[[i]] - 1) * Z_extend - } - } - cb_data$data[[flag]] <- tmp + cb_data$data[[flag]] <- sumstat_formated$results[[i]] names(cb_data$data)[flag] <- paste0("sumstat_outcome_", i) flag <- flag + 1 } - cb_data$dict <- c(cb_data$dict, dict_sumstatLD + n_ind) + cb_data$dict <- c(cb_data$dict, sumstat_formated$unified_dict + n_ind) } # - if residual correlation matrix is not NULL, we need to adjust the study dependence if (is.null(residual_correlation)) { @@ -585,3 +517,140 @@ get_multiple_testing_correction <- function(z, miss_idx = NULL, func_multi_test stop("Invalid option for func_multi_test") } } + + +#' Process LD matrices and variant lists with optimized storage +#' +#' @param ld_matrices List of LD matrices (e.g., list(PP=PP_matrix, QQ=QQ_matrix)) +#' @param variant_lists List of variant lists (e.g., list(P1=P1_variants, P2=P2_variants, ...)) +#' @param dict Vector mapping variant lists to LD matrices (e.g., c(1,1,2,2,2)) +#' @param target_variants Vector of variants to be considered (M variants) +#' +#' @return List containing processed data with optimized LD submatrix storage +#' @noRd +#' ld_matrices = LD +#' variant_lists = variant_lists +#' dict = dict_sumstatLD +#' target_variants = keep_variable_names +#' N = N_sumstat +process_sumstat <- function(Z, N, Var_y, SeBhat, ld_matrices, variant_lists, dict, target_variants) { + + # Step 1: Identify unique combinations of (variant list, LD matrix) + unified_dict <- integer(length(variant_lists)) + + # First item is always assigned its own position + unified_dict[1] <- 1 + + # Process remaining items + if (length(variant_lists) > 1){ + for (i in 2:length(variant_lists)) { + # Check if current combination is duplicate of any previous one + is_duplicate <- FALSE + for (j in 1:(i-1)) { + if (identical(variant_lists[[i]], variant_lists[[j]]) && dict[i] == dict[j]) { + unified_dict[i] <- unified_dict[j] + is_duplicate <- TRUE + break + } + } + + if (!is_duplicate) { + # If not a duplicate, assign its exact index + unified_dict[i] <- i + } + } + } + + # Step 2: Process each variant list + results <- list() + + for (i in 1:length(variant_lists)) { + + tmp <- list( + "XtX" = NULL, + "XtY" = NULL, + "YtY" = NULL, + "N" = N[[i]], + "variable_miss" = NULL + ) + + # Get current status + current_variants <- variant_lists[[i]] + current_z <- Z[[i]] + current_n <- N[[i]] + + # Get corresponding LD matrix from original dictionary mapping + ld_index <- dict[i] + current_ld_matrix <- ld_matrices[[ld_index]] + + # Find common variants between current list and target variants + common_variants <- intersect(current_variants, target_variants) + + # Find variants in target but not in current list + missing_variants <- setdiff(target_variants, current_variants) + tmp$variable_miss <- which(target_variants %in% missing_variants) + + # - creat extend Z by setting 0 to missing variants + Z_extend <- rep(0, length(target_variants)) + pos_z <- match(common_variants, current_variants) + pos_target <- match(common_variants, target_variants) + Z_extend[pos_target] <- current_z[pos_z] + + # Calculate submatrix for each unique entry (not duplicates) + ld_submatrix <- NULL + + if (length(common_variants) > 0) { + # Only include the submatrix if this entry is unique or is the first occurrence + if (i == unified_dict[i]) { + # Check if common_variants and rownames have identical order + if (identical(common_variants, rownames(current_ld_matrix))) { + # If order is identical, use the matrix directly without reordering + ld_submatrix <- current_ld_matrix + } else { + # If order is different, reorder using matched indices + matched_indices <- match(common_variants, rownames(current_ld_matrix)) + ld_submatrix <- current_ld_matrix[matched_indices, matched_indices, drop = FALSE] + rownames(ld_submatrix) <- common_variants + colnames(ld_submatrix) <- common_variants + } + } + } + + # Organize data + if (is.null(current_n)) { + tmp$XtX <- ld_submatrix + tmp$XtY <- Z_extend + tmp$YtY <- 1 + } else { + if (!is.null(SeBhat[[i]]) & !is.null(Var_y[[i]])) { + # var_y, shat (and bhat) are provided, so the effects are on the + # *original scale*. + adj <- 1 / (Z_extend^2 + current_n - 2) + if (!is.null(LD_tmp)) { + XtXdiag <- Var_y[[i]] * adj / (SeBhat[[i]]^2) + xtx <- t(ld_submatrix * sqrt(XtXdiag)) * sqrt(XtXdiag) + tmp$XtX <- (xtx + t(xtx)) / 2 + } + tmp$YtY <- (current_n - 1) * Var_y[[i]] + tmp$XtY <- Z_extend * sqrt(adj) * Var_y[[i]] / SeBhat[[i]] + } else { + if (!is.null(ld_submatrix)) { + tmp$XtX <- ld_submatrix + } + tmp$YtY <- (current_n - 1) + tmp$XtY <- sqrt(current_n - 1) * Z_extend + } + } + + # Store results for current list + results[[i]] <- tmp + } + + + # Return results with the unified dictionary + return(list( + results = results, + unified_dict = unified_dict, + original_dict = dict + )) +} diff --git a/vignettes/Input_Data_Format.Rmd b/vignettes/Input_Data_Format.Rmd index 80b6e0c..f1ec390 100644 --- a/vignettes/Input_Data_Format.Rmd +++ b/vignettes/Input_Data_Format.Rmd @@ -2,7 +2,7 @@ title: "Input Data Format" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Input Data Format and Example Data} + %\VignetteIndexEntry{Input Data Format} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- diff --git a/vignettes/Summary_Level_Colocalization.Rmd b/vignettes/Summary_Level_Colocalization.Rmd new file mode 100644 index 0000000..1d125b4 --- /dev/null +++ b/vignettes/Summary_Level_Colocalization.Rmd @@ -0,0 +1,137 @@ +--- +title: "Summary Level Data Colocalization" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Summary Level Data Colocalization} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +This vignette demonstrates how to perform multi-trait colocalization analysis using summary statistics data in ColocBoost, +specifically focusing on the `Sumstat_5traits` dataset included in the package. + + + +```{r setup} +library(colocboost) +``` + + +# 1. The `Sumstat_5traits` Dataset + +The `Sumstat_5traits` dataset contains 5 simulated summary statistics, where it is directly derived from the `Ind_5traits` dataset using marginal association. +The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost in multiple trait colocalization analysis with summary association data. + +- `sumstat`: A list of data.frames of summary statistics for different traits. +- `true_effect_variants`: True effect variants indices for each trait. +- Note that `LD` could be calculated from the `X` data in the `Ind_5traits` dataset, but it is not included in the `Sumstat_5traits` dataset. + +### Causal variant structure +The dataset features two causal variants with indices 644 and 2289. + +- Causal variant 644 is associated with traits 1, 2, 3, and 4. +- Causal variant 2289 is associated with traits 2, 3, and 5. + +This structure creates a realistic scenario where multiple traits are influenced by different but overlapping sets of genetic variants. + +```{r load-summary-data} +# Loading the Dataset +data("Sumstat_5traits") +names(Sumstat_5traits) +Sumstat_5traits$true_effect_variants +``` + +### Important data format for summary data +Must include the following columns: +- `z` or (`beta`, `sebeta`): either z-score or (effect size and standard error) +- `n`: sample size for the summary statistics, it is highly recommendation to provide. +- `variant`: required if sumstat for different outcomes do not have the same number of variables (multiple sumstat and multiple LD). + + +```{r summary-data-format} +class(Sumstat_5traits$sumstat[[1]]) +head(Sumstat_5traits$sumstat[[1]]) +``` + +# 2. Run ColocBoost (Basic usage) + + +The preferred format for colocalization analysis in ColocBoost using summary statistics data is where one LD matrix is provided for all traits, +and the summary statistics are organized in a list. The **Basic format** us + +- `sumstat` is organized as a list of data.frames for all traits +- `LD` is a matrix of linkage disequilibrium (LD) information for all variants across all traits. + +```{r one-LD} +# Extract genotype (X) and calculate LD matrix +data("Ind_5traits") +LD <- get_cormat(Ind_5traits$X[[1]]) + +# Run colocboost +res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD) + +# Identified CoS +res$cos_details$cos$cos_index +``` + + +### Results Interpretation + +For comprehensive tutorials on result interpretation and advanced visualization techniques, please visit our documentation portal at FIXME (link). + + +# 3. Run ColocBoost (Advance usage) + +## 3.1. Matched LD with multiple sumstat (Trait-specific LD) + +When studying multiple traits with its own trait-specific LD matrix, you could provide a list of LD matrices matched with a list of summary statistics. + +- **Basic format**: `sumstat` and `LD` are organized as lists, matched by trait index, + - `(sumstat[1], LD[1])` contains information for trait 1, + - `(sumstat[2], LD[2])` contains information for trait 2, + - And so on for each trait under analysis. +- **Cross-trait flexibility**: + - There is no requirement for the same variants across different traits. This allows for the analysis of traits with variants avaiable. + - This is particularly useful when you have a large dataset with many traits and want to focus on specific variants and trait-specific LD. + +```{r matched-LD} +# Duplicate LD with matched summary statistics +LD_multiple <- lapply(1:length(Sumstat_5traits$sumstat), function(i) LD ) + +# Run colocboost +res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_multiple) + +# Identified CoS +res$cos_details$cos$cos_index +``` + + +## 3.2. LD matrix is a superset of variants across different summary statistics + +When the LD matrix includes a superset of variants across different summary statistics, with **Input Format**: +- `sumstat` is a list of data.frames for all traits +- `LD` is a matrix of linkage disequilibrium (LD) information for all variants across all traits. +- The LD matrix should contain all variants present in the summary statistics data frames. +- This is particularly useful when you have a large LD matrix from a reference panel and want to use it for multiple summary statistics datasets. +It allows for efficient analysis without redundancy. + + +```{r superset-LD} +# Create sumstat with different number of variants - remove 100 variants in each sumstat +set.seed(1) +LD_superset <- LD +sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 100), , drop = FALSE]) + +# Run colocboost +res <- colocboost(sumstat = sumstat, LD = LD_superset) + +# Identified CoS +res$cos_details$cos$cos_index +``` \ No newline at end of file From a676fd5c7a37b01af7c735311f4f01b91982aef8 Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Wed, 16 Apr 2025 19:34:15 -0400 Subject: [PATCH 4/8] Update colocboost_init.R --- R/colocboost_init.R | 5 ----- 1 file changed, 5 deletions(-) diff --git a/R/colocboost_init.R b/R/colocboost_init.R index 10c78b1..ecb9fcd 100644 --- a/R/colocboost_init.R +++ b/R/colocboost_init.R @@ -528,11 +528,6 @@ get_multiple_testing_correction <- function(z, miss_idx = NULL, func_multi_test #' #' @return List containing processed data with optimized LD submatrix storage #' @noRd -#' ld_matrices = LD -#' variant_lists = variant_lists -#' dict = dict_sumstatLD -#' target_variants = keep_variable_names -#' N = N_sumstat process_sumstat <- function(Z, N, Var_y, SeBhat, ld_matrices, variant_lists, dict, target_variants) { # Step 1: Identify unique combinations of (variant list, LD matrix) From a65aac6d1d46fbb05ac740d2ec695166e1dc22b6 Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Wed, 16 Apr 2025 19:45:34 -0400 Subject: [PATCH 5/8] Update ColocBoost_tutorial_advance.Rmd --- vignettes/ColocBoost_tutorial_advance.Rmd | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/vignettes/ColocBoost_tutorial_advance.Rmd b/vignettes/ColocBoost_tutorial_advance.Rmd index 7ecfee9..f7dd852 100644 --- a/vignettes/ColocBoost_tutorial_advance.Rmd +++ b/vignettes/ColocBoost_tutorial_advance.Rmd @@ -33,8 +33,8 @@ Consider a scenario where all phenotypes share the same genotype matrix, as demo data("Ind_5traits") X <- Ind_5traits$X[[1]] Y <- do.call(cbind, Ind_5traits$Y) -res <- colocboost(X = X, Y = Y) -res$cos_summary +# res <- colocboost(X = X, Y = Y) +# res$cos_summary ``` **Example of using summary statistics** @@ -45,8 +45,8 @@ For summary statistics, similar optimization is applied using `Sumstat_5traits` data("Sumstat_5traits") sumstat <- Sumstat_5traits$sumstat LD <- get_cormat(Ind_5traits$X[[1]]) -res <- colocboost(sumstat = sumstat, LD = LD) -res$cos_summary +# res <- colocboost(sumstat = sumstat, LD = LD) +# res$cos_summary ``` **Example of combining individual-level data and summary statistics** @@ -60,9 +60,9 @@ X <- Ind_5traits$X[[1]] Y <- do.call(cbind, Ind_5traits$Y[1:3]) data("Sumstat_5traits") sumstat <- Sumstat_5traits$sumstat[4:5] -LD <- get_cormat(Ind_5traits$X[[1]]) -res <- colocboost(X = X, Y = Y, sumstat = sumstat, LD = LD) -res$cos_summary +L# D <- get_cormat(Ind_5traits$X[[1]]) +# res <- colocboost(X = X, Y = Y, sumstat = sumstat, LD = LD) +# res$cos_summary ``` From b9552e9507189071ec23af769507cbc551a19e50 Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Wed, 16 Apr 2025 19:52:59 -0400 Subject: [PATCH 6/8] sumstat --- vignettes/ColocBoost_tutorial_advance.Rmd | 2 +- vignettes/Summary_Level_Colocalization.Rmd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/vignettes/ColocBoost_tutorial_advance.Rmd b/vignettes/ColocBoost_tutorial_advance.Rmd index f7dd852..731c477 100644 --- a/vignettes/ColocBoost_tutorial_advance.Rmd +++ b/vignettes/ColocBoost_tutorial_advance.Rmd @@ -60,7 +60,7 @@ X <- Ind_5traits$X[[1]] Y <- do.call(cbind, Ind_5traits$Y[1:3]) data("Sumstat_5traits") sumstat <- Sumstat_5traits$sumstat[4:5] -L# D <- get_cormat(Ind_5traits$X[[1]]) +# LD <- get_cormat(Ind_5traits$X[[1]]) # res <- colocboost(X = X, Y = Y, sumstat = sumstat, LD = LD) # res$cos_summary ``` diff --git a/vignettes/Summary_Level_Colocalization.Rmd b/vignettes/Summary_Level_Colocalization.Rmd index 1d125b4..fc9ea18 100644 --- a/vignettes/Summary_Level_Colocalization.Rmd +++ b/vignettes/Summary_Level_Colocalization.Rmd @@ -116,6 +116,7 @@ res$cos_details$cos$cos_index ## 3.2. LD matrix is a superset of variants across different summary statistics When the LD matrix includes a superset of variants across different summary statistics, with **Input Format**: + - `sumstat` is a list of data.frames for all traits - `LD` is a matrix of linkage disequilibrium (LD) information for all variants across all traits. - The LD matrix should contain all variants present in the summary statistics data frames. @@ -125,7 +126,6 @@ It allows for efficient analysis without redundancy. ```{r superset-LD} # Create sumstat with different number of variants - remove 100 variants in each sumstat -set.seed(1) LD_superset <- LD sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 100), , drop = FALSE]) From 995c830dfbaa38bf18a126f51f7de902ce8a7895 Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Wed, 16 Apr 2025 20:00:48 -0400 Subject: [PATCH 7/8] Update Summary_Level_Colocalization.Rmd --- vignettes/Summary_Level_Colocalization.Rmd | 32 +++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/vignettes/Summary_Level_Colocalization.Rmd b/vignettes/Summary_Level_Colocalization.Rmd index fc9ea18..61448ab 100644 --- a/vignettes/Summary_Level_Colocalization.Rmd +++ b/vignettes/Summary_Level_Colocalization.Rmd @@ -134,4 +134,34 @@ res <- colocboost(sumstat = sumstat, LD = LD_superset) # Identified CoS res$cos_details$cos$cos_index -``` \ No newline at end of file +``` + + +## 3.3. Arbitrary LD and sumstat with dictionary provided + +When studying multiple traits with arbitrary LD matrices for different summary statistics, +we also provide the interface for arbitrary LD matrices with multiple sumstat. +This particularly benefits meta-analysis across heterogeneous datasets where, +for different subsets of summary statistics, LD comes from different population. + +- **Input Format**: + - `sumstat` is a list of data.frames for all traits. + - `LD` is a list of LD matrices. + - `dict_sumstatLD` is a dictionary matrix that index of sumstat to index of LD. + + +```{r dictionary-mapped} +# Create a simple dictionary for demonstration purposes +LD_arbitrary <- list(LD, LD) # traits 1 and 2 matched to the first genotype matrix; traits 3,4,5 matched to the third genotype matrix. +dict_sumstatLD = cbind(c(1:5), c(1,1,2,2,2)) + +# Display the dictionary +dict_sumstatLD + +# Run colocboost +res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_arbitrary, dict_sumstatLD = dict_sumstatLD) + +# Identified CoS +res$cos_details$cos$cos_index +``` + From e54919994e6b77ccc17b5410e6d0f90f9b07e42a Mon Sep 17 00:00:00 2001 From: xuewei cao <36172337+xueweic@users.noreply.github.com> Date: Wed, 16 Apr 2025 20:13:00 -0400 Subject: [PATCH 8/8] add minimal example --- R/colocboost_inference.R | 10 ++++++ R/colocboost_output.R | 62 +++++++++++++++++++++++++++++++- R/colocboost_plot.R | 21 +++++++++++ man/colocboost_plot.Rd | 22 ++++++++++++ man/get_cormat.Rd | 11 ++++++ man/get_cos.Rd | 21 +++++++++++ man/get_cos_summary.Rd | 21 +++++++++++ man/get_strong_colocalization.Rd | 23 +++++++++++- 8 files changed, 189 insertions(+), 2 deletions(-) diff --git a/R/colocboost_inference.R b/R/colocboost_inference.R index 44b7979..e79b701 100644 --- a/R/colocboost_inference.R +++ b/R/colocboost_inference.R @@ -33,6 +33,16 @@ colocboost_post_inference <- function() { #' @param intercepte A logical value indicating whether to include an intercept in the model. Default is FALSE. #' #' @return A correlation matrix (LD matrix). +#' +#' @examples +#' # colocboost example +#' set.seed(1) +#' N = 1000 +#' P = 100 +#' # Generate X with LD structure +#' sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +#' X <- MASS::mvrnorm(N, rep(0, P), sigma) +#' cormat <- get_cormat(X) #' #' @family colocboost_utilities #' @export diff --git a/R/colocboost_output.R b/R/colocboost_output.R index 002c761..43fb7b0 100644 --- a/R/colocboost_output.R +++ b/R/colocboost_output.R @@ -22,6 +22,26 @@ #' \item{colocalized_index}{Indices of colocalized variables} #' \item{colocalized_variables}{List of colocalized variables} #' \item{colocalized_variables_vcp}{Variant colocalization probabilities for all colocalized variables} +#' +#' @examples +#' # colocboost example +#' set.seed(1) +#' N = 1000 +#' P = 100 +#' # Generate X with LD structure +#' sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +#' X <- MASS::mvrnorm(N, rep(0, P), sigma) +#' colnames(X) <- paste0("SNP", 1:P) +#' L = 3 +#' true_beta <- matrix(0, P, L) +#' true_beta[5, 1] <- 0.5 # SNP5 affects trait 1 +#' true_beta[5, 2] <- 0.4 # SNP5 also affects trait 2 (colocalized) +#' true_beta[10, 2] <- 0.3 # SNP10 only affects trait 2 +#' true_beta[20, 3] <- 0.6 # SNP20 only affects trait 3 +#' Y <- matrix(0, N, L) +#' for (l in 1:L){ Y[, l] <- X %*% true_beta[, l] + rnorm(N, 0, 1) } +#' res <- colocboost(X = X, Y = Y) +#' get_cos_summary(res) #' #' @family colocboost_inference #' @export @@ -128,11 +148,31 @@ get_cos_summary <- function(cb_output, #' \item{data_info}{A object with detailed information from input data} #' \item{model_info}{A object with detailed information for colocboost model} #' +#' @examples +#' # colocboost example +#' set.seed(1) +#' N = 1000 +#' P = 100 +#' # Generate X with LD structure +#' sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +#' X <- MASS::mvrnorm(N, rep(0, P), sigma) +#' colnames(X) <- paste0("SNP", 1:P) +#' L = 3 +#' true_beta <- matrix(0, P, L) +#' true_beta[5, 1] <- 0.5 # SNP5 affects trait 1 +#' true_beta[5, 2] <- 0.4 # SNP5 also affects trait 2 (colocalized) +#' true_beta[10, 2] <- 0.3 # SNP10 only affects trait 2 +#' true_beta[20, 3] <- 0.6 # SNP20 only affects trait 3 +#' Y <- matrix(0, N, L) +#' for (l in 1:L){ Y[, l] <- X %*% true_beta[, l] + rnorm(N, 0, 1) } +#' res <- colocboost(X = X, Y = Y) +#' filter_res <- get_strong_colocalization(res, cos_npc_cutoff = 0.5, npc_outcome_cutoff = 0.2) +#' #' @family colocboost_inference #' @export get_strong_colocalization <- function(cb_output, cos_npc_cutoff = 0.5, - npc_outcome_cutoff = 0.25, + npc_outcome_cutoff = 0.2, pvalue_cutoff = NULL, weight_fudge_factor = 1.5, coverage = 0.95) { @@ -421,6 +461,26 @@ get_ucos_summary <- function(cb_output, outcome_names = NULL, region_name = NULL #' #' @return A list of indices of variables in each CoS. #' +#' @examples +#' # colocboost example +#' set.seed(1) +#' N = 1000 +#' P = 100 +#' # Generate X with LD structure +#' sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +#' X <- MASS::mvrnorm(N, rep(0, P), sigma) +#' colnames(X) <- paste0("SNP", 1:P) +#' L = 3 +#' true_beta <- matrix(0, P, L) +#' true_beta[5, 1] <- 0.5 # SNP5 affects trait 1 +#' true_beta[5, 2] <- 0.4 # SNP5 also affects trait 2 (colocalized) +#' true_beta[10, 2] <- 0.3 # SNP10 only affects trait 2 +#' true_beta[20, 3] <- 0.6 # SNP20 only affects trait 3 +#' Y <- matrix(0, N, L) +#' for (l in 1:L){ Y[, l] <- X %*% true_beta[, l] + rnorm(N, 0, 1) } +#' res <- colocboost(X = X, Y = Y) +#' get_cos(res, coverage = 0.75) +#' #' @family colocboost_utilities #' @export get_cos <- function(cb_output, coverage = 0.95) { diff --git a/R/colocboost_plot.R b/R/colocboost_plot.R index 15f0616..cb423f4 100644 --- a/R/colocboost_plot.R +++ b/R/colocboost_plot.R @@ -33,6 +33,27 @@ #' @param ... Additional parameters passed to `plot` functions #' #' @return Visualization plot for each colcoalization event. +#' +#' @examples +#' # colocboost example +#' set.seed(1) +#' N = 1000 +#' P = 100 +#' # Generate X with LD structure +#' sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +#' X <- MASS::mvrnorm(N, rep(0, P), sigma) +#' colnames(X) <- paste0("SNP", 1:P) +#' L = 3 +#' true_beta <- matrix(0, P, L) +#' true_beta[5, 1] <- 0.5 # SNP5 affects trait 1 +#' true_beta[5, 2] <- 0.4 # SNP5 also affects trait 2 (colocalized) +#' true_beta[10, 2] <- 0.3 # SNP10 only affects trait 2 +#' true_beta[20, 3] <- 0.6 # SNP20 only affects trait 3 +#' Y <- matrix(0, N, L) +#' for (l in 1:L){ Y[, l] <- X %*% true_beta[, l] + rnorm(N, 0, 1) } +#' res <- colocboost(X = X, Y = Y) +#' colocboost_plot(res, plot_cols = 1) +#' #' #' @importFrom utils head tail #' @importFrom graphics abline axis legend mtext par points text diff --git a/man/colocboost_plot.Rd b/man/colocboost_plot.Rd index 318d514..6972966 100644 --- a/man/colocboost_plot.Rd +++ b/man/colocboost_plot.Rd @@ -94,5 +94,27 @@ Visualization plot for each colcoalization event. } \description{ \code{colocboost_plot} generates visualization plots for colocalization events from a ColocBoost analysis. +} +\examples{ +# colocboost example +set.seed(1) +N = 1000 +P = 100 +# Generate X with LD structure +sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +X <- MASS::mvrnorm(N, rep(0, P), sigma) +colnames(X) <- paste0("SNP", 1:P) +L = 3 +true_beta <- matrix(0, P, L) +true_beta[5, 1] <- 0.5 # SNP5 affects trait 1 +true_beta[5, 2] <- 0.4 # SNP5 also affects trait 2 (colocalized) +true_beta[10, 2] <- 0.3 # SNP10 only affects trait 2 +true_beta[20, 3] <- 0.6 # SNP20 only affects trait 3 +Y <- matrix(0, N, L) +for (l in 1:L){ Y[, l] <- X \%*\% true_beta[, l] + rnorm(N, 0, 1) } +res <- colocboost(X = X, Y = Y) +colocboost_plot(res, plot_cols = 1) + + } \concept{colocboost_plot} diff --git a/man/get_cormat.Rd b/man/get_cormat.Rd index 1c36951..f01aed9 100644 --- a/man/get_cormat.Rd +++ b/man/get_cormat.Rd @@ -16,6 +16,17 @@ A correlation matrix (LD matrix). } \description{ This function calculates the correlation matrix (LD matrix) from individual level data. +} +\examples{ +# colocboost example +set.seed(1) +N = 1000 +P = 100 +# Generate X with LD structure +sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +X <- MASS::mvrnorm(N, rep(0, P), sigma) +cormat <- get_cormat(X) + } \seealso{ Other colocboost_utilities: diff --git a/man/get_cos.Rd b/man/get_cos.Rd index a38068f..453ada6 100644 --- a/man/get_cos.Rd +++ b/man/get_cos.Rd @@ -16,6 +16,27 @@ A list of indices of variables in each CoS. } \description{ \code{get_cos} get the colocalization confidence sets (CoS) with different coverage. +} +\examples{ +# colocboost example +set.seed(1) +N = 1000 +P = 100 +# Generate X with LD structure +sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +X <- MASS::mvrnorm(N, rep(0, P), sigma) +colnames(X) <- paste0("SNP", 1:P) +L = 3 +true_beta <- matrix(0, P, L) +true_beta[5, 1] <- 0.5 # SNP5 affects trait 1 +true_beta[5, 2] <- 0.4 # SNP5 also affects trait 2 (colocalized) +true_beta[10, 2] <- 0.3 # SNP10 only affects trait 2 +true_beta[20, 3] <- 0.6 # SNP20 only affects trait 3 +Y <- matrix(0, N, L) +for (l in 1:L){ Y[, l] <- X \%*\% true_beta[, l] + rnorm(N, 0, 1) } +res <- colocboost(X = X, Y = Y) +get_cos(res, coverage = 0.75) + } \seealso{ Other colocboost_utilities: diff --git a/man/get_cos_summary.Rd b/man/get_cos_summary.Rd index bd3a193..30ee54a 100644 --- a/man/get_cos_summary.Rd +++ b/man/get_cos_summary.Rd @@ -37,6 +37,27 @@ A summary table for colocalization events with the following columns: } \description{ \code{get_cos_summary} get the colocalization summary table with or without the outcomes of interest. +} +\examples{ +# colocboost example +set.seed(1) +N = 1000 +P = 100 +# Generate X with LD structure +sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +X <- MASS::mvrnorm(N, rep(0, P), sigma) +colnames(X) <- paste0("SNP", 1:P) +L = 3 +true_beta <- matrix(0, P, L) +true_beta[5, 1] <- 0.5 # SNP5 affects trait 1 +true_beta[5, 2] <- 0.4 # SNP5 also affects trait 2 (colocalized) +true_beta[10, 2] <- 0.3 # SNP10 only affects trait 2 +true_beta[20, 3] <- 0.6 # SNP20 only affects trait 3 +Y <- matrix(0, N, L) +for (l in 1:L){ Y[, l] <- X \%*\% true_beta[, l] + rnorm(N, 0, 1) } +res <- colocboost(X = X, Y = Y) +get_cos_summary(res) + } \seealso{ Other colocboost_inference: diff --git a/man/get_strong_colocalization.Rd b/man/get_strong_colocalization.Rd index 1b9dfa4..39636b9 100644 --- a/man/get_strong_colocalization.Rd +++ b/man/get_strong_colocalization.Rd @@ -7,7 +7,7 @@ get_strong_colocalization( cb_output, cos_npc_cutoff = 0.5, - npc_outcome_cutoff = 0.25, + npc_outcome_cutoff = 0.2, pvalue_cutoff = NULL, weight_fudge_factor = 1.5, coverage = 0.95 @@ -37,6 +37,27 @@ A \code{"colocboost"} object with some or all of the following elements: } \description{ \code{get_strong_colocalization} get the colocalization by discarding the weaker colocalization events or colocalized outcomes +} +\examples{ +# colocboost example +set.seed(1) +N = 1000 +P = 100 +# Generate X with LD structure +sigma <- 0.9^abs(outer(1:P, 1:P, "-")) +X <- MASS::mvrnorm(N, rep(0, P), sigma) +colnames(X) <- paste0("SNP", 1:P) +L = 3 +true_beta <- matrix(0, P, L) +true_beta[5, 1] <- 0.5 # SNP5 affects trait 1 +true_beta[5, 2] <- 0.4 # SNP5 also affects trait 2 (colocalized) +true_beta[10, 2] <- 0.3 # SNP10 only affects trait 2 +true_beta[20, 3] <- 0.6 # SNP20 only affects trait 3 +Y <- matrix(0, N, L) +for (l in 1:L){ Y[, l] <- X \%*\% true_beta[, l] + rnorm(N, 0, 1) } +res <- colocboost(X = X, Y = Y) +filter_res <- get_strong_colocalization(res, cos_npc_cutoff = 0.5, npc_outcome_cutoff = 0.2) + } \seealso{ Other colocboost_inference: