[Important update] Design Shared pecotmr QC and ColocBoost Analysis Work by xueweic · Pull Request #494 · StatFunGen/pecotmr

xueweic · 2026-05-24T16:26:00Z

1. Interface Separation

Rationale: Direct ColocBoost users and xQTL protocol users need different entrypoints. Mixing both workflows in one function made the API confusing and encouraged duplicated QC logic.

Key changes:

Added/standardized colocboost_analysis() as the direct user-facing wrapper.
- Accepts the same core inputs as colocboost() through ...
- Adds optional QC/imputation parameters.
- If no QC is requested, it behaves as a minimal call-through to colocboost().
- If QC is requested, it prepares cleaned inputs first, then calls colocboost().
Added/standardized colocboost_pipeline() as the protocol-oriented wrapper.
- Starts from region_data produced by load_multitask_regional_data().
- Designed for xQTL-protocol workflows.
- Runs QC once, then reuses cleaned inputs across xQTL-only, joint-GWAS, and separate-GWAS analyses.

2. RegionalData Conversion Layer

Rationale: Loaded regional data should be converted into reusable analysis inputs without reloading files or duplicating conversion logic.

Key changes:

Added reusable adapters:
- region_data_to_rss_input() for summary-stat/RSS workflows.
- region_data_to_ind_input() for individual-level workflows.
- region_data_to_colocboost_input() for ColocBoost-specific assembly.
Preserved shared heavy objects such as X, LD, and X_ref.
Used dictionaries such as dict_YX and dict_sumstatLD to avoid duplicated matrix storage.

3. Shared QC Refactor

Rationale: QC logic should be reusable across ColocBoost, SuSiE RSS, and individual-level SuSiE workflows instead of being embedded separately in each pipeline.

Key changes:

Added qc_individual_data() for individual-level QC.
Expanded summary_stats_qc() to support combined RSS/ColocBoost summary-stat QC while preserving its historical LD-mismatch QC behavior.
Clarified qc_method behavior:
- NULL, "none", and "rss_qc" mean basic summary-stat harmonization only.
- "slalom" and "dentist" add optional LD-mismatch outlier QC.
Added QC progress messages for stage tracking, retained counts, skips, and warnings.

4. X_ref-Based Imputation

Rationale: When genotype reference X_ref is available, imputation should use the genotype reference directly rather than deriving LD blocks, because LD block partitioning can alter imputation behavior.

Key changes:

Final behavior uses whole-region scaled X_ref with RAISS genotype/SVD imputation.
The temporary scaled matrix is used only inside imputation.
The original X_ref remains the reference passed into the final colocboost() call.
The implementation avoids full LD construction for X_ref imputation.
Added tests showing:
- raw X_ref SVD can differ from LD-based RAISS;
- scaled X_ref SVD matches LD scale in a controlled single-matrix case;
- colocboost_analysis(..., X_ref=..., impute=TRUE) uses scaled genotype/SVD and does not call compute_LD().

Important report observation:

Old LD-block RAISS retained fewer variants for ENSG00000155304: 166 -> 7 and 167 -> 7.
Whole-region scaled X_ref SVD retained more: 166 -> 16 and 167 -> 13.
This difference is expected and should be highlighted as evidence that LD block partitioning changes imputation behavior.

5. Protocol Compatibility and Minimal Loader Fixes

Rationale: The implementation should support the uploaded chr21 test data and xQTL-protocol use case while keeping loader changes minimal.

Key changes:

Kept all changes within pecotmr; no changes were made to the core colocboost() function.
Made only minimal loader-related fixes required by smoke testing:
- multi-genotype-source loading via match_geno_pheno;
- header handling for tabix-backed summary-stat files;
- whitespace-padded covariate numeric parsing.
Updated protocol-facing naming to use colocboost_pipeline().

6. Documentation and Tests

Rationale: The new interfaces, QC behavior, and imputation decision need clear user-facing documentation and regression coverage so future changes do not reintroduce ambiguity.

Key changes:

Updated documentation/man pages and vignette references for the new ColocBoost interfaces.
Added chr21 smoke-test deliverable notes emphasizing the X_ref imputation decision.
Added focused regression tests for:
- colocboost_analysis() direct-input behavior;
- colocboost_pipeline() protocol-level behavior;
- reusable summary-stat QC behavior;
- RAISS raw/scaled genotype-reference behavior;
- X_ref-backed imputation avoiding compute_LD().
Focused tests passed:
- test_raiss.R
- test_sumstats_qc.R
- test_colocboost_pipeline.R
Chr21 smoke tests completed on two genes across six scenarios, covering no-AD xQTL-only, basic QC, imputation, PIP screening, SLALOM QC, and combined modes.

Changed `colocboost_analysis_pipeline` to `colocboost_pipeline`. See PR/494 in StatFunGen/pecotmr (StatFunGen/pecotmr#494) for rationale and detailed changes.

See PR/494 in StatFunGen/pecotmr (StatFunGen/pecotmr#494) for rationale and detailed changes.

gaow · 2026-05-24T19:49:03Z

 #' @param LD_data A list containing combined LD variants data that is generated by load_LD_matrix.
 #' @param skip_region A character vector specifying regions to be skipped in the analysis (optional).
 #'                    Each region should be in the format "chrom:start-end" (e.g., "1:1000000-2000000").
+#' @param return_LD_mat Logical; if \code{FALSE}, return only harmonized


should be return_LD_data if this is returning the LD object which can be either LD matrix or genotype matrix?

this is a good point. i did not change this code, but this code seems tricky. In current pipeline, we should handle both Xref and LD matrix. Let me revisit this code and try to modify as the minimal changes and impact for other functions.

gaow · 2026-05-24T19:53:42Z

    extract_region_name = NULL, region_name_col = NULL,
-    qc_method = c("slalom", "dentist"),
+    qc_method = c("slalom", "dentist", "rss_qc", "none"),
    finemapping_method = c("susie_rss", "single_effect", "bayesian_conditional_regression"),


I dont think this logic is necessary now. We just do susie_rss with EB. But maybe this par Haochen should weigh in

danielnachun · 2026-05-25T00:50:01Z

I tried to push a fix to this but it closed for some reason. I am fixing this!

xueweic added a commit to xueweic/xqtl-protocol that referenced this pull request May 24, 2026

Update colocboost.ipynb

529090c

Changed `colocboost_analysis_pipeline` to `colocboost_pipeline`. See PR/494 in StatFunGen/pecotmr (StatFunGen/pecotmr#494) for rationale and detailed changes.

xueweic mentioned this pull request May 24, 2026

Update colocboost.ipynb StatFunGen/xqtl-protocol#1324

Open

xueweic added a commit to xueweic/colocboost that referenced this pull request May 24, 2026

Changed colocboost_analysis_pipeline to colocboost_pipeline

6f10589

See PR/494 in StatFunGen/pecotmr (StatFunGen/pecotmr#494) for rationale and detailed changes.

gaow reviewed May 24, 2026

View reviewed changes

Comment thread R/sumstats_qc.R Outdated

gaow reviewed May 24, 2026

View reviewed changes

Comment thread R/univariate_pipeline.R Outdated

gaow reviewed May 24, 2026

View reviewed changes

danielnachun closed this May 25, 2026

danielnachun force-pushed the main branch from 90ea691 to 27d1412 Compare May 25, 2026 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Important update] Design Shared pecotmr QC and ColocBoost Analysis Work#494

[Important update] Design Shared pecotmr QC and ColocBoost Analysis Work#494
xueweic wants to merge 0 commit into
StatFunGen:mainfrom
xueweic:main

xueweic commented May 24, 2026

Uh oh!

Uh oh!

gaow May 24, 2026

Uh oh!

xueweic May 24, 2026

Uh oh!

Uh oh!

gaow May 24, 2026

Uh oh!

danielnachun commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xueweic commented May 24, 2026

1. Interface Separation

2. RegionalData Conversion Layer

3. Shared QC Refactor

4. X_ref-Based Imputation

5. Protocol Compatibility and Minimal Loader Fixes

6. Documentation and Tests

Uh oh!

Uh oh!

gaow May 24, 2026

Choose a reason for hiding this comment

Uh oh!

xueweic May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gaow May 24, 2026

Choose a reason for hiding this comment

Uh oh!

danielnachun commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants