Skip to content

[Important update] Design Shared pecotmr QC and ColocBoost Analysis Work#494

Closed
xueweic wants to merge 0 commit into
StatFunGen:mainfrom
xueweic:main
Closed

[Important update] Design Shared pecotmr QC and ColocBoost Analysis Work#494
xueweic wants to merge 0 commit into
StatFunGen:mainfrom
xueweic:main

Conversation

@xueweic
Copy link
Copy Markdown
Contributor

@xueweic xueweic commented May 24, 2026

1. Interface Separation

Rationale: Direct ColocBoost users and xQTL protocol users need different entrypoints. Mixing both workflows in one function made the API confusing and encouraged duplicated QC logic.

Key changes:

  • Added/standardized colocboost_analysis() as the direct user-facing wrapper.
    • Accepts the same core inputs as colocboost() through ...
    • Adds optional QC/imputation parameters.
    • If no QC is requested, it behaves as a minimal call-through to colocboost().
    • If QC is requested, it prepares cleaned inputs first, then calls colocboost().
  • Added/standardized colocboost_pipeline() as the protocol-oriented wrapper.
    • Starts from region_data produced by load_multitask_regional_data().
    • Designed for xQTL-protocol workflows.
    • Runs QC once, then reuses cleaned inputs across xQTL-only, joint-GWAS, and separate-GWAS analyses.

2. RegionalData Conversion Layer

Rationale: Loaded regional data should be converted into reusable analysis inputs without reloading files or duplicating conversion logic.

Key changes:

  • Added reusable adapters:
    • region_data_to_rss_input() for summary-stat/RSS workflows.
    • region_data_to_ind_input() for individual-level workflows.
    • region_data_to_colocboost_input() for ColocBoost-specific assembly.
  • Preserved shared heavy objects such as X, LD, and X_ref.
  • Used dictionaries such as dict_YX and dict_sumstatLD to avoid duplicated matrix storage.

3. Shared QC Refactor

Rationale: QC logic should be reusable across ColocBoost, SuSiE RSS, and individual-level SuSiE workflows instead of being embedded separately in each pipeline.

Key changes:

  • Added qc_individual_data() for individual-level QC.
  • Expanded summary_stats_qc() to support combined RSS/ColocBoost summary-stat QC while preserving its historical LD-mismatch QC behavior.
  • Clarified qc_method behavior:
    • NULL, "none", and "rss_qc" mean basic summary-stat harmonization only.
    • "slalom" and "dentist" add optional LD-mismatch outlier QC.
  • Added QC progress messages for stage tracking, retained counts, skips, and warnings.

4. X_ref-Based Imputation

Rationale: When genotype reference X_ref is available, imputation should use the genotype reference directly rather than deriving LD blocks, because LD block partitioning can alter imputation behavior.

Key changes:

  • Final behavior uses whole-region scaled X_ref with RAISS genotype/SVD imputation.
  • The temporary scaled matrix is used only inside imputation.
  • The original X_ref remains the reference passed into the final colocboost() call.
  • The implementation avoids full LD construction for X_ref imputation.
  • Added tests showing:
    • raw X_ref SVD can differ from LD-based RAISS;
    • scaled X_ref SVD matches LD scale in a controlled single-matrix case;
    • colocboost_analysis(..., X_ref=..., impute=TRUE) uses scaled genotype/SVD and does not call compute_LD().

Important report observation:

  • Old LD-block RAISS retained fewer variants for ENSG00000155304: 166 -> 7 and 167 -> 7.
  • Whole-region scaled X_ref SVD retained more: 166 -> 16 and 167 -> 13.
  • This difference is expected and should be highlighted as evidence that LD block partitioning changes imputation behavior.

5. Protocol Compatibility and Minimal Loader Fixes

Rationale: The implementation should support the uploaded chr21 test data and xQTL-protocol use case while keeping loader changes minimal.

Key changes:

  • Kept all changes within pecotmr; no changes were made to the core colocboost() function.
  • Made only minimal loader-related fixes required by smoke testing:
    • multi-genotype-source loading via match_geno_pheno;
    • header handling for tabix-backed summary-stat files;
    • whitespace-padded covariate numeric parsing.
  • Updated protocol-facing naming to use colocboost_pipeline().

6. Documentation and Tests

Rationale: The new interfaces, QC behavior, and imputation decision need clear user-facing documentation and regression coverage so future changes do not reintroduce ambiguity.

Key changes:

  • Updated documentation/man pages and vignette references for the new ColocBoost interfaces.
  • Added chr21 smoke-test deliverable notes emphasizing the X_ref imputation decision.
  • Added focused regression tests for:
    • colocboost_analysis() direct-input behavior;
    • colocboost_pipeline() protocol-level behavior;
    • reusable summary-stat QC behavior;
    • RAISS raw/scaled genotype-reference behavior;
    • X_ref-backed imputation avoiding compute_LD().
  • Focused tests passed:
    • test_raiss.R
    • test_sumstats_qc.R
    • test_colocboost_pipeline.R
  • Chr21 smoke tests completed on two genes across six scenarios, covering no-AD xQTL-only, basic QC, imputation, PIP screening, SLALOM QC, and combined modes.

xueweic added a commit to xueweic/xqtl-protocol that referenced this pull request May 24, 2026
Changed `colocboost_analysis_pipeline` to `colocboost_pipeline`.  See PR/494 in StatFunGen/pecotmr (StatFunGen/pecotmr#494) for rationale and detailed changes.
xueweic added a commit to xueweic/colocboost that referenced this pull request May 24, 2026
See PR/494 in StatFunGen/pecotmr (StatFunGen/pecotmr#494) for rationale and detailed changes.
Comment thread R/sumstats_qc.R Outdated
Comment thread R/sumstats_qc.R Outdated
#' @param LD_data A list containing combined LD variants data that is generated by load_LD_matrix.
#' @param skip_region A character vector specifying regions to be skipped in the analysis (optional).
#' Each region should be in the format "chrom:start-end" (e.g., "1:1000000-2000000").
#' @param return_LD_mat Logical; if \code{FALSE}, return only harmonized
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be return_LD_data if this is returning the LD object which can be either LD matrix or genotype matrix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point. i did not change this code, but this code seems tricky. In current pipeline, we should handle both Xref and LD matrix. Let me revisit this code and try to modify as the minimal changes and impact for other functions.

Comment thread R/univariate_pipeline.R Outdated
Comment thread R/univariate_pipeline.R
extract_region_name = NULL, region_name_col = NULL,
qc_method = c("slalom", "dentist"),
qc_method = c("slalom", "dentist", "rss_qc", "none"),
finemapping_method = c("susie_rss", "single_effect", "bayesian_conditional_regression"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this logic is necessary now. We just do susie_rss with EB. But maybe this par Haochen should weigh in

@danielnachun
Copy link
Copy Markdown
Collaborator

I tried to push a fix to this but it closed for some reason. I am fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants