Skip to content

recover pr 494#496

Merged
danielnachun merged 5 commits into
mainfrom
xueweic/recover-pr494
May 25, 2026
Merged

recover pr 494#496
danielnachun merged 5 commits into
mainfrom
xueweic/recover-pr494

Conversation

@danielnachun
Copy link
Copy Markdown
Collaborator

Final Report

  1. Interface Separation

Rationale:
Direct ColocBoost users and xQTL protocol users need different entrypoints. Mixing both workflows in one function made the API confusing and encouraged duplicated QC logic.

Key changes:

  • Added/standardized colocboost_analysis() as the direct user-facing wrapper.
    • Accepts the same core inputs as colocboost() through ....
    • Adds optional QC/imputation parameters.
    • If no QC is requested, it behaves as a minimal call-through to colocboost().
    • If QC is requested, it prepares cleaned inputs first, then calls colocboost().
  • Added/standardized colocboost_pipeline() as the protocol-oriented wrapper.
    • Starts from region_data produced by load_multitask_regional_data().
    • Designed for xQTL-protocol workflows.
    • Runs QC once, then reuses cleaned inputs across xQTL-only, joint-GWAS, and separate-GWAS analyses.
  1. RegionalData Conversion Layer

Rationale:
Loaded regional data should be converted into reusable analysis inputs without reloading files or duplicating conversion logic.

Key changes:

  • Added reusable adapters:
    • region_data_to_rss_input() for summary-stat/RSS workflows.
    • region_data_to_ind_input() for individual-level workflows.
    • region_data_to_colocboost_input() for ColocBoost-specific assembly.
  • Preserved shared heavy objects such as X, LD, and X_ref.
  • Used dictionaries such as dict_YX and dict_sumstatLD to avoid duplicated matrix storage.
  1. Shared QC Refactor

Rationale:
QC logic should be reusable across ColocBoost, SuSiE RSS, and individual-level SuSiE workflows instead of being embedded separately in each pipeline.

Key changes:

  • Added qc_individual_data() for individual-level QC.
  • Expanded summary_stats_qc() to support combined RSS/ColocBoost summary-stat QC while preserving its historical LD-mismatch QC behavior.
  • Clarified qc_method behavior:
    • NULL, "none", and "rss_qc" mean basic summary-stat harmonization only.
    • "slalom" and "dentist" add optional LD-mismatch outlier QC.
  • Added QC progress messages for stage tracking, retained counts, skips, and warnings.
  1. X_ref-Based Imputation

Rationale:
When genotype reference X_ref is available, imputation should use the genotype reference directly rather than deriving LD blocks, because LD block partitioning can alter imputation behavior.

Key changes:

  • Final behavior uses whole-region scaled X_ref with RAISS genotype/SVD imputation.
  • The temporary scaled matrix is used only inside imputation.
  • The original X_ref remains the reference passed into the final colocboost() call.
  • The implementation avoids full LD construction for X_ref imputation.
  • Added tests showing:
    • raw X_ref SVD can differ from LD-based RAISS;
    • scaled X_ref SVD matches LD scale in a controlled single-matrix case;
    • colocboost_analysis(..., X_ref=..., impute=TRUE) uses scaled genotype/SVD and does not call compute_LD().

Important report observation:

  • Old LD-block RAISS retained fewer variants for ENSG00000155304: 166 -> 7 and 167 -> 7.
  • Whole-region scaled X_ref SVD retained more: 166 -> 16 and 167 -> 13.
  • This difference is expected and should be highlighted as evidence that LD block partitioning changes imputation behavior.
  1. Protocol Compatibility and Minimal Loader Fixes

Rationale:
The implementation should support the uploaded chr21 test data and xQTL-protocol use case while keeping loader changes minimal.

Key changes:

  • Kept all changes within pecotmr; no changes were made to the core colocboost() function.
  • Made only minimal loader-related fixes required by smoke testing:
    • multi-genotype-source loading via match_geno_pheno;
    • header handling for tabix-backed summary-stat files;
    • whitespace-padded covariate numeric parsing.
  • Updated protocol-facing naming to use colocboost_pipeline().
  1. Documentation and Tests

Rationale:
The new interfaces, QC behavior, and imputation decision need clear user-facing documentation and regression coverage so future changes do not reintroduce ambiguity.

Key changes:

  • Updated documentation/man pages and vignette references for the new ColocBoost interfaces.
  • Added chr21 smoke-test deliverable notes emphasizing the X_ref imputation decision.
  • Added focused regression tests for:
    • colocboost_analysis() direct-input behavior;
    • colocboost_pipeline() protocol-level behavior;
    • reusable summary-stat QC behavior;
    • RAISS raw/scaled genotype-reference behavior;
    • X_ref-backed imputation avoiding compute_LD().
  • Focused tests passed:
    • test_raiss.R
    • test_sumstats_qc.R
    • test_colocboost_pipeline.R
  • Chr21 smoke tests completed on two genes across six scenarios, covering no-AD xQTL-only, basic QC, imputation, PIP screening, SLALOM QC, and combined modes.

xueweic and others added 5 commits May 24, 2026 11:18
Final Report

1. Interface Separation

Rationale:
Direct ColocBoost users and xQTL protocol users need different entrypoints. Mixing both workflows in one function made the API confusing and encouraged duplicated QC logic.

Key changes:
- Added/standardized colocboost_analysis() as the direct user-facing wrapper.
  - Accepts the same core inputs as colocboost() through ....
  - Adds optional QC/imputation parameters.
  - If no QC is requested, it behaves as a minimal call-through to colocboost().
  - If QC is requested, it prepares cleaned inputs first, then calls colocboost().
- Added/standardized colocboost_pipeline() as the protocol-oriented wrapper.
  - Starts from region_data produced by load_multitask_regional_data().
  - Designed for xQTL-protocol workflows.
  - Runs QC once, then reuses cleaned inputs across xQTL-only, joint-GWAS, and separate-GWAS analyses.

2. RegionalData Conversion Layer

Rationale:
Loaded regional data should be converted into reusable analysis inputs without reloading files or duplicating conversion logic.

Key changes:
- Added reusable adapters:
  - region_data_to_rss_input() for summary-stat/RSS workflows.
  - region_data_to_ind_input() for individual-level workflows.
  - region_data_to_colocboost_input() for ColocBoost-specific assembly.
- Preserved shared heavy objects such as X, LD, and X_ref.
- Used dictionaries such as dict_YX and dict_sumstatLD to avoid duplicated matrix storage.

3. Shared QC Refactor

Rationale:
QC logic should be reusable across ColocBoost, SuSiE RSS, and individual-level SuSiE workflows instead of being embedded separately in each pipeline.

Key changes:
- Added qc_individual_data() for individual-level QC.
- Expanded summary_stats_qc() to support combined RSS/ColocBoost summary-stat QC while preserving its historical LD-mismatch QC behavior.
- Clarified qc_method behavior:
  - NULL, "none", and "rss_qc" mean basic summary-stat harmonization only.
  - "slalom" and "dentist" add optional LD-mismatch outlier QC.
- Added QC progress messages for stage tracking, retained counts, skips, and warnings.

4. X_ref-Based Imputation

Rationale:
When genotype reference X_ref is available, imputation should use the genotype reference directly rather than deriving LD blocks, because LD block partitioning can alter imputation behavior.

Key changes:
- Final behavior uses whole-region scaled X_ref with RAISS genotype/SVD imputation.
- The temporary scaled matrix is used only inside imputation.
- The original X_ref remains the reference passed into the final colocboost() call.
- The implementation avoids full LD construction for X_ref imputation.
- Added tests showing:
  - raw X_ref SVD can differ from LD-based RAISS;
  - scaled X_ref SVD matches LD scale in a controlled single-matrix case;
  - colocboost_analysis(..., X_ref=..., impute=TRUE) uses scaled genotype/SVD and does not call compute_LD().

Important report observation:
- Old LD-block RAISS retained fewer variants for ENSG00000155304: 166 -> 7 and 167 -> 7.
- Whole-region scaled X_ref SVD retained more: 166 -> 16 and 167 -> 13.
- This difference is expected and should be highlighted as evidence that LD block partitioning changes imputation behavior.

5. Protocol Compatibility and Minimal Loader Fixes

Rationale:
The implementation should support the uploaded chr21 test data and xQTL-protocol use case while keeping loader changes minimal.

Key changes:
- Kept all changes within pecotmr; no changes were made to the core colocboost() function.
- Made only minimal loader-related fixes required by smoke testing:
  - multi-genotype-source loading via match_geno_pheno;
  - header handling for tabix-backed summary-stat files;
  - whitespace-padded covariate numeric parsing.
- Updated protocol-facing naming to use colocboost_pipeline().

6. Documentation and Tests

Rationale:
The new interfaces, QC behavior, and imputation decision need clear user-facing documentation and regression coverage so future changes do not reintroduce ambiguity.

Key changes:
- Updated documentation/man pages and vignette references for the new ColocBoost interfaces.
- Added chr21 smoke-test deliverable notes emphasizing the X_ref imputation decision.
- Added focused regression tests for:
  - colocboost_analysis() direct-input behavior;
  - colocboost_pipeline() protocol-level behavior;
  - reusable summary-stat QC behavior;
  - RAISS raw/scaled genotype-reference behavior;
  - X_ref-backed imputation avoiding compute_LD().
- Focused tests passed:
  - test_raiss.R
  - test_sumstats_qc.R
  - test_colocboost_pipeline.R
- Chr21 smoke tests completed on two genes across six scenarios, covering no-AD xQTL-only, basic QC, imputation, PIP screening, SLALOM QC, and combined modes.
Remove the legacy rss_qc alias so summary_stats_qc only accepts none, slalom, and dentist.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

❌ Patch coverage is 82.89474% with 208 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.95%. Comparing base (789cf97) to head (d86da99).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
R/colocboost_pipeline.R 81.02% 137 Missing ⚠️
R/sumstats_qc.R 82.69% 45 Missing ⚠️
R/file_utils.R 86.24% 26 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #496      +/-   ##
==========================================
- Coverage   82.98%   82.95%   -0.03%     
==========================================
  Files          36       36              
  Lines        8146     9130     +984     
==========================================
+ Hits         6760     7574     +814     
- Misses       1386     1556     +170     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@danielnachun danielnachun merged commit c638868 into main May 25, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants