Exit gracefully if all gwas fail QC by kal26 · Pull Request #431 · StatFunGen/pecotmr

kal26 · 2026-01-15T23:38:57Z

Summary

Adds graceful handling when all GWAS studies are skipped during QC in the colocboost analysis pipeline.

Problem

When all GWAS studies are skipped during QC (e.g., due to "No variants in region"), the pipeline would crash with a "subscript out of bounds" error when trying to access dict_sumstatLD[i_gwas, ] with an invalid index.

Solution

Added check for empty sumstats before running joint_gwas analysis
Added check for NULL or empty dict_sumstatLD before running separate_gwas analysis
Returns NULL results with informative message instead of crashing

Changes

Modified colocboost_analysis_pipeline() to check if GWAS data exists before attempting analysis

- Check for missing phenotype, covariate, genotype, sumstat, column mapping, and LD metadata files - Error (not warn) when files are missing to catch path issues early - Fix genotype file check to look for .bed/.bim/.fam or .pgen/.pvar/.psam files instead of base filename - Improve error messages to show which files are missing with condition names - Filter phenotypes without data for specific region before loading

- Centralize normalize_variant_id function in misc.R and remove duplicates - Normalize variant IDs consistently across sumstats, LD matrices, and genotype data - Fix LD matrix usage: use processed LD matrices after normalization in summary_stats_qc - Normalize ref_panel variant IDs before imputation to match processed sumstats - Add try-catch error handling in summary_stats_qc_multitask loop - Add try-catch error handling in separate_gwas focal analysis loop - Fix pip_cutoff_to_skip_sumstat handling for single values, named vectors, and unnamed vectors - Remove debug statements

…alyses - Use normalize_variant_id() consistently for sumstats and LD matrices to handle variant ID format mismatches - Remove double 'chr' prefix issue and build suffix (e.g., :b38) mismatches - Add tryCatch error handling for joint_gwas and separate_gwas analyses - Add concise warning messages for colocboost validation failures - Fix sumstat_studies handling when sumstats list is empty

- Check if sumstats is empty before running joint_gwas analysis - Check if dict_sumstatLD is NULL or has 0 rows before running separate_gwas analysis - Prevents 'subscript out of bounds' error when all GWAS studies are skipped during QC - Returns NULL results with informative message instead of crashing

gaow · 2026-03-01T15:32:06Z

Hi @kal26 @danielnachun

Re: Your Open PRs (#418, #419, #427, #429, #431)

Thanks so much for all the work you put into these PRs. After some recent code changes, I asked Claude Code (Opus 4.6) to go through every single change across all five PRs and cross-reference them against the current main branch. The good news is that Claude reports we've addressed the vast majority of the issues you raised through recent refactoring. Here's what it found what's been resolved so you can verify on your end.

You will need to update pecotmr and susieR both to the latest main branch. We have not yet made updated conda packages.

What's been resolved in `main`

Variant ID normalization (all PRs)

This was a big one across your PRs — the chr prefix ping-pong, build suffix issues, inconsistent ID formats between sumstats and LD matrices. Claude did a comprehensive overhaul of variant ID handling in commit 9e2c10a. The new system in R/misc.R includes:

normalize_variant_id() — single canonical format everywhere
parse_variant_id() — robust parser that handles 2-, 3-, 4-, and 5-part IDs
detect_variant_convention() — auto-detects format conventions
format_variant_id() — reconstructs IDs with configurable conventions

These are now used consistently across R/LD.R, R/allele_qc.R, R/colocboost_pipeline.R, R/file_utils.R, R/susie_wrapper.R, and R/sumstats_qc.R. The align_variant_names() function also has your remove_build_suffix parameter (default TRUE). The old manual paste0("chr", ...) / ifelse(startsWith(...)) patterns are gone.

Graceful error handling / exit when all GWAS fail QC (PRs #431, #418)

All three colocboost() calls (xQTL, joint GWAS, separate GWAS) are wrapped in tryCatch with informative error messages
is_valid_sumstat_entry() and filter_valid_sumstats() validate entries before analysis (checks for NULL, non-data.frame, too-few variants, all-NA z-scores, non-positive n)
When all sumstats are filtered, the pipeline sets them to NULL and returns early with a message instead of crashing
NA z-scores are filtered out before constructing the sumstat data frames

QC method defaults (PR #419)

match.arg() is now called in colocboost_analysis_pipeline(), qc_regional_data(), and summary_stats_qc(). The default QC method is now a less aggressive "slalom" (first element in the vector), for reasons discussed in https://statfungen.github.io/pecotmr/articles/dentist.html

PIP cutoff validation (PRs #419, #427)

The robust handling you proposed is in place:

Scalar pip_cutoff_to_skip_ind gets recycled with rep() and length-validated
Scalar pip_cutoff_to_skip_sumstat expands to a named vector; named vectors get missing studies filled with 0
Mismatched lengths produce clear error messages

Region naming in `load_phenotype_data` (PR #419)

When extract_region_name is NULL, colnames are now assigned from region_name_col if available. This was your fix for being unable to analyze multiple phenotypes within region files without passing both parameters.

LD variant ID alignment in `rss_basic_qc` (PRs #418, #427)

R/sumstats_qc.R now has a fallback alignment path: if sumstats and LD variants don't match directly, it strips build suffixes and chr prefixes from both sides and re-attempts matching.

NA handling in `filter_raiss_output` (PRs #418, #427)

Claude confirmed the na.rm = TRUE on the four sum() calls was already present in our codebase.

File existence validation (PR #427)

We added file.exists() checks at the individual function level — tabix_region(), load_genotype_region() (validates both PLINK1 and PLINK2 file sets), load_covariate_data(), and load_rss_data(). Since each loading function already validates its inputs, errors get caught with clear messages either way — just slightly later in execution. The net result is the same: bad paths produce clear errors before any analysis runs.

Phenotype pre-filtering and alignment (PRs #427, #431)

Your check_phenotype_has_data() probed tabix before loading to skip phenotypes with no data in a region. We handle this slightly differently but with the same outcome — load_phenotype_data() loads the file and if it's empty for the region, returns NULL with a message. It then tracks kept_indices for which phenotypes survived, and load_regional_association_data() uses those to align covariates and conditions accordingly. So empty phenotypes are skipped gracefully and downstream data stays consistent.

Colname preservation through residuals

add_Y_residuals() now preserves column names (gene IDs) through the residual computation: colnames(res) <- colnames(.x).

What we decided not to implement (and why)

`num_threads` propagation to vroom/fread/read_delim (PR #429)

We understand the HPC motivation — you don't want vroom spawning 64 threads inside a SLURM job allocated 4 cores. However, threading the num_threads parameter through every single function in the call chain adds a lot of API complexity. The better approach for HPC environments is to set the environment variables that these libraries already respect:

VROOM_THREADS=4 (for vroom)
R_DATATABLE_NUM_THREADS=4 (for data.table::fread)
OMP_NUM_THREADS=4 (for C++ OpenMP code)

These can be set in your SLURM job script or .Renviron file and they work globally without needing to pass parameters through every function call. We'd recommend this approach for your HPC workflows.

Debug thread logging in C++ (PR #429)

The timestamp/OMP_NUM_THREADS logging you added to dentist_iterative_impute.cpp and src/mr_ash.h was debugging instrumentation to diagnose thread behavior on HPC, but not something we'd include in release code. Also, mr_ash has since been moved out of pecotmr entirely — it now lives in susieR/src/mr_ash_rss, so changes to src/mr_ash.h here wouldn't apply anymore.

`sanitize_names` helper in `allele_qc` (PR #427)

The column name sanitization after merge() in allele_qc() was to handle edge cases with unnamed or NA columns. Our refactored allele_qc now uses format_variant_id() and the new normalization system which doesn't produce these edge cases, so the sanitization isn't needed.

If you find any remaining issues after testing against current main, please open a new focused issue and we'll address it right away.

Thanks again for all the careful debugging work — it directly informed a lot of the improvements we made!

Best,
Gao & Claude Code (Opus 4.6)

kal26 added 5 commits December 19, 2025 14:08

add num_threads control to colocboost pipeline

022e3be

gaow closed this Mar 1, 2026

This was referenced Mar 1, 2026

Add multi-threading control for colocboost pipeline #429

Closed

Feat/filter empty phenotypes by region #427

Closed

Set defaults and improve region naming in loaders #419

Closed

Skip invalid GWAS in focal mode; validate joint GWAS inputs #418

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exit gracefully if all gwas fail QC#431

Exit gracefully if all gwas fail QC#431
kal26 wants to merge 5 commits intoStatFunGen:mainfrom
kal26:feat/graceful-exit-no-gwas

kal26 commented Jan 15, 2026

Uh oh!

gaow commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kal26 commented Jan 15, 2026

Summary

Problem

Solution

Changes

Uh oh!

gaow commented Mar 1, 2026

Re: Your Open PRs (#418, #419, #427, #429, #431)

What's been resolved in main

Variant ID normalization (all PRs)

Graceful error handling / exit when all GWAS fail QC (PRs #431, #418)

QC method defaults (PR #419)

PIP cutoff validation (PRs #419, #427)

Region naming in load_phenotype_data (PR #419)

LD variant ID alignment in rss_basic_qc (PRs #418, #427)

NA handling in filter_raiss_output (PRs #418, #427)

File existence validation (PR #427)

Phenotype pre-filtering and alignment (PRs #427, #431)

Colname preservation through residuals

What we decided not to implement (and why)

num_threads propagation to vroom/fread/read_delim (PR #429)

Debug thread logging in C++ (PR #429)

sanitize_names helper in allele_qc (PR #427)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

What's been resolved in `main`

Region naming in `load_phenotype_data` (PR #419)

LD variant ID alignment in `rss_basic_qc` (PRs #418, #427)

NA handling in `filter_raiss_output` (PRs #418, #427)

`num_threads` propagation to vroom/fread/read_delim (PR #429)

`sanitize_names` helper in `allele_qc` (PR #427)