Skip to content

Exit gracefully if all gwas fail QC#431

Closed
kal26 wants to merge 5 commits intoStatFunGen:mainfrom
kal26:feat/graceful-exit-no-gwas
Closed

Exit gracefully if all gwas fail QC#431
kal26 wants to merge 5 commits intoStatFunGen:mainfrom
kal26:feat/graceful-exit-no-gwas

Conversation

@kal26
Copy link
Contributor

@kal26 kal26 commented Jan 15, 2026

Summary

Adds graceful handling when all GWAS studies are skipped during QC in the colocboost analysis pipeline.

Problem

When all GWAS studies are skipped during QC (e.g., due to "No variants in region"), the pipeline would crash with a "subscript out of bounds" error when trying to access dict_sumstatLD[i_gwas, ] with an invalid index.

Solution

  • Added check for empty sumstats before running joint_gwas analysis
  • Added check for NULL or empty dict_sumstatLD before running separate_gwas analysis
  • Returns NULL results with informative message instead of crashing

Changes

  • Modified colocboost_analysis_pipeline() to check if GWAS data exists before attempting analysis

- Check for missing phenotype, covariate, genotype, sumstat, column mapping, and LD metadata files
- Error (not warn) when files are missing to catch path issues early
- Fix genotype file check to look for .bed/.bim/.fam or .pgen/.pvar/.psam files instead of base filename
- Improve error messages to show which files are missing with condition names
- Filter phenotypes without data for specific region before loading
- Centralize normalize_variant_id function in misc.R and remove duplicates
- Normalize variant IDs consistently across sumstats, LD matrices, and genotype data
- Fix LD matrix usage: use processed LD matrices after normalization in summary_stats_qc
- Normalize ref_panel variant IDs before imputation to match processed sumstats
- Add try-catch error handling in summary_stats_qc_multitask loop
- Add try-catch error handling in separate_gwas focal analysis loop
- Fix pip_cutoff_to_skip_sumstat handling for single values, named vectors, and unnamed vectors
- Remove debug statements
…alyses

- Use normalize_variant_id() consistently for sumstats and LD matrices to handle variant ID format mismatches
- Remove double 'chr' prefix issue and build suffix (e.g., :b38) mismatches
- Add tryCatch error handling for joint_gwas and separate_gwas analyses
- Add concise warning messages for colocboost validation failures
- Fix sumstat_studies handling when sumstats list is empty
- Check if sumstats is empty before running joint_gwas analysis
- Check if dict_sumstatLD is NULL or has 0 rows before running separate_gwas analysis
- Prevents 'subscript out of bounds' error when all GWAS studies are skipped during QC
- Returns NULL results with informative message instead of crashing
@gaow
Copy link
Contributor

gaow commented Mar 1, 2026

Hi @kal26 @danielnachun

Re: Your Open PRs (#418, #419, #427, #429, #431)

Thanks so much for all the work you put into these PRs. After some recent code changes, I asked Claude Code (Opus 4.6) to go through every single change across all five PRs and cross-reference them against the current main branch. The good news is that Claude reports we've addressed the vast majority of the issues you raised through recent refactoring. Here's what it found what's been resolved so you can verify on your end.

You will need to update pecotmr and susieR both to the latest main branch. We have not yet made updated conda packages.

What's been resolved in main

Variant ID normalization (all PRs)

This was a big one across your PRs — the chr prefix ping-pong, build suffix issues, inconsistent ID formats between sumstats and LD matrices. Claude did a comprehensive overhaul of variant ID handling in commit 9e2c10a. The new system in R/misc.R includes:

  • normalize_variant_id() — single canonical format everywhere
  • parse_variant_id() — robust parser that handles 2-, 3-, 4-, and 5-part IDs
  • detect_variant_convention() — auto-detects format conventions
  • format_variant_id() — reconstructs IDs with configurable conventions

These are now used consistently across R/LD.R, R/allele_qc.R, R/colocboost_pipeline.R, R/file_utils.R, R/susie_wrapper.R, and R/sumstats_qc.R. The align_variant_names() function also has your remove_build_suffix parameter (default TRUE). The old manual paste0("chr", ...) / ifelse(startsWith(...)) patterns are gone.

Graceful error handling / exit when all GWAS fail QC (PRs #431, #418)

  • All three colocboost() calls (xQTL, joint GWAS, separate GWAS) are wrapped in tryCatch with informative error messages
  • is_valid_sumstat_entry() and filter_valid_sumstats() validate entries before analysis (checks for NULL, non-data.frame, too-few variants, all-NA z-scores, non-positive n)
  • When all sumstats are filtered, the pipeline sets them to NULL and returns early with a message instead of crashing
  • NA z-scores are filtered out before constructing the sumstat data frames

QC method defaults (PR #419)

match.arg() is now called in colocboost_analysis_pipeline(), qc_regional_data(), and summary_stats_qc(). The default QC method is now a less aggressive "slalom" (first element in the vector), for reasons discussed in https://statfungen.github.io/pecotmr/articles/dentist.html

PIP cutoff validation (PRs #419, #427)

The robust handling you proposed is in place:

  • Scalar pip_cutoff_to_skip_ind gets recycled with rep() and length-validated
  • Scalar pip_cutoff_to_skip_sumstat expands to a named vector; named vectors get missing studies filled with 0
  • Mismatched lengths produce clear error messages

Region naming in load_phenotype_data (PR #419)

When extract_region_name is NULL, colnames are now assigned from region_name_col if available. This was your fix for being unable to analyze multiple phenotypes within region files without passing both parameters.

LD variant ID alignment in rss_basic_qc (PRs #418, #427)

R/sumstats_qc.R now has a fallback alignment path: if sumstats and LD variants don't match directly, it strips build suffixes and chr prefixes from both sides and re-attempts matching.

NA handling in filter_raiss_output (PRs #418, #427)

Claude confirmed the na.rm = TRUE on the four sum() calls was already present in our codebase.

File existence validation (PR #427)

We added file.exists() checks at the individual function level — tabix_region(), load_genotype_region() (validates both PLINK1 and PLINK2 file sets), load_covariate_data(), and load_rss_data(). Since each loading function already validates its inputs, errors get caught with clear messages either way — just slightly later in execution. The net result is the same: bad paths produce clear errors before any analysis runs.

Phenotype pre-filtering and alignment (PRs #427, #431)

Your check_phenotype_has_data() probed tabix before loading to skip phenotypes with no data in a region. We handle this slightly differently but with the same outcome — load_phenotype_data() loads the file and if it's empty for the region, returns NULL with a message. It then tracks kept_indices for which phenotypes survived, and load_regional_association_data() uses those to align covariates and conditions accordingly. So empty phenotypes are skipped gracefully and downstream data stays consistent.

Colname preservation through residuals

add_Y_residuals() now preserves column names (gene IDs) through the residual computation: colnames(res) <- colnames(.x).

What we decided not to implement (and why)

num_threads propagation to vroom/fread/read_delim (PR #429)

We understand the HPC motivation — you don't want vroom spawning 64 threads inside a SLURM job allocated 4 cores. However, threading the num_threads parameter through every single function in the call chain adds a lot of API complexity. The better approach for HPC environments is to set the environment variables that these libraries already respect:

  • VROOM_THREADS=4 (for vroom)
  • R_DATATABLE_NUM_THREADS=4 (for data.table::fread)
  • OMP_NUM_THREADS=4 (for C++ OpenMP code)

These can be set in your SLURM job script or .Renviron file and they work globally without needing to pass parameters through every function call. We'd recommend this approach for your HPC workflows.

Debug thread logging in C++ (PR #429)

The timestamp/OMP_NUM_THREADS logging you added to dentist_iterative_impute.cpp and src/mr_ash.h was debugging instrumentation to diagnose thread behavior on HPC, but not something we'd include in release code. Also, mr_ash has since been moved out of pecotmr entirely — it now lives in susieR/src/mr_ash_rss, so changes to src/mr_ash.h here wouldn't apply anymore.

sanitize_names helper in allele_qc (PR #427)

The column name sanitization after merge() in allele_qc() was to handle edge cases with unnamed or NA columns. Our refactored allele_qc now uses format_variant_id() and the new normalization system which doesn't produce these edge cases, so the sanitization isn't needed.

If you find any remaining issues after testing against current main, please open a new focused issue and we'll address it right away.

Thanks again for all the careful debugging work — it directly informed a lot of the improvements we made!

Best,
Gao & Claude Code (Opus 4.6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants