Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the MADC→VCF conversion workflow (especially madc2vcf_targets) by adding structured verbose messaging, expanding MADC sanity checks, and introducing count-collapsing options—along with substantial new tests and documentation updates.
Changes:
- Add
vmsg()utility and wire verbose progress messages through MADC/VCF conversion and count-extraction helpers. - Expand
check_madc_sanity()with additional validations and integrate it intomadc2vcf_targets()/madc2vcf_all(). - Add
collapse_matches_countssupport to aggregate|AltMatch/|RefMatchcounts, and update docs/tests accordingly.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| R/utils.R | Adds vmsg() and url_exists() utilities. |
| R/check_madc_sanity.R | Expands sanity checks (IUPAC/lowercase/indels/ChromPos/NA rows+cols/etc.) and moves/extends check_botloci(). |
| R/get_countsMADC.R | Adds madc_object, collapse_matches_counts, and verbose messaging; refactors get_counts(). |
| R/madc2vcf_targets.R | Adds input validation, integrates sanity checks, supports collapsing counts, improves REF/ALT recovery logic, and writes metadata into VCF headers. |
| R/madc2vcf_all.R | Integrates sanity checks and adds markers_info parameter. |
| tests/testthat/test-madc2vcf_targets.R | Removes REF/ALT cross-check vs madc2vcf_all; adds extensive new target conversion tests (incl. external fixtures). |
| tests/testthat/test-check_madc_sanity.R | Replaces local fixture with GitHub-hosted CSVs and adds expected check vectors. |
| man/vmsg.Rd | New documentation for vmsg(). |
| man/check_madc_sanity.Rd | Updates documentation for expanded sanity checks. |
| man/get_counts.Rd | New internal documentation for get_counts(). |
| man/get_countsMADC.Rd | Updates docs for new args/behavior. |
| man/madc2vcf_targets.Rd | Updates signature/docs for new args and updated requirements. |
| man/madc2vcf_all.Rd | Adds markers_info to usage (docs incomplete). |
| man/imputation_concordance.Rd | Updates API docs and adds plot/printing options (example currently non-runnable). |
| man/filterVCF.Rd | Adds quality.rates to usage (docs incomplete). |
| NAMESPACE | Exports vmsg and adds several imports (incl. ggplot2). |
| NEWS.md | Adds release notes for 0.6.4 changes. |
Comments suppressed due to low confidence (1)
R/madc2vcf_targets.R:161
report <- read.csv(madc_file)uses the defaultcheck.names = TRUE, which can silently modify sample column names (e.g., prefixing withX), changing sample IDs in the resulting VCF matrices. Usecheck.names = FALSEwhen reading the MADC to preserve original sample identifiers.
# MADC checks
report <- read.csv(madc_file)
checks <- check_madc_sanity(report)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| checks <- check_madc_sanity(report) | ||
|
|
||
| messages_results <- mapply(function(check, message) { | ||
| if (check) message[1] else message[2] |
There was a problem hiding this comment.
Same NA-handling issue as in madc2vcf_targets(): if (check) inside mapply() will error if any check is NA. check_madc_sanity() uses NA for skipped checks, so this needs isTRUE()/isFALSE() handling before building messages_results.
| if (check) message[1] else message[2] | |
| if (isTRUE(check)) { | |
| message[1] | |
| } else if (isFALSE(check)) { | |
| message[2] | |
| } else { | |
| NA_character_ | |
| } |
| out_vcf = NULL, | ||
| markers_info = NULL, | ||
| verbose = TRUE | ||
| ) |
There was a problem hiding this comment.
markers_info was added to the function signature here, but it is not documented in the \arguments section below (only verbose is). Please add an \item{markers_info}{...} entry so the Rd matches the updated API.
| @@ -106,18 +164,121 @@ madc2vcf_targets <- function(madc_file, | |||
| if (check) message[1] else message[2] | |||
There was a problem hiding this comment.
messages_results <- mapply(function(check, message) { if (check) ... }, checks$checks, ...) will throw missing value where TRUE/FALSE needed when any check is NA (which check_madc_sanity() explicitly uses for skipped checks). Use isTRUE(check) / isFALSE(check) handling (or default message) so NA checks don't error before you can stop() on Columns.
| if (check) message[1] else message[2] | |
| if (isTRUE(check)) message[1] else message[2] |
| "verbose= ", verbose,')">') | ||
|
|
||
| if(!is.null(madc)) report <- read.csv(madc, check.names = FALSE) else stop("Please provide a MADC file") | ||
| report <- read.csv(madc, check.names = FALSE) | ||
| checks <- check_madc_sanity(report) |
There was a problem hiding this comment.
madc is allowed to be NULL by the signature, but the function now unconditionally calls read.csv(madc, ...). When madc is NULL, this will error with an unhelpful message (and your input checks currently don't stop on NULL). Add an explicit is.null(madc) guard (or make madc required).
| export(updog2vcf) | ||
| export(vmsg) | ||
| import(dplyr) | ||
| import(ggplot2) |
There was a problem hiding this comment.
import(ggplot2) was added, but ggplot2 is not declared in DESCRIPTION Imports/Depends (it's not currently listed). This will fail R CMD check (namespace imports must be declared). Either add ggplot2 to DESCRIPTION Imports, or move it to Suggests and use requireNamespace() conditionally in plotting code.
| import(ggplot2) |
| filterVCF( | ||
| vcf.file, | ||
| quality.rates = F, | ||
| filter.OD = NULL, | ||
| filter.BIAS.min = NULL, |
There was a problem hiding this comment.
quality.rates was added to the function usage but is missing from the \arguments list, and the default is shown as F instead of FALSE. Please document the parameter and use FALSE for consistency/clarity in generated docs.
| test_that("ALFALFA — clean fixed allele ID MADC", { | ||
| out <- tempfile(fileext = ".vcf") | ||
| expect_no_error( | ||
| madc2vcf_targets(madc_file = alfalfa_madc, | ||
| output.file = out, |
There was a problem hiding this comment.
testthat::test_that() blocks are nested here (a test_that() inside another test_that()). testthat does not support nesting and this will error during test execution. Flatten these into separate top-level test_that() calls (or use local_* helpers/setup code instead of nesting).
| github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/" | ||
|
|
||
| # External alfalfa test files | ||
| alfalfa_madc <- paste0(github_path, "test_madcs/alfalfa_madc.csv") | ||
| alfalfa_madc_wrongID <- paste0(github_path, "test_madcs/alfalfa_madc_wrongID.csv") | ||
| alfalfa_madc_raw <- paste0(github_path, "test_madcs/alfalfa_madc_raw.csv") # raw DArT format (7-row header) | ||
| alfalfa_iupac <- paste0(github_path, "test_madcs/alfalfa_IUPAC.csv") | ||
| alfalfa_lowercase <- paste0(github_path, "test_madcs/alfalfa_lowercase.csv") | ||
| alfalfa_botloci <- paste0(github_path, "alfalfa/20201030-BI-Alfalfa_SNPs_DArTag-probe-design_f180bp.botloci") # botloci for alfalfa | ||
| alfalfa_markers_info <- paste0(github_path, "alfalfa/20201030-BI-Alfalfa_SNPs_DArTag-probe-design_snpID_lut.csv") # markers_info: CloneID/BI_markerID, Chr, Pos, Ref, Alt | ||
| alfalfa_markers_info_ChromPos <- paste0(github_path, "test_madcs/alfalfa_marker_info_ChromPos.csv") # markers_info: CloneID/BI_markerID, Chr, Pos | ||
|
|
||
|
|
||
| # External potato test files | ||
| potato_indel_madc <- paste0(github_path, "test_madcs/potato_indel_madc.csv") | ||
| potato_indel_iupac <- paste0(github_path, "test_madcs/potato_indel_IUPAC.csv") | ||
| potato_indel_lowercase <- paste0(github_path, "test_madcs/potato_indel_lowercase.csv") | ||
| potato_more_indels_chrompos_false <- paste0(github_path, "test_madcs/potato_more_indels_madc_ChromPosFALSE.csv") | ||
| potato_botloci <- paste0(github_path, "potato/potato_dartag_v2_3915markers_rm7dupTags_6traitMarkers_f150bp_ref_alt.botloci") |
There was a problem hiding this comment.
These tests pull fixtures directly from GitHub at runtime. Even with skip_if_offline(), this makes CI/test results dependent on external availability and file immutability. Prefer committing small, versioned fixtures under inst/extdata (or tests/testthat/fixtures) and reading locally so tests are deterministic.
| github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/" | |
| # External alfalfa test files | |
| alfalfa_madc <- paste0(github_path, "test_madcs/alfalfa_madc.csv") | |
| alfalfa_madc_wrongID <- paste0(github_path, "test_madcs/alfalfa_madc_wrongID.csv") | |
| alfalfa_madc_raw <- paste0(github_path, "test_madcs/alfalfa_madc_raw.csv") # raw DArT format (7-row header) | |
| alfalfa_iupac <- paste0(github_path, "test_madcs/alfalfa_IUPAC.csv") | |
| alfalfa_lowercase <- paste0(github_path, "test_madcs/alfalfa_lowercase.csv") | |
| alfalfa_botloci <- paste0(github_path, "alfalfa/20201030-BI-Alfalfa_SNPs_DArTag-probe-design_f180bp.botloci") # botloci for alfalfa | |
| alfalfa_markers_info <- paste0(github_path, "alfalfa/20201030-BI-Alfalfa_SNPs_DArTag-probe-design_snpID_lut.csv") # markers_info: CloneID/BI_markerID, Chr, Pos, Ref, Alt | |
| alfalfa_markers_info_ChromPos <- paste0(github_path, "test_madcs/alfalfa_marker_info_ChromPos.csv") # markers_info: CloneID/BI_markerID, Chr, Pos | |
| # External potato test files | |
| potato_indel_madc <- paste0(github_path, "test_madcs/potato_indel_madc.csv") | |
| potato_indel_iupac <- paste0(github_path, "test_madcs/potato_indel_IUPAC.csv") | |
| potato_indel_lowercase <- paste0(github_path, "test_madcs/potato_indel_lowercase.csv") | |
| potato_more_indels_chrompos_false <- paste0(github_path, "test_madcs/potato_more_indels_madc_ChromPosFALSE.csv") | |
| potato_botloci <- paste0(github_path, "potato/potato_dartag_v2_3915markers_rm7dupTags_6traitMarkers_f150bp_ref_alt.botloci") | |
| fixtures_dir <- testthat::test_path("fixtures") | |
| # External alfalfa test files (now local fixtures) | |
| alfalfa_madc <- file.path(fixtures_dir, "test_madcs", "alfalfa_madc.csv") | |
| alfalfa_madc_wrongID <- file.path(fixtures_dir, "test_madcs", "alfalfa_madc_wrongID.csv") | |
| alfalfa_madc_raw <- file.path(fixtures_dir, "test_madcs", "alfalfa_madc_raw.csv") # raw DArT format (7-row header) | |
| alfalfa_iupac <- file.path(fixtures_dir, "test_madcs", "alfalfa_IUPAC.csv") | |
| alfalfa_lowercase <- file.path(fixtures_dir, "test_madcs", "alfalfa_lowercase.csv") | |
| alfalfa_botloci <- file.path(fixtures_dir, "alfalfa", "20201030-BI-Alfalfa_SNPs_DArTag-probe-design_f180bp.botloci") # botloci for alfalfa | |
| alfalfa_markers_info <- file.path(fixtures_dir, "alfalfa", "20201030-BI-Alfalfa_SNPs_DArTag-probe-design_snpID_lut.csv") # markers_info: CloneID/BI_markerID, Chr, Pos, Ref, Alt | |
| alfalfa_markers_info_ChromPos <- file.path(fixtures_dir, "test_madcs", "alfalfa_marker_info_ChromPos.csv") # markers_info: CloneID/BI_markerID, Chr, Pos | |
| # External potato test files (now local fixtures) | |
| potato_indel_madc <- file.path(fixtures_dir, "test_madcs", "potato_indel_madc.csv") | |
| potato_indel_iupac <- file.path(fixtures_dir, "test_madcs", "potato_indel_IUPAC.csv") | |
| potato_indel_lowercase <- file.path(fixtures_dir, "test_madcs", "potato_indel_lowercase.csv") | |
| potato_more_indels_chrompos_false <- file.path(fixtures_dir, "test_madcs", "potato_more_indels_madc_ChromPosFALSE.csv") | |
| potato_botloci <- file.path(fixtures_dir, "potato", "potato_dartag_v2_3915markers_rm7dupTags_6traitMarkers_f150bp_ref_alt.botloci") |
| has_0001 <- any(grepl("_0001", report$AlleleID, fixed = TRUE), na.rm = TRUE) | ||
| has_0002 <- any(grepl("_0002", report$AlleleID, fixed = TRUE), na.rm = TRUE) | ||
| checks["FixAlleleIDs"] <- (!all_blank_or_star) & has_0001 & has_0002 |
There was a problem hiding this comment.
check_madc_sanity() reads report$AlleleID to compute has_0001/has_0002 before confirming that the required columns exist. If AlleleID is missing, this will error instead of returning Columns = FALSE as intended. Validate required columns (or guard column access) before using report$AlleleID/report$CloneID/report$AlleleSequence.
| has_0001 <- any(grepl("_0001", report$AlleleID, fixed = TRUE), na.rm = TRUE) | |
| has_0002 <- any(grepl("_0002", report$AlleleID, fixed = TRUE), na.rm = TRUE) | |
| checks["FixAlleleIDs"] <- (!all_blank_or_star) & has_0001 & has_0002 | |
| if ("AlleleID" %in% names(report)) { | |
| has_0001 <- any(grepl("_0001", report$AlleleID, fixed = TRUE), na.rm = TRUE) | |
| has_0002 <- any(grepl("_0002", report$AlleleID, fixed = TRUE), na.rm = TRUE) | |
| checks["FixAlleleIDs"] <- (!all_blank_or_star) & has_0001 & has_0002 | |
| } else { | |
| has_0001 <- NA | |
| has_0002 <- NA | |
| # leave checks["FixAlleleIDs"] as initialized (NA) when AlleleID is missing | |
| } |
|
|
||
| # --- All NA ---- | ||
| checks["allNArow"] <- any(apply(report, 1, function(x) all(is.na(x) | x == ""))) | ||
| checks["allNAcol"] <- any(apply(report, 2, function(x) all(is.na(x)) | x == "")) |
There was a problem hiding this comment.
The allNAcol check is incorrect: all(is.na(x)) | x == "" vectorizes over x and will often return TRUE if a column contains any empty string, not only when all values are missing. Use all(is.na(x) | x == "") (matching the allNArow logic).
| checks["allNAcol"] <- any(apply(report, 2, function(x) all(is.na(x)) | x == "")) | |
| checks["allNAcol"] <- any(apply(report, 2, function(x) all(is.na(x) | x == ""))) |
Updates functions madc2vcf
Details:
Still in progress:
Users now have the option to generate multiallelic VCF - new function madc2vcf_multi
madc2vcf_all and madc2vcf_multi doesn’t run if:
See the table for
madc2vcf_allandmadc2vcf_multirequirements accordingly to MADC content: