Skip to content

Madc2vcf updates#53

Merged
Cristianetaniguti merged 5 commits intoped_indels_updatefrom
madc2vcf_all_updates
Mar 27, 2026
Merged

Madc2vcf updates#53
Cristianetaniguti merged 5 commits intoped_indels_updatefrom
madc2vcf_all_updates

Conversation

@Cristianetaniguti
Copy link
Copy Markdown
Collaborator

Updates functions madc2vcf

Details:

  • If verbose = TRUE, the functions output informative messages along the process
  • both functions (targets and all (targets + off-targets) markers now have check_madc_sanity function implemented. It tests:
    • [ChromPos] If MADC has the expected columns
    • [allNArow | allNAcol] Presence of columns and rows with all NA (happens often when people open the MADC in excel before loading in R)
    • [IUPACcodes] Presence of IUPAC codes on AlleleSequence
    • [LowerCase] Presence of lower case bases on AlleleSequence
    • [Indels] Presence of Indels
    • [ChromPos] If CloneID follows the format Chr_Pos
    • [RefAltSeqs] If all Ref Allele has corresponding Alt and vice-versa
    • [OtherAlleles] If "Other" exists in the MADC AlleleID
  • madc2vcf_targets doesn’t run if:
    • MADC Column names are not correct
    • Ignore Other alleles - but inform the user if they exist or not
  • See the table for madc2vcf_targets requirements accordingly to MADC content:
  check status get_REF_ALT include_others Requires
IUPAC TRUE TRUE - markers_info REF/ALT
  TRUE FALSE - -
  FALSE TRUE - botloci or markers_info REF/ALT
  FALSE FALSE - -
Indels TRUE TRUE - markers_info REF/ALT
  TRUE FALSE - -
  FALSE TRUE - botloci or markers_info REF/ALT
  FALSE FALSE - -
ChromPos TRUE TRUE - botloci or markers_info REF/ALT
  TRUE FALSE - -
  FALSE TRUE - markers_info CHR/POS/REF/ALT or markers_info CHR/POS/ + botloci
  FALSE FALSE - markers_info CHR/POS
FixAlleleIDs TRUE TRUE - botloci or markers_info REF/ALT
  TRUE FALSE - -
  FALSE TRUE - markers_info REF/ALT
  FALSE FALSE - -
  • madc2vcf_targets got two new arguments: collapse_matches_counts

Still in progress:

  • Users now have the option to generate multiallelic VCF - new function madc2vcf_multi

  • madc2vcf_all and madc2vcf_multi doesn’t run if:

    • MADC Column names are not correct
    • If it is raw MADC
    • If it has IUPAC codes
  • See the table for madc2vcf_all and madc2vcf_multi requirements accordingly to MADC content:

  Check status Requires
Indels TRUE markers_info REF/ALT/IndelPos/IndelLenght + botloci
  FALSE botloci
ChromPos TRUE botloci
  FALSE markers_info CHR/POS + botloci
RefAltSeqs TRUE botloci
  FALSE botloci + hapdb
  • Help pages updated with all these new information

@Cristianetaniguti Cristianetaniguti merged commit df6fe92 into ped_indels_update Mar 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the MADC→VCF conversion workflow (especially madc2vcf_targets) by adding structured verbose messaging, expanding MADC sanity checks, and introducing count-collapsing options—along with substantial new tests and documentation updates.

Changes:

  • Add vmsg() utility and wire verbose progress messages through MADC/VCF conversion and count-extraction helpers.
  • Expand check_madc_sanity() with additional validations and integrate it into madc2vcf_targets() / madc2vcf_all().
  • Add collapse_matches_counts support to aggregate |AltMatch/|RefMatch counts, and update docs/tests accordingly.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
R/utils.R Adds vmsg() and url_exists() utilities.
R/check_madc_sanity.R Expands sanity checks (IUPAC/lowercase/indels/ChromPos/NA rows+cols/etc.) and moves/extends check_botloci().
R/get_countsMADC.R Adds madc_object, collapse_matches_counts, and verbose messaging; refactors get_counts().
R/madc2vcf_targets.R Adds input validation, integrates sanity checks, supports collapsing counts, improves REF/ALT recovery logic, and writes metadata into VCF headers.
R/madc2vcf_all.R Integrates sanity checks and adds markers_info parameter.
tests/testthat/test-madc2vcf_targets.R Removes REF/ALT cross-check vs madc2vcf_all; adds extensive new target conversion tests (incl. external fixtures).
tests/testthat/test-check_madc_sanity.R Replaces local fixture with GitHub-hosted CSVs and adds expected check vectors.
man/vmsg.Rd New documentation for vmsg().
man/check_madc_sanity.Rd Updates documentation for expanded sanity checks.
man/get_counts.Rd New internal documentation for get_counts().
man/get_countsMADC.Rd Updates docs for new args/behavior.
man/madc2vcf_targets.Rd Updates signature/docs for new args and updated requirements.
man/madc2vcf_all.Rd Adds markers_info to usage (docs incomplete).
man/imputation_concordance.Rd Updates API docs and adds plot/printing options (example currently non-runnable).
man/filterVCF.Rd Adds quality.rates to usage (docs incomplete).
NAMESPACE Exports vmsg and adds several imports (incl. ggplot2).
NEWS.md Adds release notes for 0.6.4 changes.
Comments suppressed due to low confidence (1)

R/madc2vcf_targets.R:161

  • report <- read.csv(madc_file) uses the default check.names = TRUE, which can silently modify sample column names (e.g., prefixing with X), changing sample IDs in the resulting VCF matrices. Use check.names = FALSE when reading the MADC to preserve original sample identifiers.
  # MADC checks
  report <- read.csv(madc_file)
  checks <- check_madc_sanity(report)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

checks <- check_madc_sanity(report)

messages_results <- mapply(function(check, message) {
if (check) message[1] else message[2]
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same NA-handling issue as in madc2vcf_targets(): if (check) inside mapply() will error if any check is NA. check_madc_sanity() uses NA for skipped checks, so this needs isTRUE()/isFALSE() handling before building messages_results.

Suggested change
if (check) message[1] else message[2]
if (isTRUE(check)) {
message[1]
} else if (isFALSE(check)) {
message[2]
} else {
NA_character_
}

Copilot uses AI. Check for mistakes.
Comment on lines 16 to 19
out_vcf = NULL,
markers_info = NULL,
verbose = TRUE
)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markers_info was added to the function signature here, but it is not documented in the \arguments section below (only verbose is). Please add an \item{markers_info}{...} entry so the Rd matches the updated API.

Copilot uses AI. Check for mistakes.
@@ -106,18 +164,121 @@ madc2vcf_targets <- function(madc_file,
if (check) message[1] else message[2]
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

messages_results <- mapply(function(check, message) { if (check) ... }, checks$checks, ...) will throw missing value where TRUE/FALSE needed when any check is NA (which check_madc_sanity() explicitly uses for skipped checks). Use isTRUE(check) / isFALSE(check) handling (or default message) so NA checks don't error before you can stop() on Columns.

Suggested change
if (check) message[1] else message[2]
if (isTRUE(check)) message[1] else message[2]

Copilot uses AI. Check for mistakes.
Comment on lines 99 to +102
"verbose= ", verbose,')">')

if(!is.null(madc)) report <- read.csv(madc, check.names = FALSE) else stop("Please provide a MADC file")
report <- read.csv(madc, check.names = FALSE)
checks <- check_madc_sanity(report)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

madc is allowed to be NULL by the signature, but the function now unconditionally calls read.csv(madc, ...). When madc is NULL, this will error with an unhelpful message (and your input checks currently don't stop on NULL). Add an explicit is.null(madc) guard (or make madc required).

Copilot uses AI. Check for mistakes.
export(updog2vcf)
export(vmsg)
import(dplyr)
import(ggplot2)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import(ggplot2) was added, but ggplot2 is not declared in DESCRIPTION Imports/Depends (it's not currently listed). This will fail R CMD check (namespace imports must be declared). Either add ggplot2 to DESCRIPTION Imports, or move it to Suggests and use requireNamespace() conditionally in plotting code.

Suggested change
import(ggplot2)

Copilot uses AI. Check for mistakes.
Comment on lines 7 to 11
filterVCF(
vcf.file,
quality.rates = F,
filter.OD = NULL,
filter.BIAS.min = NULL,
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quality.rates was added to the function usage but is missing from the \arguments list, and the default is shown as F instead of FALSE. Please document the parameter and use FALSE for consistency/clarity in generated docs.

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +125
test_that("ALFALFA — clean fixed allele ID MADC", {
out <- tempfile(fileext = ".vcf")
expect_no_error(
madc2vcf_targets(madc_file = alfalfa_madc,
output.file = out,
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testthat::test_that() blocks are nested here (a test_that() inside another test_that()). testthat does not support nesting and this will error during test execution. Flatten these into separate top-level test_that() calls (or use local_* helpers/setup code instead of nesting).

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +115
github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/"

# External alfalfa test files
alfalfa_madc <- paste0(github_path, "test_madcs/alfalfa_madc.csv")
alfalfa_madc_wrongID <- paste0(github_path, "test_madcs/alfalfa_madc_wrongID.csv")
alfalfa_madc_raw <- paste0(github_path, "test_madcs/alfalfa_madc_raw.csv") # raw DArT format (7-row header)
alfalfa_iupac <- paste0(github_path, "test_madcs/alfalfa_IUPAC.csv")
alfalfa_lowercase <- paste0(github_path, "test_madcs/alfalfa_lowercase.csv")
alfalfa_botloci <- paste0(github_path, "alfalfa/20201030-BI-Alfalfa_SNPs_DArTag-probe-design_f180bp.botloci") # botloci for alfalfa
alfalfa_markers_info <- paste0(github_path, "alfalfa/20201030-BI-Alfalfa_SNPs_DArTag-probe-design_snpID_lut.csv") # markers_info: CloneID/BI_markerID, Chr, Pos, Ref, Alt
alfalfa_markers_info_ChromPos <- paste0(github_path, "test_madcs/alfalfa_marker_info_ChromPos.csv") # markers_info: CloneID/BI_markerID, Chr, Pos


# External potato test files
potato_indel_madc <- paste0(github_path, "test_madcs/potato_indel_madc.csv")
potato_indel_iupac <- paste0(github_path, "test_madcs/potato_indel_IUPAC.csv")
potato_indel_lowercase <- paste0(github_path, "test_madcs/potato_indel_lowercase.csv")
potato_more_indels_chrompos_false <- paste0(github_path, "test_madcs/potato_more_indels_madc_ChromPosFALSE.csv")
potato_botloci <- paste0(github_path, "potato/potato_dartag_v2_3915markers_rm7dupTags_6traitMarkers_f150bp_ref_alt.botloci")
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests pull fixtures directly from GitHub at runtime. Even with skip_if_offline(), this makes CI/test results dependent on external availability and file immutability. Prefer committing small, versioned fixtures under inst/extdata (or tests/testthat/fixtures) and reading locally so tests are deterministic.

Suggested change
github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/"
# External alfalfa test files
alfalfa_madc <- paste0(github_path, "test_madcs/alfalfa_madc.csv")
alfalfa_madc_wrongID <- paste0(github_path, "test_madcs/alfalfa_madc_wrongID.csv")
alfalfa_madc_raw <- paste0(github_path, "test_madcs/alfalfa_madc_raw.csv") # raw DArT format (7-row header)
alfalfa_iupac <- paste0(github_path, "test_madcs/alfalfa_IUPAC.csv")
alfalfa_lowercase <- paste0(github_path, "test_madcs/alfalfa_lowercase.csv")
alfalfa_botloci <- paste0(github_path, "alfalfa/20201030-BI-Alfalfa_SNPs_DArTag-probe-design_f180bp.botloci") # botloci for alfalfa
alfalfa_markers_info <- paste0(github_path, "alfalfa/20201030-BI-Alfalfa_SNPs_DArTag-probe-design_snpID_lut.csv") # markers_info: CloneID/BI_markerID, Chr, Pos, Ref, Alt
alfalfa_markers_info_ChromPos <- paste0(github_path, "test_madcs/alfalfa_marker_info_ChromPos.csv") # markers_info: CloneID/BI_markerID, Chr, Pos
# External potato test files
potato_indel_madc <- paste0(github_path, "test_madcs/potato_indel_madc.csv")
potato_indel_iupac <- paste0(github_path, "test_madcs/potato_indel_IUPAC.csv")
potato_indel_lowercase <- paste0(github_path, "test_madcs/potato_indel_lowercase.csv")
potato_more_indels_chrompos_false <- paste0(github_path, "test_madcs/potato_more_indels_madc_ChromPosFALSE.csv")
potato_botloci <- paste0(github_path, "potato/potato_dartag_v2_3915markers_rm7dupTags_6traitMarkers_f150bp_ref_alt.botloci")
fixtures_dir <- testthat::test_path("fixtures")
# External alfalfa test files (now local fixtures)
alfalfa_madc <- file.path(fixtures_dir, "test_madcs", "alfalfa_madc.csv")
alfalfa_madc_wrongID <- file.path(fixtures_dir, "test_madcs", "alfalfa_madc_wrongID.csv")
alfalfa_madc_raw <- file.path(fixtures_dir, "test_madcs", "alfalfa_madc_raw.csv") # raw DArT format (7-row header)
alfalfa_iupac <- file.path(fixtures_dir, "test_madcs", "alfalfa_IUPAC.csv")
alfalfa_lowercase <- file.path(fixtures_dir, "test_madcs", "alfalfa_lowercase.csv")
alfalfa_botloci <- file.path(fixtures_dir, "alfalfa", "20201030-BI-Alfalfa_SNPs_DArTag-probe-design_f180bp.botloci") # botloci for alfalfa
alfalfa_markers_info <- file.path(fixtures_dir, "alfalfa", "20201030-BI-Alfalfa_SNPs_DArTag-probe-design_snpID_lut.csv") # markers_info: CloneID/BI_markerID, Chr, Pos, Ref, Alt
alfalfa_markers_info_ChromPos <- file.path(fixtures_dir, "test_madcs", "alfalfa_marker_info_ChromPos.csv") # markers_info: CloneID/BI_markerID, Chr, Pos
# External potato test files (now local fixtures)
potato_indel_madc <- file.path(fixtures_dir, "test_madcs", "potato_indel_madc.csv")
potato_indel_iupac <- file.path(fixtures_dir, "test_madcs", "potato_indel_IUPAC.csv")
potato_indel_lowercase <- file.path(fixtures_dir, "test_madcs", "potato_indel_lowercase.csv")
potato_more_indels_chrompos_false <- file.path(fixtures_dir, "test_madcs", "potato_more_indels_madc_ChromPosFALSE.csv")
potato_botloci <- file.path(fixtures_dir, "potato", "potato_dartag_v2_3915markers_rm7dupTags_6traitMarkers_f150bp_ref_alt.botloci")

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +97
has_0001 <- any(grepl("_0001", report$AlleleID, fixed = TRUE), na.rm = TRUE)
has_0002 <- any(grepl("_0002", report$AlleleID, fixed = TRUE), na.rm = TRUE)
checks["FixAlleleIDs"] <- (!all_blank_or_star) & has_0001 & has_0002
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_madc_sanity() reads report$AlleleID to compute has_0001/has_0002 before confirming that the required columns exist. If AlleleID is missing, this will error instead of returning Columns = FALSE as intended. Validate required columns (or guard column access) before using report$AlleleID/report$CloneID/report$AlleleSequence.

Suggested change
has_0001 <- any(grepl("_0001", report$AlleleID, fixed = TRUE), na.rm = TRUE)
has_0002 <- any(grepl("_0002", report$AlleleID, fixed = TRUE), na.rm = TRUE)
checks["FixAlleleIDs"] <- (!all_blank_or_star) & has_0001 & has_0002
if ("AlleleID" %in% names(report)) {
has_0001 <- any(grepl("_0001", report$AlleleID, fixed = TRUE), na.rm = TRUE)
has_0002 <- any(grepl("_0002", report$AlleleID, fixed = TRUE), na.rm = TRUE)
checks["FixAlleleIDs"] <- (!all_blank_or_star) & has_0001 & has_0002
} else {
has_0001 <- NA
has_0002 <- NA
# leave checks["FixAlleleIDs"] as initialized (NA) when AlleleID is missing
}

Copilot uses AI. Check for mistakes.

# --- All NA ----
checks["allNArow"] <- any(apply(report, 1, function(x) all(is.na(x) | x == "")))
checks["allNAcol"] <- any(apply(report, 2, function(x) all(is.na(x)) | x == ""))
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allNAcol check is incorrect: all(is.na(x)) | x == "" vectorizes over x and will often return TRUE if a column contains any empty string, not only when all values are missing. Use all(is.na(x) | x == "") (matching the allNArow logic).

Suggested change
checks["allNAcol"] <- any(apply(report, 2, function(x) all(is.na(x)) | x == ""))
checks["allNAcol"] <- any(apply(report, 2, function(x) all(is.na(x) | x == "")))

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants