Enhanced dedup by TNRiley · Pull Request #259 · ESHackathon/CiteSource

TNRiley · 2026-06-01T15:12:16Z

See 0.2.1 news.md

Enable an auto-dedup-now, review-later workflow so users can pause after automatic deduplication, export, and complete manual review on re-import. Package: - export_dedup_candidates() / reimport_dedup_candidates(): persist and restore the $manual_dedup candidate pairs across an export/reimport boundary (IDs kept as character for re-merging). - export_csv() gains manual_dedup_complete flag (written as a column on full exports; read back by reimport_csv()) as a UX guard. - reimport_csv() now reads all columns as character, matching the canonical all-character types from dedup_citations(). Required so a reimported set can re-enter dedup_citations_add_manual() without column-type clashes (read.csv otherwise infers integer ids/years). - Tests in test-reimport.R cover the round-trip and merge. Shiny app: - file_reimport observer now handles multiple files and routes by content (candidate-pairs CSV vs deduplicated citation set vs RIS), fixing a latent length>1 condition error on multi-file selection. - Restoring candidate pairs repopulates the Manual deduplication tab on a reimported set; the result column is dropped so merges follow the user's row selection. - Export tab: Candidate Pairs (CSV) download; CSV export sets the manual_dedup_complete flag based on whether pairs remain pending.

dedup_citations_add_sources(existing, new_raw) adds new raw citations to a previously deduplicated set and re-deduplicates across both, preserving prior auto/manual merge decisions and the original record_ids provenance. For the same data it produces the same unique set as deduplicating everything from scratch (validated on the gambling-harms vignette data: 163 existing + 431 new -> 278 unique, == from-scratch; 645 underlying record_ids preserved). Implementation reconciles IDs (existing duplicate_id -> record_id; new records get fresh non-colliding ids based on the max underlying id), drops duplicate_id/record_ids so the engine's format_rerun rename can't clash on record_id, re-runs dedup_citations(), then expands record_ids back to the original underlying IDs via a provenance lookup. Works in manual = TRUE mode to surface new candidate pairs. Shiny app: - file_reimport sets rv$existing_dedup_present when a deduplicated set is re-imported. - identify_dups: with an existing re-imported set present, "Find duplicates" merges new uploads in via dedup_citations_add_sources(); otherwise deduplicates the uploads as before. Uploads (and the upload form) are cleared after a merge to prevent adding the same records twice. - Deduplicate tab hint describes the add-sources flow. Tests in test-add-sources.R.

On the File upload tab, re-importing a previously deduplicated/exported set now renders a view-only summary card listing per-source (and label/string) record counts, so users can see what is already in the set before adding more references. Tokens are de-duplicated within each record, so each unique record counts once per distinct source/label/string. The card also notes the total record count and whether manual deduplication was marked complete. It is kept separate from the new-uploads metadata form and does not allow editing source/label/string.

…ad page User Guide (in-app www/user_guide.md): - Step 1: re-import section rewritten — re-importing is no longer a dead end; describes the read-only source-overview card and the three paths (continue to analysis, add new sources, or finish manual review by re-uploading candidate pairs). Adds a "growing a review over time" note. - Step 2: note that Find duplicates merges new uploads into a re-imported set. - Step 3: how to pause and finish manual review later via candidate-pairs export. - Step 6: document Dedup Log and Candidate Pairs downloads and the manual-dedup-complete flag in the full CSV. File upload page (app.R sidebar): clearer labels and helper text distinguishing "upload new files to deduplicate" from "re-upload a CiteSource export" (and that the two can be combined to add sources or resume manual review). README: note incremental add-sources and deferred manual review on re-import.

- DESCRIPTION: Version 0.2.0 -> 0.2.1, Date 2026-06-01. - NEWS.md: add 0.2.1 section covering incremental deduplication (dedup_citations_add_sources), deferred manual deduplication (export/reimport_dedup_candidates + manual_dedup_complete flag), the Shiny re-import source overview and multi-file content-routed re-upload, the all-character reimport_csv fix, and the doc updates. These were moved out of the released 0.2.0 section. - CITATION.cff: version 0.2.1, date-released 2026-06-01.

Copilot

Pull request overview

This PR introduces incremental deduplication and a deferred manual-review workflow, allowing users to (1) add new sources to an already-deduplicated set and (2) export/import manual candidate pairs to finish review later. It updates the R API, adds tests, and wires the workflows into the Shiny app plus documentation/release metadata.

Changes:

Add dedup_citations_add_sources() for incremental deduplication while preserving record_ids provenance.
Add candidate-pair export/import helpers (export_dedup_candidates() / reimport_dedup_candidates()) and make reimport_csv() read all columns as character for type-stable round-trips.
Update Shiny app UX and exports to support re-uploading deduped sets, restoring candidate pairs, and exporting a manual_dedup_complete flag.

Reviewed changes

Copilot reviewed 12 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`R/dedup.R`	Adds `dedup_citations_add_sources()` incremental dedup + provenance restoration.
`R/export.R`	Adds `manual_dedup_complete` export flag + `export_dedup_candidates()`.
`R/reimport.R`	Forces all-character CSV reimport + adds `reimport_dedup_candidates()`.
`inst/shiny-app/CiteSource/app.R`	Supports multi-file re-import routing (dedup set vs candidate pairs), merge-new-sources flow, and candidate-pair download.
`inst/shiny-app/CiteSource/www/user_guide.md`	Documents incremental + deferred manual-review workflows in the app.
`README.md`	Documents incremental + deferred manual-review workflows in the package README.
`tests/testthat/test-add-sources.R`	Adds automated tests for incremental deduplication behavior.
`tests/testthat/test-reimport.R`	Adds round-trip tests for CSV reimport types, flags, and candidate-pair workflows.
`NAMESPACE`	Exports newly added public functions.
`man/dedup_citations_add_sources.Rd`	Generated docs for `dedup_citations_add_sources()`.
`man/export_csv.Rd`	Generated docs for new `export_csv()` parameter.
`man/export_dedup_candidates.Rd`	Generated docs for `export_dedup_candidates()`.
`man/reimport_dedup_candidates.Rd`	Generated docs for `reimport_dedup_candidates()`.
`NEWS.md`	Adds 0.2.1 release notes covering new workflows and fixes.
`DESCRIPTION`	Bumps package version/date to 0.2.1.
`CITATION.cff`	Bumps citation version/date-released to 0.2.1.

Files not reviewed (4)

man/dedup_citations_add_sources.Rd: Language not supported
man/export_csv.Rd: Language not supported
man/export_dedup_candidates.Rd: Language not supported
man/reimport_dedup_candidates.Rd: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TNRiley · 2026-06-01T16:58:43Z

+  max_id <- suppressWarnings(max(as.numeric(existing_ids), na.rm = TRUE))
+
+  nw <- dplyr::mutate(new_citations, dplyr::across(dplyr::everything(), as.character))
+  nw <- dplyr::select(nw, -dplyr::any_of(c("duplicate_id", "record_ids", "record_id")))
+  nw$record_id <- if (is.finite(max_id)) {
+    as.character(max_id + seq_len(nrow(nw)))
+  } else {
+    paste0("new_", seq_len(nrow(nw)))
+  }


record_ids would never be non-numeric

R CMD check --as-cran is now 0 errors | 0 warnings | 1 note (the note is the expected "New submission" feasibility notice plus transient URL-check resets). - R/dedup.R: replace non-ASCII em-dashes (incl. one in a stop() string) with ASCII hyphens; regenerate affected .Rd files. Clears the "non-ASCII characters in R code" WARNING. - .Rbuildignore: exclude CLAUDE.md, guide/, and .tmp* so dev/session files are not bundled into the build tarball. Clears the "non-standard top-level files" NOTE. - cran-comments.md: rewritten for the 0.2.1 feature update. Note: networkD3 remains in Suggests but is unused anywhere in the package (harmless on CRAN; flagged for optional removal).

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

TNRiley added 6 commits June 1, 2026 10:05

Merge remote-tracking branch 'origin/dev' into enhanced-dedup

1132f02

TNRiley requested a review from TRileyNOAA June 1, 2026 15:13

TRileyNOAA requested a review from Copilot June 1, 2026 15:13

Copilot started reviewing on behalf of TRileyNOAA June 1, 2026 15:13 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

TNRiley and others added 2 commits June 1, 2026 11:47

Potential fix for pull request finding

5aa432a

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

TNRiley merged commit 6c98cf8 into dev Jun 1, 2026
2 checks passed

TNRiley deleted the enhanced-dedup branch June 1, 2026 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced dedup#259

Enhanced dedup#259
TNRiley merged 8 commits into
devfrom
enhanced-dedup

TNRiley commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

TNRiley Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TNRiley commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

TNRiley Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants