Enhanced dedup#259
Merged
Merged
Conversation
Enable an auto-dedup-now, review-later workflow so users can pause after automatic deduplication, export, and complete manual review on re-import. Package: - export_dedup_candidates() / reimport_dedup_candidates(): persist and restore the $manual_dedup candidate pairs across an export/reimport boundary (IDs kept as character for re-merging). - export_csv() gains manual_dedup_complete flag (written as a column on full exports; read back by reimport_csv()) as a UX guard. - reimport_csv() now reads all columns as character, matching the canonical all-character types from dedup_citations(). Required so a reimported set can re-enter dedup_citations_add_manual() without column-type clashes (read.csv otherwise infers integer ids/years). - Tests in test-reimport.R cover the round-trip and merge. Shiny app: - file_reimport observer now handles multiple files and routes by content (candidate-pairs CSV vs deduplicated citation set vs RIS), fixing a latent length>1 condition error on multi-file selection. - Restoring candidate pairs repopulates the Manual deduplication tab on a reimported set; the result column is dropped so merges follow the user's row selection. - Export tab: Candidate Pairs (CSV) download; CSV export sets the manual_dedup_complete flag based on whether pairs remain pending.
dedup_citations_add_sources(existing, new_raw) adds new raw citations to a previously deduplicated set and re-deduplicates across both, preserving prior auto/manual merge decisions and the original record_ids provenance. For the same data it produces the same unique set as deduplicating everything from scratch (validated on the gambling-harms vignette data: 163 existing + 431 new -> 278 unique, == from-scratch; 645 underlying record_ids preserved). Implementation reconciles IDs (existing duplicate_id -> record_id; new records get fresh non-colliding ids based on the max underlying id), drops duplicate_id/record_ids so the engine's format_rerun rename can't clash on record_id, re-runs dedup_citations(), then expands record_ids back to the original underlying IDs via a provenance lookup. Works in manual = TRUE mode to surface new candidate pairs. Shiny app: - file_reimport sets rv$existing_dedup_present when a deduplicated set is re-imported. - identify_dups: with an existing re-imported set present, "Find duplicates" merges new uploads in via dedup_citations_add_sources(); otherwise deduplicates the uploads as before. Uploads (and the upload form) are cleared after a merge to prevent adding the same records twice. - Deduplicate tab hint describes the add-sources flow. Tests in test-add-sources.R.
On the File upload tab, re-importing a previously deduplicated/exported set now renders a view-only summary card listing per-source (and label/string) record counts, so users can see what is already in the set before adding more references. Tokens are de-duplicated within each record, so each unique record counts once per distinct source/label/string. The card also notes the total record count and whether manual deduplication was marked complete. It is kept separate from the new-uploads metadata form and does not allow editing source/label/string.
…ad page User Guide (in-app www/user_guide.md): - Step 1: re-import section rewritten — re-importing is no longer a dead end; describes the read-only source-overview card and the three paths (continue to analysis, add new sources, or finish manual review by re-uploading candidate pairs). Adds a "growing a review over time" note. - Step 2: note that Find duplicates merges new uploads into a re-imported set. - Step 3: how to pause and finish manual review later via candidate-pairs export. - Step 6: document Dedup Log and Candidate Pairs downloads and the manual-dedup-complete flag in the full CSV. File upload page (app.R sidebar): clearer labels and helper text distinguishing "upload new files to deduplicate" from "re-upload a CiteSource export" (and that the two can be combined to add sources or resume manual review). README: note incremental add-sources and deferred manual review on re-import.
- DESCRIPTION: Version 0.2.0 -> 0.2.1, Date 2026-06-01. - NEWS.md: add 0.2.1 section covering incremental deduplication (dedup_citations_add_sources), deferred manual deduplication (export/reimport_dedup_candidates + manual_dedup_complete flag), the Shiny re-import source overview and multi-file content-routed re-upload, the all-character reimport_csv fix, and the doc updates. These were moved out of the released 0.2.0 section. - CITATION.cff: version 0.2.1, date-released 2026-06-01.
There was a problem hiding this comment.
Pull request overview
This PR introduces incremental deduplication and a deferred manual-review workflow, allowing users to (1) add new sources to an already-deduplicated set and (2) export/import manual candidate pairs to finish review later. It updates the R API, adds tests, and wires the workflows into the Shiny app plus documentation/release metadata.
Changes:
- Add
dedup_citations_add_sources()for incremental deduplication while preservingrecord_idsprovenance. - Add candidate-pair export/import helpers (
export_dedup_candidates()/reimport_dedup_candidates()) and makereimport_csv()read all columns as character for type-stable round-trips. - Update Shiny app UX and exports to support re-uploading deduped sets, restoring candidate pairs, and exporting a
manual_dedup_completeflag.
Reviewed changes
Copilot reviewed 12 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
R/dedup.R |
Adds dedup_citations_add_sources() incremental dedup + provenance restoration. |
R/export.R |
Adds manual_dedup_complete export flag + export_dedup_candidates(). |
R/reimport.R |
Forces all-character CSV reimport + adds reimport_dedup_candidates(). |
inst/shiny-app/CiteSource/app.R |
Supports multi-file re-import routing (dedup set vs candidate pairs), merge-new-sources flow, and candidate-pair download. |
inst/shiny-app/CiteSource/www/user_guide.md |
Documents incremental + deferred manual-review workflows in the app. |
README.md |
Documents incremental + deferred manual-review workflows in the package README. |
tests/testthat/test-add-sources.R |
Adds automated tests for incremental deduplication behavior. |
tests/testthat/test-reimport.R |
Adds round-trip tests for CSV reimport types, flags, and candidate-pair workflows. |
NAMESPACE |
Exports newly added public functions. |
man/dedup_citations_add_sources.Rd |
Generated docs for dedup_citations_add_sources(). |
man/export_csv.Rd |
Generated docs for new export_csv() parameter. |
man/export_dedup_candidates.Rd |
Generated docs for export_dedup_candidates(). |
man/reimport_dedup_candidates.Rd |
Generated docs for reimport_dedup_candidates(). |
NEWS.md |
Adds 0.2.1 release notes covering new workflows and fixes. |
DESCRIPTION |
Bumps package version/date to 0.2.1. |
CITATION.cff |
Bumps citation version/date-released to 0.2.1. |
Files not reviewed (4)
- man/dedup_citations_add_sources.Rd: Language not supported
- man/export_csv.Rd: Language not supported
- man/export_dedup_candidates.Rd: Language not supported
- man/reimport_dedup_candidates.Rd: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+245
to
+253
| max_id <- suppressWarnings(max(as.numeric(existing_ids), na.rm = TRUE)) | ||
|
|
||
| nw <- dplyr::mutate(new_citations, dplyr::across(dplyr::everything(), as.character)) | ||
| nw <- dplyr::select(nw, -dplyr::any_of(c("duplicate_id", "record_ids", "record_id"))) | ||
| nw$record_id <- if (is.finite(max_id)) { | ||
| as.character(max_id + seq_len(nrow(nw))) | ||
| } else { | ||
| paste0("new_", seq_len(nrow(nw))) | ||
| } |
Member
Author
There was a problem hiding this comment.
record_ids would never be non-numeric
R CMD check --as-cran is now 0 errors | 0 warnings | 1 note (the note is the expected "New submission" feasibility notice plus transient URL-check resets). - R/dedup.R: replace non-ASCII em-dashes (incl. one in a stop() string) with ASCII hyphens; regenerate affected .Rd files. Clears the "non-ASCII characters in R code" WARNING. - .Rbuildignore: exclude CLAUDE.md, guide/, and .tmp* so dev/session files are not bundled into the build tarball. Clears the "non-standard top-level files" NOTE. - cran-comments.md: rewritten for the 0.2.1 feature update. Note: networkD3 remains in Suggests but is unused anywhere in the package (harmless on CRAN; flagged for optional removal).
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See 0.2.1 news.md