Skip to content

ENH: Add Utilities/Maintenance/RemoteModuleIngest tooling#6098

Merged
hjmjohnson merged 3 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:remote-module-ingest-tooling
Apr 22, 2026
Merged

ENH: Add Utilities/Maintenance/RemoteModuleIngest tooling#6098
hjmjohnson merged 3 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:remote-module-ingest-tooling

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

Separates the remote-module ingestion tooling from any specific module ingest (see #6093 for the first concrete use case). This PR adds a new directory Utilities/Maintenance/RemoteModuleIngest/ containing the driver script, strategy documents, and AI-agent guidance. Relates to #6060.

Directory layout
Utilities/Maintenance/RemoteModuleIngest/
├── ingest-remote-module.sh   # v3 whitelist-based driver (executable, 280 lines)
├── README.md                 # Human-facing quickstart + convention
├── AGENTS.md                 # AI-agent pre-flight + decision tree + escalation triggers
├── INGESTION_STRATEGY.md     # Policy: whitelist, modes, thresholds, CID normalization, examples policy
├── AUDIT_DESIGN.md           # Pre-ingest audit pipeline + recommend_mode() pseudocode
└── CLEANUP_CHECKLIST.md      # Safety-net list for the post-merge STYLE commit
What the script does
  1. Clones upstream (git clone --no-local --mirror).
  2. Runs git filter-repo --paths include/ src/ test/ wrapping/ CMakeLists.txt itk-module.cmake --to-subdirectory-filter Modules/<Group>/<Module> --prune-empty always.
  3. A second filter-repo pass strips CTestConfig.cmake (always points at a standalone CDash project, never applies in-tree).
  4. Reports a content-link inventory (.md5 / .shaNNN / .cid) and warns if non-.cid links remain.
  5. In non-dry-run mode: merges the filter-repo'd upstream into the current ITK branch with --allow-unrelated-histories --no-ff, auto-populating Co-authored-by: trailers from the upstream log.
  6. Prints the recommended follow-on commits (README, remote.cmake delete, CI opt-in, optional CID norm).

--dry-run --keep-tempdir lets reviewers preview the post-filter tree before committing anything.

Consensus decisions captured in the strategy doc
How this PR relates to #6093

Reviewers can land this PR first (it's small, self-contained, touches only Utilities/Maintenance/), then the follow-on module ingest PRs can reference Utilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh <Module> <Group> as the reproducible command that produced the merge commit.

@github-actions github-actions Bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Enhancement Improvement of existing methods or implementation labels Apr 22, 2026
Copy link
Copy Markdown
Member

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on a glance.

Comment thread Utilities/Maintenance/RemoteModuleIngest/AGENTS.md Outdated
Comment thread Utilities/Maintenance/RemoteModuleIngest/README.md
@hjmjohnson hjmjohnson force-pushed the remote-module-ingest-tooling branch from 54953f0 to 94ed775 Compare April 22, 2026 15:51
@dzenanz
Copy link
Copy Markdown
Member

dzenanz commented Apr 22, 2026

The described changes sound good. I did not take a good look at new CID-related scripts.

@hjmjohnson hjmjohnson marked this pull request as ready for review April 22, 2026 16:27
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR adds Utilities/Maintenance/RemoteModuleIngest/ — a self-contained tooling directory with a whitelist-based git filter-repo driver script and companion strategy/agent guidance documents for reproducibly ingesting ITK remote modules into the main source tree.

  • P1 — ingest-remote-module.sh line 273: git fetch ingest-src-tmp main:ingest-src-tmp-main hardcodes the branch name main; many older remote-module repos still default to master, causing the fetch (and therefore the entire ingest) to fail silently under "Merge failed" even when the filter-repo pass succeeds.
  • P2 — exit code table (ingest-remote-module.sh lines 43–48): documented codes 2 = filter-repo failure and 3 = merge failure don't match the implementation, where both error paths call die (exit 1) and code 2 is actually the CID-normalization gate.
  • P2 — cid-normalize.sh line 190: trap ... RETURN has no effect outside a shell function body; temp file cleanup on unexpected exits is unreliable.

Confidence Score: 4/5

The P1 hardcoded main branch will break ingests from any master-default upstream repo; fix before merging to avoid silent failures in the first real use case.

One real P1 defect (hardcoded branch name) that will break the script for any upstream using a non-main default branch. All other findings are P2. The documentation and strategy files are thorough and well-structured. Score reflects a single blocking issue that is easy to fix.

Utilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh — hardcoded main branch name (line 273) and exit code table mismatch.

Important Files Changed

Filename Overview
Utilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh Main driver script (280 lines): hardcodes branch name main when fetching from the filter-repo'd mirror (P1 – breaks ingests from master-default repos); exit code comments document codes 2 and 3 that don't match the implementation (P2).
Utilities/Maintenance/RemoteModuleIngest/cid-normalize.sh Content-link converter: trap ... RETURN used outside a function has no effect, leaving temp files unclean on unexpected exits (P2); logic and fallback chain are otherwise correct.
Utilities/Maintenance/RemoteModuleIngest/verify-cid-access.sh Pre-push CID gateway checker: correct flow; uses printf "$g_fmt" "$cid" with a known-safe base32 CID (SC2059 suppressed intentionally); all exit paths covered.
Utilities/Maintenance/RemoteModuleIngest/AGENTS.md AI-agent guidance document: thorough decision tree, escalation triggers, and post-merge validation steps; no code issues.
Utilities/Maintenance/RemoteModuleIngest/README.md Human-facing quickstart and convention overview; documentation only.
Utilities/Maintenance/RemoteModuleIngest/INGESTION_STRATEGY.md Policy document capturing whitelist rationale, CID-normalization mandate, examples routing, and author-preservation decisions; documentation only.
Utilities/Maintenance/RemoteModuleIngest/AUDIT_DESIGN.md Audit pipeline design and audit.json schema; documentation only.
Utilities/Maintenance/RemoteModuleIngest/CLEANUP_CHECKLIST.md Post-merge style-commit safety checklist; documentation only.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([ingest-remote-module.sh]) --> B[Preflight checks\ngit-filter-repo, clean tree, ITK root]
    B --> C[Step 1: git clone --mirror upstream]
    C --> D[Step 2: filter-repo whitelist pass\ninclude/ src/ test/ wrapping/\nCMakeLists.txt itk-module.cmake\n--to-subdirectory-filter Modules/Group/Module]
    D --> E{CTestConfig.cmake\nin result?}
    E -- yes --> F[Step 3: filter-repo invert-paths\nstrip CTestConfig.cmake]
    E -- no --> G
    F --> G[Step 4: inventory\n.md5 / .shaNNN / .cid counts]
    G --> H{--dry-run?}
    H -- yes --> I([Exit 0: report only])
    H -- no --> J[Step 6: collect authors\nbuild Co-authored-by trailers]
    J --> K[Step 7: git merge --allow-unrelated-histories\n⚠ hardcodes branch 'main']
    K --> L[Print follow-up commit checklist]
    L --> M{non-.cid\nlinks remain?}
    M -- no --> N([Exit 0: success])
    M -- yes --> O([Exit 2: ACTION REQUIRED\nrun cid-normalize.sh])
Loading

Reviews (1): Last reviewed commit: "DOC: Add strategy + agent guidance for R..." | Re-trigger Greptile

Comment thread Utilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh Outdated
Comment thread Utilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh
Comment thread Utilities/Maintenance/RemoteModuleIngest/cid-normalize.sh Outdated
Introduces ingest-remote-module.sh, a driver script that moves an ITK
remote module (Modules/Remote/<Name>.remote.cmake) into the main
source tree at Modules/<Group>/<Name>/ while preserving authorship
and keeping ITK's git pack small.

The script implements the v3 whitelist strategy agreed on PR InsightSoftwareConsortium#6093:

  * filter-repo --paths restricts history to include/, src/, test/,
    wrapping/, CMakeLists.txt, itk-module.cmake -- everything else
    (Old/, examples/, docs/, paper/, .github/, pyproject.toml,
    CTestConfig.cmake, LICENSE, .clang-format, ...) stays in the
    archived upstream repo.
  * --to-subdirectory-filter rewrites paths under the destination.
  * --prune-empty always drops commits whose changes are entirely
    outside the whitelist.
  * A second pass strips CTestConfig.cmake specifically (points at
    a standalone CDash project that does not apply in-tree).
  * The resulting merge is --allow-unrelated-histories --no-ff; the
    commit message carries the upstream URL + tip SHA plus
    Co-authored-by: trailers for every upstream contributor derived
    from the filter-repo'd git log.

README.md gives the human operator the quick-start recipe (five
commits per module, one PR per module) and a --dry-run walkthrough
for previewing an ingest before committing.

Intended follow-ups handled by the caller, not by this script:

  * DOC: the in-tree README pointing at the archived upstream.
  * COMP: deletion of Modules/Remote/<Name>.remote.cmake.
  * ENH: -DModule_<Name>:BOOL=ON in pyproject.toml configure-ci.
  * STYLE: optional .md5/.shaNNN -> .cid content-link conversion
    (may be deferred to a tree-wide sweep PR following the
    f3899ce precedent).
Four long-form documents landing alongside ingest-remote-module.sh:

  INGESTION_STRATEGY.md
    Policy document.  Whitelist definition, mode selection (full /
    filtered / squash), attribution floor, CID-normalization pipeline,
    examples-relocation policy.  Codifies the PR InsightSoftwareConsortium#6093 consensus that
    commit count is NOT a gate -- only size metrics are -- so modules
    with hundreds of genuine upstream commits can land in full-history
    mode as long as pack-delta and blob-size thresholds are met.

  AUDIT_DESIGN.md
    Design notes for the pre-ingest audit pass: blob-size histogram,
    strip-candidate path detection (paths present in pre-tip history
    but absent in tip), copyright-review flag for PDFs / videos /
    large images, recommend_mode() pseudocode.

  CLEANUP_CHECKLIST.md
    What the post-merge STYLE commit checks (now a safety-net since
    the whitelist handles the common case at graft time).  Still used
    for copyright review and for Mode B residual-blob stripping.

  AGENTS.md
    Guidance for AI coding agents running this workflow.  Pre-flight
    gates, decision points (non-Apache license, raw binary test assets,
    non-whitelisted paths the module needs, CID-normalization gap,
    examples/ routing), escalation triggers for handing back to the
    human.  Explicit "don't do these things" section covers common
    pitfalls (re-squash-silently, widen-whitelist-without-documenting,
    force-push-ingest-PR).

These documents were developed and iterated on across PR InsightSoftwareConsortium#6061,
InsightSoftwareConsortium#6085, InsightSoftwareConsortium#6086, and especially InsightSoftwareConsortium#6093 (the thread that produced the v3
whitelist + CID-normalization approach).  Landing them in-tree under
Utilities/Maintenance/RemoteModuleIngest/ makes future changes to the
strategy reviewable via standard PR process rather than through PR
comment updates on long-running threads.
cid-normalize.sh: require a CID backend (w3, ipfs, or Python
multiformats) before fetching, with install guidance; fix mirror URLs
to use uppercase algorithm names (MD5, SHA512, ...) as required by
ITKExternalData.cmake and prioritize the GitHub Pages mirror, which is
the most reliable endpoint now that data.kitware.com returns 400 for
the hashsum path.

verify-cid-access.sh: parallelize gateway probes via xargs -P (default
16), reorder gateways to try the pinned itk.mypinata.cloud first and
the slow CID-subdomain form last, switch from HEAD to 1-byte ranged
GET for better gateway compatibility, and add a permanent success
cache under XDG_CACHE_HOME since CIDs are immutable.
@hjmjohnson hjmjohnson force-pushed the remote-module-ingest-tooling branch from bcc62d1 to febe0a7 Compare April 22, 2026 21:24
@hjmjohnson hjmjohnson merged commit 92cd95b into InsightSoftwareConsortium:main Apr 22, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:Enhancement Improvement of existing methods or implementation type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants