ENH: Add Utilities/Maintenance/RemoteModuleIngest tooling#6098
Conversation
54953f0 to
94ed775
Compare
|
The described changes sound good. I did not take a good look at new CID-related scripts. |
|
| Filename | Overview |
|---|---|
| Utilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh | Main driver script (280 lines): hardcodes branch name main when fetching from the filter-repo'd mirror (P1 – breaks ingests from master-default repos); exit code comments document codes 2 and 3 that don't match the implementation (P2). |
| Utilities/Maintenance/RemoteModuleIngest/cid-normalize.sh | Content-link converter: trap ... RETURN used outside a function has no effect, leaving temp files unclean on unexpected exits (P2); logic and fallback chain are otherwise correct. |
| Utilities/Maintenance/RemoteModuleIngest/verify-cid-access.sh | Pre-push CID gateway checker: correct flow; uses printf "$g_fmt" "$cid" with a known-safe base32 CID (SC2059 suppressed intentionally); all exit paths covered. |
| Utilities/Maintenance/RemoteModuleIngest/AGENTS.md | AI-agent guidance document: thorough decision tree, escalation triggers, and post-merge validation steps; no code issues. |
| Utilities/Maintenance/RemoteModuleIngest/README.md | Human-facing quickstart and convention overview; documentation only. |
| Utilities/Maintenance/RemoteModuleIngest/INGESTION_STRATEGY.md | Policy document capturing whitelist rationale, CID-normalization mandate, examples routing, and author-preservation decisions; documentation only. |
| Utilities/Maintenance/RemoteModuleIngest/AUDIT_DESIGN.md | Audit pipeline design and audit.json schema; documentation only. |
| Utilities/Maintenance/RemoteModuleIngest/CLEANUP_CHECKLIST.md | Post-merge style-commit safety checklist; documentation only. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([ingest-remote-module.sh]) --> B[Preflight checks\ngit-filter-repo, clean tree, ITK root]
B --> C[Step 1: git clone --mirror upstream]
C --> D[Step 2: filter-repo whitelist pass\ninclude/ src/ test/ wrapping/\nCMakeLists.txt itk-module.cmake\n--to-subdirectory-filter Modules/Group/Module]
D --> E{CTestConfig.cmake\nin result?}
E -- yes --> F[Step 3: filter-repo invert-paths\nstrip CTestConfig.cmake]
E -- no --> G
F --> G[Step 4: inventory\n.md5 / .shaNNN / .cid counts]
G --> H{--dry-run?}
H -- yes --> I([Exit 0: report only])
H -- no --> J[Step 6: collect authors\nbuild Co-authored-by trailers]
J --> K[Step 7: git merge --allow-unrelated-histories\n⚠ hardcodes branch 'main']
K --> L[Print follow-up commit checklist]
L --> M{non-.cid\nlinks remain?}
M -- no --> N([Exit 0: success])
M -- yes --> O([Exit 2: ACTION REQUIRED\nrun cid-normalize.sh])
Reviews (1): Last reviewed commit: "DOC: Add strategy + agent guidance for R..." | Re-trigger Greptile
1bc6858 to
bcc62d1
Compare
Introduces ingest-remote-module.sh, a driver script that moves an ITK remote module (Modules/Remote/<Name>.remote.cmake) into the main source tree at Modules/<Group>/<Name>/ while preserving authorship and keeping ITK's git pack small. The script implements the v3 whitelist strategy agreed on PR InsightSoftwareConsortium#6093: * filter-repo --paths restricts history to include/, src/, test/, wrapping/, CMakeLists.txt, itk-module.cmake -- everything else (Old/, examples/, docs/, paper/, .github/, pyproject.toml, CTestConfig.cmake, LICENSE, .clang-format, ...) stays in the archived upstream repo. * --to-subdirectory-filter rewrites paths under the destination. * --prune-empty always drops commits whose changes are entirely outside the whitelist. * A second pass strips CTestConfig.cmake specifically (points at a standalone CDash project that does not apply in-tree). * The resulting merge is --allow-unrelated-histories --no-ff; the commit message carries the upstream URL + tip SHA plus Co-authored-by: trailers for every upstream contributor derived from the filter-repo'd git log. README.md gives the human operator the quick-start recipe (five commits per module, one PR per module) and a --dry-run walkthrough for previewing an ingest before committing. Intended follow-ups handled by the caller, not by this script: * DOC: the in-tree README pointing at the archived upstream. * COMP: deletion of Modules/Remote/<Name>.remote.cmake. * ENH: -DModule_<Name>:BOOL=ON in pyproject.toml configure-ci. * STYLE: optional .md5/.shaNNN -> .cid content-link conversion (may be deferred to a tree-wide sweep PR following the f3899ce precedent).
Four long-form documents landing alongside ingest-remote-module.sh:
INGESTION_STRATEGY.md
Policy document. Whitelist definition, mode selection (full /
filtered / squash), attribution floor, CID-normalization pipeline,
examples-relocation policy. Codifies the PR InsightSoftwareConsortium#6093 consensus that
commit count is NOT a gate -- only size metrics are -- so modules
with hundreds of genuine upstream commits can land in full-history
mode as long as pack-delta and blob-size thresholds are met.
AUDIT_DESIGN.md
Design notes for the pre-ingest audit pass: blob-size histogram,
strip-candidate path detection (paths present in pre-tip history
but absent in tip), copyright-review flag for PDFs / videos /
large images, recommend_mode() pseudocode.
CLEANUP_CHECKLIST.md
What the post-merge STYLE commit checks (now a safety-net since
the whitelist handles the common case at graft time). Still used
for copyright review and for Mode B residual-blob stripping.
AGENTS.md
Guidance for AI coding agents running this workflow. Pre-flight
gates, decision points (non-Apache license, raw binary test assets,
non-whitelisted paths the module needs, CID-normalization gap,
examples/ routing), escalation triggers for handing back to the
human. Explicit "don't do these things" section covers common
pitfalls (re-squash-silently, widen-whitelist-without-documenting,
force-push-ingest-PR).
These documents were developed and iterated on across PR InsightSoftwareConsortium#6061,
InsightSoftwareConsortium#6085, InsightSoftwareConsortium#6086, and especially InsightSoftwareConsortium#6093 (the thread that produced the v3
whitelist + CID-normalization approach). Landing them in-tree under
Utilities/Maintenance/RemoteModuleIngest/ makes future changes to the
strategy reviewable via standard PR process rather than through PR
comment updates on long-running threads.
cid-normalize.sh: require a CID backend (w3, ipfs, or Python multiformats) before fetching, with install guidance; fix mirror URLs to use uppercase algorithm names (MD5, SHA512, ...) as required by ITKExternalData.cmake and prioritize the GitHub Pages mirror, which is the most reliable endpoint now that data.kitware.com returns 400 for the hashsum path. verify-cid-access.sh: parallelize gateway probes via xargs -P (default 16), reorder gateways to try the pinned itk.mypinata.cloud first and the slow CID-subdomain form last, switch from HEAD to 1-byte ranged GET for better gateway compatibility, and add a permanent success cache under XDG_CACHE_HOME since CIDs are immutable.
bcc62d1 to
febe0a7
Compare
Separates the remote-module ingestion tooling from any specific module ingest (see #6093 for the first concrete use case). This PR adds a new directory
Utilities/Maintenance/RemoteModuleIngest/containing the driver script, strategy documents, and AI-agent guidance. Relates to #6060.Directory layout
What the script does
git clone --no-local --mirror).git filter-repo --paths include/ src/ test/ wrapping/ CMakeLists.txt itk-module.cmake --to-subdirectory-filter Modules/<Group>/<Module> --prune-empty always.CTestConfig.cmake(always points at a standalone CDash project, never applies in-tree)..md5/.shaNNN/.cid) and warns if non-.cidlinks remain.--allow-unrelated-histories --no-ff, auto-populatingCo-authored-by:trailers from the upstream log.--dry-run --keep-tempdirlets reviewers preview the post-filter tree before committing anything.Consensus decisions captured in the strategy doc
.md5/.shaNNNcontent-link converts to.cidbefore merge. The tree-wide sweep PR path is preferred over per-module conversions, following the precedent at ITK commitf3899ce8c6.examples/directories are NOT ingested inline; they either stay in the archived upstream (default) or relocate toExamples/via a separate follow-up PR.Old/,paper/,docs/figures/, demos, and CI scaffolding can never enter ITK's pack because they are outside the whitelist by construction.Co-authored-by:trailers to the merge commit body.How this PR relates to #6093
Reviewers can land this PR first (it's small, self-contained, touches only
Utilities/Maintenance/), then the follow-on module ingest PRs can referenceUtilities/Maintenance/RemoteModuleIngest/ingest-remote-module.sh <Module> <Group>as the reproducible command that produced the merge commit.