Skip to content

Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts#737

Open
cailmdaley wants to merge 21 commits into
developfrom
cleanup/candide-scripts-container
Open

Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts#737
cailmdaley wants to merge 21 commits into
developfrom
cleanup/candide-scripts-container

Conversation

@cailmdaley
Copy link
Copy Markdown
Contributor

@cailmdaley cailmdaley commented May 30, 2026

Update: the user-facing docs that were originally part of this PR (README front door, the candide cluster walkthrough, the MPI run docs) have been moved to the docs-rework PR #739 so this PR stays focused on code/infra. It retains the CLAUDE.md build-loop note, which documents the container build workflow these changes introduce.


What

Makes the candide example job scripts actually run ShapePipe through the published container. The container is the supported source of truth (docs/source/container.md), so the example scripts should use it rather than module load intelpython/3 + source activate $HOME/.conda/envs/shapepipe. Verifying the MPI script end-to-end turned up three things in the long-unexercised MPI path that had to be fixed for it to run — a launcher/PMIx mismatch, a code bug, and a stale config — each hidden behind the previous one.

Changes

Job scripts (example/pbs/candide_{smp,mpi}.sh)

  • Drop conda entirely. The pipeline runs via apptainer exec against the runtime image, pulled once to a SIF whose path is overridable via $SP_IMAGE.
  • SLURM, not PBS. candide moved to SLURM — qsub/#PBS are gone. The scripts now use #SBATCH / sbatch and mpirun -n $SLURM_NTASKS.
  • Bind-mount the host clone ($SPDIR) at the same path inside the container so the configs' $SPDIR-relative INPUT_DIR/OUTPUT_DIR resolve identically in- and outside the container.
  • Fix a stale config path; propagate the real pipeline exit code instead of always exit 0.
  • Un-rot example/pbs/config_mpi.ini (last touched 2020): its MODULE / section names lacked the _runner suffix the loader now requires, so the run died with No module named 'shapepipe.modules.python_example'. Updated to the current names (matching example/config.ini).

Container (Dockerfile)

  • Build OpenMPI 5.0.x from source (--with-pmix=internal --with-prrte=internal --disable-dlopen), dropping libopenmpi-dev. The image previously shipped OpenMPI 4.1.4 / PMIx 2, which is wire-incompatible with candide's OpenMPI 5.0.x / PMIx 5 host launcher — so hybrid MPI silently degraded to N independent "rank 0 of 1" processes (wrong science, no error). The stock mpi4py wheel dlopens libmpi.so.40 (ABI-stable across OMPI 4↔5), so uv.lock is untouched.

ShapePipe MPI code (src/shapepipe/pipeline/mpi_run.py, src/shapepipe/run.py)

  • Thread module_config_sec through run_mpisubmit_mpi_jobsworker(). By git history this dates to Multiple Module Runs #415 ("Multiple Module Runs"): worker() gained a module_config_sec parameter, but the MPI submit path wasn't updated in step, so it passes 7 args where 8 are required — worker() missing 1 required positional argument: 'module_runner'. On candide this path wasn't reachable until now (the PMIx mismatch meant MPI never wired up), so fixing the launcher is what surfaced it.

Exit-code propagation (src/shapepipe/shapepipe_run.py)

  • main() called run(args) but didn't return it, so exit(main()) was always exit(None) → 0. That means every caught error in ShapePipe — not just MPI — has been exiting 0, invisible to exit $? and CI. Now return run(args), with a regression test. (This is independent of MPI and worth landing on its own.)

CI (.github/workflows/deploy-image.yml)

  • Publish images on every branch push, so any PR has a cluster-pullable image to test against before merge.

Verification (on candide)

  • MPI runs end-to-end — the unmodified candide_mpi.sh against the published runtime image. 2-node / 4-task hybrid run: ranks distribute across two nodes (process 0–3 of 4 on n23+n25, not the pre-fix singletons), all three *_example_runner modules produce output, real exit 0 (the script's exit $?), and shapepipe.log records "A total of 0 errors were recorded."
  • candide_smp.sh runs the SMP example pipeline end-to-end through the container, 0 errors.
  • bash -n clean; structural shell-syntax test tier passes.
  • Caught errors now propagate a non-zero exit code (the main() fix), so a failed run no longer looks like success to exit $?.

Question for review — is MPI a used dependency of ShapePipe at all?

We got MPI working on candide (the three fixes above) — but the bigger question this surfaced, @martinkilbinger, is whether it's worth keeping at all. Its entire footprint in the repo is tiny:

  • A hard dependency (mpi4py>=4.0 in pyproject.toml) plus the core code (run_mpi in run.py, mpi_run.py).
  • Two example job scriptscandide_mpi.sh (candide) and cc_mpi.sh (ccin2p3) — and one active example config (config_mpi.ini).
  • Zero production paths: every canfar/candide submission script runs SMP (N_SMP).

And functionally it adds nothing: SMP and MPI call the identical worker() with identical arguments — the same computation behind two dispatchers (worker_handler.py has no MPI; the MPI path just scatters the independent job-list and gathers results). The workload is embarrassingly parallel, so MPI buys only an ergonomic convenience over "many per-node SMP jobs under SLURM."

So: is MPI actually used anywhere, or can the mode (and the mpi4py dependency) be dropped? If it's used, this PR restores it to working order and there's a ready follow-up to harden it against silent desync; if not, removing it (mpi_run.py, run_mpi, the import_mpi branches, mpi4py, the two *_mpi.sh scripts, config_mpi.ini) is clean and contained. Your call — you'd know whether anyone depends on it.

Out of scope (deliberately untouched)

The production submission scripts — scripts/sh/run_scratch_local.sh, init_run_exclusive_canfar.sh, job_sp_canfar*.bash (canfar + modern candide) — and the ccin2p3 scripts. These target workflows we can't fully verify here; note they are still SMP + conda and not yet containerized (a separate future cleanup).

🤖 Generated with Claude Code

— Claude on behalf of Cail

candide_smp.sh and candide_mpi.sh activated a personal conda env
(module load intelpython/3; source activate $HOME/.conda/envs/shapepipe)
and called $SPENV/bin/shapepipe_run. Convert them to run the pipeline
through the published container image, matching the supported workflow
(the container is the source of truth; see docs/source/container.md).

- Drop the conda environment entirely. The pipeline runs via
  `apptainer exec` against the slim runtime image
  (ghcr.io/cosmostat/shapepipe:develop-runtime), pulled once to a SIF
  whose path is overridable via $SP_IMAGE.
- Bind-mount the host clone ($SPDIR) at the same path inside the
  container so the example configs' $SPDIR-relative input/output
  directories resolve identically in- and outside the container.
- MPI uses the standard "hybrid" Apptainer pattern: host mpiexec
  (module load openmpi) launches one container rank per slot, the
  in-image mpi4py/OpenMPI handle communication.
- Fix a stale path: candide_mpi.sh pointed at example/config_mpi.ini,
  which does not exist; the file is example/pbs/config_mpi.ini.
- Propagate the pipeline exit code to the batch system (exit $?)
  instead of always exiting 0.
- Make $SPDIR overridable for testing.

Tested on candide (c03): candide_smp.sh runs the SMP example pipeline
end-to-end through the container with 0 errors. The MPI hybrid launch
needs a real multi-node allocation to verify end-to-end (it hangs on a
login node); the image's MPI stack (mpiexec + mpi4py 4.1.1) and the
shared container invocation are verified via the SMP run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request May 30, 2026
…t question

Worker delivered both cleanups as open PRs against develop (not merged):
#736 (rho-stats/PSF-plot removal) and #737 (candide container scripts).
Resolves the open question — random_cat is a general LSS random-catalogue
generator, not rho-stats, so it was kept. Records that stile was already
vestigial and the MPI-verification gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cailmdaley and others added 8 commits May 31, 2026 12:53
… as #737)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Hybrid Apptainer MPI was broken on candide: the image shipped Debian
bookworm's OpenMPI 4.1.4 (PMIx 2.x) while candide's host launcher is now
OpenMPI 5.0.x (PMIx 5.x). A PMIx 2 client cannot handshake with a PMIx 5
server, so every rank degraded to a standalone "rank 0 of 1" -- N
singletons instead of one N-rank job (the textbook Apptainer symptom).

- Dockerfile: drop libopenmpi-dev/openmpi-bin; build OpenMPI 5.0.8 from
  source with bundled PMIx 5 / PRRTE (--with-pmix=internal etc.) and
  --disable-dlopen (static MCA -- fixes an internal-openpmix pdl configure
  failure and is the right posture for a container). The stock mpi4py
  wheel dlopens libmpi.so.40, which this build provides, so uv.lock is
  untouched.
- example/pbs/candide_{mpi,smp}.sh: candide migrated PBS -> SLURM (qsub is
  gone), so convert #PBS -> #SBATCH and launch with
  `mpirun -n $SLURM_NTASKS apptainer exec ... shapepipe_run`. Load the
  cluster-default `openmpi` (any 5.0.x is PMIx-compatible).
- docs + CLAUDE.md: document the hybrid-MPI run pattern and the
  build-remotely / pull-locally container workflow.

Empirically verified on candide: the 4.1.4 image gives 4x "rank 0 of 1";
an OpenMPI 5.0.8 build wires up correctly. See .felt shapepipe/mpi-hybrid.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tag each pushed branch's image with the branch name so any open PR has a
pullable image (apptainer pull ...:<branch>-runtime) that can be tested on a
real cluster before merge. Same-repo branch pushes always carry a
registry-write token, so this is safe; fork PRs still only build+test via the
pull_request trigger (they have no token to publish with).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the public .felt/ team-facing rather than personal collaboration notes:
- shapepipe.md root: drop first-person role framing, the 'working agreement
  with Martin' section, private ~/.claude memory-note pointers, and royal-we
  voice convention; rewrite as a person-generic gateway (stack division,
  repo conventions incl. corrected rho-stats/meanshapes boundary, threads).
- Delete fabian-coord-bug (body-less personal reminder) and prs-in-flight
  (personal PR dashboard); rephrase the 3 inbound wikilinks.
- Neutralize ngmix-update + docker-uv-revert: strip collaborator names and
  'mine'/'we agreed' framing, keep the technical why.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MPI execution path was broken since #415 ("Multiple Module Runs"):
WorkerHandler.worker() gained a `module_config_sec` parameter, but
`submit_mpi_jobs` in mpi_run.py was never updated to pass it. So the MPI
path called worker() with 7 args where 8 are required, failing every run
with:

    WorkerHandler.worker() missing 1 required positional argument:
    'module_runner'

This stayed invisible for 16 months because MPI is a legacy execution
mode (SMP is the production path), and on candide MPI couldn't even wire
up due to a PMIx version mismatch -- which masked the code bug beneath.
Fixing the launcher (OpenMPI 5.0.x in the image) exposed it.

Thread `module_config_sec` from run_mpi (root rank, broadcast to all
ranks) into submit_mpi_jobs and on to worker(), matching the SMP/serial
call sites. Verified end-to-end on candide: 2-node / 4-rank hybrid MPI
run of the example pipeline, all three modules complete, 0 errors
recorded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lers

The earlier mpi-hybrid close claimed the full pipeline ran clean under MPI.
It did not — that run hit a latent ShapePipe code bug and the sbatch
RUN_EXIT=0 was a hardcoded echo. Rewrite the empirical close to the true
two-layer story: launcher (PMIx) fixed and verified, which then exposed the
module_config_sec bug (#415), now fixed in e599973 and re-verified e2e.
Reopen status (fix not yet in the published image, #737 not merged).

Add exec-modes-schedulers: a reference fiber mapping smp/mpi (execution
modes) and PBS/SLURM (schedulers) — what's production (SMP+SLURM) vs legacy
(MPI, PBS) — the context that explains why this bug survived 16 months.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ripts-container

# Conflicts:
#	.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md
@cailmdaley cailmdaley changed the title Run candide PBS job scripts through the container instead of conda Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts May 31, 2026
cailmdaley and others added 10 commits May 31, 2026 17:24
The MPI example config still used the pre-suffix module names
(`python_example`, `serial_example`, `execute_example`) and section
headers from 2019-2020; the module loader needs the full runner names
(`*_runner`), as example/config.ini uses. With the stale names, rank 0
failed with "No module named 'shapepipe.modules.python_example'" and the
other ranks deadlocked in the collective until the wall-clock timeout.

Third layer of MPI bit-rot beneath the launcher and the module_config_sec
fix, same root cause: nobody runs MPI, so its example config rotted too.

Verified: the unmodified candide_mpi.sh against the published runtime image
now runs the example pipeline end-to-end (4 ranks / 2 nodes, all three
modules, 0 errors, exit 0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nknown)

Walk back 'nobody runs MPI / invisible for 16 months' across both fibers.
What we observed: MPI needed three fixes to run on candide; the code bug
dates to #415 by git history; the canfar/candide tooling is SMP-only. What
we cannot see: how MPI was actually used, especially on canfar where most
processing ran. State the evidence, not the inference about practice.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rtin

SMP and MPI call the identical worker() with identical args — same computation,
two dispatchers (joblib-on-node vs MPI scatter/gather). worker_handler has no
MPI; the workload is embarrassingly parallel. So MPI is an ergonomic convenience,
not a computational need. Defer to Martin (in #737) whether MPI earns its keep on
candide vs just using SMP; don't retire the documented mode unilaterally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`main()` called `run(args)` but discarded its return value, so `exit(main())`
was always `exit(None)` → 0. `run()` returns 1 when it catches an error
(`catch_error` + `return 1`), so *every* handled failure has been exiting 0 —
invisible to `exit $?` in the job scripts and to any CI/automation. One-word
fix: `return run(args)`. Add a regression test that main forwards run's value.

Surfaced while end-to-end testing the MPI singleton guard: the guard fired and
logged loudly but the job still exited 0 until this fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…etons"

When the host MPI launcher and the container's MPI/PMIx stack are incompatible,
every process initialises standalone (COMM_WORLD size 1, rank 0). ShapePipe
then treats each as master, hands each the full job list, and runs N
uncoordinated copies of the pipeline into the same output directory — silently,
with exit 0. This is the failure the OpenMPI-5 image fix prevents on candide,
but nothing guarded against a future recurrence on another cluster.

check_mpi_world() compares the size that actually wired up (COMM_WORLD) against
the size the launcher intended (OMPI_COMM_WORLD_SIZE, which is set per process
even when the world fails to form) and aborts on a mismatch. Empirically
verified on candide: SLURM_NTASKS is NOT reliable for this (reads 1 on
remote-node ranks even in a healthy run) — OMPI_COMM_WORLD_SIZE is. Tested both
ways on a real allocation: healthy OMPI-5 run passes and completes; OMPI-4
image under the OMPI-5 host launcher fires the check and exits non-zero
(together with the exit-code fix). Also catches partial wire-up (N-1 of N).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 'warning sign' pass: added check_mpi_world preflight (OMPI_COMM_WORLD_SIZE
vs COMM_WORLD size; SLURM_NTASKS proven unreliable) and, found while testing it
e2e, fixed main() swallowing run()'s exit code (every caught error had exited 0).
Both tested on a real allocation. Distinct remaining gap: mid-setup rank-0
failure still deadlocks the other ranks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sion)

Reverts the check_mpi_world preflight added earlier in this branch. It guards a
real but narrow failure (the "rank 0 of N singletons" desync → silent wrong
results), which is already designed out on candide by the OpenMPI-5 image match,
and adds a runtime check to core run.py — scope creep for what is an
example-script modernization PR, especially while MPI's future is an open
question for Martin (it's a hard mpi4py dependency used only by two example
scripts and one config, by zero production paths).

Keeps the exit-code propagation fix (33494d7), which is broad and unrelated.
The guard's detection recipe (OMPI_COMM_WORLD_SIZE vs COMM_WORLD size;
SLURM_NTASKS is unreliable) is preserved in the mpi-hybrid fiber as a ready
follow-up if MPI is kept.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Guard pulled from #737 (scope creep on a maybe-retired mode); exit-code fix kept.
Recipe preserved in Layer 4. Question to Martin sharpened to 'is MPI a used
dependency at all?' with the full footprint: hard mpi4py dep, 2 example scripts,
1 config, 0 production paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The job scripts assumed the runtime SIF already existed; the pull step and
candide's home-quota gotcha (point APPTAINER_CACHEDIR at a data partition or
`apptainer pull` dies on $HOME) lived only in CLAUDE.md and felt. Add a
copy-paste "Quickstart on a cluster (candide)" to the README — the on-ramp a
newcomer actually reads — covering clone -> quota-safe pull -> sbatch the
example, pointing at example/pbs and docs/source/container.md for depth.

Verified end to end on candide (c03, apptainer 1.4.5, SLURM): the exact
quickstart command form runs candide_smp.sh against the published
:develop-runtime image -> job COMPLETED, ExitCode 0:0, "A total of 0 errors
were recorded".

Also refresh the stale python-3.9 badge to 3.12 (the shipped interpreter) and
drop a stray character from its target URL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cailmdaley
Copy link
Copy Markdown
Contributor Author

cailmdaley commented May 31, 2026

@fabianhervaspeters @martinkilbinger — beyond fixing MPI, this PR doubles as a readiness test for the container itself, and as far as I can tell it's ready for other people to pull it, run it, and try to break it.

Verified end to end on candide today (c03, apptainer 1.4.5, SLURM) against the published :develop-runtime image — observed, not inferred:

  • Bundled example via direct apptainer exec → exit 0, "A total of 0 errors were recorded".
  • apptainer shell resolves the entry point (/app/.venv/bin/shapepipe_run).
  • Real SLURM job — candide_smp.shCOMPLETED, ExitCode 0:0, 0 errors, bind-mount + exit-code propagation working.
  • Branch CI is green.

The user-facing docs for all this — a one-command Quickstart, the candide cluster walkthrough (quota-safe pull → sbatch example/pbs/candide_smp.sh, including the $HOME-quota gotcha), and the MPI run pattern — now live in the docs-rework PR #739, so this PR stays focused on the code/infra. See the README front door and the candide cluster docs there.

So: please pull it and try to break it. The MPI-vs-keep question above still stands separately.

— Claude on behalf of Cail

…ner.md

The README had tunnel-visioned onto a candide-specific SLURM walkthrough —
wrong altitude for the project's landing page, where the broader community
arrives. Restructure it as a front door: a one-sentence-deeper description of
what ShapePipe does, a Quickstart that runs the bundled example straight from
the published container in one command (apptainer or docker, no install, no
cluster specifics), the image tag scheme, and a Documentation signpost to the
published pages.

Move the candide cluster walkthrough (quota-safe pull -> sbatch candide_smp.sh,
the SPDIR bind-mount, the MPI PMIx note) into a new "Running on a cluster
(SLURM)" section in container.md, which the README links to. Drop the
test-assertion prose ("logs ... and exits 0") that read like a CI check rather
than user docs.

Both quickstart commands verified on candide against :develop-runtime
(including the no-pre-pull `apptainer exec docker://...` form): the bundled
example runs to completion, 0 errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request May 31, 2026
Done-state re-verified fresh: PR #736 (rho-stats removal) and #737 (candide
scripts → container) both OPEN against develop, no reviews, not merged. Noted
PR #636 (rho-stats feature PR) needs Martin to reconcile against #736's removal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request May 31, 2026
…open & green

Cold survey found zero remaining worker work. PR #736 (rho-stats removal)
and #737 (candide scripts via apptainer) are both open, mergeable, and
CI-green. The only remaining step is Martin's review, which is outside
worker scope. Includes mechanical felt reindex churn.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The README front door, the container.md 'Running on a cluster' section, and the
basic_execution.md MPI docs are relocated to #739, which owns the full docs
story (cluster docs now live in a dedicated clusters.md, so keeping the
walkthrough here too would duplicate it). This PR keeps only the code/infra and
the CLAUDE.md build-loop note that the container changes here introduce.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request May 31, 2026
Unify all user-facing docs in this PR (relocated from #737, which is now pure
code/infra):
- README front door (Quickstart + Documentation signpost). The signpost now
  has a dedicated 'Running on a cluster' entry pointing at clusters.html, and
  the container-workflow entry no longer claims to carry the cluster example
  (that lives in clusters.md).
- basic_execution.md MPI section: the hybrid-Apptainer run pattern and the
  OpenMPI-5 PMIx note, kept alongside the conda-framing fix.
- container.md gains a one-line pointer to clusters.md.

This removes the container.md/clusters.md duplication at the source rather than
reconciling it after merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request May 31, 2026
Audited every narrative docs page against the current code. The install /
container / testing / API pages were already fresh; the staleness concentrated
in cluster docs and a few content errors. This rework:

**Machine-specific cluster tree.** Cluster guidance was scattered and half
of it invisible (candide lived only inside container.md on a feature branch;
canfar was split across orphaned pages; none of canfar/candide were in the
sidebar). Add a single `clusters.md` under a new "Running on a cluster" toctree
caption: the shared pattern (container = unit of execution, bind-mount, keep
SIFs off a quota-limited $HOME), then per-machine sections for candide (SLURM,
the candide_{smp,mpi}.sh scripts, the quota-safe pull, MPI/PMIx) and CANFAR
(the current canfar_submit_job / canfar_monitor console scripts), with ccin2p3
stubbed. The deep CANFAR production walkthrough stays in pipeline_canfar.md,
linked, and is now in the toctree too.

**Delete obsolete pages.** canfar.md (the old curl-VM submission model,
superseded by canfar_submit_job), pipeline_v2.0.md (personal paths, a missing
script), and work_flow_v2.0.md (an unrealized planning wishlist) — all three
orphaned from the toctree. The v2.0 wishlist is preserved in the team's felt
store rather than lost.

**Fix content errors.**
- dependencies.md: rewritten against pyproject.toml. Reframed around the
  abstract-minimums + uv.lock SSOT (was "pinned per release"); ngmix now points
  at the aguinot/ngmix@stable_version fork (was esheldon upstream); dropped the
  phantom CDSclient; added the missing CANFAR/data stack (vos, skaha, canfar,
  cs_util, astroquery, reproject, h5py, numba).
- post_processing.md: dropped the removed rho-statistics step and the dead
  prepare_tiles_for_final command; added a legacy banner pointing at sp_validation.
- random_cat.md: legacy banner; fixed module name random_runner -> random_cat_runner.
- pipeline_canfar.md: flagged the matched-star / coverage-mask helpers that
  moved to sp_validation (merge_psf_cat.py, download_headers, …).
- basic_execution.md: replaced the conda-era "activate the environment" framing
  with the container reality. (MPI sections deferred pending the #737 decision.)
- configuration.md (conifg->config, NUMBERING_LIST->NUMBER_LIST),
  contributing.md (Pleas->Please), module_develop.md (src/shapepipe/modules).

Verified with a local sphinx-book-theme build: succeeds; the only new warning
the tree introduced (a clusters.md heading anchor) is fixed. Remaining warnings
are all pre-existing (the autosummary API page needs the installed package;
multiple-toctree notices on every page).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request May 31, 2026
Unify all user-facing docs in this PR (relocated from #737, which is now pure
code/infra):
- README front door (Quickstart + Documentation signpost). The signpost now
  has a dedicated 'Running on a cluster' entry pointing at clusters.html, and
  the container-workflow entry no longer claims to carry the cluster example
  (that lives in clusters.md).
- basic_execution.md MPI section: the hybrid-Apptainer run pattern and the
  OpenMPI-5 PMIx note, kept alongside the conda-framing fix.
- container.md gains a one-line pointer to clusters.md.

This removes the container.md/clusters.md duplication at the source rather than
reconciling it after merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant