Skip to content

V2.0#706

Open
martinkilbinger wants to merge 25 commits intoCosmoStat:developfrom
martinkilbinger:v2.0
Open

V2.0#706
martinkilbinger wants to merge 25 commits intoCosmoStat:developfrom
martinkilbinger:v2.0

Conversation

@martinkilbinger
Copy link
Copy Markdown
Contributor

Summary

  • New directory structure to better handle runs with O(100,000) images.
  • Updated job scripts to run tile-based jobs including required exposures.
  • Dockerfile.jupyter to deply jupyter notebook session to canfar science portal

Reviewer Checklist

Reviewers should tick the following boxes before approving and merging the PR.

  • The PR targets the develop branch
  • The PR is assigned to the developer
  • The PR has appropriate labels
  • The PR is included in appropriate projects and/or milestones
  • The PR includes a clear description of the proposed changes
  • If the PR addresses an open issue the description includes "closes #"
  • The code and documentation style match the current standards
  • Documentation has been added/updated consistently with the code
  • All CI tests are passing
  • API docs have been built and checked at least once (if relevant)
  • All changed files have been checked and comments provided to the developer
  • All of the reviewer's comments have been satisfactorily addressed by the developer

@martinkilbinger martinkilbinger self-assigned this Apr 7, 2026
@martinkilbinger martinkilbinger added the enhancement New feature or request label Apr 7, 2026
@cailmdaley
Copy link
Copy Markdown
Contributor

Full disclosure: this is a Claude-only review — no human second pass. Bugs and blockers should be real, but the nitpicks may be overzealous in places. Take the polish-level items with a grain of salt.

Overview

Restructures the pipeline for O(100k)-image runs: per-exposure work dirs, a new run_job_sp_canfar_v2.0.bash driver dispatching bit-coded jobs across tile- and exposure-level runners, and a new exp_utils.get_exp_output_files helper that lets tile-level modules discover files produced by per-exposure runners. Also: read_ext_cat module (ASCII SExtractor → FITS-LDAC, enables using Stephen Gwyn's external UNIONS catalogue for tile detection); Dockerfile rewrite (base swapped to images.canfar.net/skaha/astroml:latest, cdsclientastroquery, new Dockerfile.jupyter); Vizier retry loop with server fallback.

Scope: 54 files, +26.5k/−306. Stripping the 23k-line r-band tile list leaves ~3400 lines of actual code change.

Blocking

  • scripts/sh/init_run_v2.0.sh:108 — syntax error, script will not parse. echo " ├── exp/ missing the closing quote; bash -n init_run_v2.0.sh reports syntax error near unexpected token '(' on line 111. Since this is the first script users run for v2.0, it's a hard block.
  • Dockerfile.jupyter:4FROM shapepipe-base references an image that isn't built by anything in the repo or CI. Commit history shows Dockerfile.base was intentionally removed without updating this. Clean builds will fail. Either restore Dockerfile.base and wire it into the workflow, or change to FROM ghcr.io/cosmostat/shapepipe:<tag>.
  • pyproject.toml:33"setuptools<81" with no rationale. Almost certainly a workaround for a transitive build regression (skyproj / pyccl / similar). Add an inline # comment naming the cause, so it can be unpinned later when upstream catches up.
  • pyproject.toml:13 — stray #"shear_psf_leakage", dangling above the dependencies = [...] list. Delete or move inside with a reason.

Bugs & risks

  • src/shapepipe/modules/mask_package/mask.py:512Vizier.SERVER = server mutates class-level state. Fine in a single process, but under SMP parallelism two workers can stomp on each other mid-query. Either construct a fresh Vizier(server=...) per call (if supported by the installed astroquery), or serialize Vizier access. Same pattern in scripts/python/create_star_cat.py.
  • src/shapepipe/modules/read_ext_cat_package/read_ext_cat.py:223–224 — silent ID overflow. tile_id = int(parts[0]) * 1000 + int(parts[1]) assumes parts[1] < 1000. CFIS dec indices are 3-digit today but that's a floor, not a ceiling — a future parts[1] == 1000 would collide with parts[0] + 1. Add an assertion or widen the multiplier.
  • src/shapepipe/pipeline/dependency_handler.py:30def __init__(..., exe_to_module={}) mutable default arg. The existing dependencies=[], executables=[] also have this, but new code shouldn't propagate. Use None and normalize inside.
  • scripts/sh/job_sp_canfar_v2.0.bash:170 — fragile path walk. export SP_EXP=$(realpath "$SP_RUN/../../../exp") assumes exactly three directories between SP_RUN and the v2.0 root. If invoked from a scratch copy or test tree, SP_EXP silently points elsewhere. Pass explicitly via env var or argument.
  • Duplicated Vizier retry logic. create_star_cat.py and mask.py carry near-identical server lists, timeouts, and backoff loops. Factor into one helper (cs_util or shapepipe.utilities.vizier) before they drift.

Code quality

  • scripts/sh/job_sp_canfar_v2.0.bash:206–226 — the command function's else branch reads $4, $5, $6 from the caller. But all call sites pass 2 args (via command_sp). The branch appears dead. Either remove or document what it was for.
  • scripts/sh/job_sp_canfar_v2.0.bash:250–255command_sp is a pure passthrough to command. Delete the wrapper.
  • src/shapepipe/modules/read_ext_cat_runner.py:33–34 — docstring says "runs multi-epoch post-processing to add per-exposure HDUs", but make_post_process is only called when MAKE_POST_PROCESS = True in the config. Note the optionality.
  • Module name read_ext_cat is vague — this is specifically an ASCII-SExtractor → FITS-LDAC converter. Something like read_ext_sexcat_runner would signal scope.
  • scripts/sh/init_run_v2.0.sh:61sed 's/CFIS\.\([0-9]*\)\..*/\1/' silently emits the original line for non-matches. A grep -oE pipeline would fail loudly.
  • scripts/sh/run_job_sp_canfar_v2.0.bash:427–428 — hardcoded CONDA_PREFIX=$HOME/.conda/envs/shapepipe is fine for CANFAR but silently not-exists elsewhere.

Performance

  • read_ext_cat.py:232 loads each tile image fully into RAM via hdul[0].data.astype(np.float32). CFIS tiles are ~320 MB, so OK, but the .astype forces a full copy. memmap=True + per-vignet slicing would halve peak memory per worker.
  • get_exp_output_files does a glob per exposure per tile-level invocation. For O(100k) images this is O(N_exp) globs per tile per job. Fine on fast storage; worth watching on slow shared mounts.

Tests

Zero new tests. exp_utils.get_exp_output_files is trivial to test with tmpdirs; read_ext_cat.make_ldac_from_ascii can round-trip a synthetic catalog. Same gap we've flagged on #702 and #699 — worth naming as a pattern and resolving.

Positives worth naming

  • The exp_utils abstraction is clean and well-documented.
  • _check_executable now including the module name in the error message is a nice UX win.
  • merge_headers_runner's dual-mode (tile-level via EXP_BASE_DIR vs per-exposure) is a clean extension rather than a fork.
  • Dockerfile base switch to images.canfar.net/skaha/astroml is sensible — avoids rebuilding the scientific stack and cuts build time significantly.
  • The Vizier retry logic itself (servers + backoff) is a correct fix for flaky astroquery behaviour, pending the concurrency caveat above.

Recommendation

Request changes on the two blockers (init_run_v2.0.sh syntax, Dockerfile.jupyter base) and the two pyproject.toml hygiene items. The rest is worth filing but not gating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: To do

Development

Successfully merging this pull request may close these issues.

2 participants