Skip to content

Ngmix v2.0 (CI mirror of #740)#741

Open
cailmdaley wants to merge 62 commits into
developfrom
ngmix_v2.0
Open

Ngmix v2.0 (CI mirror of #740)#741
cailmdaley wants to merge 62 commits into
developfrom
ngmix_v2.0

Conversation

@cailmdaley
Copy link
Copy Markdown
Contributor

What this is

A same-repo mirror of #740 (@martinkilbinger's "Ngmix v2.0"), pushed to a branch on CosmoStat/shapepipe so that CI actually runs. All 57 commits are authored by Martin (and carry Lucy Baumont's and Axel Guinot's work) — pushing the branch preserves that authorship unchanged; the only thing this PR adds is a same-repo head so GitHub Actions fires the pull_request workflow without the fork-PR approval gate. #740 received no CI runs at all for this reason.

Substance is identical to #740 — see that PR for the full description. In short: upgrade ngmix to 2.4.0 and adopt Lucy's new ngmix classes/interface; overhaul the shape-measurement module; centroid-bias fix + validation; v2.0 production-run plumbing.

Going forward, opening PRs directly on CosmoStat/shapepipe (rather than from a fork) avoids this — fork PRs don't trigger our Docker-image CI without a maintainer approval that wasn't happening.

Closes/supersedes #740 once CI is green (leaving that call to Martin).

Review

A detailed review is on its way (read against Martin's checklist plus a science-quality pass). Headline from exercising the new fitter against real ngmix 2.4.0 on candide: the metacal path runs end-to-end and shear recovery is unbiased at the few×10⁻⁴ level in m. Full notes to follow.

— Claude on behalf of Cail

Lucie Baumont and others added 30 commits January 9, 2023 14:07
- bin/ scripts were untracked, causing Docker build to fail
- Fix license field to use SPDX string format (MIT) to resolve
  SetuptoolsDeprecationWarning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
martinkilbinger and others added 26 commits May 7, 2026 14:41
Fix check on zero-valued pixels (most did not pass), now all vignets with
at least one != 0 pixel pass.
Syncs the branch with develop to pull in the modernized CI (so the full
test suite + example pipeline run on this PR) and resolve drift. Only
conflict was uv.lock, regenerated with `uv lock` against the
auto-merged pyproject.toml (ngmix 2.4.0 source preserved).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
prepare_ngmix_weights drew the noise image and masked-pixel fill from the
unseeded global numpy.random, defeating the per-tile self._rng =
RandomState(seed) the module sets for reproducibility. Same tile + same seed
gave different shears (observed max|dg1| ~ 3e-5). Thread rng through
make_ngmix_observation -> prepare_ngmix_weights and draw via rng.standard_normal.

Adds test_metacal_is_reproducible_with_fixed_seed (same seed -> identical
noshear g); fails before this fix, passes after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_imports.py walks the shapepipe package, but standalone scripts/ files are
outside it and never get imported. This extends import-smoke coverage to the
validation scripts added with ngmix v2.0. Currently fails on
centroid_bias.py, which imports get_guess — a helper removed in the module
rewrite; that script's status (restore get_guess / fix import / relocate as
legacy) is a developer decision.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
centroid_bias.py is the old-API (ngmix 1.x) bias-validation script: it imports
get_guess, a helper the v2.0 rewrite removed (v2.0 seeds fits via ngmix's own
guessers). It is only ever run against the test_centroid_bug worktree — the
centroid_bug/centroid_fix cases invoke it as SHAPEPIPE_PATH/TEST_SCRIPT with
SHAPEPIPE_PATH pointing at that worktree (see run_bias_test.sh), which has its
own copy. The duplicate on this branch was never executed and only broke
import-smoke. centroid_bias_v2.py supersedes it for ngmix_v2.0.

Drops it from the scripts import-smoke list accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@cailmdaley cailmdaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the v2.0 work — part 1 of 2, scoped to the few cut-and-dry items we could fix and test outright. I've committed these here on the CI mirror so the suite proves them (green: 260 passed); inline notes point at each.

A second pass with the more substantive, science-level questions (the *_psfo columns now reporting the metacal-reconvolved PSF rather than the original PSFEx/MCCD PSF; the weight-map normalization change; the r50/T column units; a few others) will follow once we've talked them through — those are judgement calls I don't want to assert unilaterally.

Context for this mirror PR is in the description above. Drafted by Claude, directed by me through the review.

— Claude on behalf of Cail

gal_guess_flag = True
wsum = 0
for n_e in range(n_epoch):
noise_img = rng.standard_normal(gal.shape) * sig_noise
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reproducibility fix (done in this branch). The module sets self._rng = RandomState(seed) per tile precisely so a rerun reproduces — but prepare_ngmix_weights was drawing the noise image and the masked-pixel fill from the unseeded global np.random.randn. Net effect: the same tile with the same seed produced different shears (observed max|Δg1| ≈ 3e-5 between two identical-seed runs). Fixed by threading rng through make_ngmix_observation → prepare_ngmix_weights and drawing via rng.standard_normal. Guarded by the new test_metacal_is_reproducible_with_fixed_seed — fails before this change, passes after.

— Claude on behalf of Cail

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. This probably was previous code before we changed to a random seed per tile.

return np.asarray(res["noshear"]["g"])


def test_metacal_is_reproducible_with_fixed_seed():
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New property test for the reproducibility fix: same seed → identical noshear g. Red on the pre-fix code (global-RNG leak), green after threading rng. Kept cheap (51 px, 2 epochs) so it runs comfortably in the suite.



@pytest.mark.parametrize("relpath", VALIDATION_SCRIPTS)
def test_validation_script_imports_cleanly(relpath):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New smoke test + a removal it triggered. tests/unit/test_imports.py walks the shapepipe package, but standalone scripts/ files live outside it and were never import-checked. This extends import-smoke into scripts/, and it immediately caught that the v1 scripts/validation/centroid/centroid_bias.py imported get_guess — a helper the v2.0 rewrite removed (v2.0 seeds fits via ngmix's own guessers; the centroid piece survives inline in make_ngmix_observation). That v1 script is the old-API bias repro and is only ever run against the test_centroid_bug worktree (run_bias_test.sh invokes ${SHAPEPIPE_PATH}/${TEST_SCRIPT}), so the copy on this branch was dead and only broke import. We removed the stranded copy here; centroid_bias_v2.py supersedes it for v2.0. If you'd rather keep a v1 copy on this branch, it'd need get_guess vendored in.

— Claude on behalf of Cail

Copy link
Copy Markdown
Contributor Author

@cailmdaley cailmdaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — part 2 of 2 (methodology + remaining items)

Provenance: this review was generated by Claude, working on candide against the actual module code; I haven't hand-edited it line by line, so please read it with that lens. The empirical checks below were genuinely run; the inline questions are real intent questions, not assertions that something is broken.

Part 1 covered the cut-and-dry items we fixed and tested directly (the RNG reproducibility fix + its property test, the import smoke-test and the get_guess/centroid_bias.py removal it surfaced). This part is the broader read: what we verified empirically, plus methodology and housekeeping. The code-specific points are left as inline comments on the relevant lines.

What we verified empirically ✅

We installed ngmix 2.4.0 (from this PR's own esheldon/ngmix@v2.4.0 source) over the container stack and ran the actual module code:

  • All 13 ngmix-2.4 symbols the module uses resolve; the metacal/bootstrap path runs end-to-end.
  • On 200 sub-pixel-shifted galsim galaxies, do_ngmix_metacal recovers shear with m = +2×10⁻⁴ ± 3×10⁻⁴, c₂ = −1×10⁻⁵ (both consistent with zero), R₁₁≈R₂₂≈0.924, 200/200 OK. Your centroid_bias_v2.py (noiseless) gives m = +3.7×10⁻⁴.
  • The centroid fix (HSM re-centering of the galaxy Jacobian in make_ngmix_observation) runs clean and is benign in the high-S/N regime we tested. We didn't try to reproduce the bias it targets — Cail notes that lives in the old ngmix-1.x path, so current code can't exhibit it. Just confirming the new fix doesn't introduce anything.

The 2.x API is used correctly, the fitter is well-behaved, and the CI suite is green on this branch.

Methodology questions (for you / Lucy)

Five places where v2.0 behaviour differs from the validated v1 in ways that may be intentional but aren't documented in the diff — flagged inline so the choice is explicit:

  • weight-map normalization (inline at prepare_ngmix_weights)
  • *_psfo columns now carrying the metacal-reconvolved PSF, and per-type PSF columns collapsing (inline at average_multiepoch_psf)
  • the galaxy zero-pixel guard loosening anyall (inline at prepare_postage_stamps)
  • the PSF fit reusing the galaxy prior + FLUX_AUTO flux guess (inline at do_ngmix_metacal)

Checklist / housekeeping

  • closes # links — none on the PR. Axel's centroid-shift PR #725 overlaps this; worth an explicit cross-ref on whether this supersedes it.
  • Docs — no docs/ changes, and the ngmix-pin note in CLAUDE.md ("don't modernize this dependency line") is now stale, since this PR is the modernization. Worth updating on/with merge.
  • pyproject.toml — ngmix is pinned to a git tag (esheldon/ngmix@v2.4.0). If 2.4.0 is on PyPI, ngmix==2.4.0 would be more reproducible and drop the git dependency — worth a check.
  • Description — for a +3894/−3410, 67-file PR bumping a core dependency and rewriting the shape module, a fuller description (especially the centroid-bias result) would help future readers.

A couple of pre-existing cleanups (not changed here, low priority)

In save_results: the error string ~ngmix.py:547 is non-f, so it prints literal {ext_name}; there's leftover print(...) debug just below it; and the np.append-per-batch FITS write is O(n²) over a tile. Worth a sweep if you're in there.

Happy to push fixes for any of the cut-and-dry ones (cleanups, the CHECK_EXISTING_DIR plumbing, the runner decorators) — just say which.

— Claude on behalf of Cail

self._zero_point = zero_point
self._pixel_scale = pixel_scale

# MKDEBUG check whether still used
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK_EXISTING_DIR resume looks silently dropped. This # MKDEBUG check whether still used is on the money: the new process() never reads self._check_existing_dir and never calls get_last_id (both now dead). So a configured resume reprocesses from scratch and overwrites, rather than skipping already-done objects — a documented feature that now no-ops. Either rewire the resume path or remove the option + get_last_id so it can't mislead. (And the # MKDEBUG marker can go.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a hack to resume interrupted ngmix runs, bookkeeping done in scripts outside modules. Can be removed now, and if needed again in the future implemented better within the module.

output_dict[name]["Tpsf"].append(results[idx]["T_PSFo"])
output_dict[name]["g1_psf"].append(results[idx]["g_PSFo"][0])
output_dict[name]["g2_psf"].append(results[idx]["g_PSFo"][1])
output_dict[name]['r50'].append(results[idx][name]['pars'][4])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r50 column naming. For a gauss model pars[4] is T (≈2σ², arcsec²) — the same value already stored under ['T'] a few lines down. And r50_PSFo = sqrt(T_psf/2) (line ~676) is the Gaussian σ, not the half-light radius (true r50 = 1.1774·σ). sp_validation consumes these size columns downstream, so it matters what they actually contain. Is r50 intentional shorthand for "a size," or should these be genuine half-light radii?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intend is to resolve the existing confusion about 'T', which in the DES papers is the area measure 2 * sigma^2, but in our ngmix implementation was the half-light radius, so the name was confusing. This was set by guess_size_type="sigma" in the earlier version in the call to get_guess(), which does not exist anymore.

We need to check that pars[4] is indeed 'T', and that r50 should be 1.1774·σ and not just σ. Would be good to not hard-code these transformations. There exist some functions in sp_validation:galaxy.py, we would have to move them to e.g. cs_util.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poking @lbaumo

)

error_msg = hsm_shape.error_message
if np.all(gal_vign == 0):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-pixel guard loosened anyall. v1 skipped an epoch if any pixel was exactly 0; this now skips only if np.all(gal_vign == 0). Partial CCD-edge stamps (a band of exact-zero pixels) will now enter the fit and rely on flags/weights to catch them. Intended loosening, or should the partial-zero case still be excluded?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was added in commit #d01503d0: the any == 0 test had pass close to no postage stamps. I though this was too restrictive.

flag_vign[np.where(tile_vign == -1e30)] = 2**10
v_flag_tmp = flag_vign.ravel()
# remove objects that are more than 1/3 masked
if len(np.where(v_flag_tmp != 0)[0]) / (51 * 51) > 1 / 3.0:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded 51 * 51 denominator. The masked-fraction cut divides by a literal 51 * 51. If the stamp size ever becomes configurable (which is on the near-term roadmap) this silently goes wrong. v_flag_tmp.size would track the actual stamp.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes would be good to not hard-code this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have in other module's config files the entries
STAMP_SIZE = 51
or VIGNET_SIZE = 51


if n_epoch == 0:
raise ValueError("0 epoch to process")
sig_noise = sigma_mad(gal)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weight-map normalization changed vs v1. The noise estimate sigma_mad(gal) here is taken over the whole object-containing stamp (flux-contaminated), where v1 used a windowed, object-free estimate (get_noise). And the returned weight keeps the per-pixel (FSCALE-rescaled) map and gets multiplied by 1/sig_noise² — which reads as a double inverse-variance weighting. Between the two, this shifts the fit's χ² weighting, the reported errors/s2n, and the PSF-averaging weights relative to the validated v1. Is this the intended new weighting?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this discussion about noise and weight maps: #604

}

Tguess = np.mean(T_guess_psf)
def do_ngmix_metacal(stamp, prior, flux_guess, rng):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PSF fit reuses the galaxy joint prior and the galaxy FLUX_AUTO as the PSF flux guess. This mirrors the upstream ngmix example's prior reuse, but a galaxy-flux guess for the PSF fit is a little unusual and could touch PSF-fit convergence/quality. Lower-confidence flag — worth a sanity check that PSF fits converge as expected.

if ccd_nb < 18 or ccd_nb in [36, 37]:
# swap x axis so origin is on top-right
return np.rot90(vign, k=2)
print('rotating megapipe image')
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup: this print('rotating megapipe image') is after the return, so it's unreachable — can be dropped.

runner=runner,
psf_runner=psf_runner,
ignore_failed_psf=True,
rng=rng,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup: sextractor_e1e2 looks unused here, and its docstring appears copy-pasted from another function (doesn't describe this one). Remove, or fix the docstring if it's kept.

# When post-processing is enabled the sqlite WCS file is the last input;
# remove it before passing the image files to SExtractorCaller.
if config.getboolean(module_config_sec, "MAKE_POST_PROCESS"):
f_wcs_path = input_file_list[-1]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runner input-contract now positional, but the decorator wasn't updated. This pulls the WCS path from input_file_list[-1] — the WCS moved from a named config option into the positional input list. The same positional move happened in read_ext_sexcat_runner, psfex_interp_runner, and vignetmaker_runner, but (unlike ngmix_runner) their @module_runner decorators weren't updated to match. The shipped v2.0 example configs are correct, so this isn't a live bug — but the real input contract now lives entirely in each config's FILE_PATTERN/FILE_EXT ordering, so a hand-written or v1-style config would silently mis-map inputs or IndexError with nothing to catch it. ngmix_runner is the model — worth extending these four decorators to match so the contract is self-documenting.

Comment thread scripts/python/fitting.py Outdated
@@ -0,0 +1,265 @@
"""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to ship? This reads as a near-verbatim copy of esheldon's fitting_bd_empsf.py ngmix example (bulge+disk fracdev, espy/images refs in the docstring) and doesn't look connected to the pipeline. It also redefines its own make_data/get_prior, duplicating the new shapepipe.testing.simulate helpers added in this same PR (whose point is to be the single stable-across-branches source). Suggest dropping it, or trimming it to import the canonical helpers. scripts/jupyter/test_centroid_shift.py has the same duplication and could likewise import simulate.make_data.

@cailmdaley
Copy link
Copy Markdown
Contributor Author

Next steps — triage of the v2.0 review

Provenance: like the two-part review above, this summary was generated by Claude working on candide against the current branch head; I haven't hand-edited it line by line, so please read it with that lens. The re-review was genuinely run — every finding below was re-checked against the code as it stands on this branch.

A follow-up now that #740 is closed and everything lives here on #741. I re-read the ngmix module and the four runners against the current head: no commits have landed since the part-2 review, so all eleven line-level findings still stand exactly where they were anchored — nothing was silently addressed or drifted. #741 is green and mergeable; the question is just which findings to land before merge and which can wait.

The useful cut isn't open-vs-resolved (they're all open) — it's what unblocks each one:

A · Cut-and-dry — mechanical, no science (one cleanup commit)

  • ngmix.py:293 — unreachable print after return; delete.
  • ngmix.py:1140sextractor_e1e2 is unused with a copy-pasted docstring; drop or fix.
  • ngmix.py:766 — masked-fraction cut divides by a literal 51 * 51; use v_flag_tmp.size so it survives a configurable stamp size.
  • sextractor_runner.py:74 (+ read_ext_sexcat, psfex_interp, vignetmaker) — WCS moved to a positional input but the @module_runner decorators weren't updated to match (only ngmix_runner was). Shipped configs are correct so it's not a live bug, but the decorator is the input contract.
  • scripts/python/fitting.py — reads as a verbatim esheldon example duplicating the new testing/simulate.py; confirm it's not meant to ship, then drop or trim to import the canonical helpers.

Happy to push these as a single commit here if that's useful — they keep CI green and narrow the review surface to the science.

B · Decisions — need your / Lucy's intent

Three are light (a one-line "intended" or a small fix): zero-pixel guard anyall (:737), PSF fit reusing the galaxy prior + FLUX_AUTO flux guess (:1068), and the r50 column naming (:415/:676 — these carry T and sqrt(T/2)=σ, not half-light radii; worth reconciling with sp_validation, which consumes them).

Two genuinely change the meaning of output columns and are worth a real answer:

  • Weight-map normalization (:949)sigma_mad is now over the whole (flux-contaminated) stamp rather than the windowed object-free get_noise, and the per-pixel map is then multiplied by 1/sig_noise² (reads as double inverse-variance). This moves the χ²-weighting, the reported errors/s2n, and the PSF-averaging weights.
  • *_psfo columns (:1045) — these now carry the metacal-reconvolved PSF and collapse the per-type columns to one value, where the ρ-stat / null tests consume them as the input PSF (and the column doc still says "original image psf").

C · Resume feature

  • CHECK_EXISTING_DIR (:254/:257/:464)_check_existing_dir is set but never read and get_last_id is never called, so a configured resume silently reprocesses from scratch. Rewire it, or remove the option + get_last_id (the latter folds into the Bucket-A commit). The adjacent # MKDEBUG sits on _f_wcs_path, which is used — that marker can just go.

Suggested sequence to merge

  1. Land the Bucket-A cleanups (I can push these).
  2. Decide the resume question (rewire vs remove).
  3. Confirm the Bucket-B science questions — the two above are the only ones I'd treat as genuine merge-gates; the rest can merge-then-iterate.
  4. Reconcile r50/T naming with sp_validation.
  5. Merge.

Honest read: CI is green and nothing here breaks the suite, so #741 can merge today — but the weight-norm and *_psfo changes silently alter columns that sp_validation and the null tests rely on, so those two are worth a conscious accept rather than a default one.

— Claude on behalf of Cail

Mechanical cleanups from the #741 review that Martin greenlit / are
unambiguous — no behaviour change on current configs:

- ngmix.py: remove unreachable print() after return in MegaCamFlip (293)
- ngmix.py: remove unused sextractor_e1e2() (copy-pasted docstring, no
  callers anywhere in src/) (1140)
- ngmix.py + ngmix_runner.py: remove the dead CHECK_EXISTING_DIR resume
  path (_check_existing_dir set but never read; get_last_id defined but
  never called) and the stale '# MKDEBUG check whether still used' marker
  on _f_wcs_path (which IS used). Martin: 'a hack to resume interrupted
  runs ... can be removed now.' (254/257/464)
- ngmix.py: masked-fraction cut divides by v_flag_tmp.size instead of the
  literal 51*51 — identical for 51x51 stamps, auto-tracks any future
  stamp size (766)
- scripts/python/fitting.py: remove — verbatim copy of esheldon's ngmix
  example, no callers, duplicates testing/simulate.py

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants