You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds CrvCosmic/, an XGBoost-based classifier for vetoing cosmic-ray-induced backgrounds, ported from sam-grant/mu2e-cosmic. The model takes per-coincidence CRV and tracker features and outputs a probability that the coincidence matched with the track is cosmic-induced. Trained on CosmicCRYSignalAllOnSpillTriggered (pure cosmics) vs. CeEndpointMix2BBTriggered (beam + pileup).
Only the cuts and helpers actually used by the ML preselection are ported across from sam-grant/mu2e-cosmic, the non-ML parts of the upstream cosmic-background framework are excluded.
Pipeline
Process: Read ROOT EventNtuple files, apply the preprocessing cutset, flatten per-coincidence features into awkward output.
Assemble: Load processed CRY and CE-mix outputs, label, combine, and produce K-fold indices for nested cross-validation (CV).
Train: Fit XGBoost on each fold, find the per-fold operating threshold at a target veto efficiency, then retrain on the full set with the CV-averaged threshold and export to .ubj.
Validate: ROC AUC, score distributions, feature importance, and a "money table" comparing the ML model against the simple ∆t cut baseline.
Optimise:(optional) grid search over XGBoost hyperparameters via k-fold nested CV, minimising deadtime at the target veto efficiency.
Layout
config/cuts.yaml: Cutset definitions; MLPreprocess is the only one defined here
Scope: +9,162 / -0 lines across 24 new files — adds a self-contained CrvCosmic/ ML pipeline (XGBoost cosmic veto), ported from sam-grant/mu2e-cosmic
Risk: Medium — new isolated subdirectory, but big surface area with several latent bugs and reproducibility concerns
Issues found
🔴 Bugs
optimise.py uses incompatible data shape in grid_search (line ~50). It calls Train(self.data, ...) where self.data is the dict from assemble_dataset() containing X, y, outer_folds, etc. But Train.__init__ indexes data["X_train"], data["X_test"], data["y_train"], ... — those keys don't exist on the assemble dict. grid_search will KeyError on first run. Only grid_search_cv paths through X_train/metadata_train correctly. → Either remove grid_search or require callers to pass a get_full_train_data(data) / get_fold_data(...) dict.
optimise.pygrid_search_cv also assumes X_train/y_train/metadata_train keys (lines ~135–140), but AssembleDataset.assemble_dataset() returns top-level X/y/metadata. Same KeyError. The notebook must be doing one of these conversions implicitly; the public API is inconsistent.
process.py logger overwritten (lines ~138–142 inside __init__): logger is created as [MLPreprocess], then re-created as [MLPreprocessor] with no verbosity argument — silently drops the user-provided verbosity for the second instance. Remove the duplicate block.
run_ml_prep.pycuts_to_toggle argument unused — run() accepts it but never forwards to MLProcessor.
run_ml_prep.py missing newline at EOF; also train.py same.
assemble.py hardcoded col_to_drop (line ~190): comment explicitly flags this as a smell. If the upstream process.py changes feature names, training silently drops/keeps wrong columns. Make this driven by feature_set or assert columns exist.
train.py_fit_keras_model reuses the same compiled model instance across folds — in train_cv, every fold instantiates a new Train, but if a user passes a Keras model (instance, not class), all folds share weights. Not triggered for the default XGBoost path, but a footgun.
validate.pyfind_threshold ties-handling: uses event_max_proba >= thr for the scan but also picks above_target[-1] (highest threshold ≥ target). With a coarse grid of 10,000 points this is fine, but signal/veto efficiency values are reported off the same >= so the operating-point math is self-consistent — just call out that signal_efficiency = 1 - deadtime here is defined as 1 - FPR, not the usual ML sense. Misleading naming.
analyse.pywithin_t0err cut is registered only-via at_trk_mid mask but the if self.on_spill: block for within_t0 means the two cuts can disagree about which segments are masked. within_t0err is always defined regardless of on_spill, but within_t0 only on-spill — a cuts.yaml with within_t0: true, on_spill: false would silently miss the cut at runtime (no error).
🟡 Reproducibility / hygiene
No tests. Description says "ran end-to-end on the gpvms" — that's manual. At minimum, add a smoke test that imports each module and runs MLProcessor on the --test fixture.
warnings.filterwarnings("ignore") in assemble.py (module-level) suppresses warnings globally for any importer. Move into a context manager or remove.
sys.path.extend([...]) everywhere — src/core/analyse.py, src/ml/process.py, src/ml/assemble.py, src/ml/load.py, src/ml/optimise.py, run/run_ml_prep.py all mutate sys.path. Make CrvCosmic a proper package with __init__.py re-exports and use relative imports; today these __init__.py are empty.
Hardcoded run_str = "k" in run_ml_prep.py and default run="h" in LoadML, run="j" in Train/Optimise, run="k" in AssembleDataset — inconsistent defaults invite silently reading the wrong directory. Pick one source of truth (env var, CLI arg, or config file).
LoadML.get_results() hardcodes tags ("CRY_onspill-LH_aw", "CE_mix2BB_onspill-LH_aw") that must match run_ml_prep.py configs. Pull these from a shared config.
feature_set arg defaults differ: MLProcessor default = "trk", but run_ml_prep.py test configs use "crv". The "trk" branch in _process_array doesn't set event/subrun which downstream assemble.py requires → broken if anyone uses the default.
Large amount of commented-out code in process.py (the coinc_idx, _test, alternate branches sections). Strip before merging.
.ipynb files committed with outputs (5,320 lines total). Strip outputs (nbstripout) — notebooks dominate the diff and bloat the repo.
Bare except Exception as e everywhere in analyse.py and process.py, all re-raised after a log — adds noise without adding context. Consider letting exceptions propagate naturally except where you can recover.
No license header consistency — some files have Sam Grant 2025, most don't.
🟢 Nits
analyse.py: Logger print_prefix mismatch between MLProcessor ([MLPreprocess] then [MLPreprocessor]).
optimise.py: imports matplotlib.pyplot as plt at module top — Mu2e style only applied via Plot(verbosity=0); results depend on import order.
validate.pyplot_score_distribution_cv uses alpha=0.1 per fold — invisible for >10 folds.
Not ready as-is. The two KeyError bugs in optimise.py and the silent-default feature_set="trk" issue are blockers — the public API doesn't match what notebooks/runners actually pass.
Suggested next steps
Fix grid_search / grid_search_cv data contract.
Make feature_set="crv" the default (or remove the "trk" branch).
Convert CrvCosmic into an importable package; drop sys.path hacks.
Strip notebook outputs and dead/commented code.
Add a minimal pytest that imports all modules + runs the --test path on tiny stub inputs.
Want me to (1) open issues for the blocker bugs, (2) draft a PR with the import/package cleanup, or (3) walk through a specific module (e.g. validate.py threshold math) in more depth?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cosmic Ray Veto ML training code
Adds
CrvCosmic/, an XGBoost-based classifier for vetoing cosmic-ray-induced backgrounds, ported fromsam-grant/mu2e-cosmic. The model takes per-coincidence CRV and tracker features and outputs a probability that the coincidence matched with the track is cosmic-induced. Trained onCosmicCRYSignalAllOnSpillTriggered(pure cosmics) vs.CeEndpointMix2BBTriggered(beam + pileup).Only the cuts and helpers actually used by the ML preselection are ported across from
sam-grant/mu2e-cosmic, the non-ML parts of the upstream cosmic-background framework are excluded.Pipeline
.ubj.Layout
config/cuts.yaml: Cutset definitions;MLPreprocessis the only one defined heresrc/core/: Cut flow & feature helpers, postprocessing combinerssrc/ml/:MLProcessor,LoadML,AssembleDataset,Train,Validate,Optimisesrc/utils/: IO, histogram booking, plotting,mu2e.mplstylerun/run_ml_prep.py: Entry point for processing ROOT files into parquet/pkl for trainingnotebooks/:assemble, feature engineering, optimise, train, validateDependencies
Requires
Mu2e/pyutils(provided bypyenv ana) forpyutils.pycut.CutManager, plusxgboost,scikit-learn,awkward,pandas,hist,pyarrow,h5py,joblib,pyyaml.Testing
Ran end-to-end on the gpvms (commit
b78cf8f)