Skip to content

Fix bug in CPR#1813

Merged
oksuzian merged 1 commit intoMu2e:mainfrom
michaelmackenzie:CprBug
Apr 29, 2026
Merged

Fix bug in CPR#1813
oksuzian merged 1 commit intoMu2e:mainfrom
michaelmackenzie:CprBug

Conversation

@michaelmackenzie
Copy link
Copy Markdown
Contributor

A recently added divide by zero protection revealed a bug in CPR where a helix found with the wrong dphi/dz slope was still being passed to the final helix saving stage. These helices had uninitialized dphi/dz values, and are therefore now skipped. This lead to a seg fault when trying to access a pointer that is no longer created. I fixed this bug and added some printout in case a nullptr appears again, but we can now safely continuing processing (most importantly Online) in this case.

@FNALbuild
Copy link
Copy Markdown
Collaborator

Hi @michaelmackenzie,
You have proposed changes to files in these packages:

  • CalPatRec

which require these tests: build.

@Mu2e/fnalbuild-users, @Mu2e/write have access to CI actions on main.

⌛ The following tests have been triggered for a15b23b: build (Build queue - API unavailable)

About FNALbuild. Code review on Mu2e/Offline.

if ( (Helix._szphi.qn() < minNFitHits) ||
((Helix._szphi.qn() >= minNFitHits) && (Helix._szphi.dfdz()*_dfdzsign < 0.)) ) {
if ( (Helix._szphi.qn() < minNFitHits) || // too few hits
(Helix._szphi.dfdz()*_dfdzsign < 0.) ) { // wrong slope
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above you changed Helix._szphi.dfdz() to Helix._dfdz. What is the difference between these 2 ways of accessing dfdz, and how is a user to know the right one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case the fit is not successful in the phi vs. z fit stage, it doesn't update the Helix_.dfdz parameters but the line fit object can still have non-zero results. The Helix._dfdz is what's actually used in the helix downstream. In cases where the fit is successful, I believe they only have numerical effect differences (this is what I saw through printouts at least). This is why this failure wasn't caught here though, this line fitter converged to a reasonable value but with the wrong slope, and that failure wasn't being properly checked. I'm not an expert on this algorithm unfortunately, so it's possible I misunderstood something in the code...

@FNALbuild
Copy link
Copy Markdown
Collaborator

☀️ The build tests passed at a15b23b.

Test Result Details
test with Command did not list any other PRs to include
merge Merged a15b23b at bca4e82
build (prof) Log file. Build time: 04 min 08 sec
ceSimReco Log file.
g4test_03MT Log file.
transportOnly Log file.
POT Log file.
g4study Log file.
cosmicSimReco Log file.
cosmicOffSpill Log file.
ceSteps Log file.
ceDigi Log file.
muDauSteps Log file.
ceMix Log file.
rootOverlaps Log file.
g4surfaceCheck Log file.
trigger Log file. Return Code 1.
check_cmake Log file.
FIXME, TODO ➡️ TODO (0) FIXME (29) in 3 files
clang-tidy ➡️ 4 errors 741 warnings
whitespace check no whitespace errors found

N.B. These results were obtained from a build of this Pull Request at a15b23b after being merged into the base branch at bca4e82.

For more information, please check the job page here.
Build artifacts are deleted after 5 days. If this is not desired, select Keep this build forever on the job page.

@brownd1978
Copy link
Copy Markdown
Collaborator

brownd1978 commented Apr 28, 2026 via email

@michaelmackenzie
Copy link
Copy Markdown
Contributor Author

I would need to spend some time reading the code to better understand if the fitter ever diverges from the values desired to be used in the final helix before removing the final fit parameters from the internal data struct. I'll try to review the code more carefully this week to understand if it's fine to only use the fitter object as the source of fit results or if we still need to separately store the final values (and if so, I'll clean up the code to only use the fitter values while performing the fits).

One caveat is this currently causes a seg fault when running reconstruction. Do we want to merge this and make this improvement a separate PR or live with the seg fault for a few days? I can also make this my top priority for today if we want it solved (and solved correctly) today.

@brownd1978
Copy link
Copy Markdown
Collaborator

brownd1978 commented Apr 28, 2026 via email

@oksuzian
Copy link
Copy Markdown
Collaborator

PR #1813 Review — Fix bug in CPR

Summary

Fix bug in CPR by @michaelmackenzie addresses a segfault in the Calorimeter Pattern Recognition (CPR) helix finder. A previously added divide-by-zero guard exposed a latent bug: helices with the wrong dphi/dz slope were being passed to the final saving stage with uninitialized values, then dereferenced as a now-null pointer downstream.

Item Value
State Open, mergeable: clean, not draft
Files 3 changed (+23 / −10), 1 commit
Reviews 2 inline comments (1 thread with @brownd1978), no formal approval yet — but @brownd1978 said "I can approve this as a bug fix"
CI FNALbuild ✅ (build prof, all sim/reco tests pass; trigger job RC=1 ⚠️; clang-tidy 4 errors / 741 warnings — informational)
Risk Low–Medium — narrow, defensive change; semantics of Helix._dfdz vs Helix._szphi.dfdz() deserve follow-up

Core changes

1. CalHelixFinderData.hh — default-initialize raw pointers

- const TimeCluster*  _timeCluster;
+ const TimeCluster*  _timeCluster = nullptr;
- HelixTraj*          _helix;
+ HelixTraj*          _helix       = nullptr;

Good defensive change — eliminates the UB that was the proximate cause of the segfault.

2. CalHelixFinderAlg.cc — propagate the bad-helix signal

  • When radius == 0 or dfdz == 0, set Helix._helix = nullptr and emit a diagnostic instead of silently returning.
  • findHelix return value is now retval = Helix.helix() != nullptr; (was unconditional true).
  • New early-out before rescueHitsBeforeSeed: if (!rc) goto PATTERN_RECOGNITION_END; so a failed linear fit doesn't leak into the rescue stage.
  • Slope check switched from Helix._szphi.dfdz() to Helix._dfdz (see the brownd1978 thread — the latter is what's actually used downstream, the former can have stale/non-zero values when the φ-z fit didn't update them).
  • Drive-by fix: "...../n""...\\n" in a debug printf (real bug — was emitting /n instead of newline).

3. CalHelixFinder_module.cc — guard the consumer

int rc = _hfinder.findHelix(tmpResult);
if (!rc) continue;
+ if (!tmpResult.helix()) {
+   std::cout << "[CalHelixFinder::" << __func__ << "] " << event.id()
+             << " Helix found but nullptr returned!!\n";
+   continue;
+ }

Defense in depth — even if findHelix returns success, the consumer skips the seed when the underlying HelixTraj* is null.

Issues / concerns found

🟡 1. Logic change in the slope check is semantically broader than stated

In the original:

if ((qn < minNFitHits) || ((qn >= minNFitHits) && (slope*_dfdzsign < 0.))) success = false;

This was equivalent to if (qn < minNFitHits || slope*_dfdzsign < 0.) — the qn >= minNFitHits guard was redundant. The new code simplifies to that, but also switches _szphi.dfdz()_dfdz. As @brownd1978 flagged, having two sources of truth (Helix._dfdz vs Helix._szphi.dfdz()) is itself a smell. The author acknowledged it and said it'll be cleaned up in a follow-up. Recommend filing a tracking issue so this doesn't get lost.

🟡 2. std::cout for error diagnostics

Both new prints ("Bad Helix parameters!" and "Helix found but nullptr returned!!") use std::cout and have no rate-limiting or _debug gate. In production / Online, these conditions appear to be event-rate-correlated (this is exactly the bug they're catching) — they could spam stdout. Consider:

  • Using mf::LogWarning / mf::LogError (art's MessageFacility) instead of std::cout, or
  • Gating on _debug > 0, or
  • Using an mf::LogWarning with a category so operators can throttle.

🟡 3. goto PATTERN_RECOGNITION_END chain extended

A third goto was added. The function already uses goto, so this is locally consistent, but the cumulative control flow into PATTERN_RECOGNITION_END now spans many scenarios. Not a blocker, but a candidate for a small refactor (early-return helper) in a follow-up.

🟢 4. _helix = nullptr is set on bad-param path, but is it cleared on every other failure path?

The new guarantee — findHelix returns false ⇒ tmpResult.helix() == nullptr — is enforced via retval = Helix.helix() != nullptr; at the end of defineHelixParams(Helix). That's the right place, but worth a quick verification that no earlier return paths leave _helix in a stale non-null state from a previous iteration. Since tmpResult is a fresh local in the module loop, and _helix now defaults to nullptr, this looks correct — just call out for the reviewer.

🟢 5. Minor: typo-style printf bugfix is silently bundled

The "...../n""...\\n" fix is correct but unrelated to the segfault. Consider a one-line note in the PR description so reviewers don't wonder. Not a blocker.

🟢 6. Trigger CI job

The FNALbuild table shows trigger with Return Code 1 (❓). Worth confirming whether that's pre-existing flakiness or related — the reset of CalPatRec behavior in a trigger context is exactly what this PR touches.

Merge readiness

  • Blocking? No formal blocking review. @brownd1978 indicated willingness to approve as a bug fix.
  • Recommended before merge:
    1. Confirm the trigger CI failure is unrelated (or document why).
    2. Either downgrade the new std::cout prints to mf::LogWarning, or accept and open an issue.
  • Recommended after merge (follow-up PR, per author): consolidate Helix._dfdz and Helix._szphi.dfdz() to a single source of truth.

Given it's a real segfault in production reconstruction, merging the bug fix and tracking the cleanup as a follow-up issue (as the author proposed) is reasonable.

Want me to:

  1. Open a tracking issue for the _dfdz / _szphi.dfdz() consolidation cleanup.
  2. Investigate the trigger CI failure (RC=1) on this build.
  3. Suggest a concrete patch swapping std::cout for mf::LogWarning.
  4. Trace _helix lifetime across all failure paths in findHelix to confirm the new invariant holds.

@oksuzian
Copy link
Copy Markdown
Collaborator

Investigation: Do Helix._szphi and Helix._dfdz ever diverge?

Short answer: yes — transiently by design, and persistently on every failure path. The persistent case is exactly the bug class PR #1813 fixes. So Michael's instinct is right, and the consolidation is not a one-line refactor.

Below is the trace.


The two storage locations

::LsqSums4   _sxy;
::LsqSums4   _szphi;          // stateful line-fitter accumulator
...
float        _dfdz;           // "official" stored slope used downstream
float        _fz0;            // "official" stored phase
  • _szphi is a stateful accumulator (LsqSums4). addPoint / removePoint mutate it; dfdz()/phi0() are derived from its current sums.
  • _dfdz / _fz0 are plain scalar snapshots intended as the canonical post-fit result.

There is no enforced invariant linking them.


Where each is written, in pipeline order

doPatternRecognitionfindDfDzdoLinearFitPhiZdefineHelixParams

1. Initial estimate (CalHelixFinderAlg.cc:~2944)

Helix._dfdz = dfdz;
Helix._fz0  = phi0 - dfdz*z_phi0;

Writes _dfdz only. _szphi not touched.

2. findDfDz (an MPV-style estimator, line ~2967)

int rs = findDfDz(Helix, SeedIndex);
if (rs == 1) {
  Helix._dfdz = _hdfdz;
  Helix._fz0  = _hphi0;
}

Again writes _dfdz only. At this point _szphi is stale (from a previous iteration or empty). They are now out of sync, but it doesn't matter yet because doLinearFitPhiZ will rebuild _szphi from scratch.

3. doLinearFitPhiZ — the iterative line fit (lines ~1011–1080)

This function repeatedly calls Helix._szphi.addPoint(...) / .removePoint(...). Throughout this loop, _szphi.dfdz() evolves continuously while Helix._dfdz stays frozen at the value from step 2.

The writeback to _dfdz/_fz0 happens only on success, at the very end:

if ( (Helix._szphi.qn() < minNFitHits) ||
     ((Helix._szphi.qn() >= minNFitHits) && (Helix._szphi.dfdz()*_dfdzsign < 0.)) ) {
  success = false;
}
else if (success) {                               // update helix results
  Helix._fz0  = Helix._szphi.phi0();
  Helix._dfdz = Helix._szphi.dfdz();
}

So:

  • Success branch: _dfdz == _szphi.dfdz() exactly, by construction.
  • Failure branch: _dfdz retains its pre-doLinearFitPhiZ value (a stale estimate from findDfDz or the initial computation), while _szphi.dfdz() is the slope of the failed line fit. They diverge and stay diverged.

This is the smoking gun for PR #1813's bug:

The line fitter converged to a reasonable-looking slope but with the wrong sign (_szphi.dfdz()*_dfdzsign < 0). The check used to pass on _szphi.dfdz() so the wrong-sign condition triggered success = false correctly — but then the code in the helix-finder module kept using Helix._dfdz (still the OK pre-fit estimate) and propagated downstream as if all were well. Later code paths assumed _helix != nullptr and dereferenced it.

4. Downstream consumer: defineHelixParams (lines 73–98)

if (radius == 0. || Helix._dfdz == 0.) return;          // ← guard added
...
pvec[HelixTraj::tanDipIndex] = amsign/(radius*Helix._dfdz);   // only _dfdz
float dphi = deltaPhi(Helix._fz0+amsign*halfpi, ...);          // only _fz0

The final HelixTraj is constructed exclusively from _dfdz/_fz0. _szphi is never read here. That's important for the consolidation discussion below.


Where _szphi.dfdz() is read elsewhere

A grep across CalPatRec finds _szphi.dfdz() / _szphi.phi0() called in:

Site Purpose Diverges from _dfdz?
doLinearFitPhiZ internal residual queries (findGoodFaceHitInFitPhiZ, findWorstChi2HitInFitPhiZ, findWorstResidHitInFitPhiZ) Live working state during iteration Yes (intentional) — the loop needs the live value
doLinearFitPhiZ end-of-loop wrong-slope check Validation Yes on failure — this is the divergence the PR moved away from
Helix._diag.dfdz_8 = Helix._szphi.dfdz(); (diagnostics block) Histogramming Yes — diagnostics see the live fit value, not the stored one
_data.chi2ZPhi[loc] = _hfResult._szphi.chi2DofLine(); in CalHelixFinder_module.cc Quality metric reporting N/A (chi2, not slope)
CalHelixFinderData::print Debug printout Logs both side-by-side — even the print function tacitly admits they can differ

The most telling evidence is print() itself — it prints _szphi(phi0, df/dz, chi2) and _dfdz, _fz0 on separate lines. Whoever wrote it knew they could disagree.


So: does the fitter ever diverge from the values desired in the final helix?

Scenario _szphi.dfdz() vs _dfdz Used downstream?
Inside doLinearFitPhiZ mid-iteration Diverge — by design (_szphi is live, _dfdz is stale) No — only the inner helpers read _szphi here
doLinearFitPhiZ success path, after writeback Equal _dfdz only
doLinearFitPhiZ failure path Diverge persistently_szphi = failed-fit slope, _dfdz = previous-stage estimate This was the bug: failure not recognized, _dfdz propagated, _helix=nullptr dereferenced
Between findDfDz and doLinearFitPhiZ rebuild Diverge Doesn't matter — _szphi gets cleared and refilled

The "values desired in the final helix" are unambiguously _dfdz and _fz0 — that's the only thing defineHelixParams consumes. So the question reduces to: can _dfdz carry a value that does not reflect the most recent successful line fit? Yes — whenever doLinearFitPhiZ fails, by construction.


Implications for the consolidation refactor

This is why Michael correctly hesitated. There are two superficially symmetric paths and only one is safe:

❌ "Delete _dfdz, always use _szphi.dfdz()"

Wrong. _szphi reflects the state of whichever fit ran last, including failed ones. It also doesn't carry the findDfDz MPV estimate at all (that's not produced by LsqSums4). Routing defineHelixParams through _szphi.dfdz() would introduce the wrong-slope bug into the final helix instead of fixing it.

❌ "Delete _szphi, always use _dfdz"

Also wrong. _szphi is a working accumulator: addPoint/removePoint during refinement, chi2DofLine() for quality, residual queries. You can't replace it with a scalar.

✅ The correct shape of the refactor

Treat them as having distinct roles and enforce that in code:

  1. _szphi is internal scratch state of doLinearFitPhiZ. It should not be read for "result" parameters anywhere outside that function and its helpers. Keep it for chi2 and residuals.
  2. _dfdz / _fz0 are the canonical fit result. All downstream code (defineHelixParams, the module loop, HelixSeed construction) reads only these.
  3. Establish an exit invariant for every helix-finding stage:
    On exit, either Helix._dfdz/_fz0 reflect a successful, sign-correct fit and Helix._helix != nullptr, OR Helix._fit carries a failure code and Helix._helix == nullptr.
  4. Reset on entry to each stage that might overwrite: set _dfdz to a sentinel (e.g. NaN) so a failure leaves no usable stale value lying around — failures become loud, not silent.
  5. Audit the diagnostics writes_diag.dfdz_8 = _szphi.dfdz() should probably be _diag.dfdz_8 = _dfdz so plots reflect the helix that was actually emitted. (Today they secretly tell you about the line-fit's last state, even on failed fits — useful for debugging the algorithm, misleading for physics.)

PR #1813 already enacts (3) and parts of (1) — it switched the wrong-slope check to read _dfdz (consistent with downstream), it sets _helix = nullptr on bad params, and it makes the consumer skip on helix() == nullptr. The remaining work is tightening the invariant in the failure paths of doLinearFitPhiZ itself (set _dfdz to a sentinel and _fit to a failure code unconditionally on success == false), and de-duplicating the diagnostics reads.


Concrete answer to feed back to David

"Yes, _szphi and _dfdz diverge — both transiently during the line-fit iteration (intentional) and persistently on every failure path of doLinearFitPhiZ (unintentional, and the root of the segfault). The consolidation cannot collapse them into one field — _szphi is a stateful accumulator with its own job. What we can do, and what this PR partially does, is make _dfdz/_fz0 the sole canonical result and rule that _szphi is internal scratch state. The remaining cleanup is a follow-up to (a) explicitly invalidate _dfdz on every failure exit of doLinearFitPhiZ, (b) move the diagnostics writes (_diag.dfdz_8 etc.) from _szphi.dfdz() to _dfdz, and (c) remove any other external readers of _szphi.dfdz()/_szphi.phi0()."

Want me to draft that follow-up PR / tracking issue with the specific call sites enumerated?

@michaelmackenzie
Copy link
Copy Markdown
Contributor Author

From the AI reviews, it seems that overall these changes are reasonable and could be merged. The next step will be cleaning and potentially refactoring some of this code to ensure the state is being properly reset when exiting, only final values are being used outside of the fitting functions, and perhaps some organizational changes to make it more maintainable. I've started these next stage improvements and will hopefully be able to finish this week.

Does this sound like a reasonable plan, @brownd1978 and @oksuzian?

Copy link
Copy Markdown
Collaborator

@brownd1978 brownd1978 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve as an urgent bug fix, with additional restructuring to come in future PRs.

@oksuzian oksuzian merged commit 1ceaf43 into Mu2e:main Apr 29, 2026
14 checks passed
@michaelmackenzie michaelmackenzie deleted the CprBug branch April 29, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants