Conversation
|
Hi @michaelmackenzie,
which require these tests: build. @Mu2e/fnalbuild-users, @Mu2e/write have access to CI actions on main. ⌛ The following tests have been triggered for a15b23b: build (Build queue - API unavailable) |
| if ( (Helix._szphi.qn() < minNFitHits) || | ||
| ((Helix._szphi.qn() >= minNFitHits) && (Helix._szphi.dfdz()*_dfdzsign < 0.)) ) { | ||
| if ( (Helix._szphi.qn() < minNFitHits) || // too few hits | ||
| (Helix._szphi.dfdz()*_dfdzsign < 0.) ) { // wrong slope |
There was a problem hiding this comment.
Above you changed Helix._szphi.dfdz() to Helix._dfdz. What is the difference between these 2 ways of accessing dfdz, and how is a user to know the right one?
There was a problem hiding this comment.
In the case the fit is not successful in the phi vs. z fit stage, it doesn't update the Helix_.dfdz parameters but the line fit object can still have non-zero results. The Helix._dfdz is what's actually used in the helix downstream. In cases where the fit is successful, I believe they only have numerical effect differences (this is what I saw through printouts at least). This is why this failure wasn't caught here though, this line fitter converged to a reasonable value but with the wrong slope, and that failure wasn't being properly checked. I'm not an expert on this algorithm unfortunately, so it's possible I misunderstood something in the code...
|
☀️ The build tests passed at a15b23b.
N.B. These results were obtained from a build of this Pull Request at a15b23b after being merged into the base branch at bca4e82. For more information, please check the job page here. |
|
redundant information is a known source of bugs. I suggest using this
update to consolidate the data payload to eliminate the possibility of
inconsistency.
…On Tue, Apr 28, 2026 at 11:41 AM Michael MacKenzie ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In CalPatRec/src/CalHelixFinderAlg.cc
<#1813 (comment)>:
> @@ -1049,8 +1056,8 @@ namespace mu2e {
success = true;
}
//----------------------------------------------------------------------//
- if ( (Helix._szphi.qn() < minNFitHits) ||
- ((Helix._szphi.qn() >= minNFitHits) && (Helix._szphi.dfdz()*_dfdzsign < 0.)) ) {
+ if ( (Helix._szphi.qn() < minNFitHits) || // too few hits
+ (Helix._szphi.dfdz()*_dfdzsign < 0.) ) { // wrong slope
In the case the fit is not successful in the phi vs. z fit stage, it
doesn't update the Helix_.dfdz parameters but the line fit object can still
have non-zero results. The Helix._dfdz is what's actually used in the helix
downstream. In cases where the fit is successful, I believe they only have
numerical effect differences (this is what I saw through printouts at
least). This is why this failure wasn't caught here though, this line
fitter converged to a reasonable value but with the wrong slope, and that
failure wasn't being properly checked. I'm not an expert on this algorithm
unfortunately, so it's possible I misunderstood something in the code...
—
Reply to this email directly, view it on GitHub
<#1813 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAH5774TNTAG3WJHF3JMGD4YD3OZAVCNFSM6AAAAACYJX2SGGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DCOJRGUYTQNZZGE>
.
You are receiving this because you are on a team that was mentioned.Message
ID: ***@***.***>
--
David Brown ***@***.***
Office Phone (510) 486-7261
Lawrence Berkeley National Lab
M/S 50R5008 (50-6026C) Berkeley, CA 94720
|
|
I would need to spend some time reading the code to better understand if the fitter ever diverges from the values desired to be used in the final helix before removing the final fit parameters from the internal data struct. I'll try to review the code more carefully this week to understand if it's fine to only use the fitter object as the source of fit results or if we still need to separately store the final values (and if so, I'll clean up the code to only use the fitter values while performing the fits). One caveat is this currently causes a seg fault when running reconstruction. Do we want to merge this and make this improvement a separate PR or live with the seg fault for a few days? I can also make this my top priority for today if we want it solved (and solved correctly) today. |
|
I can approve this as a bug fix. On Apr 28, 2026, at 12:48, Michael MacKenzie ***@***.***> wrote:michaelmackenzie left a comment (Mu2e/Offline#1813)
I would need to spend some time reading the code to better understand if the fitter ever diverges from the values desired to be used in the final helix before removing the final fit parameters from the internal data struct. I'll try to review the code more carefully this week to understand if it's fine to only use the fitter object as the source of fit results or if we still need to separately store the final values (and if so, I'll clean up the code to only use the fitter values while performing the fits).
One caveat is this currently causes a seg fault when running reconstruction. Do we want to merge this and make this improvement a separate PR or live with the seg fault for a few days? I can also make this my top priority for today if we want it solved (and solved correctly) today.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>
|
PR #1813 Review — Fix bug in CPRSummaryFix bug in CPR by @michaelmackenzie addresses a segfault in the Calorimeter Pattern Recognition (CPR) helix finder. A previously added divide-by-zero guard exposed a latent bug: helices with the wrong
Core changes1.
|
Investigation: Do
|
| Site | Purpose | Diverges from _dfdz? |
|---|---|---|
doLinearFitPhiZ internal residual queries (findGoodFaceHitInFitPhiZ, findWorstChi2HitInFitPhiZ, findWorstResidHitInFitPhiZ) |
Live working state during iteration | Yes (intentional) — the loop needs the live value |
doLinearFitPhiZ end-of-loop wrong-slope check |
Validation | Yes on failure — this is the divergence the PR moved away from |
Helix._diag.dfdz_8 = Helix._szphi.dfdz(); (diagnostics block) |
Histogramming | Yes — diagnostics see the live fit value, not the stored one |
_data.chi2ZPhi[loc] = _hfResult._szphi.chi2DofLine(); in CalHelixFinder_module.cc |
Quality metric reporting | N/A (chi2, not slope) |
CalHelixFinderData::print |
Debug printout | Logs both side-by-side — even the print function tacitly admits they can differ |
The most telling evidence is print() itself — it prints _szphi(phi0, df/dz, chi2) and _dfdz, _fz0 on separate lines. Whoever wrote it knew they could disagree.
So: does the fitter ever diverge from the values desired in the final helix?
| Scenario | _szphi.dfdz() vs _dfdz |
Used downstream? |
|---|---|---|
Inside doLinearFitPhiZ mid-iteration |
Diverge — by design (_szphi is live, _dfdz is stale) |
No — only the inner helpers read _szphi here |
doLinearFitPhiZ success path, after writeback |
Equal | _dfdz only |
doLinearFitPhiZ failure path |
Diverge persistently — _szphi = failed-fit slope, _dfdz = previous-stage estimate |
This was the bug: failure not recognized, _dfdz propagated, _helix=nullptr dereferenced |
Between findDfDz and doLinearFitPhiZ rebuild |
Diverge | Doesn't matter — _szphi gets cleared and refilled |
The "values desired in the final helix" are unambiguously _dfdz and _fz0 — that's the only thing defineHelixParams consumes. So the question reduces to: can _dfdz carry a value that does not reflect the most recent successful line fit? Yes — whenever doLinearFitPhiZ fails, by construction.
Implications for the consolidation refactor
This is why Michael correctly hesitated. There are two superficially symmetric paths and only one is safe:
❌ "Delete _dfdz, always use _szphi.dfdz()"
Wrong. _szphi reflects the state of whichever fit ran last, including failed ones. It also doesn't carry the findDfDz MPV estimate at all (that's not produced by LsqSums4). Routing defineHelixParams through _szphi.dfdz() would introduce the wrong-slope bug into the final helix instead of fixing it.
❌ "Delete _szphi, always use _dfdz"
Also wrong. _szphi is a working accumulator: addPoint/removePoint during refinement, chi2DofLine() for quality, residual queries. You can't replace it with a scalar.
✅ The correct shape of the refactor
Treat them as having distinct roles and enforce that in code:
_szphiis internal scratch state ofdoLinearFitPhiZ. It should not be read for "result" parameters anywhere outside that function and its helpers. Keep it for chi2 and residuals._dfdz/_fz0are the canonical fit result. All downstream code (defineHelixParams, the module loop,HelixSeedconstruction) reads only these.- Establish an exit invariant for every helix-finding stage:
On exit, eitherHelix._dfdz/_fz0reflect a successful, sign-correct fit andHelix._helix != nullptr, ORHelix._fitcarries a failure code andHelix._helix == nullptr. - Reset on entry to each stage that might overwrite: set
_dfdzto a sentinel (e.g.NaN) so a failure leaves no usable stale value lying around — failures become loud, not silent. - Audit the diagnostics writes —
_diag.dfdz_8 = _szphi.dfdz()should probably be_diag.dfdz_8 = _dfdzso plots reflect the helix that was actually emitted. (Today they secretly tell you about the line-fit's last state, even on failed fits — useful for debugging the algorithm, misleading for physics.)
PR #1813 already enacts (3) and parts of (1) — it switched the wrong-slope check to read _dfdz (consistent with downstream), it sets _helix = nullptr on bad params, and it makes the consumer skip on helix() == nullptr. The remaining work is tightening the invariant in the failure paths of doLinearFitPhiZ itself (set _dfdz to a sentinel and _fit to a failure code unconditionally on success == false), and de-duplicating the diagnostics reads.
Concrete answer to feed back to David
"Yes,
_szphiand_dfdzdiverge — both transiently during the line-fit iteration (intentional) and persistently on every failure path ofdoLinearFitPhiZ(unintentional, and the root of the segfault). The consolidation cannot collapse them into one field —_szphiis a stateful accumulator with its own job. What we can do, and what this PR partially does, is make_dfdz/_fz0the sole canonical result and rule that_szphiis internal scratch state. The remaining cleanup is a follow-up to (a) explicitly invalidate_dfdzon every failure exit ofdoLinearFitPhiZ, (b) move the diagnostics writes (_diag.dfdz_8etc.) from_szphi.dfdz()to_dfdz, and (c) remove any other external readers of_szphi.dfdz()/_szphi.phi0()."
Want me to draft that follow-up PR / tracking issue with the specific call sites enumerated?
|
From the AI reviews, it seems that overall these changes are reasonable and could be merged. The next step will be cleaning and potentially refactoring some of this code to ensure the state is being properly reset when exiting, only final values are being used outside of the fitting functions, and perhaps some organizational changes to make it more maintainable. I've started these next stage improvements and will hopefully be able to finish this week. Does this sound like a reasonable plan, @brownd1978 and @oksuzian? |
brownd1978
left a comment
There was a problem hiding this comment.
Approve as an urgent bug fix, with additional restructuring to come in future PRs.
A recently added divide by zero protection revealed a bug in CPR where a helix found with the wrong dphi/dz slope was still being passed to the final helix saving stage. These helices had uninitialized dphi/dz values, and are therefore now skipped. This lead to a seg fault when trying to access a pointer that is no longer created. I fixed this bug and added some printout in case a nullptr appears again, but we can now safely continuing processing (most importantly Online) in this case.