Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix subtle floating point bug in ProbabilityTree #1080

Merged
merged 6 commits into from
May 14, 2020
Merged

Conversation

rmjarvis
Copy link
Member

@jchiang87 found a bug in drawing InterpolatedImage objects when there were a bunch of pixels with values around 1.e-16 -- 1.e-12 of the total flux where it would seg fault on the draw command.

It turned out to be due to a subtle bug in the ProbabilityTree code where some sums could have rounding errors that caused some of the totals to get out of sync, eventually leading a pointer to go one higher than it should have been able to. Once that happened, it went into a loop that sent it off into the depths of the computer's memory dereferencing things willy nilly until it seg faulted.

Anyway, the solution was to add an extra check to make sure that the variable (mid) doesn't go past the part of the array being worked on (end), rather than relying on the floating point math to ensure that, since it doesn't necessarily do so, apparently.

Along the way, I was also getting frustrated trying to get the asserts to trigger, since the python compiler automatically sets -DNDEBUG, which turns off asserts, and it's hard to undo that it turns out. Plus, the regular assert bombs out with an abort, which isn't very python-friendly.

So I rolled my own assert statement (copied from TreeCorr actually), which will now always run some checks regardless of NDEBUG status to make sure we don't get seg faults if things aren't what we expect. And if they do hit, it raises a runtime_error, rather than abort, so a bit less catastrophic when running from python.

We have some other asserts throughout the code, and I checked that they are all not doing any significant calculations. Most of them check for non-NULL pointers and similar things that would lead to badness if they were false. So probably worth keeping them all turned on all the time.

There is also an xassert we use in various places which is (still) not run unless DEBUGLOGGING is turned on. So those are mostly for trying to track down bugs. Some of these do have arithmetic in them, but they turn into a no op in normal running, so that's ok.

@rmjarvis
Copy link
Member Author

Note: This is listed to merge into releases/2.2. Once merged, I'll also cherry-pick it to master. But we should probably cut a bugfix release in the 2.2 series with this fix.

# But I'm finding S in the reference, and U in info.
c = info.dtype[name].char
c = 'S' if c == 'U' else c
assert dt.char == c, "column %s is the wrong type"%name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated to the topic of this PR. There was apparently a change in behavior of either meds or fitsio that caused this test to start failing. I already fixed it on master, but not yet on releases/2.2.

@@ -170,19 +175,22 @@ namespace galsim {
// stops quickly with the large flux Elements on the left.
double half_tot = absFlux/2.;
double leftSum=0.;
for (; leftSum < half_tot; ++mid) leftSum += std::abs((*mid)->getFlux());
for (; leftSum < half_tot && mid < end; ++mid) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specifically is where the bug was. Added && mid < end here, since the leftSum with rounding can remain less than half_tot, even though 2*half_tot is supposed to be the sum of everything (in exact arithmetic).

Copy link

@jchiang87 jchiang87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I guess it's not worth trying to address this with arbitrary precision arithmetic since this concerns extremely low probability draws.

@rmjarvis rmjarvis merged commit 7d9f733 into releases/2.2 May 14, 2020
@rmjarvis rmjarvis deleted the glagn branch May 14, 2020 03:57
@rmjarvis rmjarvis added this to the v2.2 milestone Jun 11, 2020
@rmjarvis rmjarvis added bug report Bug report numerics Involving details of the numerical algorithms for performing some calculation(s) labels Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug report Bug report numerics Involving details of the numerical algorithms for performing some calculation(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants