-
Notifications
You must be signed in to change notification settings - Fork 608
hypothesis should provide assert_almost_equal() #2180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Rather than providing any particular support for this, it seems like this would fit in well with the new targeted property based testing features. You could set the test to maximize Currently we don't do anything very sensible with integrating targeted testing with multiple errors, but it's probably a thing we should do. |
Maybe I don't understand how the targeted testing works, but it seems like it is oriented towards finding failures; that's not the problem I am trying to address here. Hypothesis is doing a perfectly acceptable job of finding failures. My problem is that |
Hey @aarchiba - thanks so much for bringing this to our attention! I think shipping assertion helpers should probably remain out of scope for Hypothesis, but if shrinking is making it harder to diagnose failures that's definitely a problem. In this case, I think we should fix #1704 and see if that's sufficient. |
The goal is to provide a "degree of failure" indication to hypothesis, because often finding an example that just barely fails is not as convincing as finding an example that fails badly. I realize this is sometimes in competition with the desire to find the simplest possible failure, but it is a real frustration. Frequently I have floating-point tests that fail because I only allowed for 2 ULP error and this one case can produce 3. So I twiddle the threshold and rerun the test. By the time I get up to 8 ULP I start to suspect it's a more serious failure and start trying tens and hundreds. If those fail, it's a different kind of problem than I thought it was. This iterative process is frustrating, and will not be improved by different methods of simplifications or better means of finding failures. I agree that assertion helpers don't seem particularly on-topic for hypothesis. But how else is it supposed to obtain degree-of-failure information? Obviously not every test has a meaningful notion of degree of failure, but floating-point ones often do. (Array-based tests may as well, for example disagreement in shape, in many places, or in only one.) I was pleased to find D. R. MacIver's article on the threshold problem as it described a problem I had very well, and I am attempting to offer a solution along the lines of the solution already adopted for deadline failures (where hypothesis has easy access to a degree-of-failure measure). |
To be clear, I definitely want to fix the problem at the level of user experience - if shrinking or other improvements do so that's great; if they don't we still have a problem. In brainstorming mode:
|
I guess the question is whether hypothesis has a use for a degree-of-failure measure, as distinct from a complexity-of-example measure. In a way this does connect to the targeted PBT stuff, in the sense that the utility value used for targeted property-based testing is also a sort of degree-of-failure measure, though the TPBT UV is for things that don't fail yet, while the degree-of-failure measure is for things that already fail, and it's a sort of proxy for nature-of-failure. As for numpy picking up a multiple-assertions version of its tests, I think it unlikely they will be interested as the extra assertions are redundant outside the context of hypothesis; and as implemented above it has nothing to do with any numpy machinery (in fact I am using its polymorphism on astropy Time and TimeDelta objects). Perhaps the apparent expansion of scope here is real: this is hypothesis expanding from discrete tests (e.g. string processing, graph properties) into continuous ones (floating-point calculations and accuracy budgets). So it shouldn't be surprising if hypothesis needs new kinds of tool if it wants to work well in this new context. |
An example of this in use is at astropy/astropy#9532 . Of course the iterative problem-hunting stage is not very visible in PRs. |
What I had in mind in suggesting targeted property based testing is that the targets can kinda serve both purposes, and I think to some degree we need something that supports both purposes if we're going to handle this: If you want to know how large an error can be then it makes sense to try to drive that error upwards even if it's currently finding failures. My suggestion would be that we change how tests with targetting are reported, in that we currently will only report distinct errors, but we could expand the scope to report both distinct errors and also the largest failing example of any given target score. So e.g. if you had: error = abs(a - b)
target(error)
assert error <= 10 ** (-3) then Hypothesis would:
We might need to think a bit more about how to communicate that this is what's going on, but unless I'm misunderstanding your use case I think this would cover it? |
Actually this would be another change to targetted property based testing: Currently it doesn't do that, it stops trying to target as soon as it hits a failure. Easy enough to fix though - I think it's just a one-line change.
This is a sort of scope expansion I'm very keen for us to have BTW. I'm not super keen on the original proposal because it seems very specific, but I'd be delighted for us to have a generalisable toolkit of things that help in this sort of scenario. |
Yes, I think that would work. I hadn't suggested any specific mechanism, and the targeted testing docs didn't seem too optimistic about working in a reasonable amount of time. But target does provide a way to report degrees of (near-)failure, though the papers didn't seem to describe it that way. There is a distinction, though - maybe many of the errors one might use TPBT on don't actually benefit from exploring serious failures? Certainly there's a limit to how much effort should go into finding them. But floating-point accuracy problems definitely do benefit from a notion of serious failure, and they generally do provide a way to report the degree of failure. It may take some doing to develop ways to drive errors upward when the inputs are also floating-point - I am seeing failures in weird corner cases like near the ends of days with leap seconds or moments when the difference between |
I've just converted my astropy testing code to use Simulated annealing is also not necessarily the best option, even using only the neighbourhood/temperature primitive - if you are going to have a large number of candidates floating around anyway you can run multiple temperatures simultaneously, for example ("parallel tempering"). I actually ended up using |
At the moment it only handles them independently, though I don't think this would be too hard to change. In particular I was thinking of tracking the convex hull, but pareto frontier seems like a much better choice! We're not really using simulated annealing either, more of a hill-climbing variant which is designed to find quick wins or bail out early because we have only a few chances to run the test function compared to most optimisers. Could be interesting to (adaptively?) swap in a more sophisticated approach depending on the max_examples, though it would need a lot of evaluation work. |
Yeah my plan when I have the bandwidth (hopefully in the next month or two) was to switch it over to tracking the pareto frontier, including both the targets and also the shortlex ordering on the underlying buffers. Step one of this is just building the data structure for keeping track of an (approximation to the) pareto frontier, which if we store them in the example database is useful even without any further optimisation beyond the current appraoch. After that I was probably going to use some sort of genetic algorithm or crossover method as a next step for exploring that frontier but it needs investigation. |
(If anyone wants to preempt that work BTW please feel free. That's not me taking out a lock on it and I'd be delighted for someone else to give it a go!) |
OK, I've written up a meta-issue #2192 with some discussion of significant new features that I think we should consider (modulo usual disclaimer; it'll happen iff and when someone feels like volunteering or paying for it). For this issue, we have a For that implementation: we might as well do everything, right? Target |
Well, there's an implementation at the top of https://github.com/astropy/astropy/pull/9532/files#diff-a3e7f7d88dc4d767215294708809de38 and it gets a little involved, in this case because it's trying to deal with Time objects and Quantities (that is, with units) and ensure that the error reporting is as intelligible as possible (in a way orthogonal to hypothesis, for example if you ask that something be smaller than a nanosecond, the thing should be expressed in nanoseconds in the assertion failure). I suggested |
And as of #2193 is! Now if you do We can think further about how to have this interact with multiple failures - right now it will only show you one - so this is only a partial solution, but I think it fixes the most pressing bit. What do you think? |
I think I need to go back and hammer out some more bugs from astropy.time to evaluate how useful this is in practice, but it sounds promising. I expect in practice there is often a competition between maximizing the error and shrinking the example (when there isn't we have no problem). The previous behaviour was to show the most shrunken example, which was useful in that it sometimes showed you that it didn't take weirdly specific values to trigger a problem. The new behaviour, as I understand it, is to favour degree of failure, which is great for showing how bad the problem can be. But yes, both would be nice - and also, when we have multiple targets (pretty much always with the a-b and b-a trick!) which maximum is preferred? Perhaps an example maximizing each label and also an example shrunk as much as possible but still failing? I have to admit that even single-failure backtraces can be a handful, and multiple-failure ones can be tough to read (just, a zillion lines of output really, especially for a whole suite of tests). So I'm not sure how best to show the user all this newly available information. |
Hmm, it does not appear to be working. In https://github.com/astropy/astropy/pull/9532/files#diff-a3e7f7d88dc4d767215294708809de38 I have two tests (the last) where I got reported discrepancies of 1.6 ns (on a 1 ns threshold), but when I enabled the two-step testing I immediately got a double-failure message showing a 106 ns failure on the first test and a 1.6 ns failure on the second. So it seems I'm not getting a worst-failure report? Maybe I don't have the current version, though I did
|
Er, oops, PR is still open. I'll go comment over there. |
master...Zac-HD:report-target-stats is a lighter approach; you don't get additional test failures - but it reports the highest scores observed from failing tests in the statistics. Could be useful? |
The
numpy
testing infrastructure providesassert_almost_equal
, which is useful for floating-point tests; itsatol
andrtol
arguments are (or can be) carefully enough defined that you can even use it forabs(a-b)<=0.5
tests. Of course I can just use it fromnumpy.testing
. But ifhypothesis
provides the function, it can use the discrepancy to avoid the threshold problem: if such are available, it can produce examples that fail the test badly rather than just marginally.I routinely find myself writing tests that go like
because when I do this
hypothesis
often reports three different errors, with each failing one of the assertions by a little bit. This is particularly valuable when the test I actually want is pushing the limits of the numerical accuracy I can hope for (for example2*np.finfo(float).eps
), and I want to know whether I was just optimistic or really the code is abjectly wrong.In fact a simple implementation of
assert_almost_equal
would simply follow this pattern and existinghypothesis
machinery would do the rest. But I thinkhypothesis
could be smarter around such a test, as it is around failed deadlines.The text was updated successfully, but these errors were encountered: