Skip to content

hypothesis should provide assert_almost_equal() #2180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aarchiba opened this issue Nov 5, 2019 · 21 comments · Fixed by #2393
Closed

hypothesis should provide assert_almost_equal() #2180

aarchiba opened this issue Nov 5, 2019 · 21 comments · Fixed by #2393
Labels
legibility make errors helpful and Hypothesis grokable test-case-reduction about efficiently finding smaller failing examples

Comments

@aarchiba
Copy link
Contributor

aarchiba commented Nov 5, 2019

The numpy testing infrastructure provides assert_almost_equal, which is useful for floating-point tests; its atol and rtol arguments are (or can be) carefully enough defined that you can even use it for abs(a-b)<=0.5 tests. Of course I can just use it from numpy.testing. But if hypothesis provides the function, it can use the discrepancy to avoid the threshold problem: if such are available, it can produce examples that fail the test badly rather than just marginally.

I routinely find myself writing tests that go like

...
assert abs(b-a) < 1e-3
assert abs(b-a) < 1e-6
assert abs(b-a) < 1e-9

because when I do this hypothesis often reports three different errors, with each failing one of the assertions by a little bit. This is particularly valuable when the test I actually want is pushing the limits of the numerical accuracy I can hope for (for example 2*np.finfo(float).eps), and I want to know whether I was just optimistic or really the code is abjectly wrong.

In fact a simple implementation of assert_almost_equal would simply follow this pattern and existing hypothesis machinery would do the rest. But I think hypothesis could be smarter around such a test, as it is around failed deadlines.

@DRMacIver
Copy link
Member

Rather than providing any particular support for this, it seems like this would fit in well with the new targeted property based testing features. You could set the test to maximize abs(a - b).

Currently we don't do anything very sensible with integrating targeted testing with multiple errors, but it's probably a thing we should do.

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 5, 2019

Maybe I don't understand how the targeted testing works, but it seems like it is oriented towards finding failures; that's not the problem I am trying to address here. Hypothesis is doing a perfectly acceptable job of finding failures.

My problem is that hypothesis shrinks the failures down so that I can't tell whether my code just slightly exceeds the bounds or fails them completely, because the shrunken examples always just slightly exceed the bounds. I want simple examples, but if one fails badly I want to know that, and maybe get the simplest example that fails badly. Whatever that means exactly.

@Zac-HD
Copy link
Member

Zac-HD commented Nov 6, 2019

Hey @aarchiba - thanks so much for bringing this to our attention! I think shipping assertion helpers should probably remain out of scope for Hypothesis, but if shrinking is making it harder to diagnose failures that's definitely a problem.

In this case, I think we should fix #1704 and see if that's sufficient.

@Zac-HD Zac-HD added legibility make errors helpful and Hypothesis grokable test-case-reduction about efficiently finding smaller failing examples labels Nov 6, 2019
@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 6, 2019

The goal is to provide a "degree of failure" indication to hypothesis, because often finding an example that just barely fails is not as convincing as finding an example that fails badly. I realize this is sometimes in competition with the desire to find the simplest possible failure, but it is a real frustration. Frequently I have floating-point tests that fail because I only allowed for 2 ULP error and this one case can produce 3. So I twiddle the threshold and rerun the test. By the time I get up to 8 ULP I start to suspect it's a more serious failure and start trying tens and hundreds. If those fail, it's a different kind of problem than I thought it was. This iterative process is frustrating, and will not be improved by different methods of simplifications or better means of finding failures.

I agree that assertion helpers don't seem particularly on-topic for hypothesis. But how else is it supposed to obtain degree-of-failure information? Obviously not every test has a meaningful notion of degree of failure, but floating-point ones often do. (Array-based tests may as well, for example disagreement in shape, in many places, or in only one.)

I was pleased to find D. R. MacIver's article on the threshold problem as it described a problem I had very well, and I am attempting to offer a solution along the lines of the solution already adopted for deadline failures (where hypothesis has easy access to a degree-of-failure measure).

@Zac-HD
Copy link
Member

Zac-HD commented Nov 6, 2019

This iterative process is frustrating, and will not be improved by different methods of simplifications or better means of finding failures.

To be clear, I definitely want to fix the problem at the level of user experience - if shrinking or other improvements do so that's great; if they don't we still have a problem. In brainstorming mode:

  • If you're in that iterative mode, passing pytest --hypothesis-verbosity=verbose (or otherwise turning up the verbosity) might help a bit by printing each example as it goes.
  • I agree that a numpy-like assert_almost_equal function which raises different errors based on the distance would be really useful, I'd just prefer to have someone else maintain it 😜 We could put a copy-pastable implementation in our docs and see if Numpy would be interested in picking up the idea?
  • As a feature, Hypothesis printing the minimal example seems to be always useful, but optionally printing a smallish and medium-sized example could be useful. I remember seeing (discussion of?) a system which would present a small collection of both passing and failing examples for each test, which can make it really easy to diagnose. (I like this idea enough that I'll write up a proper issue for it in the next few days)

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 6, 2019

I guess the question is whether hypothesis has a use for a degree-of-failure measure, as distinct from a complexity-of-example measure. In a way this does connect to the targeted PBT stuff, in the sense that the utility value used for targeted property-based testing is also a sort of degree-of-failure measure, though the TPBT UV is for things that don't fail yet, while the degree-of-failure measure is for things that already fail, and it's a sort of proxy for nature-of-failure.

As for numpy picking up a multiple-assertions version of its tests, I think it unlikely they will be interested as the extra assertions are redundant outside the context of hypothesis; and as implemented above it has nothing to do with any numpy machinery (in fact I am using its polymorphism on astropy Time and TimeDelta objects).

Perhaps the apparent expansion of scope here is real: this is hypothesis expanding from discrete tests (e.g. string processing, graph properties) into continuous ones (floating-point calculations and accuracy budgets). So it shouldn't be surprising if hypothesis needs new kinds of tool if it wants to work well in this new context.

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 6, 2019

An example of this in use is at astropy/astropy#9532 . Of course the iterative problem-hunting stage is not very visible in PRs.

@DRMacIver
Copy link
Member

I guess the question is whether hypothesis has a use for a degree-of-failure measure, as distinct from a complexity-of-example measure. In a way this does connect to the targeted PBT stuff, in the sense that the utility value used for targeted property-based testing is also a sort of degree-of-failure measure, though the TPBT UV is for things that don't fail yet, while the degree-of-failure measure is for things that already fail, and it's a sort of proxy for nature-of-failure.

What I had in mind in suggesting targeted property based testing is that the targets can kinda serve both purposes, and I think to some degree we need something that supports both purposes if we're going to handle this: If you want to know how large an error can be then it makes sense to try to drive that error upwards even if it's currently finding failures.

My suggestion would be that we change how tests with targetting are reported, in that we currently will only report distinct errors, but we could expand the scope to report both distinct errors and also the largest failing example of any given target score. So e.g. if you had:

error = abs(a - b)
target(error)
assert error <= 10 ** (-3)

then Hypothesis would:

  1. Attempt to drive the error upwards
  2. Reduce the failing test case as normal and report that
  3. Reduce the failing test case subject to the condition that the error score is the maximum one seen and also report that

We might need to think a bit more about how to communicate that this is what's going on, but unless I'm misunderstanding your use case I think this would cover it?

@DRMacIver
Copy link
Member

it makes sense to try to drive that error upwards even if it's currently finding failures.

Actually this would be another change to targetted property based testing: Currently it doesn't do that, it stops trying to target as soon as it hits a failure. Easy enough to fix though - I think it's just a one-line change.

Perhaps the apparent expansion of scope here is real: this is hypothesis expanding from discrete tests (e.g. string processing, graph properties) into continuous ones (floating-point calculations and accuracy budgets). So it shouldn't be surprising if hypothesis needs new kinds of tool if it wants to work well in this new context.

This is a sort of scope expansion I'm very keen for us to have BTW. I'm not super keen on the original proposal because it seems very specific, but I'd be delighted for us to have a generalisable toolkit of things that help in this sort of scenario.

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 6, 2019

My suggestion would be that we change how tests with targetting are reported, in that we currently will only report distinct errors, but we could expand the scope to report both distinct errors and also the largest failing example of any given target score. So e.g. if you had:

error = abs(a - b)
target(error)
assert error <= 10 ** (-3)

then Hypothesis would:

1. Attempt to drive the error upwards

2. Reduce the failing test case as normal and report that

3. Reduce the failing test case subject to the condition that the error score is the maximum one seen and also report _that_

We might need to think a bit more about how to communicate that this is what's going on, but unless I'm misunderstanding your use case I think this would cover it?

Yes, I think that would work. I hadn't suggested any specific mechanism, and the targeted testing docs didn't seem too optimistic about working in a reasonable amount of time. But target does provide a way to report degrees of (near-)failure, though the papers didn't seem to describe it that way.

There is a distinction, though - maybe many of the errors one might use TPBT on don't actually benefit from exploring serious failures? Certainly there's a limit to how much effort should go into finding them. But floating-point accuracy problems definitely do benefit from a notion of serious failure, and they generally do provide a way to report the degree of failure.

It may take some doing to develop ways to drive errors upward when the inputs are also floating-point - I am seeing failures in weird corner cases like near the ends of days with leap seconds or moments when the difference between tdb and tai bends the day fraction from less than 0.5 to more than. But it would already be a big improvement to show the worst failure ever encountered on the way to a simplified result.

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 7, 2019

I've just converted my astropy testing code to use target where appropriate. (I realize the exploring-already-failing and reporting-worst-failure stuff isn't in there yet.) In several cases I have been naughty and combined several tests, usually several related ways of checking the same underlying property. There I have used different labels for each of the targets. Does hypothesis take advantage of multi-objective optimizers? Mostly this means keeping track of multiple "best" candidates along the Pareto front.

Simulated annealing is also not necessarily the best option, even using only the neighbourhood/temperature primitive - if you are going to have a large number of candidates floating around anyway you can run multiple temperatures simultaneously, for example ("parallel tempering"). I actually ended up using emcee as a more-or-less tuning-free optimizer that just happened to also be a Bayesian MCMC algorithm. It's not necessarily the best choice, but something like its concept of affine stretch-moves might be useful for multidimensional floating-point exploration.

@Zac-HD
Copy link
Member

Zac-HD commented Nov 7, 2019

At the moment it only handles them independently, though I don't think this would be too hard to change. In particular I was thinking of tracking the convex hull, but pareto frontier seems like a much better choice!

We're not really using simulated annealing either, more of a hill-climbing variant which is designed to find quick wins or bail out early because we have only a few chances to run the test function compared to most optimisers. Could be interesting to (adaptively?) swap in a more sophisticated approach depending on the max_examples, though it would need a lot of evaluation work.

@DRMacIver
Copy link
Member

Yeah my plan when I have the bandwidth (hopefully in the next month or two) was to switch it over to tracking the pareto frontier, including both the targets and also the shortlex ordering on the underlying buffers.

Step one of this is just building the data structure for keeping track of an (approximation to the) pareto frontier, which if we store them in the example database is useful even without any further optimisation beyond the current appraoch. After that I was probably going to use some sort of genetic algorithm or crossover method as a next step for exploring that frontier but it needs investigation.

@DRMacIver
Copy link
Member

(If anyone wants to preempt that work BTW please feel free. That's not me taking out a lock on it and I'd be delighted for someone else to give it a go!)

@Zac-HD
Copy link
Member

Zac-HD commented Nov 8, 2019

OK, I've written up a meta-issue #2192 with some discussion of significant new features that I think we should consider (modulo usual disclaimer; it'll happen iff and when someone feels like volunteering or paying for it).

For this issue, we have a hypothesis.provisional namespace 🤦‍♂ and I'd be perfectly happy to stick assert_almost_equal in there. That way we can decide to move or remove it in minor versions if we decide to e.g. add more assertion helpers (new hypothesis.helpers) or whatever else, and that solves my only objection to the whole idea. This makes it a specific - though widely useful - application of a generally useful pattern, and we can document it as such and recommend reading the implementation.

For that implementation: we might as well do everything, right? Target a-b and b-a; use a couple of assertions to report severe as well as minimal failures; and throw in the logic to make Numpy arrays work with target() (for arrays, a_sub_b = float(a_sub_b.max())). Writing this is now on my todo-eventually stack but I'd be very happy to review someone else's PR instead 😄

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 8, 2019

Well, there's an implementation at the top of https://github.com/astropy/astropy/pull/9532/files#diff-a3e7f7d88dc4d767215294708809de38 and it gets a little involved, in this case because it's trying to deal with Time objects and Quantities (that is, with units) and ensure that the error reporting is as intelligible as possible (in a way orthogonal to hypothesis, for example if you ask that something be smaller than a nanosecond, the thing should be expressed in nanoseconds in the assertion failure).

I suggested assert_almost_equal as a way to signal degree of failure, but if there is another one hypothesis can actually use (which target will hopefully be) then indeed this may be out of scope for hypothesis.

@DRMacIver
Copy link
Member

if there is another one hypothesis can actually use (which target will hopefully be)

And as of #2193 is! Now if you do target(degree_of_error) and have an assertion that provides an upper bound on valid errors, it will give you shrunk failures that maximize the error.

We can think further about how to have this interact with multiple failures - right now it will only show you one - so this is only a partial solution, but I think it fixes the most pressing bit. What do you think?

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 8, 2019

I think I need to go back and hammer out some more bugs from astropy.time to evaluate how useful this is in practice, but it sounds promising.

I expect in practice there is often a competition between maximizing the error and shrinking the example (when there isn't we have no problem). The previous behaviour was to show the most shrunken example, which was useful in that it sometimes showed you that it didn't take weirdly specific values to trigger a problem. The new behaviour, as I understand it, is to favour degree of failure, which is great for showing how bad the problem can be. But yes, both would be nice - and also, when we have multiple targets (pretty much always with the a-b and b-a trick!) which maximum is preferred? Perhaps an example maximizing each label and also an example shrunk as much as possible but still failing?

I have to admit that even single-failure backtraces can be a handful, and multiple-failure ones can be tough to read (just, a zillion lines of output really, especially for a whole suite of tests). So I'm not sure how best to show the user all this newly available information.

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 8, 2019

Hmm, it does not appear to be working. In https://github.com/astropy/astropy/pull/9532/files#diff-a3e7f7d88dc4d767215294708809de38 I have two tests (the last) where I got reported discrepancies of 1.6 ns (on a 1 ns threshold), but when I enabled the two-step testing I immediately got a double-failure message showing a 106 ns failure on the first test and a 1.6 ns failure on the second. So it seems I'm not getting a worst-failure report?

Maybe I don't have the current version, though I did pip install -U hypothesis

$ pip show hypothesis
Name: hypothesis
Version: 4.43.8
Summary: A library for property based testing
Home-page: https://github.com/HypothesisWorks/hypothesis/tree/master/hypothesis-python
Author: David R. MacIver
Author-email: david@drmaciver.com
License: MPL v2
Location: /home/archibald/.virtualenvs/astropy/lib/python3.7/site-packages
Requires: attrs
Required-by: 

@aarchiba
Copy link
Contributor Author

aarchiba commented Nov 8, 2019

Er, oops, PR is still open. I'll go comment over there.

@Zac-HD
Copy link
Member

Zac-HD commented Nov 13, 2019

master...Zac-HD:report-target-stats is a lighter approach; you don't get additional test failures - but it reports the highest scores observed from failing tests in the statistics. Could be useful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
legibility make errors helpful and Hypothesis grokable test-case-reduction about efficiently finding smaller failing examples
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants