Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-bug discovery during generation #847

Closed
DRMacIver opened this issue Sep 11, 2017 · 5 comments · Fixed by #1781
Closed

Multi-bug discovery during generation #847

DRMacIver opened this issue Sep 11, 2017 · 5 comments · Fixed by #1781
Assignees
Labels
enhancement it's not broken, but we want it to be better

Comments

@DRMacIver
Copy link
Member

As a consequence of #836 we can now find and display multiple bugs if these occur during either shrinking or retrieving examples from the database.

Amusingly though, we can't currently find multiple bugs in the phase that is actually designed to find bugs, generation! We should probably fix that.

However it is not sufficient to just run generation to completion, because a reasonable usage pattern is "Run generation until you find a bug and then stop and tell me that bug", which could be done by e.g. setting max_examples absurdly high and turning off the time limit. In this case we would not want to keep running indefinitely after we've found a bug.

I think the correct logic here is:

  1. Skip generation altogether if we found a bug in example replay
  2. Always respect max_examples in the sense that we stop running generation after we've generated that many valid or interesting examples.
  3. Stop running generation at some point "reasonably soon" after we've found our first bug.

The devil is of course in the details of what "reasonably soon" means. A non-exhaustive list of possibilities:

  1. Just do it time based - run for at most a few seconds after the first bug is found.
  2. Stop when we've run twice as many examples as it took to find the first bug
  3. Stop when we've run twice as many examples as it took to find the last bug
  4. Do something clever and statistical to estimate the number of bugs we can expect to find and stop when we've found that many
  5. Some hybrid of the above.

My preference is probably some hybrid of 1 and 3.

@Zac-HD Zac-HD added the enhancement it's not broken, but we want it to be better label Sep 11, 2017
@Zac-HD
Copy link
Member

Zac-HD commented Sep 11, 2017

We should indeed fix this!

I'm much less eager to skip generation though - particularly I don't see much use in "run until you find a bug, then stop". This seems to fall in to a awkward spot where it's mostly a case for AFL, and if Hypothesis can't find the bug you'll just waste a lot of cycles on it - especially if your tests are running approximately serially rather than interleaved.

I agree with the 'correct logic'; not much more to say there.

I don't like the idea of reintroducing time into the number of examples run. If we're not simply running until we hit max_examples - which is after all the expected behaviour in all non-buggy runs! - I favor option three as a simple version of option four. If you found one or several bugs in 0 .. n examples but none in n+1 .. 2n, that's a reasonable place to stop looking. (not counting examples run while shrinking and possibly slipping between bugs)

@DRMacIver
Copy link
Member Author

I'm much less eager to skip generation though - particularly I don't see much use in "run until you find a bug, then stop". This seems to fall in to a awkward spot where it's mostly a case for AFL, and if Hypothesis can't find the bug you'll just waste a lot of cycles on it - especially if your tests are running approximately serially rather than interleaved.

I find myself running with large numbers of examples fairly often. It's not quite "run until you hit a bug", but it's still going to be running for minutes or tens of minutes. When I'm doing this I would definitely find it annoying if Hypothesis found a bug and then spent ages looking for another one.

I agree that something like AFL would be more useful here, but for now Hypothesis scratches a lot of itches that AFL can't (and Hypothesis should get more AFL-like over time, ideally)

I don't like the idea of reintroducing time into the number of examples run.

I don't think it's too problematic. The problem with timeout was that it gave users a false sense of confidence because it looked like you were running more examples than you were. In this case that's a non-issue, because Hypothesis is running either the amount of examples you asked for, or it found a bug in which case there was never any guarantee of how many examples were run anyway.

@Zac-HD
Copy link
Member

Zac-HD commented Sep 12, 2017

To make the point, we should also update our headline example in the readme:

@given(st.lists(st.floats(allow_nan=False)))
def test_mean(xs):
    mean = sum(xs) / len(xs)
    assert min(xs) <= mean <= max(xs)
Falsifying example: test_mean(xs=[9.9792015476736e+291, 1.7976931348623157e+308])
Traceback (most recent call last):
    ...
AssertionError: mean=inf

Falsifying example: test_mean(xs=[])
Traceback (most recent call last):
    ...
ZeroDivisionError: division by zero

---------------------------------------------------------------------------
MultipleFailures: Hypothesis found 2 distinct failures.

(Note that this is on the current released version! We already find two distinct bugs, just by shrinking the list 😄)

@pazzarpj
Copy link

I'm going to start working on this

@pazzarpj
Copy link

I'm not working on this anymore. Can a maintainer grab this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants