New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display all slipped errors found during shrinking #836

Merged
merged 26 commits into from Sep 11, 2017

Conversation

Projects
None yet
4 participants
@DRMacIver
Member

DRMacIver commented Sep 6, 2017

There's a phenomenon often called "slippage" where when shrinking a bug results in finding another bug. This often happens with rarer bugs shrinking to more common ones.

This isn't a huge problem in practice for Hypothesis because the example database still saves the original failures and will replay them, but it's annoying. I ran into at least one example of it in debugging #785, which reminded me that I've been meaning to do some work on trying to avoid the problem.

So this is that work. It adds a notion of interestingness_origin to ConjectureData and forces the engine to shrink subject to the additional constraint that the origin is preserved.

For tests we use the somewhat imprecise but probably good enough partitioning that two failures are "the same bug" if they throw the same exception type from the same line. We could use a more precise notion based on the entire stack trace, but it's likely that would be too precise (even after normalising to account for recursion).

This then has a bunch of knock-on effects for database management: We absolutely want to save all these examples that we're finding but not slipping to, but if we do that under the current scheme then when we rerun the test we might get a slipped example rather than the previous test failure. So we split the storage into two segments, primary and secondary. Primary is where everything that was an actual target bug goes, secondary is where all other bugs we find go. We then make sure to replay primary failures first (see mini-essay comment inline in the code for more details of logic here).

As well as the actual feature this is for, the two main implementation details of this pull request are also useful stepping stones in my sinister long term plans for world domination better testing tools. In particular:

  • Being able to distinguish different types of failure is pretty vital for the use of Hypothesis as a long-running fuzzer tool. We don't currently have a workflow or implementation that supports that, so this is just step one, but it's a vital step and it will be useful to see how well this works in the wild.
  • The primary/secondary corpus thing is a split I've been meaning to make for a while. Once Hypothesis starts growing coverage features we'll need somewhere to stash a corpus of non-failures that produce interesting coverage results.
@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 6, 2017

Member

(@silentbicycle this might be of interest to you as we've talked about this feature quite recently. Though I don't think much in the actual pull request would port over naturally to theft)

Member

DRMacIver commented Sep 6, 2017

(@silentbicycle this might be of interest to you as we've talked about this feature quite recently. Though I don't think much in the actual pull request would port over naturally to theft)

@silentbicycle

This comment has been minimized.

Show comment
Hide comment
@silentbicycle

silentbicycle Sep 6, 2017

I already have it implemented for theft's non-forking mode on the develop-failure_tagging branch (https://github.com/silentbicycle/theft/compare/develop-failure_tagging). It was a pretty small change, and seems like a big improvement. Updating the failure tag from a forked worker process will require a command handling loop in the supervisor process (because of COW), but that's already part of multicore searching and shrinking, so I'm working on finishing that first.

silentbicycle commented Sep 6, 2017

I already have it implemented for theft's non-forking mode on the develop-failure_tagging branch (https://github.com/silentbicycle/theft/compare/develop-failure_tagging). It was a pretty small change, and seems like a big improvement. Updating the failure tag from a forked worker process will require a command handling loop in the supervisor process (because of COW), but that's already part of multicore searching and shrinking, so I'm working on finishing that first.

@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 6, 2017

Member

seems like a big improvement

I'm curious, do you have a sense of how big? IIRC theft does a thing where once it's shrunk it backs up to a previous shrink and tries again, avoiding previously seen examples, which I would expect to work around the slippage in a broadly similar way to Hypothesis's example database does.

Member

DRMacIver commented Sep 6, 2017

seems like a big improvement

I'm curious, do you have a sense of how big? IIRC theft does a thing where once it's shrunk it backs up to a previous shrink and tries again, avoiding previously seen examples, which I would expect to work around the slippage in a broadly similar way to Hypothesis's example database does.

@Zac-HD

This comment has been minimized.

Show comment
Hide comment
@Zac-HD

Zac-HD Sep 6, 2017

Member

This reminds me of @regehr's post Reducers are Fuzzers - to summarise, slipping between bugs can be bad if you care about minimising a particular bug, but good if you care about finding bugs in general.

This may seem trivially true, but it has some interesting implications. Namely, why do we want to work around slippage at all? For future fuzzing efforts, it's obviously very useful to be able to distinguish failures. However, is primary/secondary keying really enough for this?

For most current uses of Hypothesis, ie unit testing with a fairly short detect-fix loop, we currently only report one failing input - and I'm not sure how hard it would be to do more under current test runners! As-is, avoiding slippage seems mostly to be good at building up a corpus of secondary bugs to report once the first is fixes - useful, but pretty poor UX!

Questions:

  • Why do we prioritise the first bug that we find rather than the one with the simplest reproducing case?
  • Why have primary/secondary storage rather than a distinct subkey for each error?
  • Textual line number seems unstable to use as a distinguishing key between runs. What effect would using function name instead have?
  • Wouldn't we want another corpus category for non-failing coverage examples?

(Also CC @SamHames; after our day at PyConAU I thought this might interest you.)

Member

Zac-HD commented Sep 6, 2017

This reminds me of @regehr's post Reducers are Fuzzers - to summarise, slipping between bugs can be bad if you care about minimising a particular bug, but good if you care about finding bugs in general.

This may seem trivially true, but it has some interesting implications. Namely, why do we want to work around slippage at all? For future fuzzing efforts, it's obviously very useful to be able to distinguish failures. However, is primary/secondary keying really enough for this?

For most current uses of Hypothesis, ie unit testing with a fairly short detect-fix loop, we currently only report one failing input - and I'm not sure how hard it would be to do more under current test runners! As-is, avoiding slippage seems mostly to be good at building up a corpus of secondary bugs to report once the first is fixes - useful, but pretty poor UX!

Questions:

  • Why do we prioritise the first bug that we find rather than the one with the simplest reproducing case?
  • Why have primary/secondary storage rather than a distinct subkey for each error?
  • Textual line number seems unstable to use as a distinguishing key between runs. What effect would using function name instead have?
  • Wouldn't we want another corpus category for non-failing coverage examples?

(Also CC @SamHames; after our day at PyConAU I thought this might interest you.)

@silentbicycle

This comment has been minimized.

Show comment
Hide comment
@silentbicycle

silentbicycle Sep 6, 2017

Rather than responding here, I included it in a post to the randomized-testing google group: https://groups.google.com/forum/#!topic/randomized-testing/YJeQQMSEwzk

For most current uses of Hypothesis, ie unit testing with a fairly short detect-fix loop

Generally speaking, that doesn't really apply for theft: I'm just as often running it in CI, or as part of an audit of existing code, and don't want to assume it's being used in tandem with initial development.

silentbicycle commented Sep 6, 2017

Rather than responding here, I included it in a post to the randomized-testing google group: https://groups.google.com/forum/#!topic/randomized-testing/YJeQQMSEwzk

For most current uses of Hypothesis, ie unit testing with a fairly short detect-fix loop

Generally speaking, that doesn't really apply for theft: I'm just as often running it in CI, or as part of an audit of existing code, and don't want to assume it's being used in tandem with initial development.

@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 6, 2017

Member

This reminds me of regehr's post Reducers are Fuzzers - to summarise, slipping between bugs can be bad if you care about minimising a particular bug, but good if you care about finding bugs in general.

Indeed, but note that if anything this approach is better at taking advantage of that - we still save all of the slipped bugs, and because the shrink target is harder we have more opportunity for different slippage to happen.

As-is, avoiding slippage seems mostly to be good at building up a corpus of secondary bugs to report once the first is fixes - useful, but pretty poor UX!

As per above I think it's also good at prioritising the more interesting bugs.

But I would like us to move towards reporting multiple bugs if we find multiple bugs. The UI side of things is relatively feasible, though it will cause poor interaction with things like e.g. pdb integration in pytest. I think this is a reasonable first step towards doing that and a self-contained enough change to be worth it on its own.

Why do we prioritise the first bug that we find rather than the one with the simplest reproducing case?

Anecdotally, bugs tend to slip from more interesting to less interesting, because the shrinker is throwing away some part of what makes the example important.

Why have primary/secondary storage rather than a distinct subkey for each error?
Textual line number seems unstable to use as a distinguishing key between runs. What effect would using function name instead have?

The second question is partially an answer to the first. But it's also because it would require changes to the database API/format - we don't currently have a good way to find all the different keys there.

Wouldn't we want another corpus category for non-failing coverage examples?

Maybe. Not sure. The current logic would extend reasonably well to a tertiary corpus anyway, and the secondary corpus might work well as a generic dumping ground for things that might be interesting but aren't the last bug found. I think cross this bridge when we come to it.

Member

DRMacIver commented Sep 6, 2017

This reminds me of regehr's post Reducers are Fuzzers - to summarise, slipping between bugs can be bad if you care about minimising a particular bug, but good if you care about finding bugs in general.

Indeed, but note that if anything this approach is better at taking advantage of that - we still save all of the slipped bugs, and because the shrink target is harder we have more opportunity for different slippage to happen.

As-is, avoiding slippage seems mostly to be good at building up a corpus of secondary bugs to report once the first is fixes - useful, but pretty poor UX!

As per above I think it's also good at prioritising the more interesting bugs.

But I would like us to move towards reporting multiple bugs if we find multiple bugs. The UI side of things is relatively feasible, though it will cause poor interaction with things like e.g. pdb integration in pytest. I think this is a reasonable first step towards doing that and a self-contained enough change to be worth it on its own.

Why do we prioritise the first bug that we find rather than the one with the simplest reproducing case?

Anecdotally, bugs tend to slip from more interesting to less interesting, because the shrinker is throwing away some part of what makes the example important.

Why have primary/secondary storage rather than a distinct subkey for each error?
Textual line number seems unstable to use as a distinguishing key between runs. What effect would using function name instead have?

The second question is partially an answer to the first. But it's also because it would require changes to the database API/format - we don't currently have a good way to find all the different keys there.

Wouldn't we want another corpus category for non-failing coverage examples?

Maybe. Not sure. The current logic would extend reasonably well to a tertiary corpus anyway, and the secondary corpus might work well as a generic dumping ground for things that might be interesting but aren't the last bug found. I think cross this bridge when we come to it.

@Zac-HD

This comment has been minimized.

Show comment
Hide comment
@Zac-HD

Zac-HD Sep 7, 2017

Member

I certainly support finding and reporting multiple bugs, and this PR as a first step towards that goal. However:

  • I would usually prefer the bugs to be ranked from least to most interesting, because the trivial violation is usually easiest to fix and often closes off the paths that lead to later bugs. This seems to be the opposite of your intuition - I care less about rare bugs than common bugs!
  • Two keys still doesn't seem like enough - what happens if you find three bugs? If you end up with '.secondary' chained multiple times, I would prefer to improve the database API.
  • It would be nice if the bug key was stable against unrelated code changes, and preferably reasonably stable against attempted bug fixes. The latter might not be possible without using something like RedBaron.
  • With or without the key enhancements above, what happens when traceback.extract_tb returns None for some entries, or simply fails? (I get an internal AttributeError trying to use it in an IPython prompt) We need a reasonably fallback key and default behaviour anyway, so why not try for a better version?
Member

Zac-HD commented Sep 7, 2017

I certainly support finding and reporting multiple bugs, and this PR as a first step towards that goal. However:

  • I would usually prefer the bugs to be ranked from least to most interesting, because the trivial violation is usually easiest to fix and often closes off the paths that lead to later bugs. This seems to be the opposite of your intuition - I care less about rare bugs than common bugs!
  • Two keys still doesn't seem like enough - what happens if you find three bugs? If you end up with '.secondary' chained multiple times, I would prefer to improve the database API.
  • It would be nice if the bug key was stable against unrelated code changes, and preferably reasonably stable against attempted bug fixes. The latter might not be possible without using something like RedBaron.
  • With or without the key enhancements above, what happens when traceback.extract_tb returns None for some entries, or simply fails? (I get an internal AttributeError trying to use it in an IPython prompt) We need a reasonably fallback key and default behaviour anyway, so why not try for a better version?
@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 7, 2017

Member

I would usually prefer the bugs to be ranked from least to most interesting, because the trivial violation is usually easiest to fix and often closes off the paths that lead to later bugs. This seems to be the opposite of your intuition - I care less about rare bugs than common bugs!

It's less about caring and more about lost opportunities. Common bugs are easy to refind, while rare bugs are easy to lose (especially if people are running without persisting the DB, which seems to be the common case still).

There's also the slightly selfish reason that interesting bugs make Hypothesis look cooler. ;-)

Two keys still doesn't seem like enough - what happens if you find three bugs? If you end up with '.secondary' chained multiple times, I would prefer to improve the database API.

Two keys handles this case fine. However see update I'll explain below.

It would be nice if the bug key was stable against unrelated code changes, and preferably reasonably stable against attempted bug fixes. The latter might not be possible without using something like RedBaron.

My version is stable against unrelated code changes, because it doesn't use the classifier as a persistent key. This is only a problem in your proposal to use the classifications as anything other than an opaque way to partition the search space (which I'm against anyway).

With or without the key enhancements above, what happens when traceback.extract_tb returns None for some entries, or simply fails?

Huh. I didn't know that returning None was a thing there. It's easy enough to handle (we just use (ExceptionType, None) as the key there), but I can't seem to figure out how to actually trigger it - nothing in the source suggests it's possible, and just raising the exception in an exec doesn't do it (the source turns up as ''). Any ideas?

(I get an internal AttributeError trying to use it in an IPython prompt) We need a reasonably fallback key and default behaviour anyway, so why not try for a better version?

I'd be happy to go for a better version, but using function rather than line number seems instead strictly worse. It conflates bugs that are easily distinguishable and doesn't bring any benefit to offset that.

Member

DRMacIver commented Sep 7, 2017

I would usually prefer the bugs to be ranked from least to most interesting, because the trivial violation is usually easiest to fix and often closes off the paths that lead to later bugs. This seems to be the opposite of your intuition - I care less about rare bugs than common bugs!

It's less about caring and more about lost opportunities. Common bugs are easy to refind, while rare bugs are easy to lose (especially if people are running without persisting the DB, which seems to be the common case still).

There's also the slightly selfish reason that interesting bugs make Hypothesis look cooler. ;-)

Two keys still doesn't seem like enough - what happens if you find three bugs? If you end up with '.secondary' chained multiple times, I would prefer to improve the database API.

Two keys handles this case fine. However see update I'll explain below.

It would be nice if the bug key was stable against unrelated code changes, and preferably reasonably stable against attempted bug fixes. The latter might not be possible without using something like RedBaron.

My version is stable against unrelated code changes, because it doesn't use the classifier as a persistent key. This is only a problem in your proposal to use the classifications as anything other than an opaque way to partition the search space (which I'm against anyway).

With or without the key enhancements above, what happens when traceback.extract_tb returns None for some entries, or simply fails?

Huh. I didn't know that returning None was a thing there. It's easy enough to handle (we just use (ExceptionType, None) as the key there), but I can't seem to figure out how to actually trigger it - nothing in the source suggests it's possible, and just raising the exception in an exec doesn't do it (the source turns up as ''). Any ideas?

(I get an internal AttributeError trying to use it in an IPython prompt) We need a reasonably fallback key and default behaviour anyway, so why not try for a better version?

I'd be happy to go for a better version, but using function rather than line number seems instead strictly worse. It conflates bugs that are easily distinguishable and doesn't bring any benefit to offset that.

@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 7, 2017

Member

OK. So I've switched this to a new implementation:

  • We minimize every bug we find (up to the shrink limit if appropriate)
  • We store only the minimized bugs in the primary storage
  • We store the non-minimized ones in the secondary
  • We always replay every distinct minimized bug
  • At the end we decide which of multiple bugs to display. This is currently "most interesting" but could be switched easily enough.
Member

DRMacIver commented Sep 7, 2017

OK. So I've switched this to a new implementation:

  • We minimize every bug we find (up to the shrink limit if appropriate)
  • We store only the minimized bugs in the primary storage
  • We store the non-minimized ones in the secondary
  • We always replay every distinct minimized bug
  • At the end we decide which of multiple bugs to display. This is currently "most interesting" but could be switched easily enough.
@Zac-HD

This comment has been minimized.

Show comment
Hide comment
@Zac-HD

Zac-HD Sep 7, 2017

Member

Ah, I was missing the impermanence of the key - as a way to partition state-space it makes perfect sense, and then we rediscover structure based on saved examples on the next run. (those dot-points are helpful!)

I don't have a better proposal for the extract_tb part - my point about None is based on my interpretation of fairly ambiguous docs, but the attribute error is something I saw. As long as we catch it, I think we're OK to just proceed - at worst we get the old behaviour of undistinguished errors.

And finally, I can certainly see your point about reporting harder-to-find bugs; especially since people don't eg transfer their bug DBs across CI runs. However I'm just not comfortable with not reporting the simplest failing input. How difficult would it be to put together basic reporting of multiple errors as part of this pull?

Member

Zac-HD commented Sep 7, 2017

Ah, I was missing the impermanence of the key - as a way to partition state-space it makes perfect sense, and then we rediscover structure based on saved examples on the next run. (those dot-points are helpful!)

I don't have a better proposal for the extract_tb part - my point about None is based on my interpretation of fairly ambiguous docs, but the attribute error is something I saw. As long as we catch it, I think we're OK to just proceed - at worst we get the old behaviour of undistinguished errors.

And finally, I can certainly see your point about reporting harder-to-find bugs; especially since people don't eg transfer their bug DBs across CI runs. However I'm just not comfortable with not reporting the simplest failing input. How difficult would it be to put together basic reporting of multiple errors as part of this pull?

@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 7, 2017

Member

How difficult would it be to put together basic reporting of multiple errors as part of this pull?

Hmm. It's probably feasible. Let me try.

Member

DRMacIver commented Sep 7, 2017

How difficult would it be to put together basic reporting of multiple errors as part of this pull?

Hmm. It's probably feasible. Let me try.

@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 8, 2017

Member

OK. Here's a version that displays all encountered errors. If we only find one error, the behaviour is as before. If we find multiple errors then we handle printing of them ourselves and raise a MultipleFailures exception.

Example raw output:

Falsifying example: test(i=10000)
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 631, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 116, in run
    return test(*args, **kwargs)
  File "test_multi.py", line 13, in test
    assert False
AssertionError

Falsifying example: test(i=1)
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 631, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 116, in run
    return test(*args, **kwargs)
  File "test_multi.py", line 17, in test
    assert False
AssertionError

Traceback (most recent call last):
  File "test_multi.py", line 23, in <module>
    test()
  File "test_multi.py", line 9, in test
    @given(st.integers())
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 774, in wrapped_test
    state.run()
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 692, in run
    len(self.falsifying_examples,)))
hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.

Or from pytest:

test_multi.py:9: in test
    @given(st.integers())
E   hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.
------------------------------------------------------------------------------------------------------------------ Hypothesis ------------------------------------------------------------------------------------------------------------------
Falsifying example: test(i=10000)
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 631, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 116, in run
    return test(*args, **kwargs)
  File "/home/david/hypothesis-python/test_multi.py", line 13, in test
    assert False
AssertionError: assert False

Falsifying example: test(i=1)
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 631, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 116, in run
    return test(*args, **kwargs)
  File "/home/david/hypothesis-python/test_multi.py", line 17, in test
    assert False
AssertionError: assert False

The stack trace of the MultipleFailures exception is kinda annoyingly irrelevant, but I'm otherwise relatively happy with this.

Member

DRMacIver commented Sep 8, 2017

OK. Here's a version that displays all encountered errors. If we only find one error, the behaviour is as before. If we find multiple errors then we handle printing of them ourselves and raise a MultipleFailures exception.

Example raw output:

Falsifying example: test(i=10000)
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 631, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 116, in run
    return test(*args, **kwargs)
  File "test_multi.py", line 13, in test
    assert False
AssertionError

Falsifying example: test(i=1)
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 631, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 116, in run
    return test(*args, **kwargs)
  File "test_multi.py", line 17, in test
    assert False
AssertionError

Traceback (most recent call last):
  File "test_multi.py", line 23, in <module>
    test()
  File "test_multi.py", line 9, in test
    @given(st.integers())
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 774, in wrapped_test
    state.run()
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 692, in run
    len(self.falsifying_examples,)))
hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.

Or from pytest:

test_multi.py:9: in test
    @given(st.integers())
E   hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.
------------------------------------------------------------------------------------------------------------------ Hypothesis ------------------------------------------------------------------------------------------------------------------
Falsifying example: test(i=10000)
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 631, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 116, in run
    return test(*args, **kwargs)
  File "/home/david/hypothesis-python/test_multi.py", line 13, in test
    assert False
AssertionError: assert False

Falsifying example: test(i=1)
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 631, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 116, in run
    return test(*args, **kwargs)
  File "/home/david/hypothesis-python/test_multi.py", line 17, in test
    assert False
AssertionError: assert False

The stack trace of the MultipleFailures exception is kinda annoyingly irrelevant, but I'm otherwise relatively happy with this.

@Zac-HD

This comment has been minimized.

Show comment
Hide comment
@Zac-HD

Zac-HD Sep 9, 2017

Member

This is great - good handling of multiple distinct errors is a substantial upgrade! IMO this deserves a blog post and some publicity 😄

Only comments at this stage: fix tests, bump minor instead of patch version and rewrite release notes, and in core.py "preior" is not a word. Once that's done, I'll do a final review and we can merge!

Member

Zac-HD commented Sep 9, 2017

This is great - good handling of multiple distinct errors is a substantial upgrade! IMO this deserves a blog post and some publicity 😄

Only comments at this stage: fix tests, bump minor instead of patch version and rewrite release notes, and in core.py "preior" is not a word. Once that's done, I'll do a final review and we can merge!

@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 9, 2017

Member

Only comments at this stage: fix tests, bump minor instead of patch version and rewrite release notes, and in core.py "preior" is not a word. Once that's done, I'll do a final review and we can merge!

Done.

(also, obviously your time is your own to do with as you please, but if it comes down to a choice between reviewing this and reviewing #785 I'd much prefer the latter :-) )

Member

DRMacIver commented Sep 9, 2017

Only comments at this stage: fix tests, bump minor instead of patch version and rewrite release notes, and in core.py "preior" is not a word. Once that's done, I'll do a final review and we can merge!

Done.

(also, obviously your time is your own to do with as you please, but if it comes down to a choice between reviewing this and reviewing #785 I'd much prefer the latter :-) )

@DRMacIver DRMacIver changed the title from Avoid slipping to a new error when shrinking to When multiple errors are found during shrinking, display them all Sep 10, 2017

@DRMacIver DRMacIver changed the title from When multiple errors are found during shrinking, display them all to Display all errors found during shrinking when slippage occurs Sep 10, 2017

@DRMacIver DRMacIver changed the title from Display all errors found during shrinking when slippage occurs to Display all slipped errors found during shrinking Sep 10, 2017

@alexwlchan

Two quick suggestions from a skim, but this is too large for me to have time to review today.

I agree with @Zac-HD – this is a really nice change, and definitely deserves some talking about when it’s released! :D

Show outdated Hide outdated src/hypothesis/database.py Outdated
Show outdated Hide outdated src/hypothesis/internal/conjecture/engine.py Outdated
@alexwlchan

This comment has been minimized.

Show comment
Hide comment
@alexwlchan

alexwlchan Sep 10, 2017

Member

One brief anecdote: I was testing the new behaviour with the following program:

from hypothesis import given
from hypothesis.strategies import integers


@given(integers())
def test_x(x):
    if x == 0:
        raise ValueError("foo")
    assert x != 1

If I change that 0 to a 2, it only finds the x=1 counterexample. It’s not a massive issue (that’s no worse than the current behaviour), but could we find x=2 as well?

Member

alexwlchan commented Sep 10, 2017

One brief anecdote: I was testing the new behaviour with the following program:

from hypothesis import given
from hypothesis.strategies import integers


@given(integers())
def test_x(x):
    if x == 0:
        raise ValueError("foo")
    assert x != 1

If I change that 0 to a 2, it only finds the x=1 counterexample. It’s not a massive issue (that’s no worse than the current behaviour), but could we find x=2 as well?

@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 10, 2017

Member

could we find x=2 as well?

We could, but I think it might make sense to do it as a subsequent patch (and exact equality tests are hard to find anyway).

Right now this pull request is focused on the problem of slippage, so the only extra bugs found will be ones where the later bugs are naturally on the simplification path of the originally found ones. So for shrinking an arbitrarily large integer it will try 0, 1, and then go "Oh I'm done. Stop". If Hypothesis had happened to try x=2 first then it would have seen both bugs because it would have shrunk that to 1 and seen the second bug.

The solution to this is probably to not stop the test generation loop immediately on finding an interesting bug. But I think there are enough fiddly heuristics to tune when doing that that it bears further thinking about. In particular:

  1. What happens for example replay? Do we want to run generation at all then?
  2. How much of the generation loop do we keep running? In particular what happens if max_examples is something silly large like 10**6?

(probably more to think about, but those are the two obvious ones).

I'm keen to do something along these lines, but I'm also keen to not make this pull request any larger or keep it waiting on hashing out these details. :-)

Member

DRMacIver commented Sep 10, 2017

could we find x=2 as well?

We could, but I think it might make sense to do it as a subsequent patch (and exact equality tests are hard to find anyway).

Right now this pull request is focused on the problem of slippage, so the only extra bugs found will be ones where the later bugs are naturally on the simplification path of the originally found ones. So for shrinking an arbitrarily large integer it will try 0, 1, and then go "Oh I'm done. Stop". If Hypothesis had happened to try x=2 first then it would have seen both bugs because it would have shrunk that to 1 and seen the second bug.

The solution to this is probably to not stop the test generation loop immediately on finding an interesting bug. But I think there are enough fiddly heuristics to tune when doing that that it bears further thinking about. In particular:

  1. What happens for example replay? Do we want to run generation at all then?
  2. How much of the generation loop do we keep running? In particular what happens if max_examples is something silly large like 10**6?

(probably more to think about, but those are the two obvious ones).

I'm keen to do something along these lines, but I'm also keen to not make this pull request any larger or keep it waiting on hashing out these details. :-)

@Zac-HD

Zac-HD approved these changes Sep 10, 2017 edited

Looks good to me, and a substantial upgrade! Optional requests:

  • Give @alexwlchan a chance to review it too
  • Extra cleanup per comment below
  • Write up some thoughts on improving the multiple-discovery story in a new issue (847)
  • Write up a blog post announcing this important new feature (and maybe how our new merge process and sponsors have increased development speed)

😄 🎉

@@ -33,17 +33,17 @@ def has_a_non_zero_byte(x):
def test_saves_incremental_steps_in_database():
key = b"a database key"
database = SQLiteExampleDatabase(':memory:')

This comment has been minimized.

@Zac-HD

Zac-HD Sep 10, 2017

Member

👍 I'm always a fan of de-crufting things, so I did a git grep for this pattern and found a few more. Can you check the remaining uses and remove them if (as it seems to me) the new in-memory DB would be simpler?

@Zac-HD

Zac-HD Sep 10, 2017

Member

👍 I'm always a fan of de-crufting things, so I did a git grep for this pattern and found a few more. Can you check the remaining uses and remove them if (as it seems to me) the new in-memory DB would be simpler?

This comment has been minimized.

@DRMacIver

DRMacIver Sep 11, 2017

Member

I propose a separate PR in which I deprecate the SQLiteExampleDatabase entirely (it's pure legacy and there's no good reason to use it any more) which will result in sorting all of that out.

@DRMacIver

DRMacIver Sep 11, 2017

Member

I propose a separate PR in which I deprecate the SQLiteExampleDatabase entirely (it's pure legacy and there's no good reason to use it any more) which will result in sorting all of that out.

This comment has been minimized.

@Zac-HD

Zac-HD Sep 11, 2017

Member

Sounds good to me.

@Zac-HD

Zac-HD Sep 11, 2017

Member

Sounds good to me.

@DRMacIver

This comment has been minimized.

Show comment
Hide comment
@DRMacIver

DRMacIver Sep 11, 2017

Member

@alexwlchan would you like an opportunity to review this still or are you OK for me to merge?

Member

DRMacIver commented Sep 11, 2017

@alexwlchan would you like an opportunity to review this still or are you OK for me to merge?

@alexwlchan

This comment has been minimized.

Show comment
Hide comment
@alexwlchan

alexwlchan Sep 11, 2017

Member

would you like an opportunity to review this still or are you OK for me to merge?

I don’t have time to review this – merge and PR away!

Member

alexwlchan commented Sep 11, 2017

would you like an opportunity to review this still or are you OK for me to merge?

I don’t have time to review this – merge and PR away!

@DRMacIver DRMacIver merged commit 52807ca into master Sep 11, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@DRMacIver DRMacIver deleted the DRMacIver/failure-tagging branch Sep 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment