New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ease bug diagnosis by reporting generalised or multiple failing examples #2192
Comments
https://agroce.github.io/issta17.pdf is an example of this. We're already effectively doing the normalization bit, but it would definitely be interesting to add in generalisation. |
Of course, reporting more information isn't always better: our current People are (were?) working on a standard |
I'd like to make the following tentative proposal:
|
Replaced by the newer and more actionable issue linked just above this comment. |
How can we make it easier to diagnose failing tests?
Shrinking examples is great: the first failures Hypothesis finds can be so complicated that they're not that helpful. However, sometimes this can go too far and shrinking can make a failure look less important than it really is - the canonical example is when we shrink floating-point examples until they look like rounding errors rather than serious problems (see e.g. #2180 or this essay).
Why do I care? Because as we push the pareto frontier of easy and powerful testing tools further out, users can write better software and more people will use the tools. IMO this is an especially important topic because it helps both novices with easy problems and experts with hard problems!
We already support multi-bug discovery, but here I'm talking about how we report each bug. Currently we print exactly one (minimal, failing,) example to demonstrate each bug, but there might be more informative options.
Generalising counterexamples
In Haskell, Extrapolate and SmartCheck try to generalise examples after minimising them. Translating into Python, it seems worth experimenting with the idea of shrinking to a minimal example and then varying parts of the strategy so that we can tell the user which parts are unimportant. We could also start from non-minimal examples, jump-start the search based on previously tried examples, or overengineer our way out with inductive programming.
We could try to display this by calculating the least-constrained strategy that always fails the test. For example, testing
a / b
witha=integers(), b=integers()
could show something likea=integers(), b=just(0) -> ZeroDivisionError
. Abstracting Failure-Inducing Inputs (pdf) and AlHazen (pdf) both take almost exactly this approach, though both are restricted to strings matching a context-free grammar.Presenting multiple examples
I'm pretty sure I've also seen some prior art somewhere which presented a set of passing and a set of failing examples to help diagnose test failure. I have enough experience looking at shrunk examples that I can implicitly see what must have passed (because it would be a valid shrink otherwise), but it would be nice to print this explicitly for the benefit of new users and anyone using complex strategies.
We don't even need to execute variations on the minimal example for this - in many cases we could just choose a few of the intermediate examples from our shrinking process. It also seems easier to display examples than an algebraic repr in complicated cases like interactive
data()
, because we already do this for the final example!Another valuable source of examples to display comes from targeted PBT: in addition to showing the minimal example, we could show the highest-scoring failing example for each label (subject to a size cap, for both performance and interpretability). This suggests that we might want to keep searching until the score stops growing, much as we shrink to a fixpoint. And why not also show the highest-scoring passing example?
What now?
This issue is designed to start a conversation - personally I think that a setting to report multiple examples would be useful enough that we might want it on-by-default or part of
verbosity >= normal
; generalising counterexamples seems like a really cool trick but I'm not convinced that it's worth the trouble if we have multi-example reporting.If or when we have concrete proposals I'll split out focussed issues and close this one.
The text was updated successfully, but these errors were encountered: