-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider splitting observability report status for Status.OVERRUN and Status.INVALID
#3827
Comments
|
This is really interesting! I see a few options for how to make sure downstream tools can surface this info.
Taking a step back, I'm also very curious why this is happening at all! In retrospect, are you surprised you're overrunning the buffer? Was it a bug in your generator, something internal to Hypothesis, or actually "expected" behavior? If this is the kind of thing that is just expected to happen sometimes, we might not want to surface it at all; if it's a bug in Hypothesis, we may want to find a way to separate "expected" discards from "hey, your framework is broken" discards. (Also, selfishly, I'd love to be able to say "Observable PBT helped us find ___ kind of error", so I want to understand how to accurately fill in that blank.) |
|
I did consider both of these options, but ultimately decided that they required such knowledge of our internals as to be unhelpful for almost all users. The main problem with splitting out OVERRUN is that we use a different max length when replaying examples from the database, which substantially changes how it should be interpreted. Additionally, I expect this to change noticeably when we cut out over to the IR as our main source of truth. For number of discards, this doesn't correspond to anything in the user-facing API and can happen for a variety of weird internal reasons. I'd be OK with including it in the metadata (not as a feature) though, and we already collect the count as an attribute of iirc ConjectureData. |
|
If you're sure that a user won't be able to do anything with the overrun data, maybe we shouldn't even report those generator runs? Instead of But my main concern is that then someone like @tybug won't have the opportunity to inspect their generator and find a potential inefficiency. Our goal here should be to provide all of the info they need to inspect how and why their testing may be less effective than they thought, and if the framework is resampling hundreds of times because it's overrunning the buffer, that seems like something the user should be able to find out about. Implementation inefficiencies aside, one big thing I want to make sure folks can learn from this data is when an |
We can't usefully distinguish it from rejection sampling, but it's still useful to report that a partial run happened for the same reasons in each case: tests can have arbitrary side-effects, and aborts can come very late in a test.
I'd be surprised if there are ten people on the planet with @tybug's hands-on experience with Hypothesis internals, but point definitely taken - it's worth serving even small groups if we can do so without confusing more people, and there are far more than ten people who might find it helpful. ISTM that the "status" entry should stay pretty general for compatibility with other frameworks, but we could show that it was due to an overrun using the "status_reason" field.
Ah - the raw discards/rejection data won't actually give us that, but I think we can collect it. (have There are very many internal reasons and locations for discards though, and they have a cumulative effect such that the final abort-triggering discard may not be informative - explicit calls to So... I endorse @tybug's alternative proposal in the OP 😅 |
|
OK, I'm on board with the idea that actually distinguishing the various |
|
sorry for the delayed responses here - I've just come out of vaguely-in-vacation mode 🙂
@hgoldstein95 I was very surprised that I was overrunning the buffer even once, much less >50% of the time. Here's the strategy, which is not doing anything unusual - a few Zac's comment about database buffer size was prescient, because disabling the database via I'm not sure why this is. I'm not at all familiar with the database code in Hypothesis yet, so I don't dare guess. What is clear to me is that these overruns were not "normal" in the sense of a strategy generating too much data. This is certainly still worth reporting to the user. In fact, one of the things that makes me — as an average software developer — uneasy about the Hypothesis database is that it may silently cause slowdowns due to large databases. I have no idea what examples are being pulled from the database, standard generation, or explicit I hear @Zac-HD's concern about requiring internal knowledge. The above would probably be confusing to developers with no knowledge of Hypothesis internals, because they don't know that it's ok for database examples to overrun, and not a problem with their code. But I argue that reporting discard count with no elaboration is more confusing. The same knowledge of internals was still required to diagnose this particular problem. It just doesn't appear so because it is obfuscated by limited information (unknown generation source). The good news is we already report @Zac-HD, keeping If I'm not beaten to it (feel free!), I'll open a PR for this. I expect we've done most of the grunt work in the discussion here and the change itself will be small. |
|
Hmm, we should also see if uncapping the max_buffer_length for database replays helps 🤔 Better to replay something somewhat different than abort, it seems to me - we'd only really want the abort behaviour when replaying a fully-shrunk example. |
|
@tybug - master...Zac-HD:hypothesis:uncap-replay-buffers should fix your excess discards. Confirm and I'll open a PR? |
|
I think you've identified a separate potential issue (and I support the proposed solution), but that branch actually does not fix my issue. I looked further into it my issue above today and have a better understand of what's going on. While it's true that This looks to be a perennial issue of having discards even for simple strategies. For instance, here's the tyche report for from hypothesis import *
from hypothesis.strategies import *
@given(lists(integers()))
def test_f(a):
pass
test_f()
where the discards are from the pareto optimiser, or presumably really from the shrinker called by the optimiser. It showed up in much larger quantities in the strategy that prompted me to open this issue:
because the pareto optimiser is doing a lot more work there, for whatever reason. But all of these discards are from the optimiser. I don't have any I'm reminded of this comment + #2985: hypothesis/hypothesis-python/tests/nocover/test_lstar.py Lines 23 to 28 in 022b1f2
|
|
Ah, I recognise the same forces at work! Very briefly:
It'd be great to have this discussion written up as an issue, with bonus points for suggesting a path forward. I'll get to that in the next few days if you don't 😁 |


I tried out Tyche with @hgoldstein95 recently (which was really cool!). Tyche was reporting a high discard rate for the property test we looked at, which was confusing to us, because I wrote that test and I was definitely not doing any rejection sampling or using
assume.Well, I looked into it afterwards, and it turns out the test was generating a fair amount of data and so was overrunning the buffer some reasonable portion of the time. Hypothesis reports both
Status.OVERRUNandStatus.INVALIDasgave_up— and Tyche was picking up on the former.hypothesis/hypothesis-python/src/hypothesis/internal/observability.py
Lines 49 to 54 in 6da7c6f
Overruns and invalids both seem like useful things to report to the user. But since the cause/remedy for each is distinct, maybe they should each be reported with a separate status?
As an alternative to adding a new status, we could do a better job of supplying a
status_reasontomark_invalidin e.g. the case ofUnsatisfiedAssumptionbeing raised, which would allow Tyche to distinguish these scenarios.cc @Zac-HD @hgoldstein95 - I'm curious what your thoughts are here.
The text was updated successfully, but these errors were encountered: