Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider splitting observability report status for Status.OVERRUN and Status.INVALID #3827

Closed
tybug opened this issue Dec 28, 2023 · 10 comments · Fixed by #3830
Closed

Consider splitting observability report status for Status.OVERRUN and Status.INVALID #3827

tybug opened this issue Dec 28, 2023 · 10 comments · Fixed by #3830
Labels
enhancement it's not broken, but we want it to be better legibility make errors helpful and Hypothesis grokable

Comments

@tybug
Copy link
Member

tybug commented Dec 28, 2023

I tried out Tyche with @hgoldstein95 recently (which was really cool!). Tyche was reporting a high discard rate for the property test we looked at, which was confusing to us, because I wrote that test and I was definitely not doing any rejection sampling or using assume.

Well, I looked into it afterwards, and it turns out the test was generating a fair amount of data and so was overrunning the buffer some reasonable portion of the time. Hypothesis reports both Status.OVERRUN and Status.INVALID as gave_up — and Tyche was picking up on the former.

"status": {
Status.OVERRUN: "gave_up",
Status.INVALID: "gave_up",
Status.VALID: "passed",
Status.INTERESTING: "failed",
}[data.status],

Overruns and invalids both seem like useful things to report to the user. But since the cause/remedy for each is distinct, maybe they should each be reported with a separate status?

As an alternative to adding a new status, we could do a better job of supplying a status_reason to mark_invalid in e.g. the case of UnsatisfiedAssumption being raised, which would allow Tyche to distinguish these scenarios.

cc @Zac-HD @hgoldstein95 - I'm curious what your thoughts are here.

@hgoldstein95
Copy link
Contributor

This is really interesting! I see a few options for how to make sure downstream tools can surface this info.

  1. I know we have status_reason already, but I've always imagined that as a human-readable way of explaining what happened. If we want to be able to tell users the proportion of invalid vs. overrun vs. valid, we can just make sure to inject a categorical feature called "discard_reason" or something with the appropriate categories.
  2. We could treat status_reason as machine readable and then interpret it as a feature downstream, but this seems like a foot-gun for other PBT library maintainers.
  3. We could add another top-level status like invalid. Then OVERRUN would become gave_up and INVALID would be invalid. This is OK, but it feels like it may slightly dilute the information for the user — I'd prefer to give them one signal "here's how many examples we had to re-try generation for" and then break it down farther with features.

Taking a step back, I'm also very curious why this is happening at all! In retrospect, are you surprised you're overrunning the buffer? Was it a bug in your generator, something internal to Hypothesis, or actually "expected" behavior? If this is the kind of thing that is just expected to happen sometimes, we might not want to surface it at all; if it's a bug in Hypothesis, we may want to find a way to separate "expected" discards from "hey, your framework is broken" discards. (Also, selfishly, I'd love to be able to say "Observable PBT helped us find ___ kind of error", so I want to understand how to accurately fill in that blank.)

@Zac-HD
Copy link
Member

Zac-HD commented Dec 29, 2023

I did consider both of these options, but ultimately decided that they required such knowledge of our internals as to be unhelpful for almost all users.

The main problem with splitting out OVERRUN is that we use a different max length when replaying examples from the database, which substantially changes how it should be interpreted. Additionally, I expect this to change noticeably when we cut out over to the IR as our main source of truth.

For number of discards, this doesn't correspond to anything in the user-facing API and can happen for a variety of weird internal reasons. I'd be OK with including it in the metadata (not as a feature) though, and we already collect the count as an attribute of iirc ConjectureData.

@hgoldstein95
Copy link
Contributor

If you're sure that a user won't be able to do anything with the overrun data, maybe we shouldn't even report those generator runs? Instead of gave_up maybe there shouldn't be a test_case line at all?

But my main concern is that then someone like @tybug won't have the opportunity to inspect their generator and find a potential inefficiency. Our goal here should be to provide all of the info they need to inspect how and why their testing may be less effective than they thought, and if the framework is resampling hundreds of times because it's overrunning the buffer, that seems like something the user should be able to find out about.

Implementation inefficiencies aside, one big thing I want to make sure folks can learn from this data is when an assume statement they've inserted is rejecting a high proportion of values. This is really critical because it's tempting to use strategies like from_type in situations where the resulting generator doesn't produce many valid values. That's why I thought it might be good to pull out discards separately from overruns and other internal issues.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 3, 2024

If you're sure that a user won't be able to do anything with the overrun data, maybe we shouldn't even report those generator runs? Instead of gave_up maybe there shouldn't be a test_case line at all?

We can't usefully distinguish it from rejection sampling, but it's still useful to report that a partial run happened for the same reasons in each case: tests can have arbitrary side-effects, and aborts can come very late in a test.

But my main concern is that then someone like @tybug won't have the opportunity to inspect their generator and find a potential inefficiency. Our goal here should be to provide all of the info they need to inspect how and why their testing may be less effective than they thought, and if the framework is resampling hundreds of times because it's overrunning the buffer, that seems like something the user should be able to find out about.

I'd be surprised if there are ten people on the planet with @tybug's hands-on experience with Hypothesis internals, but point definitely taken - it's worth serving even small groups if we can do so without confusing more people, and there are far more than ten people who might find it helpful.

ISTM that the "status" entry should stay pretty general for compatibility with other frameworks, but we could show that it was due to an overrun using the "status_reason" field.

Implementation inefficiencies aside, one big thing I want to make sure folks can learn from this data is when an assume statement they've inserted is rejecting a high proportion of values. This is really critical because it's tempting to use strategies like from_type in situations where the resulting generator doesn't produce many valid values. That's why I thought it might be good to pull out discards separately from overruns and other internal issues.

Ah - the raw discards/rejection data won't actually give us that, but I think we can collect it. (have assume() stash the source location of the call into the ConjectureData's extra_information attribute, and/or propagate it through into the stopped_because string somehow)

There are very many internal reasons and locations for discards though, and they have a cumulative effect such that the final abort-triggering discard may not be informative - explicit calls to assume() in user code being an important exception, from which we can only recover if it happens while drawing from a strategy (which we can also check). Since something like a causal profiler seems totally out of scope, I'm not convinced that offering more detailed internal information is a good idea, although "aborted due to failing this assume() call in user code" would be great.

So... I endorse @tybug's alternative proposal in the OP 😅

@Zac-HD Zac-HD added enhancement it's not broken, but we want it to be better legibility make errors helpful and Hypothesis grokable labels Jan 3, 2024
@hgoldstein95
Copy link
Contributor

OK, I'm on board with the idea that actually distinguishing the various gave_up cases may just not make practical sense. I'm fine with just doing a better job of setting status_reason. Users will need to inspect individual examples to diagnose discard issues, rather than being able to do any significant analysis or visualization, but if other frameworks want to communicate more robust discard categorizations they can just make them features.

@tybug
Copy link
Member Author

tybug commented Jan 5, 2024

sorry for the delayed responses here - I've just come out of vaguely-in-vacation mode 🙂

Taking a step back, I'm also very curious why this is happening at all! In retrospect, are you surprised you're overrunning the buffer? Was it a bug in your generator, something internal to Hypothesis, or actually "expected" behavior?

@hgoldstein95 I was very surprised that I was overrunning the buffer even once, much less >50% of the time. Here's the strategy, which is not doing anything unusual - a few datetimes, text, and lists draws without a large min_size. My statement that it was generating a "fair amount" of data was pretty generous. I thought that maybe I just didn't have a good sense of how large Hypothesis' max buffer was.

Zac's comment about database buffer size was prescient, because disabling the database via @settings(database=None) reduced discards to the expected zero. I thought this was because I had many previous failures saved to the database whose new interpretation happened to overrun. But deleting .hypothesis (where I understand database examples are stored) still shows a large discard rate, unless the database is also disabled via @settings.

I'm not sure why this is. I'm not at all familiar with the database code in Hypothesis yet, so I don't dare guess. What is clear to me is that these overruns were not "normal" in the sense of a strategy generating too much data.

This is certainly still worth reporting to the user. In fact, one of the things that makes me — as an average software developer — uneasy about the Hypothesis database is that it may silently cause slowdowns due to large databases. I have no idea what examples are being pulled from the database, standard generation, or explicit @examples (ok, I guess I can look at the source code for @example, in most cases). If Tyche told me "you have lots of discards" and also "here is the origin of those discards" (how_generated, in our json schema) I would have been immediately able to diagnose the above issue.

I hear @Zac-HD's concern about requiring internal knowledge. The above would probably be confusing to developers with no knowledge of Hypothesis internals, because they don't know that it's ok for database examples to overrun, and not a problem with their code. But I argue that reporting discard count with no elaboration is more confusing. The same knowledge of internals was still required to diagnose this particular problem. It just doesn't appear so because it is obfuscated by limited information (unknown generation source).

The good news is we already report how_generated, which should help the above if properly visualized. I don't think there are immediate Hypothesis-side changes needed here.


@Zac-HD, keeping status general and using status_reason as a distinguisher makes sense to me. I would push that "we retried generation because of an assume() statement in user code", while maybe not fundamentally different from "we retried generation", is nonetheless a particularly useful distinction to make for users. Even more useful if we can report exactly which assume statement was the cause, as you say. But I accept that this may be out of scope for Tyche or any current work 😄

If I'm not beaten to it (feel free!), I'll open a PR for this. I expect we've done most of the grunt work in the discussion here and the change itself will be small.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 5, 2024

Hmm, we should also see if uncapping the max_buffer_length for database replays helps 🤔

Better to replay something somewhat different than abort, it seems to me - we'd only really want the abort behaviour when replaying a fully-shrunk example.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 22, 2024

@tybug - master...Zac-HD:hypothesis:uncap-replay-buffers should fix your excess discards. Confirm and I'll open a PR?

@tybug
Copy link
Member Author

tybug commented Jan 22, 2024

I think you've identified a separate potential issue (and I support the proposed solution), but that branch actually does not fix my issue.

I looked further into it my issue above today and have a better understand of what's going on. While it's true that @settings(database=None) does fix my issue, that's somewhat of a red herring. It's really the target phase that is causing the discards here, and the target phase happens to be disabled when no database is available. (confirmed by reproducing the same behavior when disabling the target phase only).

This looks to be a perennial issue of having discards even for simple strategies. For instance, here's the tyche report for lists(integers()):

from hypothesis import *
from hypothesis.strategies import *

@given(lists(integers()))
def test_f(a):
    pass

test_f()
image

where the discards are from the pareto optimiser, or presumably really from the shrinker called by the optimiser.

It showed up in much larger quantities in the strategy that prompted me to open this issue:

image

because the pareto optimiser is doing a lot more work there, for whatever reason. But all of these discards are from the optimiser.

I don't have any target calls in this strategy, fwiw, so I'm not entirely sure why the optimiser is running in the first place — maybe an opportunity for us to optimize (pardon) our own code here?

I'm reminded of this comment + #2985:

@example({0}, [1])
@given(st.sets(st.integers(0, 255)), byte_order())
# This test doesn't even use targeting at all, but for some reason the
# pareto optimizer makes it much slower.
@settings(phases=set(settings.default.phases) - {Phase.target})
def test_learning_always_changes_generation(chars, order):

@Zac-HD
Copy link
Member

Zac-HD commented Jan 23, 2024

Ah, I recognise the same forces at work! Very briefly:

  • We're seeing Overrun for essentially the same reason: the target phase is built on the shrinker, which similarly caps the length of candidate buffers. That makes sense in the shrink phase (any longer buffer is not a shrink, so we might as well abort), but not in the target phase (longer buffers might be good!).
  • So the solution is probably going to look like "add some 'what phase are we in' state tracking, and if it's < Phase.shrink, then allow extending buffers for the target phase". Or possibly a deeper overhaul of the targeting engine, which exists but is basically an MVP at the moment.
    • if we start tracking phase accurately, let's make sure to threat this through to the how_generated observability key. Currently that uses a not-very-good heuristic
  • why run the target phase when there are no target calls? It also targets shorter buffers, and saves those for later replay if there are few failing examples. However I doubt that there's been rigorous analysis of this, so we could change it as part of Better heuristic for when to optimize target() metrics #3176.

It'd be great to have this discussion written up as an issue, with bonus points for suggesting a path forward. I'll get to that in the next few days if you don't 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better legibility make errors helpful and Hypothesis grokable
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants