Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

settings rationalization #535

Closed
DRMacIver opened this issue Apr 20, 2017 · 29 comments
Closed

settings rationalization #535

DRMacIver opened this issue Apr 20, 2017 · 29 comments
Labels
enhancement it's not broken, but we want it to be better legibility make errors helpful and Hypothesis grokable opinions-sought tell us what you think about these ones!

Comments

@DRMacIver
Copy link
Member

DRMacIver commented Apr 20, 2017

The Hypothesis settings system is currently a confusing mess of confusingness. It asks users to make decisions they can't possibly have sufficient information to make, and ties those decisions very closely to Hypothesis internals.

I think we should give serious consideration to deprecating most of them. This ties in to #534

Here is a rough list of what I think about specific settings:

  • buffer_size - very bad. Totally about Hypothesis internals, totally undocumented as to what it really means. Hypothesis should figure out something sensible to do here and then do it.
  • database_file - not intrinsically bad, but redundant with database. May be worth considering making the database API more public and just using the database setting. deprecated in Various improvements to the database setting #1196
  • database - good. Sensible operational decision that the user can reasonably have strong opinions about and Hypothesis shouldn't.
  • derandomize - Good though arguably redundant with seed.
  • max_examples - mostly reasonable but confusingly named (conflicts with example decorator while hilariously ignoring it entirely). I feel like having it be a max might also be a bad idea.
  • max_iterations - bad, often trips people up when they forget to set it, there's no real way people could have enough information to set it. Hypothesis should grow a better heuristic here.
  • max_shrinks - tuning heuristic. Only really useful to make testing strategies faster, but crucial there (~10x slowdown in our tests without it).
  • min_satisfying_examples - Same. See also Towards true non-flakiness #534 and comments on min_satisfying_examples doesn't consider uniqueness #518. This is very tied to Hypothesis's internal notion of what an example is.
  • perform_health_check - sensible operational decision in which the user deliberately decides to ignore warnings, but I feel like it might be better dropped and asking people to explicitly suppress the health check they don't want using perform_health_check.
  • phases - this is a useful feature that I think supplants a lot of the use cases for things like max_shrinks, max_examples, etc and apparently is completely undocumented! That should be fixed.
  • stateful_step_count - totally bad. Why is this even here? Hypothesis should just do something sensible automatically and provide a function argument to override that.
  • strict - good. Sensible operational decision as to how the user wants to respond to Hypothesis deprecations bad - deprecated in favour of normal warning control.
  • suppress_health_check - good. Explicit request to suppress health checks. Totally reasonable to want.
  • timeout - not intrinsically bad from a user point of view but should probably go anyway due to Towards true non-flakiness #534 now deprecated
  • verbosity - Good. Useful debugging tool, totally sensible thing to want to tune.

All of this should be done with a very careful deprecation process which preserves the current default behaviour, warns when you execute a behaviour that will change in future, and gives people an option to opt in to the future behaviour.

Pending Deprecation Plans

The following are suggested routes for some deprecations:

  • Deprecate perform_health_check=False and suggest suppress_health_check=list(HealthCheck)
  • Deprecate max_iterations, and default to not_set - meaning five times max_examples (the longstanding factor before the default for that was reduced to speed up under coverage).
  • Deprecate stateful_step_count and set it to not_set by default. Add a max_steps property on GenericStateMachine which also defaults to not_set (or maybe None?). Determine actual max_steps in order of "Look on object, look on settings, use a default". When the deprecation plan moves to dropping it,
@alexwlchan
Copy link
Contributor

I’m adding reraise_exceptions in #515 as well. I haven’t looked into documentation specifically – but I sort of assumed settings were automatically documented?

@DRMacIver
Copy link
Member Author

I sort of assumed settings were automatically documented?

That's not the case - they have to be explicitly added to the autoclass declaration in settings.rst. It might be possible to make it the case though - I don't remember why there's an explicit list.

@Daenyth
Copy link

Daenyth commented May 10, 2017

I'd really like timeout to not go away because I've used hypothesis before for integration tests where a single example can realistically take 10 seconds. Controlling timeout is really important for those - it lets me have fast vs thorough profiles

@DRMacIver
Copy link
Member Author

it lets me have fast vs thorough profiles

@Daenyth, is there a reason max_examples doesn't work for this use case? It's a significantly more robust way of controlling this sort of thing - the problem with timeout for this use case is that it causes the amount of testing you do to silently degrade as your tests get slower.

@Daenyth
Copy link

Daenyth commented May 10, 2017

Yes - I can set timeout globally, for all tests in the suite. With max_examples I need to tune each individual test to acceptable tolerances experimentally. I hear what you're saying about silently degrading the test quality, but I'd prefer a warning section in the docs suggesting that I not do it rather than just making it impossible.

As for degrading tests - if we could hook into pytest's --durations flag at a per-example level it would really help me actually profile and tune my tests. Right now speed of tests is a bit of a black box. It's hard to tell whether or not it's my strategy that's slow, or the test body, or what

@Zac-HD
Copy link
Member

Zac-HD commented May 14, 2017

The whole issue is related to #197.

  • database_file - not intrinsically bad, but redundant with database. May be worth considering making the database API more public and just using the database setting.
  • database - good. Sensible operational decision that the user can reasonably have strong opinions about and Hypothesis shouldn't.

See #175, and maybe #177.

@a3kov
Copy link

a3kov commented Jun 18, 2017

Also related: #697

@Zac-HD Zac-HD added enhancement it's not broken, but we want it to be better opinions-sought tell us what you think about these ones! labels Oct 29, 2017
@bukzor
Copy link
Contributor

bukzor commented Nov 22, 2017

I've run into buffer_size today for the first time. My strategy mysteriously wasn't generating any examples in the allotted time, even though it it's pretty linear and has no assume() statements. Stripping a couple layers of onion off revealed a StopTest exception. Reading the code revealed that hypothesis stopped the test due to the buffer of its ConjectureData being below a required size. Simply deleting this check seemed to allow things to continue as expected. I eventually found the setting, and tried it, and found I had to double it five times to get the required buffer size.

Because it's a hypothesis utility/helper strategy that is generating this giant structure, I'm at a loss as to how to make this remotely user-friendly. If nothing is done, I have to add @settings(max_buffer=2**18) to all tests that make use of this utility, and new users of this utility will run into the same mysterious issue. A settings decorator on this strategy does nothing. I feel sure that changing settings from within the strategy's body is futil, since the buffer is part of the input.

Recommendations? Could hypothesis auto-expand its buffer in this case? Is that a huge or small patch?

@bukzor
Copy link
Contributor

bukzor commented Nov 22, 2017

The below patch seems to make things better. I feel sure you can think of something better though; it's only meant to serve as a strawman.

Click to show patch
--- src/hypothesis/internal/conjecture/engine.py
+++ src/hypothesis/internal/conjecture/engine.py
@@ -122,6 +122,7 @@
             self._test_function(data)
             data.freeze()
         except StopTest as e:
+            self.debug('%s %s' % (e, data.status))
             if e.testcounter != data.testcounter:
                 self.save_buffer(data.buffer)
                 raise e
@@ -652,6 +653,7 @@
         self.test_function(zero_data)
 
         count = 0
+        buffer_size = self.settings.buffer_size
         while count < 10 and not self.interesting_examples:
             def draw_bytes(data, n):
                 return self.__rewrite_for_novelty(
@@ -661,11 +663,13 @@
             targets_found = len(self.covering_examples)
 
             self.last_data = ConjectureData(
-                max_length=self.settings.buffer_size,
+                max_length=buffer_size,
                 draw_bytes=draw_bytes
             )
             self.test_function(self.last_data)
             self.last_data.freeze()
+            if self.last_data.status == Status.OVERRUN:
+              buffer_size *= 2
 
             if len(self.covering_examples) > targets_found:
                 count = 0
@@ -712,7 +716,7 @@
 
                 data = ConjectureData(
                     draw_bytes=draw_bytes,
-                    max_length=self.settings.buffer_size,
+                    max_length=buffer_size,
                 )
                 self.test_function(data)
                 data.freeze()
@@ -723,7 +727,7 @@
                 prev_data = self.last_data
                 data = ConjectureData(
                     draw_bytes=mutator,
-                    max_length=self.settings.buffer_size
+                    max_length=buffer_size,
                 )
                 self.test_function(data)
                 data.freeze()

@Zac-HD
Copy link
Member

Zac-HD commented Nov 23, 2017

Hmm. @DRMacIver can probably comment on whether this patch is a good idea - I suspect not, on the basis that it could lead to hard-to-find performance problems when it kicks in. (we'd also have to deprecate the setting, obviously)

With Hypothesis as it is now though, needing 256MB of entropy is a huge amount! It's going to be super-slow, and may not minimise well. If you can share the code I'd be happy to comment directly; otherwise the generic advice includes things like checking limits in recursive and giving users size parameters to guide whatever strategy you define.

If you also have no control of the utility/strategy/helper, then you'll just have to live with setting the setting on each test (or changing settings.default as a quick hack).

@DRMacIver
Copy link
Member Author

Recommendations? Could hypothesis auto-expand its buffer in this case? Is that a huge or small patch?

I do have plans to do a certain amount of auto-expanding of the buffer at some point, but most of this is so that the buffer size can start low. In particular doubling it each time you exceed it as in your patch is not a good idea - the whole point of having the setting at all is to stop data getting unreasonably big.

My main recommendation is that if you need this much buffer size, your users are going to have a bad time regardless of what the settings are, because the performance is going to be awful - you're also going to have to turn off most of the health checks to make it run for starters.

What are you actually generating? Can you make it smaller somehow?

@DRMacIver
Copy link
Member Author

What are you actually generating? Can you make it smaller somehow?

One thing to look into here is that you could be routinely exceeding the buffer size simply because your data shrinks badly - what's supposed to happen is that if you exceed half the max buffer size Hypothesis essentially starts shrinking your example, but if your example doesn't shrink well this can backfire. One thing to look at is e.g. do you have smaller examples towards the left in one_of calls? (That this is useful is currently not well documented at all)

@Zac-HD
Copy link
Member

Zac-HD commented Nov 23, 2017

do you have smaller examples towards the left in one_of calls? (That this is useful is currently not well documented at all)

#984

@bukzor
Copy link
Contributor

bukzor commented Nov 23, 2017 via email

@DRMacIver
Copy link
Member Author

I think I can make a good patch from this, and I have some rebuttals for your objections (that I hope you'll like)

I think it's really very unlikely that you're going to be able to make this approach work in a way that doesn't break a lot of existing well supported use cases. This isn't an easy problem, and the fixed buffer size is currently carrying a lot of weight in terms of how various cases are handled (deeply nested and recursive structures in particular).

You're welcome to give it a try if you really want to, but I would be very surprised if this proved to be a promising direction because fundamentally the outcome you are trying to achieve is something that looks like an error for normal operation.

@Zac-HD Zac-HD added the legibility make errors helpful and Hypothesis grokable label Apr 1, 2018
@Zac-HD
Copy link
Member

Zac-HD commented Apr 1, 2018

@DRMacIver - can you update the top of this thread? e.g. strict is called out as a good idea, but has actually been deprecated for a while now!


I'm thinking about doing some aggressive pruning, with a view to removing some old settings entirely in Hypothesis 4. How does this plan sound:

  • Deprecate perform_health_check=False in favor of suppress_health_check=list(HealthCheck) (with appropriate docs)
  • Deprecate max_shrinks, and default to not_set - meaning 500 shrinks as a non-user-visible tuning parameter.
  • Deprecate max_iterations, and default to not_set - meaning ten times max_examples (the current factor).

In the absence of any proposal, I'm inclined to leave buffer_size, stateful_step_count, etc alone. Happy to just make these internal as for max_shrinks, but they seem somewhat more useful to me; given the tight timeline we can come back later.

@DRMacIver
Copy link
Member Author

In the absence of any proposal, I'm inclined to leave buffer_size, stateful_step_count, etc alone. Happy to just make these internal as for max_shrinks, but they seem somewhat more useful to me;

I actually have a concrete proposal for deprecating stateful_step_count. I'll add that. Agreed that we don't currently have a good plan for buffer_size and that it should stay for now.

with a view to removing some old settings entirely in Hypothesis 4.
given the tight timeline we can come back later.

Note that I'm still very against us dropping things with anything less than a six month deprecation timeline, and that opinion hasn't somehow changed while I was in quiet mode, so the tightness of the timeline for a release is somewhat irrelevant to the question of deprecating these.

@Zac-HD
Copy link
Member

Zac-HD commented Apr 4, 2018

And another thing: deprecated settings using their default values should be omitted from the __repr__ - at least when the default is "pretend this doesn't exist anymore" (ie not timeout). There's no (reasonable and) principled way to do this, but IMO an awful hack is worth it to guide users away from things they shouldn't touch.

@DRMacIver
Copy link
Member Author

There's no (reasonable and) principled way to do this, but IMO an awful hack is worth it to guide users away from things they shouldn't touch.

I think the thing to do might be to add a repr argument to the settings definition, default it to not_set, and when it's not_set treat it as False if deprecated and True otherwise.

@Zac-HD
Copy link
Member

Zac-HD commented Apr 5, 2018

That's the elegant version - problem is that e.g. the timeout setting can still affect your test with the default value. Easily solvable if you're OK with changing the current default timeout to unlimited though!

@DRMacIver
Copy link
Member Author

problem is that e.g. the timeout setting can still affect your test with the default value.

That's' why there's a flag you can set, so we can make some deprecated settings opt in to being included in the repr!

But also we're well outside the 6 month period for timeout's deprecation, so we could just get rid of it in 4.0 (or rather switch it over to "doesn't do anything and is still deprecated, but assigning unlimited to it isn't an error" mode which we promised to do)

@Zac-HD
Copy link
Member

Zac-HD commented Apr 6, 2018

To summarise a really annoying afternoon: it's tedious but fairly easy to remove max_shrinks; but unfortunately this makes our test suite at least an order of magnitude slower.

Removing that setting is therefore impractical without building an equivalent mechanism for use in our tests, and at that point I'm not actually convinced we should remove it at all.

@DRMacIver
Copy link
Member Author

To summarise a really annoying afternoon: it's tedious but fairly easy to remove max_shrinks; but unfortunately this makes our test suite at least an order of magnitude slower.

I think there are two senses in which we can remove it and we'd probably been implicitly assuming the opposite ones (unless I'm misunderstanding why) - I'd interpreted remove as "hard code the value" and it sounds like you'd interpreted as removing the limit?

Note BTW that there's an alternative equivalent to max_shrinks=0 which is that you can remove it from the phases setting.

@bukzor
Copy link
Contributor

bukzor commented Apr 6, 2018

(I hope I'm being helpful.) I would like to see a system that derives the "max shrinks" from the deadline setting. The test will desist the shrinking process after the deadline passes. This would neither removes the limit nor sets a hard-coded value. I believe the slowdown that @Zac-HD encountered wouldn't occur, as long as good deadlines are set.

Closer to current default behavior would be to allow up to 5x the deadline to be spent on the shrink process.

If this is a good idea, I would spend time to contribute this patch.

@Zac-HD
Copy link
Member

Zac-HD commented Apr 7, 2018

I think there are two senses in which we can remove it and we'd probably been implicitly assuming the opposite ones (unless I'm misunderstanding why) - I'd interpreted remove as "hard code the value" and it sounds like you'd interpreted as removing the limit? Note BTW that there's an alternative equivalent to max_shrinks=0 which is that you can remove it from the phases setting.

I'd also interpreted it as "hard-code the current default"... but missed the phases setting. That should help a lot!

If this is a good idea, I would spend time to contribute this patch.

I've already got most of a patch for #1211, so leave this one to me 😁

@bukzor
Copy link
Contributor

bukzor commented Apr 8, 2018

@Zac-HD I don't see max-shrinks in #1211. It sounds like you plan to add it?

@Zac-HD
Copy link
Member

Zac-HD commented Apr 8, 2018

Yep, not quite finished over there - our test suite turns out to be really subtle sometimes. At least I'm confident that it'll work for anyone once it passes!

@Zac-HD
Copy link
Member

Zac-HD commented Apr 14, 2018

Note BTW that there's an alternative equivalent to max_shrinks=0 which is that you can remove it from the phases setting.

Turns out that max_shrinks=0 behaves exactly like max_shrinks=1, not phases=list(Phase)[:Phase.shrink]. Oops?

Since this difference does in fact get caught by our tests, I'm going to drop max_shrinks from #1211 for now so that can actually be merged.

@Zac-HD
Copy link
Member

Zac-HD commented Apr 14, 2018

It turns out that (assuming my pull will be merged) there are only three settings left on the to-deprecate list! Due to the selection effects they are also the trickiest ones to get right, so I've moved our notes over to individual issues for discussion and hopefully action ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better legibility make errors helpful and Hypothesis grokable opinions-sought tell us what you think about these ones!
Projects
None yet
Development

No branches or pull requests

6 participants