Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating test data without using @given decorator #3790

Closed
karlicoss opened this issue Nov 15, 2023 · 3 comments
Closed

Generating test data without using @given decorator #3790

karlicoss opened this issue Nov 15, 2023 · 3 comments
Labels
question not sure it's a bug? questions welcome

Comments

@karlicoss
Copy link

What I want to achieve: I'm trying to use hypothesis to generate large amounts of randomized test data -- I'm not trying to use it for tests, just want to use in a script.
I found out that I can use .example method from a strategy to achieve data generation. I intentionally simplified my usecase, so let's say we want to generate 1000 integers:

TOTAL = 1000
minint = 0
maxint = 2 ** 31

from hypothesis.strategies import lists, integers
gen = lists(integers(min_value=minint, max_value=maxint), min_size=TOTAL, max_size=TOTAL)
ints = gen.example()
assert len(ints) == TOTAL  # just to check

This works, however I have two issues

  • it takes noticeable time to run (about 10 seconds).
    If I use custom code with random.Random.randint to generate 1000 integers, it completes instantly, as expected.
    If I use hypothesis via @given, defining the test, etc -- it also works instantly. But I don't really understand why is there such a performance difference?
  • I couldn't find a way to force it to use a fixed random seed (this makes sense in my case as I am interested in data generation rather than fuzzing/finding minimal failing example).
    I tried using register_random, but it had no effect

So the questions are:

  • why is this bit of code so slow? I looked in code and it seems that there could be some overhead due to extra filtering etc (even if they aren't defined like in my case), but I wouldn't expect it to be that slow
  • is it a completely wrong way to use Hypothesis? Feels like it could be useful to benefit from hypothesis machinery for data generation without necessarily using decorators, etc.

Apologies if it's not the best forum to ask -- I did read the docs and searched through the source code but couldn't really figure this out. Thanks!

@Zac-HD Zac-HD added the question not sure it's a bug? questions welcome label Nov 15, 2023
@Zac-HD
Copy link
Member

Zac-HD commented Nov 15, 2023

@given() is the only way to draw data from strategies - the .example() method just wraps that up for you internally! Supporting meaningfully different interfaces just isn't technically feasible with our limited volunteer time 🙁

For determinism and number of examples, you'll want to use @settings(max_examples=..., derandomize=True).

It's slower than plain random.randint() because we're doing much more under the hood which is useful in testing. If your data is simple that's probably a poor tradeoff; if it's complex then the convenient API probably wins out and the performance gap will be smaller.

Finally, I'll note that Hypothesis' data is draw from a really weird distribution, full of edge cases and weird correlations. That's great for finding bugs, but may or may not be what you want here - if not, I've heard good things about the mimesis library for non-testing usecases (but not used it myself). I hope that helps!

@Zac-HD Zac-HD closed this as completed Nov 15, 2023
@karlicoss
Copy link
Author

Thanks for such a quick response, this helps!

@tybug
Copy link
Member

tybug commented Nov 15, 2023

just to answer a concrete question....example() in your case is slow because it is generating and caching 100 examples ahead of time, not just one:

@settings(
database=None,
max_examples=100,
deadline=None,
verbosity=Verbosity.quiet,
phases=(Phase.generate,),
suppress_health_check=list(HealthCheck),
)
def example_generating_inner_function(ex):
self.__examples.append(ex)
example_generating_inner_function()
shuffle(self.__examples)
return self.__examples.pop()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question not sure it's a bug? questions welcome
Projects
None yet
Development

No branches or pull requests

3 participants