Generating test data without using @given decorator #3790

karlicoss · 2023-11-15T00:45:52Z

What I want to achieve: I'm trying to use hypothesis to generate large amounts of randomized test data -- I'm not trying to use it for tests, just want to use in a script.
I found out that I can use .example method from a strategy to achieve data generation. I intentionally simplified my usecase, so let's say we want to generate 1000 integers:

TOTAL = 1000
minint = 0
maxint = 2 ** 31

from hypothesis.strategies import lists, integers
gen = lists(integers(min_value=minint, max_value=maxint), min_size=TOTAL, max_size=TOTAL)
ints = gen.example()
assert len(ints) == TOTAL  # just to check

This works, however I have two issues

it takes noticeable time to run (about 10 seconds).
If I use custom code with random.Random.randint to generate 1000 integers, it completes instantly, as expected.
If I use hypothesis via @given, defining the test, etc -- it also works instantly. But I don't really understand why is there such a performance difference?
I couldn't find a way to force it to use a fixed random seed (this makes sense in my case as I am interested in data generation rather than fuzzing/finding minimal failing example).
I tried using register_random, but it had no effect

So the questions are:

why is this bit of code so slow? I looked in code and it seems that there could be some overhead due to extra filtering etc (even if they aren't defined like in my case), but I wouldn't expect it to be that slow
is it a completely wrong way to use Hypothesis? Feels like it could be useful to benefit from hypothesis machinery for data generation without necessarily using decorators, etc.

Apologies if it's not the best forum to ask -- I did read the docs and searched through the source code but couldn't really figure this out. Thanks!

The text was updated successfully, but these errors were encountered:

Zac-HD · 2023-11-15T00:58:10Z

@given() is the only way to draw data from strategies - the .example() method just wraps that up for you internally! Supporting meaningfully different interfaces just isn't technically feasible with our limited volunteer time 🙁

For determinism and number of examples, you'll want to use @settings(max_examples=..., derandomize=True).

It's slower than plain random.randint() because we're doing much more under the hood which is useful in testing. If your data is simple that's probably a poor tradeoff; if it's complex then the convenient API probably wins out and the performance gap will be smaller.

Finally, I'll note that Hypothesis' data is draw from a really weird distribution, full of edge cases and weird correlations. That's great for finding bugs, but may or may not be what you want here - if not, I've heard good things about the mimesis library for non-testing usecases (but not used it myself). I hope that helps!

karlicoss · 2023-11-15T01:01:04Z

Thanks for such a quick response, this helps!

tybug · 2023-11-15T01:30:51Z

just to answer a concrete question....example() in your case is slow because it is generating and caching 100 examples ahead of time, not just one:

hypothesis/hypothesis-python/src/hypothesis/strategies/_internal/strategies.py

Lines 327 to 340 in 226268c

    
                   @settings( 
        
                       database=None, 
        
                       max_examples=100, 
        
                       deadline=None, 
        
                       verbosity=Verbosity.quiet, 
        
                       phases=(Phase.generate,), 
        
                       suppress_health_check=list(HealthCheck), 
        
                   ) 
        
                   def example_generating_inner_function(ex): 
        
                       self.__examples.append(ex) 
        
                   example_generating_inner_function() 
        
                   shuffle(self.__examples) 
        
                   return self.__examples.pop()

Zac-HD added the question not sure it's a bug? questions welcome label Nov 15, 2023

Zac-HD closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating test data without using @given decorator #3790

Generating test data without using @given decorator #3790

karlicoss commented Nov 15, 2023

Zac-HD commented Nov 15, 2023

karlicoss commented Nov 15, 2023

tybug commented Nov 15, 2023

Generating test data without using @given decorator #3790

Generating test data without using @given decorator #3790

Comments

karlicoss commented Nov 15, 2023

Zac-HD commented Nov 15, 2023

karlicoss commented Nov 15, 2023

tybug commented Nov 15, 2023