Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutually_broadcastable_shapes much more constrained than expected #3170

Closed
asmeurer opened this issue Dec 1, 2021 · 18 comments
Closed

mutually_broadcastable_shapes much more constrained than expected #3170

asmeurer opened this issue Dec 1, 2021 · 18 comments
Assignees
Labels
enhancement it's not broken, but we want it to be better

Comments

@asmeurer
Copy link
Contributor

asmeurer commented Dec 1, 2021

I've been using the mutually_broadcastable_shapes strategy to generate some shapes. My test has been working well, but it appears the strategy is not really generating very interesting inputs by default. I only ended up noticing this on accident, because I noticed that a line of code wasn't being covered. I am basically using mutually_broadcastable_shapes(num_shapes=n), where n is chosen between 1 and 32 (see #3151).

After playing with the strategy interactively, I found:

  • The documentation mentions a keyword argument shape but no such keyword argument exists. I'm assuming this is actually supposed to be base_shape.
  • The base_shape defaults to (), which "corresponds to a scalar and thus does not constrain broadcasting at all." However, this appears to constrain the shapes that are actually generated by the strategy by quite a bit.
  • The max_dims argument defaults to 2, if I understand the documentation correctly. 2 dimensions is not very many for getting good tests for mutually broadcastable shapes.
  • The max_side argument defaults to 3.
  • The default behavior does not produce shapes where the broadcasting occurs against a size 1 dimension. That is, it can never generate shapes like (2, 2) (2, 1). It only generates shapes that broadcast against missing dimensions.

For example

>>> from hypothesis.extra.numpy import mutually_broadcastable_shapes
>>> strat = mbs(num_shapes=3)
... for _ in range(10):
...     print(strat.example())
BroadcastableShapes(input_shapes=((), (), (3,)), result_shape=(3,))
BroadcastableShapes(input_shapes=((2,), (), ()), result_shape=(2,))
BroadcastableShapes(input_shapes=((), (), ()), result_shape=())
BroadcastableShapes(input_shapes=((3,), (2, 3), (3,)), result_shape=(2, 3))
BroadcastableShapes(input_shapes=((3,), (3,), ()), result_shape=(3,))
BroadcastableShapes(input_shapes=((1,), (1,), ()), result_shape=(1,))
BroadcastableShapes(input_shapes=((), (), ()), result_shape=())
BroadcastableShapes(input_shapes=((1,), (1,), ()), result_shape=(1,))
BroadcastableShapes(input_shapes=((3, 2), (), ()), result_shape=(3, 2))
BroadcastableShapes(input_shapes=((1,), (), ()), result_shape=(1,))

It would seem that to get good behavior from this strategy, you must first generate a base shape then run the strategy against that. But 1) this is not really documented at all, 2) it seems pointless to have the default behavior generate almost useless outputs, dangerous in fact, since I expect most users won't notice this as I have, and 3) I'm not sure, but couldn't chaining strategies like "generate base_shape" -> mutually_broadcastable_arrays(base_shape=base_shape) produce the same sorts of problems I ran into in #3151?

To me, just having mutually_broadcastable_shapes() produce reasonable but sufficiently complex shapes by default would be much better. I already expect to have to filter the strategy based on the prod(result_shape), which I generally already expect to do for any shape generating strategy in any instance where I create an actual array of the given shape.

CC @honno

@honno
Copy link
Member

honno commented Dec 1, 2021

@Zac-HD Is the documented defaulting of None args considered public API? e.g.

max_dims is the largest length that the generated shape can possess, defaulting to max(len(shape), min_dims) + 2.

No idea yet on how to improve mutually_broadcastable_shapes, but imagine that would be a good start... personally I think documented defaulting is public API though.

@Zac-HD
Copy link
Member

Zac-HD commented Dec 2, 2021

  • The documentation mentions a keyword argument shape but no such keyword argument exists. I'm assuming this is actually supposed to be base_shape.

Thanks for pointing this out! git blame confirms that this was a stray dot-point from way back and I've now removed it.

  • The max_dims argument defaults to 2, if I understand the documentation correctly. 2 dimensions is not very many for getting good tests for mutually broadcastable shapes.
  • The max_side argument defaults to 3.

These points are because mutually_broadcastable_shapes() is often used to chose shapes for generated arrays - in which case slightly larger sides or numbers of dimensions get very expensive. If the defaults don't suit your use-case though, you can just pass your preferred arguments!

  • The base_shape defaults to (), which "corresponds to a scalar and thus does not constrain broadcasting at all." However, this appears to constrain the shapes that are actually generated by the strategy by quite a bit.
  • The default behavior does not produce shapes where the broadcasting occurs against a size 1 dimension. That is, it can never generate shapes like (2, 2) (2, 1). It only generates shapes that broadcast against missing dimensions. ... for example, ... print(strat.example())

This is a bug that we should fix. Thanks again for reporting!

I'd love to get your feedback on the diversity of the examples it generates again once the fix is merged; I think this might address most of your issues and but will re-engage on remaining points once you confirm them.

I already expect to have to filter the strategy based on the prod(result_shape), which I generally already expect to do for any shape generating strategy in any instance where I create an actual array of the given shape.

Do you mean that you need all-nonzero side lengths, in which case passing min_side=1 (per default) is faster, or that you want to broadcast to an exact target shape? The latter is going to be quite expensive and you might need to write a custom strategy to get good performance and example-diversity.


Is the documented defaulting of None args considered public API? personally I think [it is]

Yes, it is - though it's hard to imagine a non-backwards-compatible change here which would still fit our API style guide. Separately I'm reluctant to change it - @rsokl and I put a lot of work into the design and tuning of this strategy.

@rsokl
Copy link
Contributor

rsokl commented Dec 2, 2021

The default behavior does not produce shapes where the broadcasting occurs against a size 1 dimension. That is, it can never generate shapes like (2, 2) (2, 1). It only generates shapes that broadcast against missing dimensions.

Thanks for bringing this to our attention. I just confirmed that this is a bug in the strategy, and I have a fix on the way.

@rsokl
Copy link
Contributor

rsokl commented Dec 2, 2021

@asmeurer the patch has been merged. You should now find that there are no constraints on where singletons can manifest in the drawn values.

@rsokl
Copy link
Contributor

rsokl commented Dec 9, 2021

I am going to go ahead and close this. The fix in #3172 should have remedied the fundamental issues here (thank you for bringing them to our attention!). The other points raised can all be remedied by simply modifying the default values that are provided to the strategy.

@asmeurer
Copy link
Contributor Author

Sorry for being slow to get back on this.

I just tested the latest release, and I think this should stay open. Indeed the bug with not generating shapes that broadcast on a 1 dimension has been fixed, but the more general issue of mutually_broadcastable_shapes not generating very interesting examples by default has not.

I find it very surprising when the default behavior for this strategy doesn't produce "adversarial" draws. When I use floats(), for instance, I know it will always tell me if some strange floating point value that I failed to consider will break my code. If my code doesn't actually work with nan or whatever, I have to tell it to not draw nan. If I instead had to remember to tell it to draw nan, I probably would never remember that, and the strategy would be useless for cases where I actually want to find bugs involving nans. Similarly if I do lists(integers()) with no arguments, I expect it to generate large integers (within reason) and large lists (again, within reason). The default behavior of mutually_broadcastable_shapes would be like if lists(integers()) only generated lists up to length 2 with integers from 0 to 3.

The default values of the parameters may make sense when combined with other parameters provided explicitly. But for the case when no parameters are provided (which I would expect to be the most common case), they don't make sense. Obviously the dimension sizes and number of dimensions should be limited so that if an actual array is created with the given shape it will fit in memory, but things could be increased many orders of magnitude before running into that limitation.

As another aside, why does it not generate size 0 dimensions by default? This is another corner case situation I would expect to have to turn off, not on.

I already expect to have to filter the strategy based on the prod(result_shape), which I generally already expect to do for any shape generating strategy in any instance where I create an actual array of the given shape.

Do you mean that you need all-nonzero side lengths, in which case passing min_side=1 (per default) is faster, or that you want to broadcast to an exact target shape? The latter is going to be quite expensive and you might need to write a custom strategy to get good performance and example-diversity.

Either I'm not understanding you or you're not understanding me. prod(shape) is the number of elements of the array. It limits the size of the array. I always expect to have to do filter(lambda shape: prod(shape) < MAX_ARRAY_SIZE) for some MAX_ARRAY_SIZE, because otherwise hypothesis might generate a shape that's a terabyte in memory (I also have to do it because of numpy/numpy#15753, but never mind that).

To be sure, it would be great if I could pass that as a parameter to a strategy rather than filtering, but that in itself isn't a huge deal. Again, filtering when hypothesis goes too far is fine. But the converse, hypothesis not going far in the first place, leaves me very worried that I'm not actually testing as much as I think I am.

@honno
Copy link
Member

honno commented Dec 13, 2021

Small note on

As another aside, why does it not generate size 0 dimensions by default? This is another corner case situation I would expect to have to turn off, not on.

What are the use cases of 0-sized dims? I've asked this before and I believe SciPy uses 'em, but outside the more core ecosystem I'm not sure if they're useful or annoying i.e. we'll be finding "bugs" in array-consuming functions where 99% of the time 0-sized dims would be erroneous input anyway and users would always have to write min_side=1.

I think say with st.floats() defaulting nans and infs, those inputs are 1) much more interesting to be generated even if ultimately disabled 2) much more prevalent as inputs.

@asmeurer
Copy link
Contributor Author

To be sure, it would be great if I could pass that as a parameter to a strategy rather than filtering, but that in itself isn't a huge deal.

Actually this is looking like it would be more useful than I has previously thought. It's hard for me to use the existing parameters to create a result shape that isn't too large. The problem is that you can either make the average side length larger, or add more dimensions, but you can't do both because the total size is side_length**n_dimensions. Correct me if I'm wrong on this, but this seems like it would be much more tractable if done at the generation stage because you could keep track of the size so far when constructing the shape and avoid adding dimensions that exceed the total size (it might be a little more complicated than that to avoid introducing bias and to keep shrinking working).

@asmeurer
Copy link
Contributor Author

I think I've found another issue with the generation in mutually_broadcastable_shapes. If have something like mutually_broadcastable_shapes(num_shapes=3, max_dims=10), it will generate result_shapes spanning the full gamut of max dimensions. But if you instead do mutually_broadcastable_shapes(num_shapes=32, max_dims=10), it seems to only generate result_shapes with 10 dimensions, only occasionally generating a shape of dimension 0, 1, or 9. Consider for instance this

strat = mutually_broadcastable_shapes(num_shapes=32, max_dims=10)
for _ in range(100):
    print(len(strat.example().result_shape))

Furthermore, the generation seems to be quite slow. The above code that generates 100 examples takes over 2 seconds to run.

I am generating 32 shapes for the reason described at #3151. However, given this issue, I may have to switch back to the old way of generating num_shapes first, and just dealing with the occasional bad shrinking.

@Zac-HD
Copy link
Member

Zac-HD commented Dec 14, 2021

Either I'm not understanding you or you're not understanding me. ... I always expect to have to do filter(lambda shape: prod(shape) < MAX_ARRAY_SIZE)

Ah, yep, just a miscommunication. I thought you meant .filter(np.prod), which is equivalent to .filter(lambda shape: 0 not in shape), or maybe .filter(lambda shape: prod(shape) == EXACT_SIZE). Capping the element count makes perfect sense.

@Zac-HD
Copy link
Member

Zac-HD commented Dec 14, 2021

@honno has our reasoning for excluding side-length-zero arrays exactly right. While some super-generic code needs to handle such cases, this is so rare that we leave it off by default. A large part of that is that often the problems it reveals are upstream bugs which most users can't do much about.

Concretely, how much does skipping these problems matter for most users when even Numpy itself leaves bug reports like numpy/numpy#15753 and numpy/numpy#16410 unfixed for so long? inf or nan or floating-point overflow are enormously more common problems, and more likely to be fixable too.

@Zac-HD
Copy link
Member

Zac-HD commented Dec 14, 2021

To be sure, it would be great if I could pass that as a parameter to a strategy rather than filtering, but that in itself isn't a huge deal.

Actually this is looking like it would be more useful than I has previously thought. It's hard for me to use the existing parameters to create a result shape that isn't too large. The problem is that you can either make the average side length larger, or add more dimensions, but you can't do both because the total size is side_length**n_dimensions. Correct me if I'm wrong on this, but this seems like it would be much more tractable if done at the generation stage because you could keep track of the size so far when constructing the shape and avoid adding dimensions that exceed the total size (it might be a little more complicated than that to avoid introducing bias and to keep shrinking working).

Yeah, we can't do it that way precisely because shrinking would become intractably difficult.

One idea that might work is to choose a side length of one with probability related to the minimum dimensionality of the array shape, to constrain the elements-in-expectation. You'd still need to filter, but it would have much better scaling properties - literally picking a random-dimensional subspace to fill out.


Overall, I think your main issue here is simply that you have a very different workload to almost all users of Hypothesis' array support. I regret that this means our API doesn't help you as much as we both want it to, but I just don't see a way to improve your workflow without compromising that of many more users. If you get stuck again, I'd suggest recreating our strategies using @composite as your baseline, and then tuning however suits you.

@rsokl
Copy link
Contributor

rsokl commented Dec 15, 2021

mutually_broadcastable_shapes not generating very interesting examples

@asmeurer I can assure you that, if mutually_broadcastable_shapes was typically used to test properties that solely depend on the shapes themselves, then we would have designed its defaults to be as broad/general as possible. Unlike strategies like floats, most people are using these shapes to feed array strategies. Thus we really want the default behavior for chaining these strategies to be reasonable (How many functions are out there that need 32 shapes -- that are mutually broadcastable -- to be tested?). The same consideration went into the defaults of the shapes strategy. Ultimately, the default behavior of mutually_broadcastable_shapes actually generates pretty interesting shapes! They exercise the most common things that can go wrong with broadcasting, and they are suitable for downstream array strategies. Please feel free to wrap the strategy and provide your own defaults for your personal use.

it seems to only generate result_shapes with 10 dimensions, only occasionally generating a shape of dimension 0, 1, or 9.

Looking at:

strat = mutually_broadcastable_shapes(num_shapes=32, max_dims=10)

for _ in range(10):
    print([len(x) for x in strat.example().input_shapes])

the diversity of shape-lengths generated is impressive!

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[9, 7, 3, 0, 4, 1, 5, 6, 2, 7, 3, 0, 3, 1, 1, 8, 6, 3, 10, 6, 10, 9, 9, 8, 5, 2, 9, 7, 3, 5, 3, 4]
[4, 0, 8, 5, 9, 5, 3, 1, 3, 6, 3, 5, 3, 1, 2, 2, 4, 8, 6, 6, 0, 6, 4, 2, 6, 3, 5, 5, 1, 3, 7, 9]
[4, 10, 0, 0, 10, 9, 7, 1, 9, 4, 3, 3, 1, 4, 5, 5, 3, 7, 7, 0, 7, 9, 0, 1, 3, 3, 3, 4, 1, 9, 9, 7]
[5, 2, 10, 10, 9, 2, 9, 0, 6, 0, 3, 10, 3, 5, 3, 3, 9, 3, 1, 7, 5, 10, 4, 6, 10, 5, 1, 9, 0, 5, 3, 4]
[5, 2, 4, 5, 7, 1, 6, 7, 3, 2, 3, 1, 0, 7, 3, 1, 7, 9, 7, 5, 9, 7, 2, 2, 0, 7, 10, 2, 1, 7, 7, 5]
[9, 8, 9, 1, 10, 3, 0, 1, 3, 4, 0, 9, 8, 5, 9, 7, 10, 0, 0, 4, 0, 8, 10, 0, 1, 2, 5, 1, 6, 7, 9, 4]
[6, 5, 7, 0, 6, 10, 9, 0, 10, 0, 5, 8, 1, 5, 7, 2, 10, 2, 10, 1, 4, 0, 0, 4, 8, 6, 6, 4, 8, 5, 10, 1]
[8, 8, 1, 4, 0, 8, 5, 9, 5, 10, 8, 1, 6, 0, 1, 5, 0, 6, 10, 2, 6, 6, 10, 6, 6, 9, 3, 3, 3, 10, 10, 6]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The length of the result-shape is looking at the max-length of each set of 32 shapes, which leads to the extreme statistics. Consider randomly sub-selecting from the collection of 32 shapes to improve your chances of getting more max shape-length diversity.

@asmeurer
Copy link
Contributor Author

Yeah, we can't do it that way precisely because shrinking would become intractably difficult.

I'm obviously pretty ignorant of how these things work, but how does this affect shrinking, given that a shrunk strategy will almost always be well below the size limit? Or do you mean specifically shrinking for the cases when it is near that boundary?

@asmeurer I can assure you that, if mutually_broadcastable_shapes was typically used to test properties that solely depend on the shapes themselves, then we would have designed its defaults to be as broad/general as possible. Unlike strategies like floats, most people are using these shapes to feed array strategies. Thus we really want the default behavior for chaining these strategies to be reasonable (How many functions are out there that need 32 shapes -- that are mutually broadcastable -- to be tested?)

The 32 shapes thing that I'm using is irrelevant to my point here. That's why I used 3 in my examples. My point is about the size of the resulting shapes, which has nothing to do with how many shapes are generated (none of the defaults are dependent on num_shapes).

I'll reiterate what I said before: "Obviously the dimension sizes and number of dimensions should be limited so that if an actual array is created with the given shape it will fit in memory, but things could be increased many orders of magnitude before running into that limitation."

I fail to see how the current defaults are useful for the people it was designed for. Again, please look at the shapes that are generated. The biggest shape that can be generated is (3, 3). 9 elements. That's it. My own code had bugs that weren't being caught because the input shapes were only 2-dimensional. Only 2-dimensions with side length <= 3 is hardly what I would call "broad/general as possible" or "pretty interesting".

By the way, in my use-case, I am generating arrays of the given shapes. Actually, I'm looping (in Python) over every element of the arrays in my test. So the size constraints you are talking about definitely apply to my own code.

Allowing to specify a max size instead of the parameters also seems to go along with the "generating shapes that are useful for creating arrays with", which we both seem to be on board with.

The length of the result-shape is looking at the max-length of each set of 32 shapes, which leads to the extreme statistics. Consider randomly sub-selecting from the collection of 32 shapes to improve your chances of getting more max shape-length diversity.

I guess that makes sense. I think I'll try generating a base shape and using it instead. I'm actually having a pretty hard time finding a middle ground of input parameters to mutually_broadcastable_shapes that generates interesting enough examples while not generating shapes that are too large or creating health check errors from too much filtering.

I also want to mention again the performance problem I noted in #3170 (comment) in case you didn't see it.

@rsokl
Copy link
Contributor

rsokl commented Dec 15, 2021

dimensions should be limited so that if an actual array is created with the given shape it will fit in memory, but things could be increased many orders of magnitude before running into that limitation

What we are worried about is that the subsequent arrays will be large and thus draws from the arrays() strategy will be slow. This would rear its head far before we hit memory footprint issues. Note that the default behavior of st.lists is to builds lists up to length 6.

I also want to mention again the performance problem I noted in #3170 (comment) in case you didn't see it.

I don't think there is much for us to do here. Drawing up to 100*32*10 integers is a relatively heavy lift - keep in mind that Hypothesis has to generate and parse a byte string for each integer that it draws. It is not merely sampling from a random distribution. I don't see anywhere in the strategy's implementation where we can provide any meaningful speedup.

@asmeurer
Copy link
Contributor Author

What we are worried about is that the subsequent arrays will be large and thus draws from the arrays() strategy will be slow. This would rear its head far before we hit memory footprint issues. Note that the default behavior of st.lists is to builds lists up to length 6.

But even array_shapes() itself draws more interesting shapes by default than mutually_broadcastable_shapes() (I would still say ndim=3 is still too small to catch many bugs, but it's better than 2).

Again, it seems the more appropriate fix here is to limit the final array size, so that arrays with large dimensionalities or large sides can be generated, just not both at the same time. Limiting both at the same time means you can't catch a bug that only triggers for larger dimensionality or a bug that only triggers for larger side length.

Worth mentioning also that I've never noticed any performance issues with using arrays() with much large shapes than the array_shapes() defaults.

I don't think there is much for us to do here. Drawing up to 1003210 integers is a relatively heavy lift - keep in mind that Hypothesis has to generate and parse a byte string for each integer that it draws. It is not merely sampling from a random distribution. I don't see anywhere in the strategy's implementation where we can provide any meaningful speedup.

I think something else is going on here. It actually takes 2 seconds to draw the just first example (this is evident if you actually run my code). All subsequent examples are drawn instantly. Does hypothesis have some sort of precomputation?

@Zac-HD
Copy link
Member

Zac-HD commented Dec 16, 2021

Note that the default behavior of st.lists() is to builds lists up to length 6.

@rsokl, nope, we can generate much longer lists than than; the average size before heuristics is min_size + 5 but the geometric tail is over-represented due to mutations and element-cloning.

Again, it seems the more appropriate fix here is to limit the final array size, so that arrays with large dimensionalities or large sides can be generated, just not both at the same time. Limiting both at the same time means you can't catch a bug that only triggers for larger dimensionality or a bug that only triggers for larger side length.

This is a good point, and I think making 1 disproportionately more common for higher-dimensional arrays would be a good way to implement it (plus a filter if you have a strict element cap).

That said, I think we're always going to be working with smaller arrays than you're talking about - Hypothesis is literally unable to generate more than 4K array elements, so much larger arrays are going to be very sparse indeed (we assign an average of sqrt(elements) non-fill values for large). This is not as bad as it sounds, because bugs per minute of runtime is more important than per example, and smaller examples tend to be proportionally faster but almost as likely to find bugs (aside from size-related bugs we probably can't trigger anyway).

It actually takes 2 seconds to draw the just first example (this is evident if you actually run my code). All subsequent examples are drawn instantly. Does hypothesis have some sort of precomputation?

If you want to time generation, use --hypothesis-show-statistics instead of s.example(); the latter runs a test internally and then pops off the list of generated values:

try:
return self.__examples.pop()
except (AttributeError, IndexError):
self.__examples: List[Ex] = []
from hypothesis.core import given
# Note: this function has a weird name because it might appear in
# tracebacks, and we want users to know that they can ignore it.
@given(self)
@settings(
database=None,
max_examples=100,
deadline=None,
verbosity=Verbosity.quiet,
phases=(Phase.generate,),
suppress_health_check=HealthCheck.all(),
)
def example_generating_inner_function(ex):
self.__examples.append(ex)
example_generating_inner_function()
shuffle(self.__examples)
return self.__examples.pop()

@rsokl
Copy link
Contributor

rsokl commented Dec 16, 2021

@rsokl, nope, we can generate much longer lists than than; the average size before heuristics is min_size + 5

Oh whoops! I realized there was a mistake in the code I was using to probe this. You are right, a batch of 100 examples can often produce lists of length 25 (or higher).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better
Projects
None yet
Development

No branches or pull requests

4 participants