New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean-up and slightly optimise extra.array_api.ArrayStrategy
#3105
Conversation
58db14d
to
5c95616
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! It took me a while to be convinced that the edge cases all worked, but with the comments below done I'll be happy to merge such a perf improvement. Thanks again 🤩
f"of dtype {type(val)} to array of dtype {result.dtype}." | ||
) from e | ||
self.check_set_value(val, result[i], strategy) | ||
self.check_hist = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty tricky!
- It's safe to share between multiple tests cases if and only if we never add disallowed elements to the set.
A comment to this effect would be very useful; it took me a while to convince myself it was OK. - We have potentially unbounded memory usage, if e.g. you set
max_examples=10**9
and leave int64 arrays overnight. This can be cheaply mitigated by checking once per array (at the start ofdo_draw
) whether the set has more than say 100K elements, and clearing it if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a comment which hopefully clarifies how check_hist
works, hope that's okay.
Thanks for catching the potential memory problems—I've made a reset mechanism at the start of do_draw()
to that effect.
if len(self.check_hist) >= 100_000:
self.check_hist = set()
This does hurt the performance gains. I tried tracking the cache size in a separate attribute, but len()
for sets is really quite fast that it's not worth the cognitive load, and also using a list + size attribute was comparable in performance. So right now this PR seems improve performance by 5%-10% for typical uses of xps.arrays()
.
I wrote test cases to cover both points... they are both admittedly a bit funky heh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'd rather not have that drop. Try a range up to say 10M? I think 80MB overhead is probably acceptable if it's a nontrivial perf improvement for arrays!
If you're benchmarking it would also be interesting to see perf without any checks at all - in some cases we could manage up-front analysis proving that we don't need them.
Finally, sharing the check_hist
between strategies would also improve the hit rate: perhaps a global defaultdict(set)
keyed off the dtype, and assign to an instance variable in __init__
? (and then self.check_hist.clear()` instead of assigning a new set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps a global defaultdict(set) keyed off the dtype
Just noting there's a potential hashing problem here, although @asmeurer mentioned to me recently that it's probably safe to assume dtypes will be hashable. I could always explore using the dtype name (strings) too.
Co-authored-by: Zac Hatfield-Dodds <zac.hatfield.dodds@gmail.com>
For future reference, something like this can reach all the error messages.>>> xps.arrays(dtype=xp.uint8, shape=10, unique=True, elements=st.characters(), fill=st.nothing()).example()
InvalidArgument: Generated elements [H, , ..., 裖, ] from strategy <hypothesis.strategies._internal.strings.OneCharStringStrategy object at 0x7f4a514e6ee0> could not be converted to array of dtype uint8. Consider if elements of type <class 'str'> are compatible with uint8.
>>> xps.arrays(dtype=xp.uint8, shape=5, unique=True, elements=st.characters() | st.floats(), fill=st.nothing()).example()
InvalidArgument: Generated elements ['5', '@', 'g', 2.2250738585072014e-308, '𫌓'] from strategy one_of(characters(), floats()) could not be converted to array of dtype uint8. Consider if elements of types (<class 'float'>, <class 'str'>) are compatible with uint8.
>>> xps.arrays(dtype=xp.uint8, shape=10, elements=st.integers(500, 510)).example()
InvalidArgument: Generated array element 500 from strategy integers(500, 510) cannot be represented with dtype uint8. Array module numpy.array_api instead represents the element as 244. Consider using a more precise elements strategy, for example passing the width argument to floats().
>>> xps.arrays(dtype=xp.uint8, shape=10, fill=st.characters()).example()
InvalidArgument: Could not create full array of dtype=uint8 with fill value 'I'
>>> xps.arrays(dtype=xp.uint8, shape=10, fill=st.characters()).example()
InvalidArgument: Generated array element '0' from strategy <hypothesis.strategies._internal.strings.OneCharStringStrategy object at 0x7fc6e39dbd60> cannot be represented with dtype uint8. Array module numpy.array_api instead represents the element as 0. Consider using a more precise elements strategy, for example passing the width argument to floats().
>>> xps.arrays(dtype=xp.uint8, shape=10, elements=st.characters(), fill=st.just(10)).example()
InvalidArgument: Could not add generated array element '0' of type <class 'str'> to array of dtype uint8. |
0b99c6e
to
da5a173
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Are there any concerns that users might provide an elements strategy that produces unhashable values? E.g. this would no longer work, I believe: >>> hnp.arrays(
... dtype="float",
... shape=(4, 2),
... elements=st.just(np.array(1)),
... fill=st.nothing(),
... ).example()
array([[1., 1.],
[1., 1.],
[1., 1.],
[1., 1.]]) Perhaps it is worthwhile to have this optimization be skipped over gracefully in this sort of scenario (although I see that the same issue would occur when I doubt that this would occur very often, but I could imagine a strategy that serves up 0D arrays as elements, for example, being used to "feed" another strategy. |
Oh yeah, nice catch! So we could cache the casted builtin we find anyway, but that would mean using array equalities with builtins in the Regarding set behaviour, I do think you're onto something but yeah it requires rethinking how we check values. I'll have to sleep on both things anywho.
If folk want, I'm happy to close this and open a PR tomorrow with just the clean-up, w/ array-as-elements-strategy test(s). |
High-level suggestion: let's try splitting into "ArrayStrategy which may need (expensive) checks", and "FastArrayStrategy which we know does not need any checks on element and fill strategies". How much faster is the latter? (just make it unsafe for now; if it's +20% or better as I expect I'll write or help with the element-strategy-inspection logic) |
Using this benchmark script with Results
I was using smaller benchmark loops and more precise strategies before (e.g. In terms of benchmarking, I have no idea about how Hypothesis generally caches things, e.g. if Hypothesis is caching something, could I benchmark repeated loops of mock test cases without triggering such mechanisms? As we can expect, |
Is that with a shared, or a per-strategy cache? Because the latter would be much less effective with varied shapes/dtypes (desugars to separate underlying strategies!). |
Per-strategy! Yep your shared suggestion seems promising—might not get round to an implementation this week, will keep you posted when I do. |
If that's per-strategy and we still get 2-5% gains, I'm pretty optimistic again! A quick sanity-check is that I'd expect the gains to increase with I have a fair backlog of other issues I've been meaning to work on, so I'm going to leave this one to you unless you ask me to step in and help with something. |
I've implemented @Zac-HD's shared cache idea. I benchmarked by timing
Also interestingly, here is my PR and current master with no checks. (Note
Of course there's some variation in So unless our... hypothesis... that shared caching would really help is infact wrong, I've probably messed something up. I did write a test case that the cache does infact get shared in Note I kept the bounded performance checks when benchmarking Also next time I'll benchmark with just |
0c496ee
to
06e2593
Compare
|
||
def do_draw(self, data): | ||
if 0 in self.shape: | ||
return self.xp.zeros(self.shape, dtype=self.dtype) | ||
|
||
# We reset check_hist when it reaches an arbitrarily large size to | ||
# prevent unbounded memory usage. | ||
if len(self.check_hist[self.dtype]) >= 100_000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered resetting the set at 157284 items? A python set reallocates for new elements following the pattern 0 --> 4 --> 18 --> 76 --> 306 --> 1228 --> 4914 --> 19660 --> 78642 --> 157285 --> 314572 --> ..., where once the set reaches 0, 4, 18, etc items, it resizes the set behind the scenes to prepare for using more memory. By resetting right before the allocation at 157285 items, it allows the cache to grow larger without allowing it to resize and prepare space for another 150k more items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also reset at 78641 of course, I haven't tested the speed either way. Unless it's set to 75_000, 100_000, 150_000, or some nice looking number, it'd be useful to include in the comments why you chose the number you did though. I found out these numbers with some basic brute-force tests as I can't be bothered to put effort into understanding the C source code, however if somebody has time then they can check out https://github.com/python/cpython/blob/f25f2e2e8c8e48490d22b0cdf67f575608701f6f/Objects/setobject.c#L177. Be warned though, it doesn't follow a clean doubling after 50k items though, and I'm not sure why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
75_000
would be a good reset point then; it's > 2**16
and below the realloc threshold. I'd be very surprised if this made a measurable difference to performance though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance-wise, it won't make a huge difference, at most a few hundredths of a second every time it clears. Memory-wise though, it prevents python from allocating a bunch more space than it needs to, which can make a difference of 4 MB of RAM for each such set that is prevented from resizing larger (a set with 150k or so elements takes up 8MB of space, not including the elements in it, while a set with 75k takes up 4MB).
Hmm, if nochecks is +16% and shared-cache is +15.5% then we seem to be very close to the maximum possible gain from caching (and in particular it doesn't seem worth adding a no-cache special case).
There's a cache on Overall: I'd be happy to merge a fairly simple caching strategy for any >= 10% speedup. |
Benchmarking I also tried putting the
I'm confused because |
When I try to run the benchmark scripts with numpy 1.21.2, I get this error: from numpy import array_api as xp
ImportError: cannot import name 'array_api' from 'numpy' What version of numpy are you using? Edit: replacing that with |
@Theelx Ah the Array API isn't in a release yet, you'll have to build it yourself, or get the nightly build:
Note |
I replaced my use of I'll work out a benchmark today/tomorrow. |
Good news, I'm pretty happy with this benchmarking script now. It basically mocks running Bad news, caching might not be worth it. Same goes for doing away with checks completely.
Raw
@Zac-HD Anything else you'd like me to explore? I could see about benchmarking high-collision scenarios, although as I said before I don't know if they're used much. If you'd like to ditch this for now, I'll be sure to benchmark a no checks scenario again when other array libraries get compliant-enough for $ python -c "from numpy import array_api as xp; from timeit import timeit; print(timeit(lambda: int(xp.asarray(42, dtype=xp.uint8))))"
3.403678069000307
$ python -c "import torch; from timeit import timeit; print(timeit(lambda: int(torch.as_tensor(42, dtype=torch.uint8))))"
3.1453401199978543 Hopefully you still want the clean-up part of this PR—if so I can drop the caching stuff here, or start a new one with just that. I'll save what I've done here anyway, but it might be a good idea to keep this PR closed as-is, as especially the tests could be interesting for the future. I also need to explore element strategies generating 0d arrays. |
I think our conclusions are that caching doesn't give enough of a speedup to justify the code complexity, and a no-checks path probably doesn't either (though I might experiment with this later). On that basis, let's close this PR to leave it as a linkable reference. I'll be happy to take a PR inlining |
Following up from #3065 where I mentioned cleaning some internal logic up (no public API changes). Personally I find it easier to follow now, although if folk disagree then that'd be good to know!
I also thought to not repeat
check_set_value()
's logic on values that have already been seen, as I think we can be pretty assured that values will be deterministically promoted once assigned into an array. Withnumpy.array_api
this gives a ~15% performance boost locally when I compared and averaged the results of typicalxps.arrays()
runs using this scrappy script... note I'm generally new to benchmarking and Hypothesis is obviously a tricky beast here, so I might need to revisit this.