Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean-up and slightly optimise extra.array_api.ArrayStrategy #3105

Closed
wants to merge 9 commits into from

Conversation

honno
Copy link
Member

@honno honno commented Sep 21, 2021

Following up from #3065 where I mentioned cleaning some internal logic up (no public API changes). Personally I find it easier to follow now, although if folk disagree then that'd be good to know!

I also thought to not repeat check_set_value()'s logic on values that have already been seen, as I think we can be pretty assured that values will be deterministically promoted once assigned into an array. With numpy.array_api this gives a ~15% performance boost locally when I compared and averaged the results of typical xps.arrays() runs using this scrappy script... note I'm generally new to benchmarking and Hypothesis is obviously a tricky beast here, so I might need to revisit this.

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! It took me a while to be convinced that the edge cases all worked, but with the comments below done I'll be happy to merge such a perf improvement. Thanks again 🤩

f"of dtype {type(val)} to array of dtype {result.dtype}."
) from e
self.check_set_value(val, result[i], strategy)
self.check_hist = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty tricky!

  • It's safe to share between multiple tests cases if and only if we never add disallowed elements to the set.
    A comment to this effect would be very useful; it took me a while to convince myself it was OK.
  • We have potentially unbounded memory usage, if e.g. you set max_examples=10**9 and leave int64 arrays overnight. This can be cheaply mitigated by checking once per array (at the start of do_draw) whether the set has more than say 100K elements, and clearing it if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a comment which hopefully clarifies how check_hist works, hope that's okay.

Thanks for catching the potential memory problems—I've made a reset mechanism at the start of do_draw() to that effect.

if len(self.check_hist) >= 100_000:
    self.check_hist = set()

This does hurt the performance gains. I tried tracking the cache size in a separate attribute, but len() for sets is really quite fast that it's not worth the cognitive load, and also using a list + size attribute was comparable in performance. So right now this PR seems improve performance by 5%-10% for typical uses of xps.arrays().

I wrote test cases to cover both points... they are both admittedly a bit funky heh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd rather not have that drop. Try a range up to say 10M? I think 80MB overhead is probably acceptable if it's a nontrivial perf improvement for arrays!

If you're benchmarking it would also be interesting to see perf without any checks at all - in some cases we could manage up-front analysis proving that we don't need them.

Finally, sharing the check_hist between strategies would also improve the hit rate: perhaps a global defaultdict(set) keyed off the dtype, and assign to an instance variable in __init__? (and then self.check_hist.clear()` instead of assigning a new set)

Copy link
Member Author

@honno honno Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a global defaultdict(set) keyed off the dtype

Just noting there's a potential hashing problem here, although @asmeurer mentioned to me recently that it's probably safe to assume dtypes will be hashable. I could always explore using the dtype name (strings) too.

hypothesis-python/src/hypothesis/extra/array_api.py Outdated Show resolved Hide resolved
hypothesis-python/tests/array_api/test_arrays.py Outdated Show resolved Hide resolved
@honno
Copy link
Member Author

honno commented Sep 22, 2021

For future reference, something like this can reach all the error messages.
>>> xps.arrays(dtype=xp.uint8, shape=10, unique=True, elements=st.characters(), fill=st.nothing()).example()
InvalidArgument: Generated elements [H, , ..., , 򘶍] from strategy <hypothesis.strategies._internal.strings.OneCharStringStrategy object at 0x7f4a514e6ee0> could not be converted to array of dtype uint8. Consider if elements of type <class 'str'> are compatible with uint8.
>>> xps.arrays(dtype=xp.uint8, shape=5, unique=True, elements=st.characters() | st.floats(), fill=st.nothing()).example()
InvalidArgument: Generated elements ['5', '@', 'g', 2.2250738585072014e-308, '𫌓'] from strategy one_of(characters(), floats()) could not be converted to array of dtype uint8. Consider if elements of types (<class 'float'>, <class 'str'>) are compatible with uint8.
>>> xps.arrays(dtype=xp.uint8, shape=10, elements=st.integers(500, 510)).example()
InvalidArgument: Generated array element 500 from strategy integers(500, 510) cannot be represented with dtype uint8. Array module numpy.array_api instead represents the element as 244. Consider using a more precise elements strategy, for example passing the width argument to floats().
>>> xps.arrays(dtype=xp.uint8, shape=10, fill=st.characters()).example()
InvalidArgument: Could not create full array of dtype=uint8 with fill value 'I'
>>> xps.arrays(dtype=xp.uint8, shape=10, fill=st.characters()).example()
InvalidArgument: Generated array element '0' from strategy <hypothesis.strategies._internal.strings.OneCharStringStrategy object at 0x7fc6e39dbd60> cannot be represented with dtype uint8. Array module numpy.array_api instead represents the element as 0. Consider using a more precise elements strategy, for example passing the width argument to floats().
>>> xps.arrays(dtype=xp.uint8, shape=10, elements=st.characters(), fill=st.just(10)).example()
InvalidArgument: Could not add generated array element '0' of type <class 'str'> to array of dtype uint8.

@honno honno requested a review from Zac-HD September 22, 2021 17:38
@rsokl

This comment has been minimized.

@rsokl

This comment has been minimized.

@rsokl
Copy link
Contributor

rsokl commented Sep 22, 2021

Are there any concerns that users might provide an elements strategy that produces unhashable values?

E.g. this would no longer work, I believe:

>>> hnp.arrays(
...     dtype="float",
...     shape=(4, 2),
...     elements=st.just(np.array(1)),
...     fill=st.nothing(),
... ).example()
array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

Perhaps it is worthwhile to have this optimization be skipped over gracefully in this sort of scenario (although I see that the same issue would occur when unique=True).

I doubt that this would occur very often, but I could imagine a strategy that serves up 0D arrays as elements, for example, being used to "feed" another strategy.

@honno
Copy link
Member Author

honno commented Sep 22, 2021

Are there any concerns that users might provide an elements strategy that produces unhashable values?

Oh yeah, nice catch! So we could cache the casted builtin we find anyway, but that would mean using array equalities with builtins in the if val in self.check_hist: line... which is going to defeat the purpose of avoiding expensive checks!

Regarding set behaviour, I do think you're onto something but yeah it requires rethinking how we check values. I'll have to sleep on both things anywho.

  • Write a test case for 0d arrays as the element strategy
  • Figure out how 1d+ arrays should work, if at all (see what extra.numpy does)

If folk want, I'm happy to close this and open a PR tomorrow with just the clean-up, w/ array-as-elements-strategy test(s).

@Zac-HD
Copy link
Member

Zac-HD commented Sep 23, 2021

High-level suggestion: let's try splitting into "ArrayStrategy which may need (expensive) checks", and "FastArrayStrategy which we know does not need any checks on element and fill strategies". How much faster is the latter?

(just make it unsafe for now; if it's +20% or better as I expect I'll write or help with the element-strategy-inspection logic)

@honno
Copy link
Member Author

honno commented Sep 23, 2021

Using this benchmark script with -n 50, I get a ~6.5% performance gain with no checks ala "FastArrayStrategy" (when compared to master), and ~2.5% with the current bounded cached checks (~4.5% when unbounded).

Results
# master
normal          16.675665184000536
unique          24.683974191000743
no fill         22.35953766900002
no fill unique  26.520580290999533
total           90.23975733500083

# no checks / "FastArrayStrategy" 
normal          16.634187734999614
unique          23.857498895999925
no fill         19.61512034900079
no fill unique  24.16638457299996
total           84.27319155300029

# this PR with bounded memory check
normal          17.363103243000296
unique          25.210598065999875
no fill         20.39611107300061
no fill unique  25.09125650900023
total           88.06106889100101

# this PR without bounded memory check
normal          17.36311352499979
unique          24.74212677700052
no fill         20.376063377999344
no fill unique  23.49789137100015
total           85.9791950509998

I was using smaller benchmark loops and more precise strategies before (e.g. dtype=xp.uint8, shape=10 instead of dtype=xps.numeric_dtypes(), shape=xps.array_shapes()), which is interesting to note, but yea means the performance gain isn't as significant as I initially had thought 😅

In terms of benchmarking, I have no idea about how Hypothesis generally caches things, e.g. if Hypothesis is caching something, could I benchmark repeated loops of mock test cases without triggering such mechanisms?


As we can expect, check_hist caching helps high collision scenarios like no fill and small element ranges e.g. xps.from_dtype(xp.uint8). In practical terms, that seems like a relatively rare use case?

@Zac-HD
Copy link
Member

Zac-HD commented Sep 23, 2021

Is that with a shared, or a per-strategy cache? Because the latter would be much less effective with varied shapes/dtypes (desugars to separate underlying strategies!).

@honno
Copy link
Member Author

honno commented Sep 23, 2021

Is that with a shared, or a per-strategy cache?

Per-strategy! Yep your shared suggestion seems promising—might not get round to an implementation this week, will keep you posted when I do.

@Zac-HD
Copy link
Member

Zac-HD commented Sep 23, 2021

If that's per-strategy and we still get 2-5% gains, I'm pretty optimistic again! A quick sanity-check is that I'd expect the gains to increase with max_examples for a while (probably only for shared caches) as the hit-rate improves.

I have a fair backlog of other issues I've been meaning to work on, so I'm going to leave this one to you unless you ask me to step in and help with something.

@honno
Copy link
Member Author

honno commented Sep 26, 2021

I've implemented @Zac-HD's shared cache idea. I benchmarked by timing pytest tests/array_api/ (avg. of 3 runs). I removed the test_check_hist_* tests before benchmarking HEAD and HEAD~1.

commit gain time
HEAD (shared cache) +15.5% 25.27
HEAD~1 (per-strategy cache) +15% 25.42
master (no cache) +0% 29.89

Also interestingly, here is my PR and current master with no checks. (Note test_may_not_use_overflowing_integers expectedly fails, but it only takes ~0.08s overall in HEAD when I look at "slowest 20 durations".)

commit gain time
HEAD~1 w/ no checks +16% 25.12
master w/ no checks +14.4% 25.60

Of course there's some variation in pytest runs, so I wouldn't care about one or two percent differences in benchmarks.


So unless our... hypothesis... that shared caching would really help is infact wrong, I've probably messed something up. I did write a test case that the cache does infact get shared in test_check_hist_shared_between_instances, hmm. I'll have to sleep on it!

Note I kept the bounded performance checks when benchmarking HEAD and HEAD~1. I'll have to see how it hurts performance again, and experiment with improvements.

Also next time I'll benchmark with just test_arrays.py.


def do_draw(self, data):
if 0 in self.shape:
return self.xp.zeros(self.shape, dtype=self.dtype)

# We reset check_hist when it reaches an arbitrarily large size to
# prevent unbounded memory usage.
if len(self.check_hist[self.dtype]) >= 100_000:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered resetting the set at 157284 items? A python set reallocates for new elements following the pattern 0 --> 4 --> 18 --> 76 --> 306 --> 1228 --> 4914 --> 19660 --> 78642 --> 157285 --> 314572 --> ..., where once the set reaches 0, 4, 18, etc items, it resizes the set behind the scenes to prepare for using more memory. By resetting right before the allocation at 157285 items, it allows the cache to grow larger without allowing it to resize and prepare space for another 150k more items.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also reset at 78641 of course, I haven't tested the speed either way. Unless it's set to 75_000, 100_000, 150_000, or some nice looking number, it'd be useful to include in the comments why you chose the number you did though. I found out these numbers with some basic brute-force tests as I can't be bothered to put effort into understanding the C source code, however if somebody has time then they can check out https://github.com/python/cpython/blob/f25f2e2e8c8e48490d22b0cdf67f575608701f6f/Objects/setobject.c#L177. Be warned though, it doesn't follow a clean doubling after 50k items though, and I'm not sure why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

75_000 would be a good reset point then; it's > 2**16 and below the realloc threshold. I'd be very surprised if this made a measurable difference to performance though.

Copy link

@Theelx Theelx Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance-wise, it won't make a huge difference, at most a few hundredths of a second every time it clears. Memory-wise though, it prevents python from allocating a bunch more space than it needs to, which can make a difference of 4 MB of RAM for each such set that is prevented from resizing larger (a set with 150k or so elements takes up 8MB of space, not including the elements in it, while a set with 75k takes up 4MB).

@Zac-HD
Copy link
Member

Zac-HD commented Sep 27, 2021

Hmm, if nochecks is +16% and shared-cache is +15.5% then we seem to be very close to the maximum possible gain from caching (and in particular it doesn't seem worth adding a no-cache special case).

So unless our... hypothesis... that shared caching would really help is infact wrong, I've probably messed something up. I did write a test case that the cache does infact get shared in test_check_hist_shared_between_instances, hmm. I'll have to sleep on it!

There's a cache on arrays(), so if you construct with the same arguments you'll get the same underlying ArrayStrategy object! This probably explains why the non-shared cache didn't degrade much, though it still might if you had e.g. varied elements strategies.

Overall: I'd be happy to merge a fairly simple caching strategy for any >= 10% speedup.

@honno
Copy link
Member Author

honno commented Sep 27, 2021

Benchmarking pytest tests/array_api/test_arrays.py (excluding test_check_hist_*, avg. of 5 runs) doesn't show as great a performance gain 😕

I also tried putting the check_hist variable into the module level (to not interfere with the strategy caching stuff... right?), as opposed to being an ArrayStrategy class attribute—seems significantly better (just couldn't figure out how to test it today).

version gain time
HEAD \w module-level cache +5.6% 14.06
HEAD +1.5% 14.56
master +0.0% 14.80

I'm confused because xps.arrays() is barely used outside of test_arrays.py... yet again, need to revisit how I benchmark this.

@Theelx
Copy link

Theelx commented Sep 27, 2021

When I try to run the benchmark scripts with numpy 1.21.2, I get this error:

from numpy import array_api as xp
ImportError: cannot import name 'array_api' from 'numpy'

What version of numpy are you using?

Edit: replacing that with import numpy as xp works fine.

@honno
Copy link
Member Author

honno commented Sep 27, 2021

What version of numpy are you using?

@Theelx Ah the Array API isn't in a release yet, you'll have to build it yourself, or get the nightly build:

pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy

Edit: replacing that with import numpy as xp works fine.

Note numpy is very similiar to numpy.array_api, but there are tons of subtle differences. For purposes of how it's used in Hypothesis, those differences don't matter (we infact mock the Array API with just numpy proper, see mock_xp at the bottom of array_api.py.

@honno
Copy link
Member Author

honno commented Sep 28, 2021

I replaced my use of self.check_hist with ArrayStrategy.check_hist inside of ArrayStrategy, which I believe allows value-checking caching to bypass the existing strategy-calls-with-identical-args caching—I think I'm correctly testing that this is the case in test_check_hist_persists_between_instances.

I'll work out a benchmark today/tomorrow.

@honno
Copy link
Member Author

honno commented Sep 28, 2021

Good news, I'm pretty happy with this benchmarking script now. It basically mocks running pytest test/array_api/test_arrays a few times... also I'm sure to leave my computer alone when it's running haha.

Bad news, caching might not be worth it. Same goes for doing away with checks completely.

version gain time
HEAD but no checks +5.1% 13.52 (+ 0.08 for failing tests)
HEAD (ArrayStrategy.check_hist) +2.0% 14.04
HEAD with no caching (control) +0.0% 14.33
Raw
HEAD (ArrayStrategy.check_hist)
 14.40
 14.22
 13.73
 13.93
 13.85
 14.04
 14.08
 14.00
 14.09
 14.05
------
 14.04  (min=13.73, max=14.40, dev=0.18)

HEAD but no caching

 14.47
 14.00
 14.27
 14.47
 14.39
 14.31
 14.14
 14.72
 14.33
 14.24
------
 14.33  (min=14.00, max=14.72, dev=0.19)

HEAD but no set value checking
 13.90
 13.67
 13.63
 13.57
 13.42
 13.40
 13.38
 13.38
 13.42
 13.49
------
 13.52  (min=13.38, max=13.90, dev=0.16)

HEAD~1 but self.check_hist

 14.16
 13.93
 13.79
 14.04
 14.04
 13.98
 14.08
 14.06
 13.89
 14.04
------
 14.00  (min=13.79, max=14.16, dev=0.10)

HEAD but module-level check_hist
 14.18
 13.99
 13.89
 13.77
 13.94
 13.97
 14.09
 14.05
 14.07
 14.06
------
 14.00  (min=13.77, max=14.18, dev=0.11)

@Zac-HD Anything else you'd like me to explore? I could see about benchmarking high-collision scenarios, although as I said before I don't know if they're used much.

If you'd like to ditch this for now, I'll be sure to benchmark a no checks scenario again when other array libraries get compliant-enough for extra.array_api. That would be interesting on its own, and could lead to exploring caching. Not to be pre-emptive, but just to keep in mind that benchmarking builtin casting would be a good starting point:

$ python -c "from numpy import array_api as xp; from timeit import timeit; print(timeit(lambda: int(xp.asarray(42, dtype=xp.uint8))))"
3.403678069000307
$ python -c "import torch; from timeit import timeit; print(timeit(lambda: int(torch.as_tensor(42, dtype=torch.uint8))))"
3.1453401199978543

Hopefully you still want the clean-up part of this PR—if so I can drop the caching stuff here, or start a new one with just that. I'll save what I've done here anyway, but it might be a good idea to keep this PR closed as-is, as especially the tests could be interesting for the future. I also need to explore element strategies generating 0d arrays.

@Zac-HD
Copy link
Member

Zac-HD commented Sep 29, 2021

I think our conclusions are that caching doesn't give enough of a speedup to justify the code complexity, and a no-checks path probably doesn't either (though I might experiment with this later).

On that basis, let's close this PR to leave it as a linkable reference. I'll be happy to take a PR inlining set_element() for the cleanup part at your convenience 😀

@Zac-HD Zac-HD closed this Sep 29, 2021
@honno honno deleted the optimise-array-strat branch February 28, 2024 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants