New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate our core representation to an IR layer #3921
Comments
This is super interesting! |
Welcome, Jonathan! We'd love to have you continue contributing - I already really appreciate the type-annotation-improvements for our numpy and pandas extras, so this would be a third contribution 😻 @tybug might have some ideas here, but my impression is that the "refactor for an IR" project in this issue is more-or-less a serialized set of tasks and so adding a second person is unlikely to help much - even with just one we've had a few times where there were two or three PRs stacked up and accumulating merge conflicts between them. As an alternative, #3764 should be a fairly self-contained bugfix. On the more ambitious side, #3914 would also benefit from ongoing work on that - testing, observability, reporting whatever bugs you surface, etc. Or of course you're welcome to work on any other open issue which appeals to you! |
I was thinking that we'd still serialize to a bytestring - that's the ultimate interop format, and when we need to handle weird unicode and floats like subnormals or non-standard bitpatterns for |
yeah, this is a hard one to parallelize 😄. Some of the steps may subtly depend on others in ways that aren't obvious until one is knee deep in implementing it. Nice! I agree with the reasoning here. Added a task for this. This probably needs to be the absolute last thing to switch to the ir. |
Definitely the last thing to switch, I just got nerdsniped 😅 |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I'm working on migrating shrinker block programs. Our upweighting for large integer ranges is giving the shrinker trouble, because it means that a simpler tree can result in a longer buffer: the buffer runs through the weighted distribution and draws Real example of this: b1 = b'\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x00'
b2 = b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
s = st.lists(st.integers(0, 2**40))
print("complex result, smaller buffer", ConjectureData.for_buffer(b1).draw(s))
# complex result, smaller buffer [0, 0, 0, 0, 0]
print("simpler result, larger buffer", ConjectureData.for_buffer(b2).draw(s))
# simpler result, larger buffer [0, 0, 0, 0] As a result I'd like to look at moving that weighting logic into |
What if we forced even more instead? If we choose a smaller |
We could do that! I'm fairly confident exactly what you stated, or some small variation, would work. I was thinking of killing two birds with one stone here, though. Do you think the upweighting belongs in the ir or in |
I think doing it 'below' the IR, so we just represent a single integer value with a minimum of redundancy, is the principled approach here. "Literally just give me an integer" feels like it should be bijective 😅 |
The concern is that moving the weighting to class IntegersStrategy(SearchStrategy):
...
def do_draw(self, data):
weights = None
if self.end is not None and self.start is not None:
bits = (self.end - self.start).bit_length()
# For large ranges, we combine the uniform random distribution from draw_bits
# with a weighting scheme with moderate chance. Cutoff at 2 ** 24 so that our
# choice of unicode characters is uniform but the 32bit distribution is not.
if bits > 24:
def weighted():
# INT_SIZES = (8, 16, 32, 64, 128)
# INT_SIZES_SAMPLER = Sampler((4.0, 8.0, 1.0, 1.0, 0.5), observe=False)
total = 4.0 + 8.0 + 1.0 + 1.0 + 0.5
return (
(4.0 / total) * (-2**8, 2**8),
# ...except split these into two ranges to avoid double counting bits=8
(8.0 / total) * (-2**16, 2**16),
(1.0 / total) * (-2**32, 2**32),
(1.0 / total) * (-2**64, 2**64),
(0.5 / total) * (-2**128, 2**128),
)
weights = (
(7 / 8) * weighted()
+ (1 / 8) * uniform()
)
# for bounded integers, make the near-bounds more likely
weights = (
weights
+ (2 / 128) * self.start
+ (1 / 64) * self.end
+ (1 / 128) * (self.start + 1)
+ (1 / 128) * (self.end - 1)
)
# ... also renormalize weights to p=1, or have the ir do that
return data.draw_integer(
min_value=self.start, max_value=self.end, weights=weights
) Now the ir |
That would work! I'm also fine with the IR |
This epic-style issue tracks our work on refactoring Hypothesis to use an IR layer in our engine.
Motivation
So far, most things in Hypothesis have been built to work at the level of a bitstream.
DataTree
, which tracks what inputs we have previously tried in order to avoid redundancy, works at the level of blocks — logically related continuous segments of bits, e.g. perhaps from the same strategy."" < "0" < "1" < "00" < "01" < "11"
) smallest bitstream which is still a counterexample.However, in many cases, a bitstream is too low-level of a representation to make intelligent decisions.
DataTree
sees these as distinct inputs and can't deduplicate them. Ever wondered why we try0
so many times for@given(st.integers())
? It's not because we want to!In a completely unrelated train of thought, we would like Hypothesis to support backends: the ability to specify a custom distribution over strategies, overriding Hypothesis' pseudo-randomness. The original motivation here was supporting CrossHair (#3086), a concolic execution tool — but many other such backends are possible. (I personally have some ideas).
Happily, we can address both of these concerns with the same refactoring. That refactoring is migrating much of Hypothesis, which currently operates on bitstreams, to instead operate on an IR layer.
The Plan
The IR will be comprised of five nodes:
draw_integer
draw_float
draw_boolean
draw_string
draw_bytes
All strategies will draw from these five functions at the base level, rather than from a bitstream. From this, we get better
DataTree
deduplication (the mapping for arbitrary strategies is still not guarantee to be injective, but it's much closer!), more intelligent shrinking, and backend support.To implement a backend, implement
PrimitiveProvider
and override each of these methods. That's it. Hypothesis will take care of the rest, including shrinking and database support.original IR design described here #3086 (comment), though some small interface details have since changed.
Implementation
Completed:
float
,integer
,string
,byte
, andboolean
drawing logic intoPrimitiveProvider
#3788DataTree
to the new IR #3818crosshair
backend #3806Float
shrinker to the ir #3899Ongoing work, roughly in order of expected completion:
weights
- see Migratepass_to_descendant
andredistribute_block_pairs
shrinker passes #3929 (comment)generate_mutations_from
(and consider if the ir unlocks any improvements)Optimiser
(used bytarget()
)explain
phase), see Inquisitor sometimes fails to report arguments as freely varying #3864ParetoOptimiser
The text was updated successfully, but these errors were encountered: