-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreliable test timings when using data.draw with text strategy #2108
Comments
|
I think this is a real problem, similar to #911. I'm not entirely sure why it's happening, but I really appreciate your writeup - thanks! My guess is that If you want to investigate I'd start by reading through |
Because of the way strategies are evaluated lazily I think this shouldn't be the case - the load time should be part of the first draw time - but it's possible something is going wrong, or that it's a function of time meaurement on computers being weird. In general I wonder if we should modify the deadline code to be a little more forgiving of flakiness to cope with this sort of lazy loading of global resources in general - say give the code some small bounded number of deadline-exceeding test runs which it will rerun to see if they're still deadline exceeding and just shrug and ignore them if they're not. |
|
@Zac-HD You're more than welcome. @DRMacIver I agree that the load time should be part of the first draw time otherwise slow draws due to genuinely complex generation could lead to slow tests that don't trigger the hypothesis deadline. Being more forgiving does sound like a reasonable solution, exactly what the limit of deadline-exceeding runs should be I'm not sure. Would the |
|
Adding a regression test to the hypothesis test suite has uncovered some more interesting behaviour 😄 . When added as a test to # hypothesis/hypothesis-python/tests/cover/test_regressions.py
#... existing tests
@given(st.data())
def test_flaky_text(data):
data.draw(st.text())However, moving the test to it's own file outside of the hypothesis test suite does fail: # hypothesis/test_flaky_text.py
from hypothesis import given, strategies as st
@given(st.data())
def test_flaky_text(data):
data.draw(st.text())Furthermore, the same issue appears for # hypothesis/test_flaky_text.py
from hypothesis import given, strategies as st
@given(st.data())
def test_flaky_text(data):
data.draw(st.characters())However, if both the # hypothesis/test_flaky_text.py
from hypothesis import given, strategies as st
# fails
@given(st.data())
def test_flaky_text(data):
data.draw(st.text())
# passes
@given(st.data())
def test_flaky_text(data):
data.draw(st.characters())or # hypothesis/test_flaky_text.py
from hypothesis import given, strategies as st
# fails
@given(st.data())
def test_flaky_text(data):
data.draw(st.characters())
# passes
@given(st.data())
def test_flaky_text(data):
data.draw(st.text())Setting an increased deadline for the first test only causes both to pass: # hypothesis/test_flaky_text.py
from hypothesis import given, settings, strategies as st
# passes
@settings(deadline=900)
@given(st.data())
def test_flaky_text(data):
data.draw(st.characters())
# passes
@given(st.data())
def test_flaky_text(data):
data.draw(st.text())Only the very first usage of either |
|
The slow calls are in |
|
Yep, looks like the tests pass when within the test suite because of
Adding a call to # hypothesis/test_flaky_text.py
from hypothesis import given, settings, strategies as st
from hypothesis.internal.charmap import charmap
charmap()
# passes
@given(st.data())
def test_flaky_text(data):
data.draw(st.characters())
# passes
@given(st.data())
def test_flaky_text(data):
data.draw(st.text()) |
|
It might be worth creating a deliberately slow on the first invocation dummy strategy to test with. You could do something like: slow_down_call = True
@composite
def flaky_timing_strategy(draw):
global slow_down_call
if slow_down_call:
time.sleep(1)
slow_down_call = False
# dummy draw to make sure we draw some data and don't hit
# the "this test doesn't actually draw anything" logic
return draw(integers(0, 10))Using this you can control the logic of whether it's slow independently from the text logic. This doesn't necessarily have the same behaviour as for text, but if it doesn't that is in and of itself interesting information! |
|
Ah yes that's a good idea thank you - I'll investigate this behaviour and whether it represents the behaviour of text 👍 |
|
Using a deliberately slow strategy doesn't trigger the error, as the time spent in the draw is correctly factored out in
However, I think I've tracked down the issue is to the lazy evaluation of the cached strategy when using
LazyStrategy:
character and text strategies make use of OneCharStringStrategy which calls charmap.categories upon initialisation:
charmap as the global _charmap isn't initialised yet
data.draw means that this initialisation of charmap takes place within the test function and is therefore included in the test duration.
Using The only way I can see around this is to either pre-initialise charmap somehow before entering the test function/factor out the time taken to do so or as @DRMacIver suggested being more forgiving with flaky test timings 🤔 |
|
The thing that's surprising to me is that that the lazy instantiation doesn't happen inside the first |
|
Some more debugging shows that the interactive draw within the test function isn't being added to the excluded draw times. This test hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/data.py Lines 842 to 843 in ce3c734
True when the draw from characters is evaluated within the test function. As such the draw time is not added to ConjectureData.draw_times here hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/data.py Lines 844 to 855 in ce3c734
execute here
So I think that either the test for |
Ooh. Good catch. That's definitely a bug.
I think it's the former. It's correct that interactive draws are not at the top level, but interactive draws should definitely be excluded from the total time. |
|
Ok excellent, I'll look into how the draw times can be excluded for interactive draws in addition to top level draws 👍 |
|
I had an epiphany this morning whilst making breakfast and I think there's two bugs here 😄 This first is the already discovered fact that the interactive draw times are not taken into account when excluding draw times from the test run time (#2108 (comment)) The second is that because the draw time is calculated within
the time spent in it's caller ConnjectureData.draw is not included
For most strategies this is not a problem, as the time spent in
charmap._charmap which takes a significant amount of time. Since only the internal draw times (those measured by ConjectureData.__draw) are factored out of the overall test time here
_charmap initialisation is included - causing the timeout on the first run.
To validate this hypothesis (pun intended), I put in a quick and dirty hack to start the draw timer in # src/hypothesis/internal/conjecture/data.py
#...
draw_start = None
class ConjectureData:
# ...
def draw(self, strategy, label=None):
#...
# start timing the draw
global draw_start
draw_start = benchmark_time()
# _charmap initialisation will be included in draw time
# for characters/text strategy
if strategy.is_empty:
self.mark_invalid()
if self.depth >= MAX_DEPTH:
self.mark_invalid()
return self.__draw(strategy, label=label)
def __draw(self, strategy, label):
at_top_level = self.depth == 0
if label is None:
label = strategy.label
self.start_example(label=label)
try:
if not True: # force all draw times to be measured
return strategy.do_draw(self)
else:
try:
strategy.validate()
global draw_start
start_time = draw_time
try:
return strategy.do_draw(self)
finally:
draw_time = benchmark_time() - start_time
self.draw_times.append(benchmark_time() - draw_start)
except BaseException as e:
mark_for_escalation(e)
raise
finally:
self.stop_example()
# src/hypothesis/core.py
class StateForActualGivenExecution(object):
# ...
def execute():
# ...
@proxies(self.test)
def test(*args, **kwargs):
self.__test_runtime = None
initial_draws = len(data.draw_times)
start = benchmark_time()
result = self.test(*args, **kwargs)
finish = benchmark_time()
# force all draw times to be taken into account not just 'internal'
# not a general solution
internal_draw_time = sum(data.draw_times)
runtime = datetime.timedelta(
seconds=finish - start - internal_draw_time
)
self.__test_runtime = runtime
current_deadline = self.settings.deadline
if not is_final:
current_deadline = (current_deadline // 4) * 5
if runtime >= current_deadline:
raise DeadlineExceeded(runtime, self.settings.deadline)
return result So I think the overall solution comes in two main parts:
Please do tell me if you think I've misdiagnosed the issue or the proposed (high level) solution isn't correct 😄 |
|
I've found what looks like a bug related to this, tested in Python 3.6.9. In hypothesis/hypothesis-python/src/hypothesis/internal/charmap.py Lines 60 to 65 in 020df5c
The cached charmap is opened as a This type mismatch occurs on every run, on the first call to something that uses the charmap, regardless of whether a cached file is present or not. |
|
I was hit by this bug as well. For what it's worth, and if this wasn't obvious or mentioned already, here's a simple workaround for pytest. In def pytest_configure(config):
st.text().example()
returnEDIT: Actually, the following is better as it won't complain about using def pytest_configure(config):
@given(st.text())
def foo(x):
pass
foo()
return |
|
@robertknight Awesome spot and well done on getting that PR in! Correct me if I'm wrong, but this issue still remains despite that fix right? @jluttine Thanks for the workaround, I should have posted a similar example that I've been using. I haven't had much time to work on a fix for this after my initial debugging frenzy 😆 However I'm hoping to get some time this week to look into it further 👍 |
|
@SteadBytes - that's correct; the PR from @robertknight fixes our unicode cache which is a great mitigation for For that, I think your diagnosis and proposed fix from #2108 (comment) are exactly right! Very happy to leave it to you if you'd like to keep working on this; equally please feel free to ask for a hand or even throw it back to a maintainer if you're sick of it. Now that you've worked out what's happening the fix should be relatively easy! |
|
@Zac-HD Haha you beat me to implementing the fix - I was planning to look at it tonight so it was close 😂 |
|
Thanks 😂 If you want to try another issue, 1116 or 2116 or anything tagged "good first issue" should be doable in an evening 😉 |
After adding some debug info, I'm about 98% sure this is `https://github.com/HypothesisWorks/hypothesis/issues/2108` or `https://github.com/HypothesisWorks/hypothesis/issues/3369` or similar. Timing the test itself: the function body in the failing instance completely very quickly. So, barring something extremely spooky, the only place the time could be hiding is in the hypothesis setup for the function (draws are supposed to be excluded but the issues above demonstrate there still appear to be corner cases).
* Remove pandas import from test_bytes_df
* Switch to using `mem://` for this test
* Remove data exclusion workarounds for previously-unsupported write path
* Change test deadline setup to address `test_bytes_{numpy,df} produces inconsistent results` errors: Update hypothesis deadlines to `None` and add an internal timing of only the test body itself, which will fail the test if >2s duration. After debugging, I'm reasonably convinced thisis
`https://github.com/HypothesisWorks/hypothesis/issues/2108` or
`https://github.com/HypothesisWorks/hypothesis/issues/3369` or similar.
The function body in all observed failures completely very quickly. So, barring something extremely spooky, the only place the time could be hiding is in the hypothesis setup for the function (draws are supposed to be excluded but the issues above demonstrate there still appear to be corner cases).

Using
strategies.datato drawstrategies.textconsistently raiseshypothesis.errors.Flakydue toUnreliable test timingsfor minimal examples such as''Minimal example:
Produces:
Using
strategies.textdirectly with@givendoes not produce the error. The following passes consistently:System Information
Happy to look into a fix if this can be determined to be a real problem and not an isolated issue 👍
The text was updated successfully, but these errors were encountered: