Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreliable test timings when using data.draw with text strategy #2108

Closed
SteadBytes opened this issue Sep 30, 2019 · 21 comments · Fixed by #2197

Comments

@SteadBytes
Copy link

@SteadBytes SteadBytes commented Sep 30, 2019

Using strategies.data to draw strategies.text consistently raises hypothesis.errors.Flaky due to Unreliable test timings for minimal examples such as ''

Minimal example:

@given(st.data())
def test_hypothesis_flaky_text(data):
    data.draw(st.text())
    assert 1

Produces:

============================================== FAILURES ==============================================
_____________________________________ test_hypothesis_flaky_text _____________________________________

    @given(st.data())
>   def test_hypothesis_flaky_text(data):

natural_merge_sort.py:78: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/ben/.local/lib/python3.7/site-packages/hypothesis/core.py:592: in execute
    % (test.__name__, text_repr[0])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hypothesis.core.StateForActualGivenExecution object at 0x7fbc41de3c88>
message = 'Hypothesis test_hypothesis_flaky_text(data=data(...)) produces unreliable results: Falsified on the first call but did not on a subsequent one'

    def __flaky(self, message):
        if len(self.falsifying_examples) <= 1:
>           raise Flaky(message)
E           hypothesis.errors.Flaky: Hypothesis test_hypothesis_flaky_text(data=data(...)) produces unreliable results: Falsified on the first call but did not on a subsequent one

/home/ben/.local/lib/python3.7/site-packages/hypothesis/core.py:751: Flaky
--------------------------------------------- Hypothesis ---------------------------------------------
Falsifying example: test_hypothesis_flaky_text(data=data(...))
Draw 1: ''
Unreliable test timings! On an initial run, this test took 902.74ms, which exceeded the deadline of 200.00ms, but on a subsequent run it took 0.04 ms, which did not. If you expect this sort of variability in your test timings, consider turning deadlines off for this test by setting deadline=None.
==================================== 1 failed, 1 skipped in 1.00s ====================================

Using strategies.text directly with @given does not produce the error. The following passes consistently:

@given(st.text())
def test_hypothesis_flaky_text(txt):
    assert 1

System Information

  • Python 3.7.3
  • Hypothesis 4.36.2
  • pytest 5.1.2
  • Ubuntu 19.04

Happy to look into a fix if this can be determined to be a real problem and not an isolated issue 👍

@Zac-HD

This comment has been minimized.

Copy link
Member

@Zac-HD Zac-HD commented Oct 1, 2019

I think this is a real problem, similar to #911. I'm not entirely sure why it's happening, but I really appreciate your writeup - thanks!

My guess is that text() specifically is doing a lot of work calculating Unicode spans etc. at instantiation, so excluding the draw-time proper isn't really helping. On the other hand excluding instantiation time from measurement might not help much (and would be intractable in general). Or perhaps the unreliable timings detection isn't using our draw-aware timer?

If you want to investigate I'd start by reading through core.py, especially the parts touched by #976. And feel free to ask questions as you go!

@DRMacIver

This comment has been minimized.

Copy link
Member

@DRMacIver DRMacIver commented Oct 1, 2019

My guess is that text() specifically is doing a lot of work calculating Unicode spans etc. at instantiation, so excluding the draw-time proper isn't really helping. On the other hand excluding instantiation time from measurement might not help much (and would be intractable in general). Or perhaps the unreliable timings detection isn't using our draw-aware timer?

Because of the way strategies are evaluated lazily I think this shouldn't be the case - the load time should be part of the first draw time - but it's possible something is going wrong, or that it's a function of time meaurement on computers being weird.

In general I wonder if we should modify the deadline code to be a little more forgiving of flakiness to cope with this sort of lazy loading of global resources in general - say give the code some small bounded number of deadline-exceeding test runs which it will rerun to see if they're still deadline exceeding and just shrug and ignore them if they're not.

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 1, 2019

@Zac-HD You're more than welcome.
That's my initial thinking too; since ithe fact that it's almost instant on the subsequent run suggests that the time is initially spent doing work at instantiation 🤔 Thanks for the pointers - I'll take a look at core.py.

@DRMacIver I agree that the load time should be part of the first draw time otherwise slow draws due to genuinely complex generation could lead to slow tests that don't trigger the hypothesis deadline. Being more forgiving does sound like a reasonable solution, exactly what the limit of deadline-exceeding runs should be I'm not sure. Would the Unreliable test timings error be raised if the deadline wasn't exceeded but the time between runs still varied largely? i.e. 180ms and 0.04ms the first and second runs respectively?

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 2, 2019

Adding a regression test to the hypothesis test suite has uncovered some more interesting behaviour 😄 .

When added as a test to tests/cover/test_regressions.py, the example i originally gave would not fail:

# hypothesis/hypothesis-python/tests/cover/test_regressions.py
#... existing tests

@given(st.data())
def test_flaky_text(data):
    data.draw(st.text())
python -m pytest hypothesis-python/tests/cover/test_regressions.py
======================================= 7 passed in 1.05s =======================================

However, moving the test to it's own file outside of the hypothesis test suite does fail:

# hypothesis/test_flaky_text.py
from hypothesis import given, strategies as st


@given(st.data())
def test_flaky_text(data):
    data.draw(st.text())

Furthermore, the same issue appears for characters:

# hypothesis/test_flaky_text.py
from hypothesis import given, strategies as st


@given(st.data())
def test_flaky_text(data):
    data.draw(st.characters())

However, if both the text test and characters test are in the same test file only the first fails:

# hypothesis/test_flaky_text.py
from hypothesis import given, strategies as st

#  fails
@given(st.data())
def test_flaky_text(data):
    data.draw(st.text())

#  passes
@given(st.data())
def test_flaky_text(data):
    data.draw(st.characters())

or

# hypothesis/test_flaky_text.py
from hypothesis import given, strategies as st

#  fails
@given(st.data())
def test_flaky_text(data):
    data.draw(st.characters())


#  passes
@given(st.data())
def test_flaky_text(data):
    data.draw(st.text())

Setting an increased deadline for the first test only causes both to pass:

# hypothesis/test_flaky_text.py
from hypothesis import given, settings, strategies as st

# passes
@settings(deadline=900)
@given(st.data())
def test_flaky_text(data):
    data.draw(st.characters())


# passes
@given(st.data())
def test_flaky_text(data):
    data.draw(st.text())

Only the very first usage of either text or characters seems to be subject to the issue; all subsequent uses are ok. Being unable to reproduce when within the existing test suite implies that somewhere during the pytest run the text and/or characters strategies are being instantiated somehow (even targeting the single test file directly didn't work) perhaps in a conftest.py 🤔

@Zac-HD

This comment has been minimized.

Copy link
Member

@Zac-HD Zac-HD commented Oct 2, 2019

The slow calls are in charmap.py, within characters() but with a couple of caches - we know that calculating all that stuff from unicode chars is slow, so we try to only do it once. (while not actually a solution, it could probably be made faster by calculating more ahead of time)

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 2, 2019

Yep, looks like the tests pass when within the test suite because of charmap.charmap() begin called during setup:

.

Adding a call to charmap.charmap() at the top of the test file outside of the test suite allows the tests to pass:

# hypothesis/test_flaky_text.py
from hypothesis import given, settings, strategies as st
from hypothesis.internal.charmap import charmap

charmap()

# passes
@given(st.data())
def test_flaky_text(data):
    data.draw(st.characters())


# passes
@given(st.data())
def test_flaky_text(data):
    data.draw(st.text())
@DRMacIver

This comment has been minimized.

Copy link
Member

@DRMacIver DRMacIver commented Oct 2, 2019

It might be worth creating a deliberately slow on the first invocation dummy strategy to test with. You could do something like:

slow_down_call = True

@composite
def flaky_timing_strategy(draw):
    global slow_down_call
    if slow_down_call:
        time.sleep(1)
        slow_down_call = False
    # dummy draw to make sure we draw some data and don't hit
    # the "this test doesn't actually draw anything" logic
    return draw(integers(0, 10))

Using this you can control the logic of whether it's slow independently from the text logic. This doesn't necessarily have the same behaviour as for text, but if it doesn't that is in and of itself interesting information!

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 2, 2019

Ah yes that's a good idea thank you - I'll investigate this behaviour and whether it represents the behaviour of text 👍

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 3, 2019

Using a deliberately slow strategy doesn't trigger the error, as the time spent in the draw is correctly factored out in execute here":

internal_draw_time = sum(data.draw_times[initial_draws:])

However, I think I've tracked down the issue is to the lazy evaluation of the cached strategy when using draw interactively within a test. During the first run, there is always a cache miss in the @cacheable decorator (as expected):

result = fn(*args, **kwargs)
The cached result however, is wrapped in a LazyStrategy:
result = LazyStrategy(strategy_definition, args, kwargs)
and has therefore not been evaluated yet. Both the character and text strategies make use of OneCharStringStrategy which calls charmap.categories upon initialisation:
assert set(whitelist_categories or ()).issubset(charmap.categories())
This will then call charmap as the global _charmap isn't initialised yet
_categories.remove("Cc") # Other, Control
. Using data.draw means that this initialisation of charmap takes place within the test function and is therefore included in the test duration.

Using @given(st.characters()) or @given(st.text()) directly instead of via data.draw` doesn't produce the same issue as the initialisation of charmap takes place during the draw phase before the test function itself begins and is thus factored out of the test duration as expected.

The only way I can see around this is to either pre-initialise charmap somehow before entering the test function/factor out the time taken to do so or as @DRMacIver suggested being more forgiving with flaky test timings 🤔

@DRMacIver

This comment has been minimized.

Copy link
Member

@DRMacIver DRMacIver commented Oct 3, 2019

The thing that's surprising to me is that that the lazy instantiation doesn't happen inside the first draw call though. What should happen (from memory, I haven't checked the code at the moemnt) is that the LazyStrategy instantiates wrapped_strategy the first time it is accessed, which should be inside draw. Maybe it's being evaluated earlier than that in evaluating a strategy property? e.g. is it possible that our is_empty check is outside the excluded period for calculating draw times?

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 3, 2019

The wrapped_strategy is instantiated the first time it's accessed:

if self.__wrapped_strategy is None:

Which does happen during the first draw call within the test function. Here's a screenshot from debugging the test run:
image
The call stack on the left show's that it's within the test function test_flaky_draw_characters. The is_empty check is indeed within the draw call (the second draw stack frame)

Because this is within the test function itself I don't think the time for the draw is being excluded.

A bit of good old print statement debugging confirms that the first time the characters strategy is being initialised is indeed within the test function:

# test_regressions.py

@settings(max_examples=1)
@given(st.data())
def test_flaky_draw_characters(data):
    print("test function starts")
    data.draw(st.characters())


# src/hypothesis/internal/conjecture/data.py
class ConjectureData(object)
    def draw(self, strategy, label=None):
        # ....
            print(f"is_empty check for {strategy}")
            if strategy.is_empty:
                self.mark_invalid()
        # ...

# src/hypothesis/internal/charmap.py
def charmap():
    # ...
    global _charmap
    # Best-effort caching in the face of missing files and/or unwritable
    # filesystems is fairly simple: check if loaded, else try loading,
    # else calculate and try writing the cache.
    if _charmap is None:
        print("_charmap initialisation")
        # ...

Running the test outputs:

collected 1 item

hypothesis-python/tests/cover/test_regressions.py
is_empty check for TupleStrategy((just(()), fixed_dictionaries({'data': data()}).map(lambda args: dict(args, **kwargs))))
is_empty check for just(())
is_empty check for fixed_dictionaries({'data': data()}).map(lambda args: dict(args, **kwargs))
is_empty check for fixed_dictionaries({'data': data()})
is_empty check for FixedKeysDictStrategy(('data',), TupleStrategy((data(),)))
is_empty check for TupleStrategy((data(),))
is_empty check for data()
test function starts
is_empty check for characters()
_charmap initialisation
is_empty check for <hypothesis.searchstrategy.strings.OneCharStringStrategy object at 0x7fd07f296208>
is_empty check for TupleStrategy((just(()), fixed_dictionaries({'data': data()}).map(lambda args: dict(args, **kwargs))))
is_empty check for just(())
is_empty check for fixed_dictionaries({'data': data()}).map(lambda args: dict(args, **kwargs))
is_empty check for fixed_dictionaries({'data': data()})
is_empty check for FixedKeysDictStrategy(('data',), TupleStrategy((data(),)))
is_empty check for TupleStrategy((data(),))
is_empty check for data()
test function starts
is_empty check for characters()
is_empty check for <hypothesis.searchstrategy.strings.OneCharStringStrategy object at 0x7fd07f296208>
is_empty check for TupleStrategy((just(()), fixed_dictionaries({'data': data()}).map(lambda args: dict(args, **kwargs))))
is_empty check for just(())
is_empty check for fixed_dictionaries({'data': data()}).map(lambda args: dict(args, **kwargs))
is_empty check for fixed_dictionaries({'data': data()})
is_empty check for FixedKeysDictStrategy(('data',), TupleStrategy((data(),)))
is_empty check for TupleStrategy((data(),))
is_empty check for data()
test function starts
is_empty check for characters()
is_empty check for <hypothesis.searchstrategy.strings.OneCharStringStrategy object at 0x7fd07f296208>
is_empty check for TupleStrategy((just(()), fixed_dictionaries({'data': data()}).map(lambda args: dict(args, **kwargs))))
is_empty check for just(())
is_empty check for fixed_dictionaries({'data': data()}).map(lambda args: dict(args, **kwargs))
is_empty check for fixed_dictionaries({'data': data()})
is_empty check for FixedKeysDictStrategy(('data',), TupleStrategy((data(),)))
is_empty check for TupleStrategy((data(),))
is_empty check for data()
@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 4, 2019

Some more debugging shows that the interactive draw within the test function isn't being added to the excluded draw times. This test

if not at_top_level:
return strategy.do_draw(self)
evaluates to True when the draw from characters is evaluated within the test function. As such the draw time is not added to ConjectureData.draw_times here
else:
try:
strategy.validate()
start_time = benchmark_time()
try:
return strategy.do_draw(self)
finally:
self.draw_times.append(benchmark_time() - start_time)
except BaseException as e:
mark_for_escalation(e)
raise
finally:
Therefore it is not excluded in execute here
internal_draw_time = sum(data.draw_times[initial_draws:])

So I think that either the test for at_top_level isn't correct for determining whether the draw time should be excluded, or the value of at_top_level is being calculated wrong in this case.

@DRMacIver

This comment has been minimized.

Copy link
Member

@DRMacIver DRMacIver commented Oct 4, 2019

Some more debugging shows that the interactive draw within the test function isn't being added to the excluded draw times.

Ooh. Good catch. That's definitely a bug.

So I think that either the test for at_top_level isn't correct for determining whether the draw time should be excluded, or the value of at_top_level is being calculated wrong in this case.

I think it's the former. It's correct that interactive draws are not at the top level, but interactive draws should definitely be excluded from the total time.

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 4, 2019

Ok excellent, I'll look into how the draw times can be excluded for interactive draws in addition to top level draws 👍

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Oct 5, 2019

I had an epiphany this morning whilst making breakfast and I think there's two bugs here 😄

This first is the already discovered fact that the interactive draw times are not taken into account when excluding draw times from the test run time (#2108 (comment))

The second is that because the draw time is calculated within ConjectureData.__draw here:


the time spent in it's caller ConnjectureData.draw is not included
return self.__draw(strategy, label=label)

For most strategies this is not a problem, as the time spent in ConjectureData.draw is minimal. However as previously mentioned (#2108 (comment)) for the characters/text strategies the is_empty check

causes initialisation of charmap._charmap which takes a significant amount of time. Since only the internal draw times (those measured by ConjectureData.__draw) are factored out of the overall test time here
internal_draw_time = sum(data.draw_times[initial_draws:])
the _charmap initialisation is included - causing the timeout on the first run.

To validate this hypothesis (pun intended), I put in a quick and dirty hack to start the draw timer in ConjectureData.draw instead of ConjectureData.__draw and to always measure all draw times (not a solution of course, but just to narrow down this issue) and the test passes without the unreliable timings error:

# src/hypothesis/internal/conjecture/data.py

#...
draw_start = None

class ConjectureData:
    # ...
    def draw(self, strategy, label=None):
        #...
        # start timing the draw
        global draw_start
        draw_start = benchmark_time()
        
        # _charmap initialisation will be included in draw time 
        # for characters/text strategy
        if strategy.is_empty:
            self.mark_invalid()

        if self.depth >= MAX_DEPTH:
            self.mark_invalid()

        return self.__draw(strategy, label=label)

    def __draw(self, strategy, label):
        at_top_level = self.depth == 0
        if label is None:
            label = strategy.label
        self.start_example(label=label)

        try:
            if not True: # force all draw times to be measured
                return strategy.do_draw(self)
            else:
                try:
                    strategy.validate()
                    global draw_start
                    start_time = draw_time
                    try:
                        return strategy.do_draw(self)
                    finally:
                        draw_time = benchmark_time() - start_time
                        self.draw_times.append(benchmark_time() - draw_start)
                except BaseException as e:
                    mark_for_escalation(e)
                    raise
        finally:
            self.stop_example()

# src/hypothesis/core.py

class StateForActualGivenExecution(object):
    # ...
    def execute():
    # ...
           @proxies(self.test)
            def test(*args, **kwargs):
                self.__test_runtime = None
                initial_draws = len(data.draw_times)
                start = benchmark_time()
                result = self.test(*args, **kwargs)
                finish = benchmark_time()
                # force all draw times to be taken into account not just 'internal'
                # not a general solution
                internal_draw_time = sum(data.draw_times)
                runtime = datetime.timedelta(
                    seconds=finish - start - internal_draw_time
                )
                self.__test_runtime = runtime
                current_deadline = self.settings.deadline
                if not is_final:
                    current_deadline = (current_deadline // 4) * 5
                if runtime >= current_deadline:
                    raise DeadlineExceeded(runtime, self.settings.deadline)
                return result    

So I think the overall solution comes in two main parts:

  1. Include the full ConjectureData.draw call in the draw timing measurements
  2. Include interactive draws when calculating the internal draw time to factor out of the overall test runtime in the proxied test method within StateForActualGivenExecution.execute

Please do tell me if you think I've misdiagnosed the issue or the proposed (high level) solution isn't correct 😄

@robertknight

This comment has been minimized.

Copy link
Contributor

@robertknight robertknight commented Oct 31, 2019

I've found what looks like a bug related to this, tested in Python 3.6.9. In hypothesis/internal/charmap.py:

try:
with gzip.GzipFile(f, "rb") as i:
tmp_charmap = dict(json.loads(i))
except Exception:
tmp_charmap = {}

The cached charmap is opened as a GzipFile which is then passed to json.loads, which expects a string or byte array, rather than json.load, which expects a file pointer. As a result, this generates an exception which is then slightly caught, leading to the generation of a temporary charmap. That involves an expensive loop which iterates from 0 to sys.maxunicode (over a million iterations).

This type mismatch occurs on every run, on the first call to something that uses the charmap, regardless of whether a cached file is present or not.

@jluttine

This comment has been minimized.

Copy link

@jluttine jluttine commented Nov 3, 2019

I was hit by this bug as well. For what it's worth, and if this wasn't obvious or mentioned already, here's a simple workaround for pytest. In conf.py just create some text samples so all that initialization happens before tests are actually run.

def pytest_configure(config):
    st.text().example()
    return
@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Nov 3, 2019

@robertknight Awesome spot and well done on getting that PR in! Correct me if I'm wrong, but this issue still remains despite that fix right?

@jluttine Thanks for the workaround, I should have posted a similar example that I've been using.

I haven't had much time to work on a fix for this after my initial debugging frenzy 😆 However I'm hoping to get some time this week to look into it further 👍

@Zac-HD

This comment has been minimized.

Copy link
Member

@Zac-HD Zac-HD commented Nov 5, 2019

@SteadBytes - that's correct; the PR from @robertknight fixes our unicode cache which is a great mitigation for text() but we want to fix the underlying issue too.

For that, I think your diagnosis and proposed fix from #2108 (comment) are exactly right! Very happy to leave it to you if you'd like to keep working on this; equally please feel free to ask for a hand or even throw it back to a maintainer if you're sick of it. Now that you've worked out what's happening the fix should be relatively easy!

@SteadBytes

This comment has been minimized.

Copy link
Author

@SteadBytes SteadBytes commented Nov 11, 2019

@Zac-HD Haha you beat me to implementing the fix - I was planning to look at it tonight so it was close 😂
Nice work 👍

@Zac-HD

This comment has been minimized.

Copy link
Member

@Zac-HD Zac-HD commented Nov 11, 2019

Thanks 😂

If you want to try another issue, 1116 or 2116 or anything tagged "good first issue" should be doable in an evening 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.