Healthchecks triggered in CI #3859

AdrianSosic · 2024-01-24T16:51:13Z

Hi, I know that this issue has been already discussed in one way or the other in previous issues, and I also saw the corresponding explanations in the doc, i.e.

If this is expected, e.g. when generating large arrays or dataframes, you can selectively disable them with the suppress_health_check setting.

Still, I thought opening this new issue could be useful because my example is really not about "large dataframes" (as is mentioned in the docs) but rather really tiny ones. So perhaps my problem has a different cause.

Here's the strategy I've implemented, creating really small dataframes of sizes up to 5x2 (and I think the issue persists even for smaller numbers):

from hypothesis import strategies as st
from hypothesis.extra.pandas import column, data_frames

_index_strategy = st.one_of(st.text(), st.integers(), st.floats())
"""A strategy for generating dataframe indexes."""


@st.composite
def random_dataframes(draw: st.DrawFn):
    """Generate pandas dataframes of random shape and content."""
    # Generate the dataframe shape
    num_rows = draw(st.integers(min_value=0, max_value=2))
    num_cols = draw(st.integers(min_value=0, max_value=5))

    # Generate the column names/types and index
    col_names = draw(
        st.lists(_index_strategy, min_size=num_cols, max_size=num_cols, unique=True)
    )
    col_types = draw(
        st.lists(
            st.sampled_from([int, float, str]), min_size=num_cols, max_size=num_cols
        )
    )
    index = st.lists(_index_strategy, min_size=num_rows, max_size=num_rows)

    # Define the column content
    columns = [
        column(name=name, dtype=dtype) for name, dtype in zip(col_names, col_types)
    ]

    # Generate the dataframe
    return draw(data_frames(columns=columns, index=index))

The tests run without problems on my local machine, but when executed in CI, I get an error like:
Only produced 1 valid examples in 2.81 seconds

And 2.81 seconds for creating a single dataframe of size 5x2 seams really fishy, so I suspect something is not running as intended!? If it helps, here is a link to one of the failed CI runs:
https://github.com/emdgroup/baybe/actions/runs/7621511880/job/20769011261

Happy to hear your thoughts! 😃

The text was updated successfully, but these errors were encountered:

Zac-HD · 2024-01-26T21:45:16Z

I'd guess the problem here is that you're imposing several interacting constraints, by picking a num_rows and num_cols up front. It's more efficient to let Hypothesis handle that internally, where it can incrementally choose whether to add each additional row. I think an efficient version of your strategy would look something like:

from hypothesis import strategies as st
from hypothesis.extra.pandas import column, data_frames, indexes

_index_strategy = st.one_of(st.text(), st.integers(), st.floats())
"""A strategy for generating dataframe indexes."""

@st.composite
def random_dataframes(draw: st.DrawFn):
    """Generate pandas dataframes of random shape and content."""
    _col_strategy = st.builds(column, name=_index_strategy, dtype=st.sampled_from([int, float, str]))
    columns = draw(st.lists(_col_strategy, max_size=5, unique_by=lambda c: c.name))
    index_strategy = indexes(elements=_index_strategy, max_size=2)
    return draw(data_frames(columns=columns, index=index_strategy))

and omitting the index_strategy (in favor of setting the first column as the index) and constraints on the number of rows would probably make this even faster.

HypothesisWorks/hypothesis#3859

AdrianSosic · 2024-01-31T21:57:33Z

Hi @Zac-HD, thanks a lot for your answer. Now that you explained the problem, it makes a lot of sense to me, and it indeed solved the issue! I was wondering: wouldn't it make sense to link something like the document you shared also in the hypothesis documentation? In fact, I was looking several times for resources about the used shrinking concepts to better understand what is happening behind the scenes (it would have probably helped my prevent the problem in the first place) but I couldn't really find much ...

HypothesisWorks/hypothesis#3859

Fixes the dataframe strategy as proposed in HypothesisWorks/hypothesis#3859

Zac-HD added question not sure it's a bug? questions welcome performance go faster! use less memory! labels Jan 26, 2024

Zac-HD closed this as completed Jan 26, 2024

AdrianSosic added a commit to emdgroup/baybe that referenced this issue Jan 31, 2024

Fix dataframe hypothesis strategy

d008c1b

HypothesisWorks/hypothesis#3859

AdrianSosic mentioned this issue Jan 31, 2024

Fix dataframe hypothesis strategy emdgroup/baybe#108

Merged

Scienfitz pushed a commit to emdgroup/baybe that referenced this issue Feb 1, 2024

Fix dataframe hypothesis strategy

cce61cc

HypothesisWorks/hypothesis#3859

Scienfitz added a commit to emdgroup/baybe that referenced this issue Feb 1, 2024

Merge: Fix dataframe hypothesis strategy (#108)

33cbb8f

Fixes the dataframe strategy as proposed in HypothesisWorks/hypothesis#3859

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthchecks triggered in CI #3859

Healthchecks triggered in CI #3859

AdrianSosic commented Jan 24, 2024

Zac-HD commented Jan 26, 2024

AdrianSosic commented Jan 31, 2024

Healthchecks triggered in CI #3859

Healthchecks triggered in CI #3859

Comments

AdrianSosic commented Jan 24, 2024

Zac-HD commented Jan 26, 2024

AdrianSosic commented Jan 31, 2024