Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthchecks triggered in CI #3859

Closed
AdrianSosic opened this issue Jan 24, 2024 · 2 comments
Closed

Healthchecks triggered in CI #3859

AdrianSosic opened this issue Jan 24, 2024 · 2 comments
Labels
performance go faster! use less memory! question not sure it's a bug? questions welcome

Comments

@AdrianSosic
Copy link

Hi, I know that this issue has been already discussed in one way or the other in previous issues, and I also saw the corresponding explanations in the doc, i.e.

If this is expected, e.g. when generating large arrays or dataframes, you can selectively disable them with the suppress_health_check setting.

Still, I thought opening this new issue could be useful because my example is really not about "large dataframes" (as is mentioned in the docs) but rather really tiny ones. So perhaps my problem has a different cause.

Here's the strategy I've implemented, creating really small dataframes of sizes up to 5x2 (and I think the issue persists even for smaller numbers):

from hypothesis import strategies as st
from hypothesis.extra.pandas import column, data_frames

_index_strategy = st.one_of(st.text(), st.integers(), st.floats())
"""A strategy for generating dataframe indexes."""


@st.composite
def random_dataframes(draw: st.DrawFn):
    """Generate pandas dataframes of random shape and content."""
    # Generate the dataframe shape
    num_rows = draw(st.integers(min_value=0, max_value=2))
    num_cols = draw(st.integers(min_value=0, max_value=5))

    # Generate the column names/types and index
    col_names = draw(
        st.lists(_index_strategy, min_size=num_cols, max_size=num_cols, unique=True)
    )
    col_types = draw(
        st.lists(
            st.sampled_from([int, float, str]), min_size=num_cols, max_size=num_cols
        )
    )
    index = st.lists(_index_strategy, min_size=num_rows, max_size=num_rows)

    # Define the column content
    columns = [
        column(name=name, dtype=dtype) for name, dtype in zip(col_names, col_types)
    ]

    # Generate the dataframe
    return draw(data_frames(columns=columns, index=index))

The tests run without problems on my local machine, but when executed in CI, I get an error like:
Only produced 1 valid examples in 2.81 seconds

And 2.81 seconds for creating a single dataframe of size 5x2 seams really fishy, so I suspect something is not running as intended!? If it helps, here is a link to one of the failed CI runs:
https://github.com/emdgroup/baybe/actions/runs/7621511880/job/20769011261

Happy to hear your thoughts! 😃

@Zac-HD Zac-HD added question not sure it's a bug? questions welcome performance go faster! use less memory! labels Jan 26, 2024
@Zac-HD
Copy link
Member

Zac-HD commented Jan 26, 2024

I'd guess the problem here is that you're imposing several interacting constraints, by picking a num_rows and num_cols up front. It's more efficient to let Hypothesis handle that internally, where it can incrementally choose whether to add each additional row. I think an efficient version of your strategy would look something like:

from hypothesis import strategies as st
from hypothesis.extra.pandas import column, data_frames, indexes

_index_strategy = st.one_of(st.text(), st.integers(), st.floats())
"""A strategy for generating dataframe indexes."""

@st.composite
def random_dataframes(draw: st.DrawFn):
    """Generate pandas dataframes of random shape and content."""
    _col_strategy = st.builds(column, name=_index_strategy, dtype=st.sampled_from([int, float, str]))
    columns = draw(st.lists(_col_strategy, max_size=5, unique_by=lambda c: c.name))
    index_strategy = indexes(elements=_index_strategy, max_size=2)
    return draw(data_frames(columns=columns, index=index_strategy))

and omitting the index_strategy (in favor of setting the first column as the index) and constraints on the number of rows would probably make this even faster.

@AdrianSosic
Copy link
Author

Hi @Zac-HD, thanks a lot for your answer. Now that you explained the problem, it makes a lot of sense to me, and it indeed solved the issue! I was wondering: wouldn't it make sense to link something like the document you shared also in the hypothesis documentation? In fact, I was looking several times for resources about the used shrinking concepts to better understand what is happening behind the scenes (it would have probably helped my prevent the problem in the first place) but I couldn't really find much ...

Scienfitz pushed a commit to emdgroup/baybe that referenced this issue Feb 1, 2024
Scienfitz added a commit to emdgroup/baybe that referenced this issue Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance go faster! use less memory! question not sure it's a bug? questions welcome
Projects
None yet
Development

No branches or pull requests

2 participants