Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hypothesis.extra.pandas.column, errors out when elements = lists and unique = True #3144

Closed
chrisli-livongo opened this issue Nov 9, 2021 · 4 comments · Fixed by #3204
Closed
Labels
legibility make errors helpful and Hypothesis grokable

Comments

@chrisli-livongo
Copy link

chrisli-livongo commented Nov 9, 2021

When defining a pandas.columns with lists

from hypothesis.extra.pandas import column
from hypothesis.strategies import lists

column(name='group_array', elements=lists(integers(min_value=-2147483648, max_value=2147483647), max_size=5, min_size=5, unique=True), dtype=list, fill=None, unique=True)

We got this error

...
/usr/local/lib/python3.8/dist-packages/hypothesis/internal/conjecture/data.py:884: in draw
    return strategy.do_draw(self)
/usr/local/lib/python3.8/dist-packages/hypothesis/strategies/_internal/lazy.py:168: in do_draw
    return data.draw(self.wrapped_strategy)
/usr/local/lib/python3.8/dist-packages/hypothesis/internal/conjecture/data.py:884: in draw
    return strategy.do_draw(self)
/usr/local/lib/python3.8/dist-packages/hypothesis/strategies/_internal/lazy.py:168: in do_draw
    return data.draw(self.wrapped_strategy)
/usr/local/lib/python3.8/dist-packages/hypothesis/internal/conjecture/data.py:884: in draw
    return strategy.do_draw(self)
/usr/local/lib/python3.8/dist-packages/hypothesis/strategies/_internal/core.py:1393: in do_draw
    return self.definition(data.draw, *self.args, **self.kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

draw = <bound method ConjectureData.draw of ConjectureData(VALID, 1961 bytes, frozen)>

    @st.composite
    def just_draw_columns(draw):
        index = draw(index_strategy)
        local_index_strategy = st.just(index)
    
        data = OrderedDict((c.name, None) for c in rewritten_columns)
    
        # Depending on how the columns are going to be generated we group
        # them differently to get better shrinking. For columns with fill
        # enabled, the elements can be shrunk independently of the size,
        # so we can just shrink by shrinking the index then shrinking the
        # length and are generally much more free to move data around.
    
        # For columns with no filling the problem is harder, and drawing
        # them like that would result in rows being very far apart from
        # each other in the underlying data stream, which gets in the way
        # of shrinking. So what we do is reorder and draw those columns
        # row wise, so that the values of each row are next to each other.
        # This makes life easier for the shrinker when deleting blocks of
        # data.
        columns_without_fill = [c for c in rewritten_columns if c.fill.is_empty]
    
        if columns_without_fill:
            for c in columns_without_fill:
                data[c.name] = pandas.Series(
                    np.zeros(shape=len(index), dtype=c.dtype), index=index
                )
            seen = {c.name: set() for c in columns_without_fill if c.unique}
    
            for i in range(len(index)):
                for c in columns_without_fill:
                    if c.unique:
                        for _ in range(5):
                            value = draw(c.elements)
>                           if value not in seen[c.name]:
E                           TypeError: unhashable type: 'numpy.ndarray'

/usr/local/lib/python3.8/dist-packages/hypothesis/extra/pandas/impl.py:578: TypeError

But it works fine, when we set column unique=False:
column(name='group_array', elements=lists(integers(min_value=-2147483648, max_value=2147483647), max_size=5, min_size=5, unique=True), dtype=list, fill=None, unique=False)

@honno
Copy link
Member

honno commented Nov 9, 2021

Have no idea myself sorry, here's just a reproducer for folks.

>>> from hypothesis import strategies as st
>>> from hypothesis.extra import pandas as pdst
>>> col1 = pdst.column(elements=st.lists(st.integers()), dtype=list, unique=False)
>>> pdst.data_frames([col1]).example()
    0
0  []
>>> col2 = pdst.column(elements=st.lists(st.integers()), dtype=list, unique=True)
>>> pdst.data_frames([col2]).example()
TypeError: unhashable type: 'numpy.ndarray'

@rsokl rsokl added the bug something is clearly wrong here label Nov 10, 2021
@Zac-HD
Copy link
Member

Zac-HD commented Nov 10, 2021

Well, the problem is just that we track a dict[str, set] of seen items for each column name, but if the generated element is not hashable (and indeed ndarray is not) then we get a type error.

You get the same error from st.lists(st.lists(st.none()), unique=True).example(), so I think this is a docs issue rather than a bug.

@rsokl
Copy link
Contributor

rsokl commented Nov 11, 2021

I had labeled this a bug because I think that pdst.column(elements=st.lists(st.integers()), dtype=list, unique=True) should immediately raise since it is not supported.

Even if the user has insight into the fact that we require "hashability" in order to draw unique elements, and used pdst.column(elements=st.lists(st.integers()).map(tuple), dtype=tuple, unique=True) instead, they would still hit the same error because pdst.data_frames will convert the sequence to an ndarray under the hood, which of course is not hashable.

@Zac-HD Zac-HD added legibility make errors helpful and Hypothesis grokable and removed bug something is clearly wrong here labels Nov 11, 2021
@Zac-HD
Copy link
Member

Zac-HD commented Dec 30, 2021

The underlying problem here is that there's no such thing as dtype=list or dtype=tuple; both are aliases of dtype=object. So the only thing we can really do is emit a warning when we see this and suspect a misunderstanding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
legibility make errors helpful and Hypothesis grokable
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants