Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error generating dataframes with a column of sets #3133

Closed
andreareina opened this issue Nov 3, 2021 · 3 comments · Fixed by #3204
Closed

Error generating dataframes with a column of sets #3133

andreareina opened this issue Nov 3, 2021 · 3 comments · Fixed by #3204
Labels
legibility make errors helpful and Hypothesis grokable

Comments

@andreareina
Copy link

First off, love love love hypothesis! I know storing non-scalars in a dataframe isn't exactly the intended use case, it's not something we're not well-placed to fix right this moment. I also understand that probably means this isn't high priority.

Environment:

attrs            21.2.0 Classes Without Boilerplate
hypothesis       6.24.1 A library for property-based testing
more-itertools   8.10.0 More routines for operating on iterables, beyond it...
numpy            1.21.3 NumPy is the fundamental package for array computin...
packaging        21.2   Core utilities for Python packages
pandas           1.3.4  Powerful data structures for data analysis, time se...
pluggy           0.13.1 plugin and hook calling mechanisms for python
py               1.10.0 library with cross-python path, ini-parsing, io, co...
pyparsing        2.4.7  Python parsing module
pytest           5.4.3  pytest: simple powerful testing with Python
python-dateutil  2.8.2  Extensions to the standard Python datetime module
pytz             2021.3 World timezone definitions, modern and historical
six              1.16.0 Python 2 and 3 compatibility utilities
sortedcontainers 2.4.0  Sorted Containers -- Sorted List, Sorted Dict, Sort...
wcwidth          0.2.5  Measures the displayed width of unicode strings in ...

Code:

from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column


@given(data_frames(columns=[column(elements=st.sets(st.text(), min_size=1))]))
def test_dataframe_with_set(df):
    pass

output.txt:

============================= test session starts ==============================
platform linux -- Python 3.8.11, pytest-5.4.3, py-1.10.0, pluggy-0.13.1
rootdir: /home/andrea/omnistream/bugs/hypothesis-dataframes-sets
plugins: hypothesis-6.24.1
collected 1 item

repro.py F                                                               [100%]

=================================== FAILURES ===================================
___________________________ test_dataframe_with_set ____________________________

self = 0    0.0
dtype: float64, key = 0, value = {'', '0'}

    def __setitem__(self, key, value) -> None:
        key = com.apply_if_callable(key, self)
        cacher_needs_updating = self._check_is_chained_assignment_possible()
    
        if key is Ellipsis:
            key = slice(None)
    
        try:
>           self._set_with_engine(key, value)

.venv/lib/python3.8/site-packages/pandas/core/series.py:1062: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = 0    0.0
dtype: float64, key = 0, value = {'', '0'}

    def _set_with_engine(self, key, value) -> None:
        # fails with AttributeError for IntervalIndex
        loc = self.index._engine.get_loc(key)
        # error: Argument 1 to "validate_numeric_casting" has incompatible type
        # "Union[dtype, ExtensionDtype]"; expected "dtype"
        validate_numeric_casting(self.dtype, value)  # type: ignore[arg-type]
>       self._values[loc] = value
E       TypeError: float() argument must be a string or a number, not 'set'

.venv/lib/python3.8/site-packages/pandas/core/series.py:1099: TypeError

During handling of the above exception, another exception occurred:

    @given(data_frames(columns=[column(elements=st.sets(st.text(), min_size=1))]))
>   def test_dataframe_with_set(df):

repro.py:6: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.8/site-packages/pandas/core/series.py:1088: in __setitem__
    self._set_with(key, value)
.venv/lib/python3.8/site-packages/pandas/core/series.py:1123: in _set_with
    self._set_labels(key, value)
.venv/lib/python3.8/site-packages/pandas/core/series.py:1135: in _set_labels
    self._set_values(indexer, value)
.venv/lib/python3.8/site-packages/pandas/core/series.py:1141: in _set_values
    self._mgr = self._mgr.setitem(indexer=key, value=value)
.venv/lib/python3.8/site-packages/pandas/core/internals/managers.py:355: in setitem
    return self.apply("setitem", indexer=indexer, value=value)
.venv/lib/python3.8/site-packages/pandas/core/internals/managers.py:327: in apply
    applied = getattr(b, f)(**kwargs)
.venv/lib/python3.8/site-packages/pandas/core/internals/blocks.py:927: in setitem
    return self.coerce_to_target_dtype(value).setitem(indexer, value)
.venv/lib/python3.8/site-packages/pandas/core/internals/blocks.py:943: in setitem
    check_setitem_lengths(indexer, value, values)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

indexer = array([0]), value = {'', '0'}, values = array([0.0], dtype=object)

    def check_setitem_lengths(indexer, value, values) -> bool:
        """
        Validate that value and indexer are the same length.
    
        An special-case is allowed for when the indexer is a boolean array
        and the number of true values equals the length of ``value``. In
        this case, no exception is raised.
    
        Parameters
        ----------
        indexer : sequence
            Key for the setitem.
        value : array-like
            Value for the setitem.
        values : array-like
            Values being set into.
    
        Returns
        -------
        bool
            Whether this is an empty listlike setting which is a no-op.
    
        Raises
        ------
        ValueError
            When the indexer is an ndarray or list and the lengths don't match.
        """
        no_op = False
    
        if isinstance(indexer, (np.ndarray, list)):
            # We can ignore other listlikes because they are either
            #  a) not necessarily 1-D indexers, e.g. tuple
            #  b) boolean indexers e.g. BoolArray
            if is_list_like(value):
                if len(indexer) != len(value) and values.ndim == 1:
                    # boolean with truth values == len of the value is ok too
                    if isinstance(indexer, list):
                        indexer = np.array(indexer)
                    if not (
                        isinstance(indexer, np.ndarray)
                        and indexer.dtype == np.bool_
                        and len(indexer[indexer]) == len(value)
                    ):
>                       raise ValueError(
                            "cannot set using a list-like indexer "
                            "with a different length than the value"
                        )
E                       ValueError: cannot set using a list-like indexer with a different length than the value

.venv/lib/python3.8/site-packages/pandas/core/indexers.py:176: ValueError
=========================== short test summary info ============================
FAILED repro.py::test_dataframe_with_set - ValueError: cannot set using a lis...
============================== 1 failed in 0.52s ===============================
@Zac-HD Zac-HD added bug something is clearly wrong here legibility make errors helpful and Hypothesis grokable labels Nov 3, 2021
@andreareina
Copy link
Author

Addendum: st.lists(st.text(), min_size=1) works, st.lists(st.text(), min_size=1).map(set) doesn't.

@Zac-HD
Copy link
Member

Zac-HD commented Nov 3, 2021

I'm glad you like Hypothesis, and thanks for an excellent reproducible bug report 🥰

This looks like a real bug to me, and a use-case that I'd like to support if that's practical - even if dataframes-of-sets are pretty unusual, Pandas is happy to represent them and so Hypothesis ought to generate them.

As you note though, without any resources to do so this is going to be a low-priority issue - I'm happy to help external contributors, but over the next few months I'm more likely to focus my OSS time on filter rewrites, the 3.6 EOL, and better support for external randomness.

@Zac-HD Zac-HD removed the bug something is clearly wrong here label Dec 30, 2021
@Zac-HD
Copy link
Member

Zac-HD commented Dec 30, 2021

Turns out that this was just a missing error message: Instead of:

data_frames(columns=[column(elements=st.sets(st.text(), min_size=1))])  # bad, mysterious error
data_frames(columns=[column(elements=st.sets(st.text(), min_size=1), dtype=object)])  # works!

I've opened a PR to give a better error message in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
legibility make errors helpful and Hypothesis grokable
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants