Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG internal indexing tools trigger error with pandas < 2.0.0 #28931

Open
lorentzenchr opened this issue May 2, 2024 · 2 comments
Open

BUG internal indexing tools trigger error with pandas < 2.0.0 #28931

lorentzenchr opened this issue May 2, 2024 · 2 comments

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented May 2, 2024

#28375 triggers errors for pandas < 2.0.0, despite just using scikit-learn internal functionalities.

As documented in https://scikit-learn.org/dev/install.html, we have pandas >= 1.1.3.

@github-actions github-actions bot added the Needs Triage Issue requires triage label May 2, 2024
@lorentzenchr lorentzenchr changed the title Statement about supported versions of pandas BUG internal indexing tools trigger error with pandas < 2.0.0 May 2, 2024
@mayer79
Copy link
Contributor

mayer79 commented May 2, 2024

Hello @lorentzenchr and @ogrisel

Digging into this, it seems that this is actually not a bug in _safe_assign(), but rather an abuse from my side: I am passing a pd.DataFrame to values, but according to the docu, it should be an "ndarray". For pandas>=2.0, the problem vanishes.

Curiously, for pandas<2, the problem only happens when values is a pd.DataFrame with duplicated indices, no matter what index X has.

So I think we can close the issue. And I will try to fix the corresponding problem in the PR #28375 via a .reset_index().

When we add polars support for _safe_assign(), you might consider relaxing the constaint that values must be numpy and allow it to be of the same type as X, e.g., in order to avoid mixing numpy and arrow.

import warnings
import pandas as pd
import numpy as np

# Can't install scikit learn under Python 3.8 and with pandas 1.5, so I am 
# simply copy pasting code from devel version
def _safe_assign(X, values, *, row_indexer=None, column_indexer=None):
    """Safe assignment to a numpy array, sparse matrix, or pandas dataframe.

    Parameters
    ----------
    X : {ndarray, sparse-matrix, dataframe}
        Array to be modified. It is expected to be 2-dimensional.

    values : ndarray
        The values to be assigned to `X`.

    row_indexer : array-like, dtype={int, bool}, default=None
        A 1-dimensional array to select the rows of interest. If `None`, all
        rows are selected.

    column_indexer : array-like, dtype={int, bool}, default=None
        A 1-dimensional array to select the columns of interest. If `None`, all
        columns are selected.
    """
    row_indexer = slice(None, None, None) if row_indexer is None else row_indexer
    column_indexer = (
        slice(None, None, None) if column_indexer is None else column_indexer
    )

    if hasattr(X, "iloc"):  # pandas dataframe
        with warnings.catch_warnings():
            # pandas >= 1.5 raises a warning when using iloc to set values in a column
            # that does not have the same type as the column being set. It happens
            # for instance when setting a categorical column with a string.
            # In the future the behavior won't change and the warning should disappear.
            # TODO(1.3): check if the warning is still raised or remove the filter.
            warnings.simplefilter("ignore", FutureWarning)
            X.iloc[row_indexer, column_indexer] = values
    else:  # numpy array or sparse matrix
        X[row_indexer, column_indexer] = values


X = pd.DataFrame({"x": [1, 2] * 2, "y": ["a", "a"] * 2})

# Works if values is a ndarray (as per docu of _safe_assign())
values_np = np.array([1] * 4)
#_safe_assign(X, values=values_np, column_indexer=[0])  # works

# Fails if values is pd.DataFrame
values_pd = pd.DataFrame({"x": values_np}, index=[0, 1] * 2)
_safe_assign(X, values=values_pd, column_indexer=[0])
X

# Error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], [line 52](vscode-notebook-cell:?execution_count=17&line=52)
     [48](vscode-notebook-cell:?execution_count=17&line=48) #_safe_assign(X, values=values_np, column_indexer=[0])  # works
     [49](vscode-notebook-cell:?execution_count=17&line=49) 
     [50](vscode-notebook-cell:?execution_count=17&line=50) # Fails if values is pd.DataFrame
     [51](vscode-notebook-cell:?execution_count=17&line=51) values_pd = pd.DataFrame({"x": values_np}, index=[0, 1] * 2)
---> [52](vscode-notebook-cell:?execution_count=17&line=52) _safe_assign(X, values=values_pd, column_indexer=[0])
     [53](vscode-notebook-cell:?execution_count=17&line=53) X

Cell In[17], [line 39](vscode-notebook-cell:?execution_count=17&line=39)
     [32](vscode-notebook-cell:?execution_count=17&line=32)     with warnings.catch_warnings():
     [33](vscode-notebook-cell:?execution_count=17&line=33)         # pandas >= 1.5 raises a warning when using iloc to set values in a column
     [34](vscode-notebook-cell:?execution_count=17&line=34)         # that does not have the same type as the column being set. It happens
     [35](vscode-notebook-cell:?execution_count=17&line=35)         # for instance when setting a categorical column with a string.
     [36](vscode-notebook-cell:?execution_count=17&line=36)         # In the future the behavior won't change and the warning should disappear.
     [37](vscode-notebook-cell:?execution_count=17&line=37)         # TODO(1.3): check if the warning is still raised or remove the filter.
     [38](vscode-notebook-cell:?execution_count=17&line=38)         warnings.simplefilter("ignore", FutureWarning)
---> [39](vscode-notebook-cell:?execution_count=17&line=39)         X.iloc[row_indexer, column_indexer] = values
     [40](vscode-notebook-cell:?execution_count=17&line=40) else:  # numpy array or sparse matrix
     [41](vscode-notebook-cell:?execution_count=17&line=41)     X[row_indexer, column_indexer] = values

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py:670](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:670), in _LocationIndexer.__setitem__(self, key, value)
    [667](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:667) self._has_valid_setitem_indexer(key)
    [669](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:669) iloc = self if self.name == "iloc" else self.obj.iloc
--> [670](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:670) iloc._setitem_with_indexer(indexer, value)

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py:1711](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1711), in _iLocIndexer._setitem_with_indexer(self, indexer, value)
   [1709](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1709) if item in value:
   [1710](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1710)     sub_indexer[info_axis] = item
-> [1711](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1711)     v = self._align_series(
   [1712](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1712)         tuple(sub_indexer), value[item], multiindex_indexer
   [1713](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1713)     )
   [1714](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1714) else:
   [1715](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1715)     v = np.nan

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py:1935](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1935), in _iLocIndexer._align_series(self, indexer, ser, multiindex_indexer)
   [1932](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1932)     if ser.index.equals(new_ix) or not len(new_ix):
   [1933](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1933)         return ser._values.copy()
-> [1935](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1935)     return ser.reindex(new_ix)._values
   [1937](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1937) # 2 dims
   [1938](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1938) elif single_aligner:
   [1939](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1939) 
   [1940](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexing.py:1940)     # reindex along index

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\series.py:4399](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/series.py:4399), in Series.reindex(self, index, **kwargs)
   [4391](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/series.py:4391) @doc(
   [4392](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/series.py:4392)     NDFrame.reindex,
   [4393](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/series.py:4393)     klass=_shared_doc_kwargs["klass"],
   (...)
   [4397](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/series.py:4397) )
   [4398](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/series.py:4398) def reindex(self, index=None, **kwargs):
-> [4399](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/series.py:4399)     return super().reindex(index=index, **kwargs)

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\generic.py:4452](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4452), in NDFrame.reindex(self, *args, **kwargs)
   [4449](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4449)     return self._reindex_multi(axes, copy, fill_value)
   [4451](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4451) # perform the reindex on the axes
-> [4452](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4452) return self._reindex_axes(
   [4453](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4453)     axes, level, limit, tolerance, method, fill_value, copy
   [4454](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4454) ).__finalize__(self, method="reindex")

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\generic.py:4472](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4472), in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   [4467](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4467)     new_index, indexer = ax.reindex(
   [4468](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4468)         labels, level=level, limit=limit, tolerance=tolerance, method=method
   [4469](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4469)     )
   [4471](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4471)     axis = self._get_axis_number(a)
-> [4472](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4472)     obj = obj._reindex_with_indexers(
   [4473](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4473)         {axis: [new_index, indexer]},
   [4474](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4474)         fill_value=fill_value,
   [4475](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4475)         copy=copy,
   [4476](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4476)         allow_dups=False,
   [4477](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4477)     )
   [4479](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4479) return obj

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\generic.py:4515](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4515), in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
   [4512](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4512)     indexer = ensure_int64(indexer)
   [4514](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4514) # TODO: speed up on homogeneous DataFrame objects
-> [4515](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4515) new_data = new_data.reindex_indexer(
   [4516](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4516)     index,
   [4517](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4517)     indexer,
   [4518](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4518)     axis=baxis,
   [4519](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4519)     fill_value=fill_value,
   [4520](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4520)     allow_dups=allow_dups,
   [4521](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4521)     copy=copy,
   [4522](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4522) )
   [4523](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4523) # If we've made a copy once, no need to make another one
   [4524](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/generic.py:4524) copy = False

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\internals\managers.py:1243](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/internals/managers.py:1243), in BlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy)
   [1241](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/internals/managers.py:1241) # some axes don't allow reindexing with dups
   [1242](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/internals/managers.py:1242) if not allow_dups:
-> [1243](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/internals/managers.py:1243)     self.axes[axis]._can_reindex(indexer)
   [1245](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/internals/managers.py:1245) if axis >= self.ndim:
   [1246](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/internals/managers.py:1246)     raise IndexError("Requested axis not found in manager")

File [c:\Users\Michael\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexes\base.py:3283](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexes/base.py:3283), in Index._can_reindex(self, indexer)
   [3281](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexes/base.py:3281) # trying to reindex on an axis with duplicates
   [3282](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexes/base.py:3282) if not self.is_unique and len(indexer):
-> [3283](file:///C:/Users/Michael/AppData/Local/Programs/Python/Python38-32/lib/site-packages/pandas/core/indexes/base.py:3283)     raise ValueError("cannot reindex from a duplicate axis")

ValueError: cannot reindex from a duplicate axis

@ogrisel ogrisel removed the Needs Triage Issue requires triage label May 3, 2024
@ogrisel
Copy link
Member

ogrisel commented May 3, 2024

I have the feeling that we should not close this straight away as discussed in #28375 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Discussion
Development

No branches or pull requests

3 participants