`reset_index` followed by `groupby` causes exception in some cases #4522

devin-petersohn · 2022-06-02T19:24:24Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows
Modin version (modin.__version__): latest
Python version: 3.8
Code we can use to reproduce:

import modin.pandas as pd

df = pd.read_csv("some.csv", index_col=[0,1,2])
df.reset_index().groupby(df.columns[:1]).count()  # error

Describe the problem

This only happens in a very corner case: when groupby by parameter contains 2 or more columns added from the reset_index call.

The text was updated successfully, but these errors were encountered:

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

RehanSD · 2022-06-02T23:41:52Z

Quick addition: Repro script accidentally groupby's over columns that aren't added by the reset - to see the bug, try:

df = pd.read_csv("some.csv", index_col=[0,1,2]).reset_index()
df.groupby(df.columns[:2]).count()  # error

it has to be more than one column to get the bug.

RehanSD · 2022-06-03T00:29:57Z

I did some digging, and I believe that the error is caused in this specific case because the results of the reset_index aren't propagated - i.e. the Dataframes in the GroupbyReduce.map still have the multi-index. If we propagate the index before calling groupby, the bug disappears.

RehanSD · 2022-06-03T00:33:32Z

If I create a dataframe where the index and label have the same name and try a groupby, that errors out:

In [11]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])

In [12]: df.groupby(df.columns[:2]).count()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 df.groupby(df.columns[:2]).count()

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   7707 axis = self._get_axis_number(axis)
   7709 # https://github.com/python/mypy/issues/7642
   7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
   7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
   7713     obj=self,
   7714     keys=by,
   7715     axis=axis,
   7716     level=level,
   7717     as_index=as_index,
   7718     sort=sort,
   7719     group_keys=group_keys,
   7720     squeeze=squeeze,  # type: ignore[arg-type]
   7721     observed=observed,
   7722     dropna=dropna,
   7723 )

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    879 if grouper is None:
    880     from pandas.core.groupby.grouper import get_grouper
--> 882     grouper, exclusions, obj = get_grouper(
    883         obj,
    884         keys,
    885         axis=axis,
    886         level=level,
    887         sort=sort,
    888         observed=observed,
    889         mutated=self.mutated,
    890         dropna=self.dropna,
    891     )
    893 self.obj = obj
    894 self.axis = obj._get_axis_number(axis)

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:893, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    888         in_axis = False
    890     # create the Grouping
    891     # allow us to passing the actual Grouping as the gpr
    892     ping = (
--> 893         Grouping(
    894             group_axis,
    895             gpr,
    896             obj=obj,
    897             level=level,
    898             sort=sort,
    899             observed=observed,
    900             in_axis=in_axis,
    901             dropna=dropna,
    902         )
    903         if not isinstance(gpr, Grouping)
    904         else gpr
    905     )
    907     groupings.append(ping)
    909 if len(groupings) == 0 and len(obj):

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:481, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna)
    479 self.level = level
    480 self._orig_grouper = grouper
--> 481 self.grouping_vector = _convert_grouper(index, grouper)
    482 self._all_grouper = None
    483 self._index = index

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:937, in _convert_grouper(axis, grouper)
    935 elif isinstance(grouper, (list, tuple, Index, Categorical, np.ndarray)):
    936     if len(grouper) != len(axis):
--> 937         raise ValueError("Grouper and axis must be same length")
    939     if isinstance(grouper, (list, tuple)):
    940         grouper = com.asarray_tuplesafe(grouper)

ValueError: Grouper and axis must be same length

It also errors in Modin, but for a different reason.

ray::_apply_list_of_funcs() (pid=56846, ip=127.0.0.1)
  File "/Users/rehandurrani/Documents/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 417, in _apply_list_of_funcs
    partition = func(partition.copy(), *args, **kwargs)
  File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 360, in map_func
    return apply_func(df, **{other_name: other})
  File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 449, in _map
    result = wrapper(df.copy(), other if other is None else other.copy())
  File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 432, in wrapper
    return cls.map(
  File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 141, in map
    df.groupby(by=by_part, axis=axis, **groupby_kwargs), *agg_args, **agg_kwargs
  File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py", line 7712, in groupby
    return DataFrameGroupBy(
  File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 882, in __init__
    grouper, exclusions, obj = get_grouper(
  File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 872, in get_grouper
    obj._check_label_or_level_ambiguity(gpr, axis=axis)
  File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/generic.py", line 1794, in _check_label_or_level_ambiguity
    raise ValueError(msg)
ValueError: 'so' is both an index level and a column label, which is ambiguous.

If I change it to be

df.groupby(df.columns[:1]).count()

both succeed.

RehanSD · 2022-06-03T00:35:28Z

If I try the repro script in pandas:

In [23]: import pandas as pd

In [24]: df = pd.read_csv("b.csv", index_col=[0,1,2]).reset_index()
    ...: df.groupby(df.columns[:2]).count()  # error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [24], in <cell line: 2>()
      1 df = pd.read_csv("b.csv", index_col=[0,1,2]).reset_index()
----> 2 df.groupby(df.columns[:2]).count()

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   7707 axis = self._get_axis_number(axis)
   7709 # https://github.com/python/mypy/issues/7642
   7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
   7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
   7713     obj=self,
   7714     keys=by,
   7715     axis=axis,
   7716     level=level,
   7717     as_index=as_index,
   7718     sort=sort,
   7719     group_keys=group_keys,
   7720     squeeze=squeeze,  # type: ignore[arg-type]
   7721     observed=observed,
   7722     dropna=dropna,
   7723 )

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    879 if grouper is None:
    880     from pandas.core.groupby.grouper import get_grouper
--> 882     grouper, exclusions, obj = get_grouper(
    883         obj,
    884         keys,
    885         axis=axis,
    886         level=level,
    887         sort=sort,
    888         observed=observed,
    889         mutated=self.mutated,
    890         dropna=self.dropna,
    891     )
    893 self.obj = obj
    894 self.axis = obj._get_axis_number(axis)

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:893, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    888         in_axis = False
    890     # create the Grouping
    891     # allow us to passing the actual Grouping as the gpr
    892     ping = (
--> 893         Grouping(
    894             group_axis,
    895             gpr,
    896             obj=obj,
    897             level=level,
    898             sort=sort,
    899             observed=observed,
    900             in_axis=in_axis,
    901             dropna=dropna,
    902         )
    903         if not isinstance(gpr, Grouping)
    904         else gpr
    905     )
    907     groupings.append(ping)
    909 if len(groupings) == 0 and len(obj):

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:481, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna)
    479 self.level = level
    480 self._orig_grouper = grouper
--> 481 self.grouping_vector = _convert_grouper(index, grouper)
    482 self._all_grouper = None
    483 self._index = index

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:937, in _convert_grouper(axis, grouper)
    935 elif isinstance(grouper, (list, tuple, Index, Categorical, np.ndarray)):
    936     if len(grouper) != len(axis):
--> 937         raise ValueError("Grouper and axis must be same length")
    939     if isinstance(grouper, (list, tuple)):
    940         grouper = com.asarray_tuplesafe(grouper)

ValueError: Grouper and axis must be same length

it fails.

RehanSD · 2022-06-03T00:38:34Z

Nevermind - converting the index to a list works in pandas.

mvashishtha added the bug 🦗 Something isn't working label Jun 2, 2022

devin-petersohn added a commit to devin-petersohn/modin that referenced this issue Jun 2, 2022

FIX-modin-project#4522: Correct multiindex metadata with groupby

13395f0

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

devin-petersohn linked a pull request Jun 2, 2022 that will close this issue

FIX-#4522: Correct multiindex metadata with groupby #4523

Open

8 tasks

vnlitvinov added the P2 Minor bugs or low-priority feature requests label Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`reset_index` followed by `groupby` causes exception in some cases #4522

`reset_index` followed by `groupby` causes exception in some cases #4522

devin-petersohn commented Jun 2, 2022

RehanSD commented Jun 2, 2022 •

edited

RehanSD commented Jun 3, 2022

RehanSD commented Jun 3, 2022

RehanSD commented Jun 3, 2022

RehanSD commented Jun 3, 2022

reset_index followed by groupby causes exception in some cases #4522

reset_index followed by groupby causes exception in some cases #4522

Comments

devin-petersohn commented Jun 2, 2022

System information

Describe the problem

RehanSD commented Jun 2, 2022 • edited

RehanSD commented Jun 3, 2022

RehanSD commented Jun 3, 2022

RehanSD commented Jun 3, 2022

RehanSD commented Jun 3, 2022

`reset_index` followed by `groupby` causes exception in some cases #4522

`reset_index` followed by `groupby` causes exception in some cases #4522

RehanSD commented Jun 2, 2022 •

edited