Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reset_index followed by groupby causes exception in some cases #4522

Open
devin-petersohn opened this issue Jun 2, 2022 · 5 comments · May be fixed by #4523
Open

reset_index followed by groupby causes exception in some cases #4522

devin-petersohn opened this issue Jun 2, 2022 · 5 comments · May be fixed by #4523
Labels
bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests

Comments

@devin-petersohn
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows
  • Modin version (modin.__version__): latest
  • Python version: 3.8
  • Code we can use to reproduce:
import modin.pandas as pd

df = pd.read_csv("some.csv", index_col=[0,1,2])
df.reset_index().groupby(df.columns[:1]).count()  # error

Describe the problem

This only happens in a very corner case: when groupby by parameter contains 2 or more columns added from the reset_index call.

@mvashishtha mvashishtha added the bug 🦗 Something isn't working label Jun 2, 2022
devin-petersohn added a commit to devin-petersohn/modin that referenced this issue Jun 2, 2022
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
@devin-petersohn devin-petersohn linked a pull request Jun 2, 2022 that will close this issue
8 tasks
@RehanSD
Copy link
Collaborator

RehanSD commented Jun 2, 2022

Quick addition: Repro script accidentally groupby's over columns that aren't added by the reset - to see the bug, try:

df = pd.read_csv("some.csv", index_col=[0,1,2]).reset_index()
df.groupby(df.columns[:2]).count()  # error

it has to be more than one column to get the bug.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 3, 2022

I did some digging, and I believe that the error is caused in this specific case because the results of the reset_index aren't propagated - i.e. the Dataframes in the GroupbyReduce.map still have the multi-index. If we propagate the index before calling groupby, the bug disappears.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 3, 2022

If I create a dataframe where the index and label have the same name and try a groupby, that errors out:

In [11]: df = pd.DataFrame([[1, 2, 3]], index=pd.Index([0], name="so"), columns=['so', 'b', 'c'])

In [12]: df.groupby(df.columns[:2]).count()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 df.groupby(df.columns[:2]).count()

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   7707 axis = self._get_axis_number(axis)
   7709 # https://github.com/python/mypy/issues/7642
   7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
   7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
   7713     obj=self,
   7714     keys=by,
   7715     axis=axis,
   7716     level=level,
   7717     as_index=as_index,
   7718     sort=sort,
   7719     group_keys=group_keys,
   7720     squeeze=squeeze,  # type: ignore[arg-type]
   7721     observed=observed,
   7722     dropna=dropna,
   7723 )

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    879 if grouper is None:
    880     from pandas.core.groupby.grouper import get_grouper
--> 882     grouper, exclusions, obj = get_grouper(
    883         obj,
    884         keys,
    885         axis=axis,
    886         level=level,
    887         sort=sort,
    888         observed=observed,
    889         mutated=self.mutated,
    890         dropna=self.dropna,
    891     )
    893 self.obj = obj
    894 self.axis = obj._get_axis_number(axis)

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:893, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    888         in_axis = False
    890     # create the Grouping
    891     # allow us to passing the actual Grouping as the gpr
    892     ping = (
--> 893         Grouping(
    894             group_axis,
    895             gpr,
    896             obj=obj,
    897             level=level,
    898             sort=sort,
    899             observed=observed,
    900             in_axis=in_axis,
    901             dropna=dropna,
    902         )
    903         if not isinstance(gpr, Grouping)
    904         else gpr
    905     )
    907     groupings.append(ping)
    909 if len(groupings) == 0 and len(obj):

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:481, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna)
    479 self.level = level
    480 self._orig_grouper = grouper
--> 481 self.grouping_vector = _convert_grouper(index, grouper)
    482 self._all_grouper = None
    483 self._index = index

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:937, in _convert_grouper(axis, grouper)
    935 elif isinstance(grouper, (list, tuple, Index, Categorical, np.ndarray)):
    936     if len(grouper) != len(axis):
--> 937         raise ValueError("Grouper and axis must be same length")
    939     if isinstance(grouper, (list, tuple)):
    940         grouper = com.asarray_tuplesafe(grouper)

ValueError: Grouper and axis must be same length

It also errors in Modin, but for a different reason.

ray::_apply_list_of_funcs() (pid=56846, ip=127.0.0.1)
  File "/Users/rehandurrani/Documents/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 417, in _apply_list_of_funcs
    partition = func(partition.copy(), *args, **kwargs)
  File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 360, in map_func
    return apply_func(df, **{other_name: other})
  File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 449, in _map
    result = wrapper(df.copy(), other if other is None else other.copy())
  File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 432, in wrapper
    return cls.map(
  File "/Users/rehandurrani/Documents/modin/modin/core/dataframe/algebra/groupby.py", line 141, in map
    df.groupby(by=by_part, axis=axis, **groupby_kwargs), *agg_args, **agg_kwargs
  File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py", line 7712, in groupby
    return DataFrameGroupBy(
  File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 882, in __init__
    grouper, exclusions, obj = get_grouper(
  File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 872, in get_grouper
    obj._check_label_or_level_ambiguity(gpr, axis=axis)
  File "/Users/rehandurrani/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/generic.py", line 1794, in _check_label_or_level_ambiguity
    raise ValueError(msg)
ValueError: 'so' is both an index level and a column label, which is ambiguous.

If I change it to be

df.groupby(df.columns[:1]).count()

both succeed.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 3, 2022

If I try the repro script in pandas:

In [23]: import pandas as pd

In [24]: df = pd.read_csv("b.csv", index_col=[0,1,2]).reset_index()
    ...: df.groupby(df.columns[:2]).count()  # error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [24], in <cell line: 2>()
      1 df = pd.read_csv("b.csv", index_col=[0,1,2]).reset_index()
----> 2 df.groupby(df.columns[:2]).count()

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py:7712, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   7707 axis = self._get_axis_number(axis)
   7709 # https://github.com/python/mypy/issues/7642
   7710 # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
   7711 # "Union[bool, NoDefault]"; expected "bool"
-> 7712 return DataFrameGroupBy(
   7713     obj=self,
   7714     keys=by,
   7715     axis=axis,
   7716     level=level,
   7717     as_index=as_index,
   7718     sort=sort,
   7719     group_keys=group_keys,
   7720     squeeze=squeeze,  # type: ignore[arg-type]
   7721     observed=observed,
   7722     dropna=dropna,
   7723 )

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    879 if grouper is None:
    880     from pandas.core.groupby.grouper import get_grouper
--> 882     grouper, exclusions, obj = get_grouper(
    883         obj,
    884         keys,
    885         axis=axis,
    886         level=level,
    887         sort=sort,
    888         observed=observed,
    889         mutated=self.mutated,
    890         dropna=self.dropna,
    891     )
    893 self.obj = obj
    894 self.axis = obj._get_axis_number(axis)

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:893, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    888         in_axis = False
    890     # create the Grouping
    891     # allow us to passing the actual Grouping as the gpr
    892     ping = (
--> 893         Grouping(
    894             group_axis,
    895             gpr,
    896             obj=obj,
    897             level=level,
    898             sort=sort,
    899             observed=observed,
    900             in_axis=in_axis,
    901             dropna=dropna,
    902         )
    903         if not isinstance(gpr, Grouping)
    904         else gpr
    905     )
    907     groupings.append(ping)
    909 if len(groupings) == 0 and len(obj):

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:481, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna)
    479 self.level = level
    480 self._orig_grouper = grouper
--> 481 self.grouping_vector = _convert_grouper(index, grouper)
    482 self._all_grouper = None
    483 self._index = index

File ~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/groupby/grouper.py:937, in _convert_grouper(axis, grouper)
    935 elif isinstance(grouper, (list, tuple, Index, Categorical, np.ndarray)):
    936     if len(grouper) != len(axis):
--> 937         raise ValueError("Grouper and axis must be same length")
    939     if isinstance(grouper, (list, tuple)):
    940         grouper = com.asarray_tuplesafe(grouper)

ValueError: Grouper and axis must be same length

it fails.

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 3, 2022

Nevermind - converting the index to a list works in pandas.

@vnlitvinov vnlitvinov added the P2 Minor bugs or low-priority feature requests label Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants