Skip to content

Commit

Permalink
Make group_mean compatible with NaT
Browse files Browse the repository at this point in the history
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
  • Loading branch information
AlexeyGy committed Sep 10, 2021
1 parent 8954bf1 commit 85f2d92
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 5 deletions.
9 changes: 6 additions & 3 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -676,16 +676,18 @@ def group_mean(floating[:, ::1] out,
ndarray[floating, ndim=2] values,
const intp_t[::1] labels,
Py_ssize_t min_count=-1,
bint is_datetimelike = False) -> None:
bint is_datetimelike=False) -> None:
"""
Compute the mean per label given a label assignment for each value. NaN values are ignored.
Compute the mean per label given a label assignment for each value.
NaN values are ignored.

Parameters
----------
out : np.ndarray[floating]
Values into which this method will write its results.
counts : np.ndarray[int64]
A zeroed array of the same shape as labels, populated by group sizes during algorithm.
A zeroed array of the same shape as labels,
populated by group sizes during algorithm.
values : np.ndarray[floating]
2-d array of the values to find the mean of.
labels : np.ndarray[np.intp]
Expand Down Expand Up @@ -750,6 +752,7 @@ def group_mean(floating[:, ::1] out,
continue
out[i, j] = sumx[i, j] / count


@cython.wraparound(False)
@cython.boundscheck(False)
def group_ohlc(floating[:, ::1] out,
Expand Down
3 changes: 1 addition & 2 deletions pandas/tests/groupby/test_libgroupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,12 +238,11 @@ def test_cython_group_transform_algos():


def test_cython_group_mean_timedelta():
is_datetimelike = True
actual = np.zeros(shape=(1, 1), dtype="float64")
counts = np.array([0], dtype="int64")
data = (
np.array(
[np.datetime64(2, "ns"), np.datetime64(4, "ns"), np.datetime64("NaT")],
[np.timedelta64(2, "ns"), np.timedelta64(4, "ns"), np.timedelta64("NaT")],
dtype="m8[ns]",
)[:, None]
.view("int64")
Expand Down

0 comments on commit 85f2d92

Please sign in to comment.