-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Pyarrow timestamp support for map() function #61236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: Pyarrow timestamp support for map() function #61236
Conversation
pandas/core/algorithms.py
Outdated
try: | ||
# Convert elements to pandas.Timestamp (or datetime64[ns]) | ||
arr = arr.astype("datetime64[ns]") | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrong place to fix this. This should be fixed in ArrowExtensionArray.map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks but the required fix is in ArrowExtensionArray
pandas/core/arrays/arrow/array.py
Outdated
@@ -1483,6 +1483,8 @@ def to_numpy( | |||
def map(self, mapper, na_action: Literal["ignore"] | None = None): | |||
if is_numeric_dtype(self.dtype): | |||
return map_array(self.to_numpy(), mapper, na_action=na_action) | |||
elif self.dtype == "timestamp[ns][pyarrow]": | |||
return map_array(self.to_numpy(dtype=object), mapper, na_action=na_action) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you avoid the type cast to object
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using datetime64[ns]
instead of object
, but some tests expect Python objects (pd.Timestamp, ) and do not pass. I think keeping object
helps preserve that expected behavior. Let me know if you'd prefer adjusting the test instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the failing test would need adjustment (we get a better result when we don't return object
)
|
||
def test_map_arrow_timestamp_dict(): | ||
# GH 61231 | ||
pytest.importorskip("pyarrow", minversion="10.0.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest.importorskip("pyarrow", minversion="10.0.1") | |
pytest.importorskip("pyarrow") |
pandas/core/arrays/arrow/array.py
Outdated
@@ -1483,6 +1483,10 @@ def to_numpy( | |||
def map(self, mapper, na_action: Literal["ignore"] | None = None): | |||
if is_numeric_dtype(self.dtype): | |||
return map_array(self.to_numpy(), mapper, na_action=na_action) | |||
elif self.dtype == "timestamp[ns][pyarrow]": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding an elif
you can modify the existing if
statement as if is_numeric_dtype(self.dtype) or self.dtype.kind in "mM":
pandas/tests/extension/test_arrow.py
Outdated
result = data_missing.map(lambda x: x, na_action=na_action) | ||
expected = data_missing.to_numpy(dtype=object) | ||
tm.assert_numpy_array_equal(result, expected) | ||
mapped = data_missing.map(lambda x: x, na_action=na_action) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to include all this logic? Ideally this should be
def test_map(...):
if data_missing.dtype == "float32[pyarrow]":
result = data_missing.map(lambda x: x, na_action=na_action)
# map roundtrips through objects, which converts to float64
expected = data_missing.to_numpy(dtype="float64", na_value=np.nan)
tm.assert_numpy_array_equal(result, expected)
else:
super().test_map(data_missing, na_action)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I agree it’d be ideal to keep this logic minimal, but this test fails specifically for timestamp[...]
and duration[...]
, which seem to require additional normalization after .map()
due to dtype coercion.
In particular, .map()
on PyArrow-backed datetime/duration types returns a float64
result when pd.NA
is present. To make the comparison meaningful, we cast both result
and expected
back to their logical dtypes (datetime64[ns]
or timedelta64[ns]
).
I will update the test to make this more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns a float64 result when pd.NA is present.
This does not seem correct behavior (might be a secondary bug). I would expect to_numpy
on those types to return their associated datetime/timedelta64
types with NaT
as the missing value
temp = Series(datelike, dtype=datelike.dtype) | ||
mapped = temp.map(mapper, na_action=na_action) | ||
return mapped._values | ||
|
||
if is_numeric_dtype(self.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, this line should allow datelike types so to_numpy
is called
Added type annotations to new arguments/methods/functions.doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.