Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement literal np.timedelta64 coding #10101

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Mar 6, 2025

This PR implements @shoyer's suggested approach for "literal" coding of np.timedelta64 values. Accordingly, it provides a pathway for roundtripping np.timedelta64 data without a FutureWarning, and preserves the encoding of variables that were encoded on disk with the previous approach.

I still want to reflect a little more on whether we want any more tests, but this seems functional at the moment—i.e. this example runs without a warning1:

>>> import xarray; import numpy as np
>>> deltas = np.array([1, 2, 3], dtype='timedelta64[D]').astype('timedelta64[s]')
>>> ds = xarray.Dataset({'lead_time': deltas})
>>> xarray.open_dataset(ds.to_netcdf())
<xarray.Dataset> Size: 24B
Dimensions:    (lead_time: 3)
Coordinates:
  * lead_time  (lead_time) timedelta64[s] 24B 1 days 2 days 3 days
Data variables:
    *empty*

@kmuehlbauer let me know if you have any initial thoughts, particularly with respect to possible interaction with other coders.

Footnotes

  1. Note I needed to move away from nanosecond resolution here, since creating bytes with to_netcdf will attempt to cast int64 data to int32 which leads to overflow.

@kmuehlbauer
Copy link
Contributor

@spencerkclark Thanks! I'll try to look into this tomorrow. The major interaction issues are with CFMaskCoder/CFScaleOffsetCoder and CFDatetimeCoder. So I do not expect too much issues with other coders here, but I'll check anyway.

if np.issubdtype(variable.data.dtype, np.timedelta64):
dims, data, attrs, encoding = unpack_for_encoding(variable)
resolution, _ = np.datetime_data(variable.dtype)
attrs["dtype"] = f"timedelta64[{resolution}]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about also including units in attrs?

That would make timedelta64 encoding still specify units in the style of CF conventions, which could make us a little more compatible with non-Xarray tools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be possible, but this would need an additional check inside CFTimedeltaCoder to prevent premature encoding and decoding if both attributes are attached.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a good idea, though does increase the complexity a little. I gave it a try in my latest push.

Comment on lines 800 to 802
# overwrite (!) dtype in encoding, and remove from attrs
# needed for correct subsequent encoding
encoding["dtype"] = attrs.pop("dtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we use the pop_to() helper to do this safely, e.g., dtype = pop_to(encoding, attrs, "dtype", name=name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up! I somewhat blindly inherited this from the BooleanCoder. I added some better tests around this issue.

@shoyer
Copy link
Member

shoyer commented Mar 7, 2025

Thanks @spencerkclark for taking a look at this! I left a couple of suggestions.

Incidentally, I think everyone will be happier the sooner we can relax the nanosecond precision restriction in Xarray :).

@kmuehlbauer
Copy link
Contributor

@kmuehlbauer let me know if you have any initial thoughts, particularly with respect to possible interaction with other coders.

@spencerkclark I do not immediately see any issues with other coders with the current implementation. Taking @shoyer's suggestion into account, we would need to make sure that xarray always gives preference to dtype-attribute instead of units (if both are given).

@shoyer
Copy link
Member

shoyer commented Mar 8, 2025

Looking at the implementation, maybe we don't need the separate new coder class and could keep this all in one objects?

I think the behavior could probably fit into a single coder:

  • Encoding time units (to disk):
    • Convert data timedelta64 -> integer
    • Write both units and dtype attributes
  • Decoding time units (from disk):
    • Convert data: integer -> timedelta64
    • If dtype != 'timedelta64', issue a future warning.

@spencerkclark
Copy link
Member Author

Thanks @shoyer—indeed it's probably simpler to have this all live in CFTimedeltaCoder, and we can re-use some existing code there. I went ahead and made that change.

The remaining awkward bit relates to interaction with masking and scaling. Since we are overloading the dtype encoding, we do not have a way of retaining the numerical dtype of the data that was written to disk during a round trip. I pushed a failing test in 503db4a to provide an example. Maybe I am just not thinking about this the proper way, however.

@spencerkclark
Copy link
Member Author

The remaining awkward bit relates to interaction with masking and scaling. Since we are overloading the dtype encoding, we do not have a way of retaining the numerical dtype of the data that was written to disk during a round trip. I pushed a failing test in 503db4a to provide an example. Maybe I am just not thinking about this the proper way, however.

For now I ended up punting on this in favor of forbidding providing any other encoding parameters when encoding timedeltas this new way. The previous encoding path supports this, and that functionality is still maintained in this PR.

Another approach would be to use a different name for the attribute we assign the timedelta dtype to, say "xarray_dtype" instead of "dtype", but this strays from how things are handled for boolean values, and I am not sure we are losing much by not supporting these other encoding parameters ("_FillValue", "missing_value", "add_offset", and "scale_factor") in this code path. I'm open to discuss, however.

So possibly this is closer now—I added some more tests and laid the groundwork for eventually turning off automatic decoding of variables with time-like units attributes (we would set CFTimedeltaCoder.decode_via_units to False by default instead of True). Ultimately though this is a delicate PR that should be carefully discussed / reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timedelta64 data cannot be round-tripped to netCDF files without a warning
3 participants