-
Notifications
You must be signed in to change notification settings - Fork 8
Use pyarrow dtype_backend #1781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'd like to see some extra tests with adding/setting a default pandas dataframe, without the correct dtypte and time. edit: Basically trying to break it ;) |
|
Tested it on De Dommel workflow and get a model that gets written with the Python-API and read and simulated without ValidationErrors or Exceptions. Didn't test/check any results, but the model + results are here: https://we.tl/t-Uxk9xra8i1 However, I do repeatedly get this FutureWarning which is messing up my logging quite alot: |
| df = gpd.read_file(path, layer=table, fid_as_index=True) | ||
| df = pyogrio.read_dataframe( | ||
| path, | ||
| layer=table, | ||
| fid_as_index=True, | ||
| use_arrow=True, | ||
| arrow_to_pandas_kwargs={"types_mapper": pd.ArrowDtype}, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only pyogrio supported this directly, geopandas not yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to place an inline comment to the specific kwarg that's not supported yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I misunderstood that all extra kwargs were passed from gpd.read_file to pyogrio.read_dataframe if engine="pyogrio", so I went back to gpd.read_file and added a comment on arrow_to_pandas_kwargs, which ends up in pyarrow, not pyogrio.
| time=pd.date_range(start="2020-01-01", end="2021-01-01", periods=100), | ||
| time=pd.date_range( | ||
| start="2020-01-01", end="2021-01-01", periods=100, unit="ms" | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needed unit="ms" otherwise the nanosecond precision times get a ValidationError now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we automatically convert such timestamps? I fear that users might hit this and have to find this fix manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only point where we got nanosecond precision was with the data_range with periods pattern. I feel like this pattern will rarely be used for real models, maybe only synthetic ones like our docs. I don't think I've ever come across hydrological data with nanosecond precision.
Nanosecond units will automatically convert to millisecond as long as it doesn't lose data. The ValidationError is also quite good. I prefer not to throw a warning on this as it usually doesn't matter. If you know an easy way to round or truncate to the nearest millisecond I'd be ok with that as well. But I doubt many users will run into this.
evetion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, but I have my doubts on how we handle the defaults of pandas (both for time as the arrow dtype).
In terms of dtype, I think it's automatically converted, so we should be fine there. However, in terms of time I'd rather we convert (maybe with a warning), rather than erroring out. That still seems an improvement over silently dropping precision. That said it's mostly a usability concern for me here.
python/ribasim/ribasim/schemas.py
Outdated
| node_id: Series[Int32] = pa.Field(nullable=False, default=0) | ||
| area: Series[float] = pa.Field(nullable=False) | ||
| level: Series[float] = pa.Field(nullable=False) | ||
| node_id: Series[Annotated[pd.ArrowDtype, pyarrow.int32()]] = pa.Field(nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have lost the defaults here, was that on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 0 was a dummy-default since np.int32 doesn't support missings, but pyarrow.int32 does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well its also because the node_id is a required field and can't be missing schema wise?
This fixes #2120, we can update to the latest Pydantic release (0.25) again. There is still one open bug unionai-oss/pandera#2016. That triggered me to essentially revert #1781 and switch back to the default pandas dtypes, because with them we are less likely to hit various compatibility issues in the ecosystem. Because we like nullable strings, integers and Booleans, we do use these now: - Nullable integer data type: https://pandas.pydata.org/docs/user_guide/integer_na.html - Nullable Boolean data type: https://pandas.pydata.org/docs/user_guide/boolean.html - StringDtype: https://pandas.pydata.org/docs/user_guide/text.html Furthermore this gets rid of Series in the type annotations using https://pandera.readthedocs.io/en/stable/dataframe_models.html#using-data-types-directly-for-column-type-annotations. This was needed because otherwise MyPy would complain about not allowing certain dtypes, but it is cleaner anyway. This also switches back to the default Timestamp with ns precision for the same reason. Possibly we need to support other units in the future if we do very long simulations. --------- Co-authored-by: Maarten Pronk <git@evetion.nl>
Fixes #1721
Fixes #1394
The
dtype_backendpart of this is not breaking. The only part that is technically breaking is that we specify a unit of milliseconds for the Arrow time type. Previously we used the default nanosecond precision, which was then truncated to milliseconds in the core. I think it is better to disallow precision higher than milliseconds if we cannot distinguish them in the core.