Skip to content

Conversation

@visr
Copy link
Member

@visr visr commented Aug 30, 2024

Fixes #1721
Fixes #1394

The dtype_backend part of this is not breaking. The only part that is technically breaking is that we specify a unit of milliseconds for the Arrow time type. Previously we used the default nanosecond precision, which was then truncated to milliseconds in the core. I think it is better to disallow precision higher than milliseconds if we cannot distinguish them in the core.

@visr visr added the breaking A change that breaks existing models label Sep 23, 2024
@evetion
Copy link
Member

evetion commented Sep 24, 2024

I'd like to see some extra tests with adding/setting a default pandas dataframe, without the correct dtypte and time. edit: Basically trying to break it ;)

@DanielTollenaar
Copy link
Contributor

Tested it on De Dommel workflow and get a model that gets written with the Python-API and read and simulated without ValidationErrors or Exceptions. Didn't test/check any results, but the model + results are here: https://we.tl/t-Uxk9xra8i1

However, I do repeatedly get this FutureWarning which is messing up my logging quite alot:

(file:///D:/repositories/Ribasim/python/ribasim/ribasim/model.py:239): FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  pd.concat(df_chunks)

@visr visr marked this pull request as ready for review September 24, 2024 12:46
Comment on lines 377 to 387
df = gpd.read_file(path, layer=table, fid_as_index=True)
df = pyogrio.read_dataframe(
path,
layer=table,
fid_as_index=True,
use_arrow=True,
arrow_to_pandas_kwargs={"types_mapper": pd.ArrowDtype},
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only pyogrio supported this directly, geopandas not yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to place an inline comment to the specific kwarg that's not supported yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I misunderstood that all extra kwargs were passed from gpd.read_file to pyogrio.read_dataframe if engine="pyogrio", so I went back to gpd.read_file and added a comment on arrow_to_pandas_kwargs, which ends up in pyarrow, not pyogrio.

Comment on lines -28 to +30
time=pd.date_range(start="2020-01-01", end="2021-01-01", periods=100),
time=pd.date_range(
start="2020-01-01", end="2021-01-01", periods=100, unit="ms"
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needed unit="ms" otherwise the nanosecond precision times get a ValidationError now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we automatically convert such timestamps? I fear that users might hit this and have to find this fix manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only point where we got nanosecond precision was with the data_range with periods pattern. I feel like this pattern will rarely be used for real models, maybe only synthetic ones like our docs. I don't think I've ever come across hydrological data with nanosecond precision.

Nanosecond units will automatically convert to millisecond as long as it doesn't lose data. The ValidationError is also quite good. I prefer not to throw a warning on this as it usually doesn't matter. If you know an easy way to round or truncate to the nearest millisecond I'd be ok with that as well. But I doubt many users will run into this.

@visr visr requested a review from evetion September 24, 2024 13:01
Copy link
Member

@evetion evetion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but I have my doubts on how we handle the defaults of pandas (both for time as the arrow dtype).

In terms of dtype, I think it's automatically converted, so we should be fine there. However, in terms of time I'd rather we convert (maybe with a warning), rather than erroring out. That still seems an improvement over silently dropping precision. That said it's mostly a usability concern for me here.

node_id: Series[Int32] = pa.Field(nullable=False, default=0)
area: Series[float] = pa.Field(nullable=False)
level: Series[float] = pa.Field(nullable=False)
node_id: Series[Annotated[pd.ArrowDtype, pyarrow.int32()]] = pa.Field(nullable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have lost the defaults here, was that on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 0 was a dummy-default since np.int32 doesn't support missings, but pyarrow.int32 does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well its also because the node_id is a required field and can't be missing schema wise?

@visr visr requested a review from evetion September 26, 2024 11:05
@visr visr merged commit f5bfb50 into main Sep 26, 2024
@visr visr deleted the pyarrow branch September 26, 2024 12:12
visr added a commit that referenced this pull request Jul 11, 2025
visr added a commit that referenced this pull request Jul 11, 2025
This fixes #2120, we can update to the latest Pydantic release (0.25)
again. There is still one open bug
unionai-oss/pandera#2016. That triggered me to
essentially revert #1781 and switch back to the default pandas dtypes,
because with them we are less likely to hit various compatibility issues
in the ecosystem.

Because we like nullable strings, integers and Booleans, we do use these
now:
- Nullable integer data type:
https://pandas.pydata.org/docs/user_guide/integer_na.html
- Nullable Boolean data type:
https://pandas.pydata.org/docs/user_guide/boolean.html
- StringDtype: https://pandas.pydata.org/docs/user_guide/text.html

Furthermore this gets rid of Series in the type annotations using
https://pandera.readthedocs.io/en/stable/dataframe_models.html#using-data-types-directly-for-column-type-annotations.
This was needed because otherwise MyPy would complain about not allowing
certain dtypes, but it is cleaner anyway.

This also switches back to the default Timestamp with ns precision for
the same reason. Possibly we need to support other units in the future
if we do very long simulations.

---------

Co-authored-by: Maarten Pronk <git@evetion.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking A change that breaks existing models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use Pandas Arrow backend Mark table columns as optional as described in the docs

3 participants