-
Notifications
You must be signed in to change notification settings - Fork 5
Issue #1698 Model write optimization #1700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #1698 Model write optimization #1700
Conversation
…eading op the nc and zarr files
…al applicalble and some unittests are failing
# Conflicts: # pixi.lock
… the way the solution is updated in the split model
…idn't expect to recieve dask objects. Optimize the flow-transport model matcher (cherry picked from commit 9c5134a)
# Conflicts: # pixi.lock
| like = ones_like(active) | ||
| bottom = like * bottom | ||
| top_2d = (like * top).sel(layer=1) | ||
| top_3d = bottom.shift(layer=1).fillna(top_2d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that this is necessary! I think we can probably load top and bottom in the beginning of the function? Then it is more explicit and performance for consecutive computations will be faster.
|
|
||
|
|
||
| @dispatch | ||
| def is_equal(array1: xu.UgridDataArray, array2: xu.UgridDataArray) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice optimization shortcut!
imod/mf6/simulation.py
Outdated
| path = f"{filename}.nc" | ||
| exchange_package.dataset.to_netcdf(directory / path) | ||
| exchange_package.dataset.to_netcdf( | ||
| directory / path, format="NETCDF4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether we should hardcode "NETCDF4" format? Maybe we should add an netcdf_kwargs argument to dump , which accepts a dictionary with settings for to_netcdf? In pkgbase I see we forward kwargs, I think we can do that here as well?
imod/mf6/rch.py
Outdated
| vars = [ | ||
| "species", | ||
| ] | ||
| for var in vars: | ||
| if var in instance.dataset: | ||
| instance.dataset[var] = instance.dataset[var].astype(str) | ||
|
|
||
| return instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this string conversion should be part of the pkgbase class. Or BoundaryCondition class, as species can also be part of other packages.
imod/mf6/hfb.py
Outdated
| """ | ||
| kwargs.update({"encoding": self._netcdf_encoding()}) | ||
| kwargs.update({"format": "NETCDF4"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about the format in pkgbase
imod/mf6/pkgbase.py
Outdated
| """ | ||
| kwargs.update({"encoding": self._netcdf_encoding()}) | ||
| kwargs.update({"format": "NETCDF4"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a netcdf_kwargs argument to the dump method so that users can specify how datasets should be written themselves. It could be they don't want to write with the netcdf4 library but with one of the many other options.
|
JoerivanEngelen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed my comments, thanks for creating this. This indeed speeds up write times considerably for the example!



Fixes #1698
Description
This is part 2 of fixing the performance issues with large model.
In part 1 #1693 the modelsplitter has been optimized. In this PR the focus is on wiring the partitioned model.
As @Huite pointed out in #1686 the performance bottleneck had to do with the fact that the same package had to be loaded from file multiple times while only a part of the file is actually needed.
After digging around for a while i discovered that this had to do with the fact how we open de the dataset.
dataset = xr.open_dataset(path, **kwargs)In the line above we don't specify anything chunk related. That has as a result that when you access the dataset the entire file has to be loaded from disk. By simply adding
chunks="auto"this is no longer the case and a huge performance gain is achieved.There are some other changes related to setting chunking to auto. There are some parts of the code that don't expect to receive dask arrays. For instance you can use .item() on a dask array. Instead i now use .values[()].
I was also getting some errors when the to_netcdf method were called on the package. All of them had something to do with wrong/unsupported datatypes. In this PR you will find that an encoding is added for float16 types. And that in some packages the from_file method has been updated to ensure that he loaded type is converted to a supported type
An unrelated change but performance wise significant change has been applied to the
_get_transport_models_per_flow_modelmethod. This method is used to match gwf models to gwt models so that gwfgwt exchanges can be created. This method was doing a full comparison of domains, which is expensive. There is also a method available that does the comparison on domain level. By switching to this method the matching algorithm becomes almost instantaneously.NOTE
This PR has issue #1699 as a base. The base needs to altered to master once that PR is in
NOTE
This PR also improves the
dumpmethodNOTE
some timmings:
Issue #nr, e.g.Issue #737pixi run generate-sbomand committed changes