Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support forcing without exact model grid definition #162

Closed
verseve opened this issue Sep 13, 2021 · 14 comments
Closed

support forcing without exact model grid definition #162

verseve opened this issue Sep 13, 2021 · 14 comments

Comments

@verseve
Copy link
Member

verseve commented Sep 13, 2021

Many wflow applications use forcing (precipitation, potential evapotranspiration and temperature) at a (much) coarser resolution than the actual wflow model grid. Supporting regridding of forcing during a wflow run (in memory) could be a nice addition to wflow for these kind applications.

@hcwinsemius
Copy link

I would rephrase "nice addition" to something imperatively important where it concerns very large runs :-)

@visr
Copy link
Member

visr commented Sep 14, 2021

I suppose that if we set up the cell to cell mapping once at the beginning and reuse it, it doesn't have to add too much runtime overhead. Though it's good to refine a bit what is needed.

  • What kind of regridding methods are needed? Just nearest, or also others?
  • Does it need to be able to handle different projections? If so, we'd probably need a GDAL / PROJ dependency. If not, we can probably stick to something like https://github.com/JuliaMath/Interpolations.jl
  • Is this only for forcing, or do we want this for the staticmaps as well?

@verseve
Copy link
Member Author

verseve commented Sep 14, 2021

I agree, some additional info about requirements would be good. @hcwinsemius or @markhegnauer could you please add (see also questions from @visr)? For now, I would say this is only for forcing, I don't see much added value to implement this for staticmaps.

@markhegnauer
Copy link
Contributor

Agreed, for now only for forcing needed. A simple interpolation (nearest) would already be very useful. In the future, reprojection would be a nice-to-have.

@verseve
Copy link
Member Author

verseve commented Sep 15, 2021

And what about using a lapse rate to correct temperature with the DEM?

@Huite
Copy link
Contributor

Huite commented Sep 15, 2021

Just to butt in, uninvited (thanks to Github notifications): I'm rather confident that in-memory runtime regridding is a poisoned chalice.

I feel confident because in-memory runtime regridding has been a staple of iMODFLOW models for as long as I've used them, and I've honestly always hated it. The primary reason is that when debugging your model, it's no longer possible to inspect your model input directly and it gives me a distinct feeling of lacking control -- maybe this is just a personal defect.

Secondly, as Martijn mentions, it can also quickly spiral out of control in terms of complexity, especially if you consider your staticmaps. In this case, the appropriate regridding method depends on the forcing or parameter in question (with many possible valid schemes: nearest, mean, median, max, min, sum, first-order conserving, etc., etc.; all area-weighted or not). Automatic reprojection is probably ten times worse due to the generally poor understanding of coordinate systems and incomplete CRS specification by users. They will get confused, and they will blame you. In comparison, if Wflow expects a single set of coordinates, a user can simply look for the odd one out and fix it.

Note that is also a form of feature creep which has nasty implications if you want to support non-structured topologies on the longer term. Regridding rasters is quite straightforward, but doing it efficiently for triangular meshes (or other convex cell types) is not: https://github.com/deltares/numba_celltree

I see two reasons why this is desirable (at first glance):

  • For the maximalist approach (also staticmaps): it reduces preprocessing burden on the user. They can just throw their data into the model and see it run. I think this is penny-wise, pound foolish, because on the long term you will run into nasty inconsistencies which are implicitly created (and then hard to debug). Furthermore, there is already a solution for this: good & flexible preprocessing tools (HydroMT).
  • Reducing storage footprint: copying hundreds of gigabytes (or even terabytes) is miserable. However, netCDF already has excellent options for data compression, and this is very easy to use with xarray (cntrl+f compression). These are part of the netCDF library, so also supported by GDAL (i.e. QGIS), xarray, and Julia's netCDF packages. For nearest neighbor interpolations, compression is obviously very efficient (it can basically store the value once, and start and end indices).

It's also worth noting there are a number of exciting developments in compression schemes (with sufficient CPUs, reading and and decompressing data can actually be faster than reading due to memory access being relatively slow compared to how fast CPUs have become):
https://www.blosc.org/

Zarr is an ND array storage convention, using blosc:
https://zarr.readthedocs.io/en/stable/

Some Zarr support in the netCDF library:
https://www.unidata.ucar.edu/blogs/developer/entry/overview-of-zarr-support-in

I'd recommend:

  • Keep all regridding and reprojection logic in HydroMT and its dependencies; Wflow's job is to do hydrology, not to change cell sizes.
  • Just try the default compression schemes to start with. From HydroMT and xarray, this comes down to specifying an encoding dict. Something like: ds.encoding = {"my_variable": {"zlib": True, "complevel": 4}. This requires no changes in Wflow.jl: NCDatasets.jl uses the netCDF-C lib which supports this.
  • A minor change: consider precision reduction or dtype casting before compression. I.e. there's no need to store (uncertain) rainfall with 0.001 mm accuracy. You might be able to fit in 16bit integers. Wflow.jl would have to do a typecast and check the appropriate FillValue.

On a slightly longer term:

  • Investigate what the capabilities of current Zarr supports in netCDF is.
  • If this is insufficient / buggy / whatever, consider adding Zarr as a Wflow supported format. The Zarr spec is straightforward, and its datamodel is near identical to netCDF (where it matters). Zarr is extremely easy to write from xarray (thus from HydroMT); .to_zarr instead of .to_dataset. Martijn is already involved with the Julia package: https://github.com/meggart/Zarr.jl; I believe this is considerably less effort than full regridding support for Wflow (which I've been working on in Python, on-and-off).

@markhegnauer
Copy link
Contributor

Hi @Huite, thanks for your input! Just to give you an idea, but we're talking about 10,000 year of simulation for the whole of Europe on 1x1 (or 2x2) km2. Based on a very rough estimation, based on some input data for the Meuse cacthment, I estimated to be in need of around 50+TB of storage. An in between solution could also be regridding using hydromt directly in the workflow.

Next to the projects' need, I do see added value for this to provide an easy step-in for new wflow users.

@verseve, good point about lapse-rate correction. This could maybe added by supplying a lapse-rate correction grid in the staticmaps as was also available (but not working) in the Python code.

@verseve
Copy link
Member Author

verseve commented Sep 15, 2021

@Huite: yes, indeed thanks for your input and recommendations. I share your concerns. In my view staticmaps is out of scope, this belongs to HydroMT, is applied mostly to higher resolution data than a typical wflow model resolution, and makes use of parameter specific upscale operators. For the forcing part, I am not sure about the added value (reason I wrote: "could be a nice addition"). In the process of converting Wflow Python to Julia we have stripped away most pre-processing (e.g. calculation of slope from the DEM) so there is a clear boundary between data and computations. Also, most probably once we start with this (simple) we can expect requests to add different regridding methods, reprojection etc., all functionality that is already part of HydroMT.

A solution where you indeed use the forcing part of HydroMT directly (in the cloud?) in a workflow is most probably more practical. Also you could do this in a loop (say 100 years) to reduce the data storage.

For an easy step-in for new wflow users I think having examples how to run HydroMT without the Deltares project drive data is probably more beneficial.

@DirkEilander
Copy link
Contributor

DirkEilander commented Sep 15, 2021

I'm in favor of this feature request, but it is indeed good to consider its scope (e.g.: only downscaling of forcing with nearest interpolation, no reprojection, etc.). Such a feature is also available in other models like SFINCS and CaMa-Flood and i'm sure many more.

Just two small remarks regarding lapse rates based on the current hydroMT workflows.

  • We use lapse rates to downscale both the temperature and the pressure (used for PET calculations). Doing the PET calculations in hydroMT at the input data (instead of model) resolution might yield different results. It might be good to consider where (wflow or hydroMT) to do the PET calculations.
  • Furthermore we require a DEM which covers the complete spatial extent of the selected input data cells which is likely larger than the model extent. This is to avoid errors in the calculation of the input data elevation from the model elevation at boundary input data cells that overlap with only a few model cells. This could be solved by requiring an additional input data DEM.

An alternative solution could be to use BMI to pass forcing to the model and downscale it on the fly (with hydroMT). This would not require additional methods in wflow but is likely slower and definitely more complex but does come with greater flexibility to the choice of downscale methods. This could be implemented by making a BMI model adapter on top of an xarray dataset object.

@verseve
Copy link
Member Author

verseve commented Sep 15, 2021

@DirkEilander : I think if you break up the problem in time slices of ~100 years, you have an alternative solution that does not require the use of BMI.

@verseve
Copy link
Member Author

verseve commented Sep 17, 2021

Wflow is a computational engine for hydrology, thus without a focus on pre- or post-processing of input data (forcing, parameters) and output data. We're happy to leave that part to other software developments like for example HydroMT, and want to keep that boundary very clear. Furthermore, there are alternative solutions available for this specific problem (very long runs, large data storage), mentioned in some of the comments here.

@verseve
Copy link
Member Author

verseve commented Sep 17, 2021

So, the conclusion is that we will not support this kind of functionality in Wflow.

@Huite
Copy link
Contributor

Huite commented Sep 17, 2021

For the lazy reader, to enumerate these solutions in a single list:

  1. For large data, just slice it into more manageable bits. This has the added benefit that a model crash (for whatever reason) does not invalidate entire runs.
  2. Regrid the data in the preprocessing step, e.g. with HydroMT, and use a compression scheme when writing to netCDF.
  3. Make use of the fact that Wflow.jl is a Julia package. Users can import Wflow into a Julia script, write their own run loops. These can easily be turned into separate Julia packages, which simply depend on Wflow.

The last option also provides a natural proving ground for new features. If after some time a significant fraction of use is via custom run logic, it stands to reason to then consider merging that logic into Wflow.jl proper.

@visr
Copy link
Member

visr commented Sep 17, 2021

For custom run loops, point 3, it comes down to essentially copy pasting the run function from here, and making the changes you need. It would be good to document this, and perhaps we can make it a bit easier still as well.

It works quite well, I've already used it for instance to create a Makie plot, which is then updated on each timestep, to visually show the state of the model as it runs. That is an example of something that we don't want by default in the computational core (heavy plotting dependencies), but can be useful for some applications.

@verseve verseve closed this as completed Sep 22, 2021
@visr visr mentioned this issue Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants