Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use iris' regridder caching for faster regridding? #2341

Closed
schlunma opened this issue Feb 23, 2024 · 10 comments · Fixed by #2344
Closed

Use iris' regridder caching for faster regridding? #2341

schlunma opened this issue Feb 23, 2024 · 10 comments · Fixed by #2344
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@schlunma
Copy link
Contributor

Some regridding schemes of iris allow reusing the regridding weights (see here for an overview of them), which might significantly speed up our regrid preprocessor if multiple variables of the same dataset are regridded. This is described here and basically boils down to using the following pattern

regridder = iris.analysis.Nearest().regridder(source_cube, target_cube)
for cube in list_of_cubes_on_source_grid:
    result = regridder(cube)

instead of

for cube in list_of_cubes_on_source_grid:
    result = cube.regrid(target_cube, iris.analysis.Nearest())

This should be fairly easy to implement with a dict cache that uses the source and target latitude and longitude dimensions as keys (we cannot use lru_cache since the hash of a coordinate is simply its id, therefore this would never use cached results for different cubes).

One drawback here is that we have an additional comparison of the coordinates for each regrid call, which makes it slightly slower if just a single variable for many data sets is analyzed.

@ESMValGroup/esmvaltool-coreteam is this something we want? I can open a draft PR so that the discussion is easier.

@schlunma schlunma added the enhancement New feature or request label Feb 23, 2024
@schlunma schlunma added this to the v2.11.0 milestone Feb 23, 2024
@schlunma schlunma self-assigned this Feb 23, 2024
@bouweandela
Copy link
Member

I'd recommend using a profiler, eg py-spy, first to profile a run of a typical recipe using regrid and then see where most time is spent. Of it turns out that it is indeed regridding, then we could consider this.

@schlunma
Copy link
Contributor Author

schlunma commented Feb 23, 2024

That's a very good suggestion.

Here is a run with ~20 variables of 1 year of ICON data (regridding only, nothing else).

No caching (current main): esmvaltool-no_caching.json
With caching: esmvaltool-caching.json
(these files can be opened with https://www.speedscope.app/)

The function that calculates the weights is _regrid_unstructured_to_rectilinear__prepare. Without caching, this step takes up 11% of runtime (50s), with caching this number drops to 0.53% (2s).

For that specific use case (multiple variables from one model) this can save up quite some time with minimal effort, so I think it's worth to include this.

@bouweandela
Copy link
Member

Maybe you could also try if it still saves time when running with #2316, as it looks like that's where we we're headed. How much memory do the weights take up?

@schlunma
Copy link
Contributor Author

The weights are 2D arrays whose size scales with the number of source/target points. For example, for a 1000x1000 grid (which is much larger than commonly used for CMIP6 models), this sums up to ~8MiB assuming float64. Of course, for very very high resolution grids you might end up with ~1GiB, but I don't think that this matters on machines where you want to evaluate that kind of data.

Also, the improvement shown here is only a lower bound for that use case, as it's data with a very low resolution. For higher resolutions, calculating the regridding weights will even take more time. For other use cases (like one variable, many models), this should not change anything regarding runtime.

Why would you think that #2316 changes something here?

@bouweandela
Copy link
Member

Because there al preprocessor functions are run on dask workers (with limited memory), so the cache will likely not yet be available on the worker you're running on. Of course there are smarter ways to manage the cache that would also work in a dask cluster, but then you would need to copy it to the workers again..

@schlunma
Copy link
Contributor Author

Sorry, I can't get #2316 to run. With the same setup that I used for the other tests (4 workers with 7 GiB of memory each), the tool fails with a wild dask error (I guess because it runs out of memory), and after increasing the resources to 2 workers with 15 GiB this runs since an hour but is still at 0% (the other tests finished in ~3min).

It would be also nice to discuss #2316 at some point. This looks like a drastic change, and to me it it's not 100% why "we're headed" that way and what the benefits of that are.

On the other hand, I really think that we should implement this. Under the described circumstances, not reusing regridding weights is really bad practice (this is the most computationally intensive part). In all other cases, this won't change performance at all. I opened a PR about this (#2344) and it really doesn't require a lot of changes. We basically rely on the weights caching of iris (or other regridding libraries).

@bouweandela
Copy link
Member

Interesting, could you share the recipe (and required data) so I can see if I can find out why it does not work with #2316? So far, I've tested it with examples/recipe_easy_ipcc.yml and gotten very good performance (was able to run the recipe in half an hour instead of the original 3-4 hours). Of course, we won't adopt that approach if it turns out that it does not work after all.

@bouweandela
Copy link
Member

The result of investigating the issue with #2316 was that the FileLock that prevents downloading the grid multiple times is blocking the computation (fairly easy to solve) and esmpy crashes if used from multiple threads (I reported this to the esmpy developers and hope they will be willing to do something about it) -> some more work to do before #2316 is usable.

@schlunma
Copy link
Contributor Author

Thanks for investigating! I still think it's worth to include this PR. What do you think?

@valeriupredoi
Copy link
Contributor

and esmpy crashes if used from multiple threads

surely having to do with HDF5's thread unsafetyness!

Well, I reviewed and approved Manu's PR which I think it's a good idea - caching is always a nice alternative to on the fly, if it's not a bulky, mem-hogging cache, that is 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants