-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache for area-weighted regridder #2472
Comments
The same is true for all of the regridders currently in iris I believe (Linear/Nearest/AreaWeighted). No caching of weights and indices takes place. |
i would like to put this together into a use case for some planned enhancement work we have for later this year further comments and thoughts on this, to keep it in my notifications, are much appreciated |
Use case: The AreaWeighted is currently the only conservative scheme. AreaWeighted is ~1000 time slower than Linear or Nearest. Is it necessary to have two ways of doing regridding?
If the weight creation is the most expensive step, a regridding function could return the weights explicitly, and feed them into the regridding function the next time. This would not add more complexity to the user than using Regridder objects, but would potentially be much simpler. Testing the current caching method seems difficult. (Is it caching? Is it using the cache? it there a benefit using the cache?) |
@tv3141 the current UI allows for a generic interface to regridding/interpolation and for the implementation details be hidden from the user. This framework is what allows caching to be developed without changing the UI for users. |
The bottleneck is averaging the source grid points using The area-weighted regridding is done in a large loop over the target grid points. The weights are calculated inside the loop for each target grid point. The regridding is done in one step for all dimensions (time, levels). https://github.com/SciTools/iris/blob/v1.12.x/lib/iris/experimental/regrid.py#L503 def _regrid_area_weighted_array(src_data, x_dim, y_dim,
src_x_bounds, src_y_bounds,
grid_x_bounds, grid_y_bounds,
grid_x_decreasing, grid_y_decreasing,
area_func, circular=False, mdtol=0):
[...]
# Simple for loop approach.
for j, (y_0, y_1) in enumerate(grid_y_bounds):
[...]
for i, (x_0, x_1) in enumerate(grid_x_bounds):
[...]
data = src_data[tuple(indices)]
# Calculate weights based on areas of cropped bounds.
weights = area_func(y_bounds, x_bounds)
[...]
# Calculate weighted mean taking into account missing data.
new_data_pt = _weighted_mean_with_mdtol(data,
weights=weights, axis=axis, mdtol=mdtol)
# Insert data (and mask) values into new array.
[...]
new_data[tuple(indices)] = new_data_pt
[...]
return new_data
def _weighted_mean_with_mdtol(data, weights, axis=None, mdtol=0):
"""
...
"""
res = ma.average(data, weights=weights, axis=axis)
[...]
return res Profiling
import cProfile
import pstats
import iris
areaweighted_regridder = iris.analysis.AreaWeighted().regridder(create_cube(*resolution['N768']), \
create_cube(*resolution['N96']))
cube = create_cube(*resolution['N768'], levels=3)
cProfile.run('areaweighted_regridder(cube)', 'aw_regrid.prof')
stats = pstats.Stats('aw_regrid.prof')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats()
|
To learn more about Any ideas what makes import inspect
import numpy as np
print np.__version__ # 1.12.1
# imports for copied code
from numpy import asarray
from numpy.ma import getmask, nomask
def ma_average(a, axis=None, weights=None, returned=False):
# copied code from
# print inspect.getsourcefile(np.ma.average) arr = np.random.rand(1,8,8,3)
w = np.random.rand(1,8,8,3)
print 'np.ma.average'
print np.ma.average(arr, weights=w, axis=(1,2))
%timeit a = np.ma.average(arr, weights=w, axis=(1,2))
print
print 'ma_average'
print ma_average(arr, weights=w, axis=(1,2))
%timeit a = ma_average(arr, weights=w, axis=(1,2))
print
print 'np.average'
print np.ma.average(arr, weights=w, axis=(1,2))
%timeit a = np.average(arr, weights=w, axis=(1,2))
print
|
from numpy import asarray when it should have been from numpy.ma import asarray This also makes the 'local'
|
I realised I perhaps neglected to mention that I would also really like to see caching implemented in iris :) |
This section has been reviewed in the "area_weighted_regridding" feature branch. |
Update from refinement meeting: The current status of the work was discussed. We think completing it is achievable with approx: This nicely aligns with the 5 days dev work that Emma/Andy are providing. |
Optimisation AC - time to process a single cube is quicker than previously see #2370 |
The feature branch,
|
Hi @abooton apologies if this isn't useful or I'm too late, but for what it's worth:
I'm not 100% sure I have understood the context here but I would have thought it reasonable/flexible that area weighted regridding return data which relates to the source dtype by the following relation: tgt_dtype = np.promote_types(src_dtype, 'float16') Hope this helps in some way. Merry Christmas! |
The following quick check indicates we expect the ASV tests should have seen a noticeable improvement:
|
Hi @cpelley, Thanks for your thoughts.
We'd not specifically looked at this as the user requirement was for reducing computation time. The regridder object now retains the weights (which may be similar in size to the 2D data array, or might be bigger as they are float64). I would think the memory requried to average the data would be more, particularly for 2D diagnostics, because 1) the weights are pushed into a additional array and 2) the src data is initially stored in an array larger than that of the target grid prior to computing the mean. (Previously the weights and mean were computed independently by looping through every point on the target grid, so I think the memory would have been released between each calculation). It should be noted that the total memory useage of the new code could easily be optimised in a future PR, by initally storing the weights in the numpy array structure required by the regridding "perform" code. This further refactoring hasn't been done yet, as we didn't wish to introduce a bug into the weights calculations.
To clarify, the changes here retain the current dtype behaviour. However I was concerned that integer source data can be returned as type float.
|
Awesome! Yes this is how we made ESMF regridding approachable (the memory overhead there really is too large). Do feel free to give me a shout for a chat if/when you think it might be useful to do so.
I don't think AreaWeighted/bilinear regridding should ever return integer types. Performing area conservative regridding implies to me floats in the return since area calculations should be float. The result needs to be always both locally and globally conservative. |
@cpelley, using memory_profiler I generated the following plots by loading a cube (where using 738093f, profiling using the feature branch, profiling Edit: There is an ~15% increase in the average memory usage, but an ~90% decrease in the total time taken to perform the regridding. |
Generating data using float16 vs float64 reveals huge differences. That is, in hindsight, I don't think data recorded as integers should be treated as anything but the highest precision (float64). Integer data doesn't mean precision only to an integer. Example here is the land cover type fraction ancillary generation (int8->float64). |
I think this is probably addressed by #3617 |
This has been addressed by #3660 - asv now demonstrates a change in performance at the appropriate commit. |
closed by #3623 Outstanding (numba) acceptance criteria is "nice to have". |
This has been a major issue for several people, here are a few links to provide some context:
#2370
https://exxconfigmgmt:6391/browse/EPM-1542
@abooton - Updating the description 03/12/2019.
As a user of the area-weighted regridder I would like to cache the area-weights (as well as snapshot the grid info) so that I can reduce the time taken to regrid multiple cubes.
Description:
As described in the iris documentation:
https://scitools.org.uk/iris/docs/latest/userguide/interpolation_and_regridding.html?highlight=regridding#caching-a-regridder
"If you need to regrid multiple cubes with a common source grid onto a common target grid you can ‘cache’ a regridder to be used for each of these regrids. This can shorten the execution time of your code as the most computationally intensive part of a regrid is setting up the regridder."
Unfortunately the weights are not currently cached, so the benefit described is not realised when carrying out area-weighted regridding.
It is noted that although, at present, the majority of time is currently spent calculating the weighted-mean, computing the weights can be significant e.g ~25% for non-masked arrays (not masked for which stats are reported on below).
See #2370 for a good example of setting the regridder up.
Acceptance Criteria:
regrid_area_weighted_rectilinear_src_and_grid
API should be maintained as it isiris.analysis.AreaWeighted().regridder(cube1, cube2)
(see example in Area weighted regridder caching #2370)__prepare
and__perform
)numba
is used for code speed-up, it should be implemented on an "if available - use it" basisNote:
The weights are currently computed alongside the weighted mean calculation, in the loop. If the weights calculation is refactored, the grid points will be looped over twice instead of once. Therefore, it is suggested that the work is developed in a feature branch, and implemented once code refactoring and optimisation are both complete.
The text was updated successfully, but these errors were encountered: