# Lazy calculations and efficient evaluation with Dask

Summary :
  * Iris cubes use Dask to provide "lazy" data arrays for delayed work and efficient out-of-core processing
  * Many cube operations are 'lazy-preserving', producing lazy results from lazy input
  * Where Iris does not have a suitable function, "custom" deferred calculations can be made with Dask, and then put into Iris cubes
  * Multiple calculations on the same data can be computed in parallel for increased performance


In [1]:
import iris
print iris.__version__

2.0.0


## First fetch some test data
and adjust it to look like (fake) U and V windspeeds.

In [2]:
from iris import sample_data_path
import glob

filepaths = glob.glob(sample_data_path('UM', '*.pp'))
filepath = filepaths[0]
u_cube = iris.load_cube(filepaths)

u_cube.coord('time').bounds = None
u_cube.attributes.clear()
u_cube.cell_methods = None

v_cube = u_cube.copy()
v_cube.coord('time').points = u_cube.coord('time').points
v_cube.rename('eastward_sea_ice_velocity')

print 'U cube : \n', u_cube
print
print 'V cube : \n', v_cube.summary(shorten=True)

U cube : 
northward_sea_ice_velocity / (m s-1) (time: 120; latitude: 215; longitude: 360)
     Dimension coordinates:
          time                            x              -               -
          latitude                        -              x               -
          longitude                       -              -               x
     Auxiliary coordinates:
          forecast_period                 x              -               -
     Scalar coordinates:
          forecast_reference_time: 1859-09-01 00:00:00

V cube : 
eastward_sea_ice_velocity / (m s-1) (time: 120; latitude: 215; longitude: 360)


## Built-in lazy operations.

Let's calculate windspeeds.

Vector magnitude can be made from the U and V cubes using the Iris built-in "cube arithmetic" operators.  
(i.e. + - * / etc.)

In [3]:
cube_windspeed = (u_cube* u_cube + v_cube * v_cube) ** 0.5
cube_windspeed.rename('wind_speed')
print cube_windspeed

wind_speed / (m.s-1)                (time: 120; latitude: 215; longitude: 360)
     Dimension coordinates:
          time                           x              -               -
          latitude                       -              x               -
          longitude                      -              -               x
     Auxiliary coordinates:
          forecast_period                x              -               -
     Scalar coordinates:
          forecast_reference_time: 1859-09-01 00:00:00


### This result is itself a lazy cube 

Iris cube arithmetic operators are lazy-preserving   
(i.e. results will be lazy if the inputs are)

Thus, the actual data has not yet been fetched, and the result values are not yet calculated.

In [4]:
print 'cube_windspeed.has_lazy_data() = ', cube_windspeed.has_lazy_data()

cube_windspeed.has_lazy_data() =  True


## Data realisation

When actual values are required, data will be fetched from disk + the calculations made.  
The result is then stored in the cube as 'real' data (i.e. a numpy array), known as **realisation**.

This can be shown working with a ***copy*** of the above cube (to avoid realising the original).

In [5]:
windspeed_copy = cube_windspeed.copy()
print 'BEFORE windspeed_copy.has_lazy_data() = ', windspeed_copy.has_lazy_data()
print ' some data : \n', windspeed_copy.data[10, 20:22, 300:303]
print 'AFTER windspeed_copy.has_lazy_data() = ', windspeed_copy.has_lazy_data()

BEFORE windspeed_copy.has_lazy_data() =  True
 some data : 
[[ 0.02622383  0.05331731  0.05291287]
 [ 0.01845959  0.05698491  0.04619173]]
AFTER windspeed_copy.has_lazy_data() =  False


### Fetching data has 'realised' this cube.
However the original windspeed cube, and the U and V it is based on are all unaffected.  
This means that when results are fetched from those cubes, all the data will have to be loaded again.

In [6]:
def cubevar_datastates(*varnames):
    for varname in varnames:
        cube = globals()[varname]
        print 'cube {}, {} : data lazy = {}'.format(varname.rjust(20),
                                               ('"' + cube.name() + '"').rjust(40),
                                               cube.has_lazy_data())

cubevar_datastates('windspeed_copy', 'cube_windspeed', 'u_cube', 'v_cube')

cube       windspeed_copy,                             "wind_speed" : data lazy = False
cube       cube_windspeed,                             "wind_speed" : data lazy = True
cube               u_cube,             "northward_sea_ice_velocity" : data lazy = True
cube               v_cube,              "eastward_sea_ice_velocity" : data lazy = True


## Benefits
The point of this is especially clear when the data is very large, e.g. too big to fit in memory.  
In that case, the results of calculations cannot possibly be computed as a complete array in memory.

Instead, however :
  * selected smaller regions can be extracted and realised for use, e.g. to make a plot
    * (which can obviously be repeated to cover a dataset in sections)
  * statistical summaries can be calculated, which are usually much smaller than the full data + can be worked with directly.

# User-defined lazy calculations

Now let's calculate wind directions (angles).

There is no cube arctan function in Iris, but we can use Dask instead.  
We construct the result just as if we were using numpy.

In [7]:
import dask
import dask.array as da

# Calculate arctan(u, v)
lazy_winddirs = da.arctan2(u_cube.lazy_data(), v_cube.lazy_data())
# Convert to degrees
lazy_winddirs = da.rad2deg(lazy_winddirs)

### Put these angles into a cube.
We copy the windspeed cube as it will have the right shape and other metadata.  
Then just put the lazy data inside, and fix up the name and units.

In [8]:
# Make a cube suitable to hold windspeed angles.
cube_winddir = cube_windspeed.copy()
cube_winddir.rename('wind_direction')
cube_winddir.unit = 'degrees'

# Insert lazy values as cube data.
cube_winddir.data = lazy_winddirs

print cube_winddir

wind_direction / (m.s-1)            (time: 120; latitude: 215; longitude: 360)
     Dimension coordinates:
          time                           x              -               -
          latitude                       -              x               -
          longitude                      -              -               x
     Auxiliary coordinates:
          forecast_period                x              -               -
     Scalar coordinates:
          forecast_reference_time: 1859-09-01 00:00:00


## Statistics

We now make some statistics based on the calculated windspeed and direction, using the Iris MEAN and STD_DEV operations.

Note that these statistical operations are lazy-preserving.

( But, not ***all*** Iris statistics are lazy-preserving ...  
MIN and MAX have yet to be made lazy, and various other operations may require 'real' data.  
In such cases, *applying the statistic will load all the data*.  )


In [9]:
cube_mean_windspeed = cube_windspeed.collapsed(('latitude', 'longitude'), iris.analysis.MEAN)
cube_stdev_windspeed = cube_windspeed.collapsed(('latitude', 'longitude'), iris.analysis.STD_DEV)
cube_mean_ang = cube_winddir.collapsed(('latitude', 'longitude'), iris.analysis.MEAN)
cube_stdev_ang = cube_winddir.collapsed(('latitude', 'longitude'), iris.analysis.STD_DEV)
print 'Sample cube = mean windspeed:\n', cube_mean_windspeed

Sample cube = mean windspeed:
wind_speed / (m.s-1)                (time: 120)
     Dimension coordinates:
          time                           x
     Auxiliary coordinates:
          forecast_period                x
     Scalar coordinates:
          forecast_reference_time: 1859-09-01 00:00:00
          latitude: 3.8147e-06 degrees, bound=(-90.0, 90.0) degrees
          longitude: 180.5 degrees, bound=(0.5, 360.5) degrees
     Cell methods:
          mean: latitude, longitude




### Check : these are all lazy results
-- no data has yet been loaded, or any actual calculations done

In [10]:
cubevar_datastates('cube_mean_windspeed', 'cube_stdev_windspeed', 'cube_mean_ang', 'cube_stdev_ang')

cube  cube_mean_windspeed,                             "wind_speed" : data lazy = True
cube cube_stdev_windspeed,                             "wind_speed" : data lazy = True
cube        cube_mean_ang,                         "wind_direction" : data lazy = True
cube       cube_stdev_ang,                         "wind_direction" : data lazy = True


## Note that realising *any* of these cubes takes a little time

...As this reads all the data from disk.  
for example ...

In [11]:
temp_mean_ang_cube = cube_mean_ang.copy()

In [12]:
%%timeit -r 1 -n 1
print 'Some data : ', temp_mean_ang_cube.data[10]

Some data :  -4.52267
1 loop, best of 1: 412 ms per loop


### Result :
Although this operation had to scan all the data, only the statistical result is saved.

It makes sense that the source data is not stored, it could be larger than memory.  
However, it means that data is re-scanned from disk when any of the statistics are calculated.

## Time the calculation of each statistic :

In [13]:
import datetime
def timeop(op):
    t0 = datetime.datetime.now()
    op()
    t1 = datetime.datetime.now()
    return (t1 - t0).total_seconds()

stats_cubes = [cube_mean_windspeed, cube_stdev_windspeed, cube_mean_ang, cube_stdev_ang]
assert all([cube.has_lazy_data() for cube in stats_cubes])
time_total = 0.0
for cube in stats_cubes:
    time = timeop(lambda: cube.copy().data)
    print 'Fetch {} = {:0.2f}'.format(cube.name().rjust(30), time)
    time_total += time

print 'Total individual fetch+calculate times        = {:0.2f}'.format(time_total)

Fetch                     wind_speed = 0.47
Fetch                     wind_speed = 0.54
Fetch                 wind_direction = 0.45
Fetch                 wind_direction = 0.47
Total individual fetch+calculate times        = 1.93


# Optimise the calculation of multiple results
By telling Dask to calculate these results at the same time, this problem can be avoided :  
Multiple statistics can be calculated in a single pass through the data.

For this, we pass the cube lazy data elements to dask in a single operation.

### Time the combined calculation + compare with separate operation timings.

In [14]:
time_combined = timeop(lambda: da.compute(*[cube.lazy_data() for cube in stats_cubes]))
percent_speedup = 100.0*(1.0 - time_combined / time_total)
print 'Combined fetch+calculate time = {:0.2f}'.format(time_combined)
print '  --> speedup ~{:02.0f}%'.format(percent_speedup)

Combined fetch+calculate time = 1.24
  --> speedup ~36%
