In [None]:
%matplotlib inline

## Iris introduction course
# 7. Advanced Concepts

**Learning Outcome**: by the end of this section, you will be able to utilise some more advanced parts of Iris's functionality.

**Duration:** 1 hour

**Overview:**<br>
7.1 [Load Callbacks](#callbacks)<br>
7.2 [Categorised Statistics](#categorical)<br>
7.3 [Out-of-core Processing and Lazy Data](#lazy_data)<br>
7.4 [Performance Tricks](#performance)<br>
7.5 [Summary of the Section](#summary)

## Setup

In [None]:
import iris
import iris.quickplot as qplt
import matplotlib.pyplot as plt
import numpy as np

----

## 7.1 Load Callbacks<a id='callbacks'></a>

Sometimes important data exists in a filename rather than in the file itself, and it is desirable for it to become part of the cube's metadata.

For example, some early GloSea4 model runs recorded the "ensemble member number" (or "realization" in CF terms) in the filename, but not in actual PP metadata itself. 

As a result, loading the following data yields two cubes, rather than a single, fully merged, cube.

In [None]:
fname = iris.sample_data_path('GloSea4', 'ensemble_00[34].pp')
for cube in iris.load(fname, 'surface_temperature'):
    print(cube, '\n', '--' * 40)

To resolve this we can define a function that gets called during the load process. This load callback function must take the following as arguments:

 * a cube,
 * a 2D field - either a PP field, a NetCDF variable or a GRIB message depending on the file format being loaded, and
 * a filename.

In our example, some cubes are missing the `realization` coordinate, so we define a function that parses the fname to identify the ensemble member number and includes this value as a `realization` coordinate. We pass this function to load, and the result is a successfully merged cube:

In [None]:
import os
def realization_callback(cube, field, fname):
    basename = os.path.basename(fname)
    if not cube.coords('realization') and basename.startswith('ensemble_'):
        cube.add_aux_coord(iris.coords.DimCoord(np.int32(basename[-6:-3]),
                                                'realization'))

print(iris.load_cube(fname, callback=realization_callback))

----

## 7.2 Categorical Coordinates for Grouped Statistics<a id='categorical'></a>

Sometimes we want to be able to categorise data before performing statistical operations on it.

For example, we might want to categorise our data by "daylight maximum" or "seasonal mean" etc. Both of these categorisations would be based on the time coordinate.

The <a href='https://scitools.org.uk/iris/docs/latest/iris/iris/coord_categorisation.html'>iris.coord_categorisation</a> module provides convenience functions to add some common categorical coordinates, and provides a generalised function to allow each creation of custom categorisations. 

Let's load in a cube that represents the monthly air_temperature from April 2006 through to October 2010.

In [None]:
import iris.coord_categorisation as coord_cat

filename = iris.sample_data_path('ostia_monthly.nc')
cube = iris.load_cube(filename, 'surface_temperature')
print(cube)

Let's add a categorisation coordinate to this cube to identify the climatological season (i.e. "djf", "mam", "jja" or "son") of each time point:

In [None]:
coord_cat.add_season(cube, 'time', name='clim_season')
print(cube)

As you can see in the above print out of the cube, we now have an extra coordinate called `clim_season`.

Let's print the coordinate out to take a closer look:

In [None]:
print(cube.coord('clim_season'))

Now that we have a coordinate representing the climatological season, we can use the cube's ``aggregated_by`` method to "group by and aggregate" on the season, to produce a new cube that represents the seasonal mean:

In [None]:
seasonal_mean = cube.aggregated_by('clim_season', iris.analysis.MEAN)
print(seasonal_mean)

We can take this further by extracting the winter season, using our newly created coordinate, and producing a plot of the winter zonal mean:

In [None]:
winter = seasonal_mean.extract(iris.Constraint(clim_season='djf'))

qplt.plot(winter.collapsed('latitude', iris.analysis.MEAN))
plt.title('Winter zonal mean surface temperature at $\pm5^{\circ}$ latitude')
plt.show()

<div class="alert alert-block alert-warning">
    <b><font color="brown">Exercise: </font></b>
    <p>Calculate the yearly maximum surface_temperature.</p>
    <p>Take a look at the documentation for <a href='https://scitools.org.uk/iris/docs/latest/iris/iris/coord_categorisation.html'>iris.coord_categorisation</a> to work out how to add a coordinate that represent the year to <font face='courier'>cube</font>, then calculate the maximum.
</div>

In [None]:
#
# edit space for user code ...
#

In [None]:
# SAMPLE SOLUTION
# %load solutions/iris_exercise_7.2a

-----

## 7.3 Out-of-core Processing<a id='out_of_core_processing'></a>

[Out-of-core processing](https://en.wikipedia.org/wiki/External_memory_algorithm) is a technical term that describes being able to process datasets that are too large to fit in memory at once. In Iris, this functionality is referred to as **lazy data**. It means that you can use Iris to load, process and save datasets that are too large to fit in memory without running out of memory. This is achieved by loading only the dataset's metadata and not the data array, unless this is specifically requested.

To determine whether your cube has lazy data you can use the `has_lazy_data` method

In [None]:
fname = iris.sample_data_path('air_temp.pp')
cube = iris.load_cube(fname)
print(cube.has_lazy_data())

Iris tries to maintain lazy data as much as possible. We refer to the operation of loading a cube's lazy data as 'realising' the cube's data. A cube's lazy data will only be loaded in a limited number of cases, including:

* when the user directly requests the cube's data using `cube.data`,
* when there is no lazy data processing algorithm available to perform the requested data processing, such as for peak finding, and
* where actual data values are necessary, such as for cube plotting.

In [None]:
cube.data
print(cube.has_lazy_data())

Above we have triggered the data to be loaded into memory by calling `cube.data`.

<div class="alert alert-block alert-warning">
    <b><font color="brown">Exercise: </font></b>
    <p>Load the <font face='courier'>sea_water_potential_temperature</font> cube from the file <font face='courier'>iris.sample_data_path('atlantic_profiles.nc')</font>. Does this cube have lazy data?<br>Calculate the mean over the depth coordinate. Does the cube still have lazy data?<br>Create a blockplot (pcolormesh) of the resulting 2D cube. Does the cube still have lazy data?</p> 
</div>

In [None]:
#
# edit space for user code ...
#

In [None]:
# SAMPLE SOLUTION
# %load solutions/iris_exercise_7.3a

----

## 7.4 Performance Tricks<a id='performance'></a>

This section details a few common tricks to improve the performance of your Iris code:

 * Data loading.
 * Load once, extract many times.

### Make Use of Deferred Loading of Data

Sometimes it makes sense to load data before doing operations, other times it makes sense to do data reduction before loading.

We define a simple function the applies some processing to the cube:

In [None]:
def zonal_sum(cube):
    """
    A really silly function to calculate the sum of the grid_longitude
    dimension.
    Don't use this in real life, instead consider doing:
    
        cube.collapsed('grid_longitude', iris.analysis.SUM)
    
    """
    total = 0
    for i, _ in enumerate(cube.coord('grid_longitude')):
        total += cube[..., i].data
    return total

First, let's try loading in our cube, and then applying the `zonal_sum` function to the cube whilst it still has lazy data.

In [None]:
%%timeit
fname = iris.sample_data_path('uk_hires.pp')
pt = iris.load_cube(fname, 'air_potential_temperature')
result = zonal_sum(pt)

Now let's try doing the same thing, but this time we tell Iris to load the data into memory prior to applying the function.

In [None]:
%%timeit
fname = iris.sample_data_path('uk_hires.pp')
pt = iris.load_cube(fname, 'air_potential_temperature')
pt.data
result = zonal_sum(pt)

As you can see, loading all the data upfront was much faster.

### Load Once, Extract Many Times

Iris loading can be slow, particularly if the format stores 2d fields of a conceptually higher dimensional dataset, as is the case with GRIB and PP. 

To maximise load speed and avoid unncecessary processing, it is worth constraining the fields that are of interest *at load time*, but there is no caching, so loading a file twice will be twice as slow.

Let's compare loading data on a select number of model levels.

In [None]:
fname = iris.sample_data_path('uk_hires.pp')
model_levels = [1, 4, 7, 16]

First, let's constrain to our chosen model levels at load time:

In [None]:
%%timeit
for model_level in model_levels:
    pt = iris.load_cube(fname,
                        iris.Constraint('air_potential_temperature',
                                        model_level_number=model_level))

Now, let's first load in the file, then extract each of our model levels.

In [None]:
%%timeit
cubes = iris.load(fname)
for model_level in model_levels:
    pt = cubes.extract(iris.Constraint('air_potential_temperature',
                                       model_level_number=model_level))

As you can see, by loading a 

For files with lots of different phenomenon this can be improved further by loading only the phenomenon (and in this case just the model levels of interest):

In [None]:
%%timeit
cube = iris.load(fname,
                 iris.Constraint('air_potential_temperature',
                                 model_level_number=model_levels))
for model_level in model_levels:
    pt = cube.extract(iris.Constraint(model_level_number=model_level))

## 7.5 Summary of Section: Advanced Concepts<a id='summary'></a>

In this section we learnt:
* load callbacks can be used to capture additional metadata during loading
* special facilities are provided for performing categorical statistics
* Iris uses lazy data and out-of-core-processing to handle data that it too large to fit into memory
* lazy loading can be used to enhance code performance
