Prereqs: 
 - Data is available through local filename (e.g. using S3FS to mount S3 bucket). 
 - Iris / dask / distributed etc installed
 - Dask distributed cluster

In [26]:
import iris
from distributed import Client

c = Client('172.31.18.5:8786')
c

<Client: scheduler="172.31.18.5:8786" processes=40 cores=40>

## Defining which files to load.
One model run is made up of 696 files and ~25GB.
There are model runs at 00, 06, 12, 18 each day.

While using network storage Iris loads ~ 1 file / 3 seconds / core, so it's important to work on an appropriately sized cluster.

If you're working on a small + consistent dataset (e.g. always analysing the same model run) it will probably be faster to download the whole dataset once and work on local disks.

In [3]:
month = '01'
day = '01'
run = '00'
prefix = 'prods_op_mogreps-g_2016{}{}_{}'.format(month, day, run)
prefix

'prods_op_mogreps-g_20160101_00'

We use boto to list the available keys, as it's much faster than listing files on the network mount. We then map the found keys into filepaths.

(Warning, some boto commands will page your results into batches of 1000 without warning so uh, be careful of that)

In [None]:
from boto.s3.connection import S3Connection
import os

os.environ['S3_USE_SIGV4'] = 'True'

def list_files(bucket, prefix, local_path='/usr/local/share/notebooks/data/mogreps-g/'):
    conn = S3Connection(host='s3.eu-west-2.amazonaws.com')
    bucket = conn.get_bucket(bucket)
    results = []
    keys = iter(bucket.list(prefix=prefix))
    return ['{}{}'.format(local_path, k.key) for k in keys]


in_files = list_files('mogreps-g', prefix)
print(len(in_files))
in_files[:10]

## Loading data with dask

Here we create a dask bag (in this case it's a list of instructions to run the 'load_cubes' function on each input file.

If we wanted to run this locally we could just run iris.load directly on the list of file paths

In [13]:
# create a dask bag (db). 
# What we end up with is a list of instructions to run the 'load_cubes' function on each input file.
from iris.cube import CubeList
from dask import delayed
import dask.bag as db

@delayed
def load_cubes(address):
    def add_realization(cube, field, filename):
        if not cube.coords('realization'):
            realization = int(filename.split('_')[-2])
            realization_coord = iris.coords.AuxCoord(realization, standard_name='realization', var_name='realization')
            cube.add_aux_coord(realization_coord)
        cube.coord('realization').points.dtype = 'int32'
        cube.coord('time').var_name = 'time'
        cube.coord('forecast_period').var_name = 'forecast_period'
    return iris.load(address, callback=add_realization)

delayed_cubes = db.from_delayed([load_cubes(f) for f in in_files])
delayed_cubes

dask.bag<bag-fro..., npartitions=696>

In [14]:
delayed_cubes.take(1)

(<iris 'Cube' of wet_bulb_freezing_level_altitude / (m) (latitude: 600; longitude: 800)>,)

We know we want to load each of these files, so we'll tell the cluster to compute + persist the bag:

In [20]:
c.restart()

<Client: scheduler="172.31.18.5:8786" processes=0 cores=0>

In [21]:
p_cubes = c.persist(delayed_cubes)

In [16]:
p_cubes.take(1)

(<iris 'Cube' of wet_bulb_freezing_level_altitude / (m) (latitude: 600; longitude: 800)>,)

Iris loads metadata only by default, so as long as we don't touch the data we can work with these cubes locally.

In [18]:
list(p_cubes)[:20]

distributed.utils - ERROR - ("('bag-from-delayed-1cf94a67065d9276052e98d5f8f0c93d', 319)", '172.31.31.51:42224')
Traceback (most recent call last):
  File "/opt/conda/lib/python3.5/site-packages/distributed/utils.py", line 149, in f
    result[0] = yield gen.maybe_future(func(*args, **kwargs))
  File "/opt/conda/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
    value = future.result()
  File "/opt/conda/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/opt/conda/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.5/site-packages/distributed/client.py", line 896, in _gather
    st.traceback)
  File "/opt/conda/lib/python3.5/site-packages/six.py", line 686, in reraise
    raise value
distributed.scheduler.KilledWorker: ("('bag-from-delayed-1cf94a67065d9276052e98d5f8f0c93d', 319)", 

KilledWorker: ("('bag-from-delayed-1cf94a67065d9276052e98d5f8f0c93d', 319)", '172.31.31.51:42224')

We publish this loaded dataset to the cluster, meaning future users don't need to repeat the loading compuations:

In [None]:
c.unpublish_dataset('mogreps')
c.publish_dataset(mogreps=p_cubes)

In [None]:
c.list_datasets()