Skip to content

Conversation

@corinnebosley
Copy link
Member

In this new page of the user guide, I have attempted to cover:

  • What is real and lazy data?
  • When does my lazy data become real?
  • What is 'core data'?
  • Different methods of changing a cube's data (cube maths, copy, replace) and their respective consequences
  • Some dask processing options and what they might be useful for (very briefly of course, because I don't fully understand myself...)
  • Some links to the dask and distributed docs.

@corinnebosley
Copy link
Member Author

corinnebosley commented May 25, 2017

@dkillick @pp-mo @bjlittle Apologies for my previous PR, clearly not used to working on the dask branch.

Feel free to tear into this draft of the docs, that's what it's there for. Don't beat around the bush, I won't take anything personally.

I will also leave open the option to allow edits directly to this PR. It may turn out to be a more efficient method of working.

Copy link
Member

@DPeterK DPeterK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corinnebosley this is a great starting point for the new userguide chapter 👍 There's plenty to think about still, but there always is when updating user docs!

>>> cube = iris.load_cube(filename, 'air_temp.pp')
>>> cube.has_lazy_data()
True
>>> _ = cube.data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need to assign this call into anything - just calling cube.data is sufficient to touch (and load) the data.

What is Real and Lazy Data?
---------------------------

Every Iris cube contains an n-dimensional data array, which could be real or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got two spaces between "dimensional" and "data".

Every Iris cube contains an n-dimensional data array, which could be real or
lazy.

Real data is contained in an array which has a shape, a data type, some other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning "NumPy" in this line.

about its real counterpart but has no actual data points, so its memory
allocation is much smaller. This will be in the form of a Dask array.

Arrays in Iris can be converted flexibly(?) between their real and lazy states,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could get away without "flexibly" here.

Arrays in Iris can be converted flexibly(?) between their real and lazy states,
although there are some limits to this process. The advantage of using lazy
data is that it has a small memory footprint, so certain operations
(such as...?) can be much faster. However, in order to perform other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage is quite right, but the application is not so accurate: lazy data allows you to load and manipulate datasets that you would otherwise not be able to fit into memory.

As it happens, it is likely that operations will be sped up, but this could be considered an indirect benefit:

  • the calculations will not be performed immediately as you execute the lines of code that define the calculations. Instead they are deferred until you request a realised result (this happens when you call compute on a dask array)
  • dask's parallel-processing schedulers allows the calculations to be performed in parallel when the realised result is requested, which can bring a significant performance improvement. However it is not directly related to the smaller memory footprint.

I think both of these points are also worth adding here as they build the picture of the benefits of lazy data and its processing.

>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
>>> cube.has_lazy_data()
True
>>> the_data = cube.core_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe core_data() is now a method that needs to be called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work both with and without the brackets. That's pretty confusing.

Please refer to the Whitepaper for further details.


Dask Processing Options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, I think we could do with a little section here on coords, too. They've been updated now as well, and while DimCoords can only have real data (because of the monotonicity checks that must be run), AuxCoords can now have lazy data, which is a bit of a big deal, I reckon.


.. doctest::

>>> dask.set_options(get=dask.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read dask.async.get_sync

Copy link
Member Author

@corinnebosley corinnebosley May 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bit? Do you mean like this:

>>> dask.set_options(get=dask.async.get_sync)

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was not how I expected that comment to look...


There are some default values which are set by Dask and passed through to Iris.
If you wish to change these options, you can override them globally or using a
context manager.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dask options are only controllable by calling dask.set_options, which isn't quite clear from this sentence. The changes you specify with this call will apply to all uses of dask for the lifetime of your "session" (be it a Python session or the extent of the context manager).


As well as Dask offering the benefit of a smaller memory footprint through the
handling of lazy arrays, it can significantly speed up performance by allowing
Iris to use multiprocessing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd call this parallel processing rather than multiprocessing, which is one parallel processing option available to users.

@QuLogic QuLogic added this to the dask milestone May 25, 2017
Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All really good work, but I'm not convinced that most of this detail needs to go in the User Guide.

I think we need to keep the User Guide absolutely as short as practical.

In this case, I believe that actually most users don't need to know anything about lazy operations except : (1) roughly what 'lazy' means; (2) it's only relevant to performance with big data; (2) there is more detail elsewhere;

So, although it is quite long + gentle, maybe most of this belongs in the Whitepaper space instead ?

'touch' the data. This means any way of directly accessing the data, such as
assigning it to a variable or simply using 'cube.data' as in the example above.

Any action which requires the use of actual data values (such as cube maths)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cube maths is not a good example : arithmetic operations don't realise.
Example non-lazy operations could be plotting or regridding ?

>>> cube.has_lazy_data()
False

You can also convert realized data back into a lazy array:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "back into" is misleading, because it implies that it puts the cube "back the way it was", which is absolutely not the case : If you convert real data to lazy, it's only a wrapper : the data is still in memory and the cube no longer gets its data from the file.

>>> cube.has_lazy_data()
False

You can also convert realized data back into a lazy array:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head "realized data" is maybe not good grammar, and simply "real data" is more accurate. : so I'd prefer it to say that a cube or coord points/bounds can be "realized" -- the content is just a "real array".

>>> cube.has_lazy_data()
True

Core data refers to the current state of the cube's data, be it real or
Copy link
Member

@pp-mo pp-mo May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't much like "state of the cube's data", it sounds like it is telling you what state it is in.
? Prefer "Core data returns the cube's current data array, of whichever type, be it real or lazy."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "state" is my suggested terminology for core_data, although maybe its use here is a little clunky. I mean, the state of the cube's data is likely to change over the lifetime of the cube, between being in a lazy state and a real state.

I'll have a go at using "state" differently here and if it still isn't liked then maybe we could go for something a little more like what @pp-mo has suggested.


Core data refers to the current state of the cube's data, be it real or
lazy. This can be used if you wish to refer to the data array but are
indifferent to its current state. If the cube's data is lazy, it will not be
Copy link
Member

@pp-mo pp-mo May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, don't much like this use of "state".
Maybe "This can be used when you want to use the data array with some operation that applies regardless of whether it is real or lazy, such as indexing it".


You can copy a cube and assign a completely new data array to the copy. All the
original cube's metadata will be the same as the new cube's metadata. However,
the new cube's data array will not be lazy if you replace it with a real array:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make it clear here that you can provide a lazy array.
However, in certain cases you will need to use 'replace()' instead, described next...

[-- -- -- ..., -- -- --]
[-- -- -- ..., -- -- --]
[-- -- -- ..., -- -- --]]
>>> new_cube = cube.copy(data=data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a replace !

Coordinate Arrays
-----------------

Cubes possess coordinate arrays as well as data arrays, so these also benefit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this as "Like cubes, coordinates also contain data arrays (points and bounds)."

Dask applies some default values to certain aspects of the parallel processing
that it offers with Iris. It is possible to change these values and override
the defaults by using 'dask.set_options(option)' in your script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth describing exactly what Dask options Iris sets first, linking it directly to the account in the Dask docs.
It may also be worth explaining how Iris "steps aside" if you make your own set_options call.

that it offers with Iris. It is possible to change these values and override
the defaults by using 'dask.set_options(option)' in your script.

You can use this as a global variable if you wish to use your chosen option for
Copy link
Member

@pp-mo pp-mo May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that, while maybe useful, all this remaining text is basically duplicating Dask documentation.
Apart from the general problems with copying and coupling, I think we really want the user to refer to Dask docs at this point + not just copy our magic recipes without a deeper understanding.


Lazy data is contained in a conceptual array which retains the information
about its real counterpart but has no actual data points, so its memory
allocation is much smaller. This will be in the form of a Dask array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corinnebosley Perhaps format Dask as a link to the dask readthedocs, so the user can follow this if they want ...


The advantage of using lazy data is that it has a small memory footprint which
enables the user to load and manipulate datasets that you would otherwise not
be able to fit into memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corinnebosley Perhaps introduce the concepts of deferred loading i.e. keeping large data on disk and not in memory

data values) the real data must be realized. Using Dask, the operation will
be deferred until you request the result, at which point it will be executed
using Dask's parallel processing schedulers. The combination of these two
behaviours can offer a significant performance boost.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corinnebosley Introduce the concept (and use the term) lazy evaluation ...

>>> cube.has_lazy_data()
False

When does my Data Become Real?
Copy link
Member

@bjlittle bjlittle May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does my data become real? ... capitalization police

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is written in title case, which is fine... other than the fact that the rest of the user guide doesn't appear to use title case.


Cubes possess coordinate arrays as well as data arrays, so these also benefit
from Dask's functionality, although there are some distinctions between how
the different coordinate types are treated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to make a clear statement that coordinates etc are now lazy, just like the data of a cube, rather than mentioning that we use dask, which implies lazy behaviour is supported ... subtle, but perhaps important.

You can check whether the data array on your cube is lazy using the Iris
function 'has_lazy_data'. For example:

.. doctest::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hardly use the doctest directive anywhere else in the userguide, and this implementation is incorrect (no accompanying testsetup directive). We could do with a decision on how much, if at all, we use doctest in this userguide chapter.

@DPeterK DPeterK added the dask label May 31, 2017
A dask array also has a shape and data type
but typically the dask array's data points are not loaded into memory.
Instead the data points are stored on disk and only loaded into memory in
small chunks when absolutely necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small chunks -> did we want a link here to the dask readthedocs section on chunks? if we use dask terms, we should point users off to their documentation.
absolutely necessary -> did we want to qualify this some how here i.e. explicitly loading or crunching numbers to do an actual computation (not a lazy computation)

small chunks when absolutely necessary.

The primary advantage of using lazy data is that it enables the loading and manipulating
of datasets that would otherwise not fit into memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference out-of-core processing i.e. it's a real term that means exactly what we want e.g. see https://en.wikipedia.org/wiki/Out-of-core_algorithm

True
>>> cube.data
>>> cube.has_lazy_data()
False
Copy link
Member

@bjlittle bjlittle Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we want to add a note, which reinforces that cube.data will make lazy data, real data by loading it into memory. Users should know this already, but it just makes it explicit.

Did we want to introduce the term realising the data to mean making lazy data real? and use it through-out ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that would just tee up the next section nicely ...

------------------------------

When you load a dataset using Iris the data array will almost always initially be
a lazy array. This section details some operations that will realise lazy data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

details -> discusses ?

Most operations on data arrays can be run equivalently on both real and lazy data.
If the data array is real then the operation will be run on the data array
immediately. The results of the operation will be available as soon as processing is completed.
If the data array is lazy then the operation will be deferred and the data array will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

highlight deferred as it's key to this sentence ? is there a link to an explanation in dask for deferred operations ? might be a good opportunity to cross-reference into the dask documentation ...


* If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
will return the cube's lazy dask array. Calling the cube's
:meth:`~iris.cube.Cube.core_data` method **will not realise** the cube's data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will not realise -> will never realise is a stronger contract ...

will return the cube's lazy dask array. Calling the cube's
:meth:`~iris.cube.Cube.core_data` method **will not realise** the cube's data.
* If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
will return the cube's real NumPy array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

real -> in-memory ? ... or real -> real (or in-memory) ...

-----------

In the same way that Iris cubes contain a data array, Iris coordinates contain
points and bounds arrays. Coordinate points and bounds arrays can also be real or lazy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bounds are optional ... don't know if "optional" is the right word, but something that implies that a coordinate doesn't need to have bounds but it must have points (you get my drift, right)

can have **real or lazy** points and bounds. If all of the
:class:`~iris.coords.AuxCoord` instances that the coordinate is derived from have
real points and bounds then the derived coordinate will also have real points
and bounds, otherwise the derived coordinate will have lazy points and bounds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all ... Difficult sentence. Totally understand what you mean, just wonder if we can make it simpler ... e.g.

Note that, a derived coordinate will only have real points and bounds if all of it's associated AuxCoord instances have real points and bounds.

True
>>> points = aux_coord.points
>>> print aux_coord.has_lazy_points()
False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to now show that touching the points doesn't effect the laziness of the coordinates bounds i.e. they are still lazy ... that's the case, right?

-----------------------

As stated earlier in this user guide section, Iris uses dask to provide
lazy data arrays for both Iris cubes and coordinates. Iris also uses dask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually mention the real dask thing i.e. dask.array with a a link to dask-land?

@bjlittle bjlittle self-assigned this Jun 2, 2017
@bjlittle
Copy link
Member

bjlittle commented Jun 2, 2017

@corinnebosley and @dkillick Great work! 👍

We've got an outstanding issue with a doc-test failure in the new user guide. I vote note that we only merge this PR once we resolve that issue first (and rebase this PR with the fix)

You can check whether a cube has real data or lazy data by using the method
:meth:`~iris.cube.Cube.has_lazy_data`. For example::

>>> cube = iris.load_cube(filename, 'air_temp.pp')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, we should probably fix 'filename' in this line...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dkillick, good spot.

@bjlittle
Copy link
Member

bjlittle commented Jun 7, 2017

@corinnebosley It's all lookin' good!

I forgot to merge #2590 before you done your rebase, duh! 🙄 ... but now that I've done that, could you please rebase again? This should make the failing doc-test pass!

@corinnebosley
Copy link
Member Author

@bjlittle All passing now.

@bjlittle bjlittle merged commit 9c9cd2f into SciTools:dask Jun 7, 2017
bjlittle added a commit to bjlittle/iris that referenced this pull request Jun 9, 2017
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants