Dask docs, first draft #2583

corinnebosley · 2017-05-25T14:42:09Z

In this new page of the user guide, I have attempted to cover:

What is real and lazy data?
When does my lazy data become real?
What is 'core data'?
Different methods of changing a cube's data (cube maths, copy, replace) and their respective consequences
Some dask processing options and what they might be useful for (very briefly of course, because I don't fully understand myself...)
Some links to the dask and distributed docs.

corinnebosley · 2017-05-25T14:43:49Z

@dkillick @pp-mo @bjlittle Apologies for my previous PR, clearly not used to working on the dask branch.

Feel free to tear into this draft of the docs, that's what it's there for. Don't beat around the bush, I won't take anything personally.

I will also leave open the option to allow edits directly to this PR. It may turn out to be a more efficient method of working.

DPeterK

@corinnebosley this is a great starting point for the new userguide chapter 👍 There's plenty to think about still, but there always is when updating user docs!

DPeterK · 2017-05-25T15:42:34Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    >>> cube = iris.load_cube(filename, 'air_temp.pp')
+    >>> cube.has_lazy_data()
+    True
+    >>> _ = cube.data


You don't actually need to assign this call into anything - just calling cube.data is sufficient to touch (and load) the data.

DPeterK · 2017-05-25T16:37:09Z

docs/iris/src/userguide/real_and_lazy_data.rst

+What is Real and Lazy Data?
+---------------------------
+
+Every Iris cube contains an n-dimensional  data array, which could be real or


You've got two spaces between "dimensional" and "data".

DPeterK · 2017-05-25T16:50:27Z

docs/iris/src/userguide/real_and_lazy_data.rst

+Every Iris cube contains an n-dimensional  data array, which could be real or
+lazy.
+
+Real data is contained in an array which has a shape, a data type, some other


Might be worth mentioning "NumPy" in this line.

DPeterK · 2017-05-25T16:53:40Z

docs/iris/src/userguide/real_and_lazy_data.rst

+about its real counterpart but has no actual data points, so its memory
+allocation is much smaller.  This will be in the form of a Dask array.
+
+Arrays in Iris can be converted flexibly(?) between their real and lazy states,


I think you could get away without "flexibly" here.

DPeterK · 2017-05-25T17:08:44Z

docs/iris/src/userguide/real_and_lazy_data.rst

+Arrays in Iris can be converted flexibly(?) between their real and lazy states,
+although there are some limits to this process.  The advantage of using lazy
+data is that it has a small memory footprint, so certain operations
+(such as...?) can be much faster.  However, in order to perform other


The advantage is quite right, but the application is not so accurate: lazy data allows you to load and manipulate datasets that you would otherwise not be able to fit into memory.

As it happens, it is likely that operations will be sped up, but this could be considered an indirect benefit:

the calculations will not be performed immediately as you execute the lines of code that define the calculations. Instead they are deferred until you request a realised result (this happens when you call compute on a dask array)

dask's parallel-processing schedulers allows the calculations to be performed in parallel when the realised result is requested, which can bring a significant performance improvement. However it is not directly related to the smaller memory footprint.

I think both of these points are also worth adding here as they build the picture of the benefits of lazy data and its processing.

DPeterK · 2017-05-25T17:14:13Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
+    >>> cube.has_lazy_data()
+    True
+    >>> the_data = cube.core_data


I believe core_data() is now a method that needs to be called.

It seems to work both with and without the brackets. That's pretty confusing.

DPeterK · 2017-05-25T17:15:44Z

docs/iris/src/userguide/real_and_lazy_data.rst

+Please refer to the Whitepaper for further details.
+
+
+Dask Processing Options


You know, I think we could do with a little section here on coords, too. They've been updated now as well, and while DimCoords can only have real data (because of the monotonicity checks that must be run), AuxCoords can now have lazy data, which is a bit of a big deal, I reckon.

DPeterK · 2017-05-25T17:16:26Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+.. doctest::
+
+    >>> dask.set_options(get=dask.get)


I think this should read dask.async.get_sync

Which bit? Do you mean like this:

>>> dask.set_options(get=dask.async.get_sync)

?

That was not how I expected that comment to look...

DPeterK · 2017-05-25T17:19:07Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+There are some default values which are set by Dask and passed through to Iris.
+If you wish to change these options, you can override them globally or using a
+context manager.


Dask options are only controllable by calling dask.set_options, which isn't quite clear from this sentence. The changes you specify with this call will apply to all uses of dask for the lifetime of your "session" (be it a Python session or the extent of the context manager).

DPeterK · 2017-05-25T17:19:53Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+As well as Dask offering the benefit of a smaller memory footprint through the
+handling of lazy arrays, it can significantly speed up performance by allowing
+Iris to use multiprocessing.


I think I'd call this parallel processing rather than multiprocessing, which is one parallel processing option available to users.

pp-mo

All really good work, but I'm not convinced that most of this detail needs to go in the User Guide.

I think we need to keep the User Guide absolutely as short as practical.

In this case, I believe that actually most users don't need to know anything about lazy operations except : (1) roughly what 'lazy' means; (2) it's only relevant to performance with big data; (2) there is more detail elsewhere;

So, although it is quite long + gentle, maybe most of this belongs in the Whitepaper space instead ?

pp-mo · 2017-05-29T19:54:21Z

docs/iris/src/userguide/real_and_lazy_data.rst

+'touch' the data.  This means any way of directly accessing the data, such as
+assigning it to a variable or simply using 'cube.data' as in the example above.
+
+Any action which requires the use of actual data values (such as cube maths)


Cube maths is not a good example : arithmetic operations don't realise.
Example non-lazy operations could be plotting or regridding ?

pp-mo · 2017-05-29T19:58:18Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    >>> cube.has_lazy_data()
+    False
+
+You can also convert realized data back into a lazy array:


I think "back into" is misleading, because it implies that it puts the cube "back the way it was", which is absolutely not the case : If you convert real data to lazy, it's only a wrapper : the data is still in memory and the cube no longer gets its data from the file.

pp-mo · 2017-05-29T20:00:50Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    >>> cube.has_lazy_data()
+    False
+
+You can also convert realized data back into a lazy array:


In my head "realized data" is maybe not good grammar, and simply "real data" is more accurate. : so I'd prefer it to say that a cube or coord points/bounds can be "realized" -- the content is just a "real array".

pp-mo · 2017-05-29T20:05:52Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    >>> cube.has_lazy_data()
+    True
+
+Core data refers to the current state of the cube's data, be it real or


Don't much like "state of the cube's data", it sounds like it is telling you what state it is in.
? Prefer "Core data returns the cube's current data array, of whichever type, be it real or lazy."

So "state" is my suggested terminology for core_data, although maybe its use here is a little clunky. I mean, the state of the cube's data is likely to change over the lifetime of the cube, between being in a lazy state and a real state.

I'll have a go at using "state" differently here and if it still isn't liked then maybe we could go for something a little more like what @pp-mo has suggested.

pp-mo · 2017-05-29T20:09:17Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+Core data refers to the current state of the cube's data, be it real or
+lazy.  This can be used if you wish to refer to the data array but are
+indifferent to its current state.  If the cube's data is lazy, it will not be


Again, don't much like this use of "state".
Maybe "This can be used when you want to use the data array with some operation that applies regardless of whether it is real or lazy, such as indexing it".

pp-mo · 2017-05-29T20:14:41Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+You can copy a cube and assign a completely new data array to the copy. All the
+original cube's metadata will be the same as the new cube's metadata.  However,
+the new cube's data array will not be lazy if you replace it with a real array:


You could make it clear here that you can provide a lazy array.
However, in certain cases you will need to use 'replace()' instead, described next...

pp-mo · 2017-05-29T20:16:56Z

docs/iris/src/userguide/real_and_lazy_data.rst

+     [-- -- -- ..., -- -- --]
+     [-- -- -- ..., -- -- --]
+     [-- -- -- ..., -- -- --]]
+    >>> new_cube = cube.copy(data=data)


Needs a replace !

pp-mo · 2017-05-29T20:18:16Z

docs/iris/src/userguide/real_and_lazy_data.rst

+Coordinate Arrays
+-----------------
+
+Cubes possess coordinate arrays as well as data arrays, so these also benefit


I'd put this as "Like cubes, coordinates also contain data arrays (points and bounds)."

pp-mo · 2017-05-29T20:20:23Z

docs/iris/src/userguide/real_and_lazy_data.rst

+Dask applies some default values to certain aspects of the parallel processing
+that it offers with Iris. It is possible to change these values and override
+the defaults by using 'dask.set_options(option)' in your script.
+


worth describing exactly what Dask options Iris sets first, linking it directly to the account in the Dask docs.
It may also be worth explaining how Iris "steps aside" if you make your own set_options call.

pp-mo · 2017-05-29T20:25:33Z

docs/iris/src/userguide/real_and_lazy_data.rst

+that it offers with Iris. It is possible to change these values and override
+the defaults by using 'dask.set_options(option)' in your script.
+
+You can use this as a global variable if you wish to use your chosen option for


I'm worried that, while maybe useful, all this remaining text is basically duplicating Dask documentation.
Apart from the general problems with copying and coupling, I think we really want the user to refer to Dask docs at this point + not just copy our magic recipes without a deeper understanding.

bjlittle · 2017-05-30T07:06:45Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+Lazy data is contained in a conceptual array which retains the information
+about its real counterpart but has no actual data points, so its memory
+allocation is much smaller.  This will be in the form of a Dask array.


@corinnebosley Perhaps format Dask as a link to the dask readthedocs, so the user can follow this if they want ...

bjlittle · 2017-05-30T07:27:55Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+The advantage of using lazy data is that it has a small memory footprint which
+enables the user to load and manipulate datasets that you would otherwise not
+be able to fit into memory.


@corinnebosley Perhaps introduce the concepts of deferred loading i.e. keeping large data on disk and not in memory

bjlittle · 2017-05-30T07:28:48Z

docs/iris/src/userguide/real_and_lazy_data.rst

+data values) the real data must be realized.  Using Dask, the operation will
+be deferred until you request the result, at which point it will be executed
+using Dask's parallel processing schedulers.  The combination of these two
+behaviours can offer a significant performance boost.


@corinnebosley Introduce the concept (and use the term) lazy evaluation ...

bjlittle · 2017-05-30T07:29:36Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    >>> cube.has_lazy_data()
+    False
+
+When does my Data Become Real?


When does my data become real? ... capitalization police

Actually, this is written in title case, which is fine... other than the fact that the rest of the user guide doesn't appear to use title case.

bjlittle · 2017-05-30T07:41:19Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+Cubes possess coordinate arrays as well as data arrays, so these also benefit
+from Dask's functionality, although there are some distinctions between how
+the different coordinate types are treated.


I think that we need to make a clear statement that coordinates etc are now lazy, just like the data of a cube, rather than mentioning that we use dask, which implies lazy behaviour is supported ... subtle, but perhaps important.

DPeterK · 2017-05-30T10:45:35Z

docs/iris/src/userguide/real_and_lazy_data.rst

+You can check whether the data array on your cube is lazy using the Iris
+function 'has_lazy_data'.  For example:
+
+.. doctest::


We hardly use the doctest directive anywhere else in the userguide, and this implementation is incorrect (no accompanying testsetup directive). We could do with a decision on how much, if at all, we use doctest in this userguide chapter.

bjlittle · 2017-06-02T09:49:32Z

docs/iris/src/userguide/real_and_lazy_data.rst

+A dask array also has a shape and data type
+but typically the dask array's data points are not loaded into memory.
+Instead the data points are stored on disk and only loaded into memory in
+small chunks when absolutely necessary.


small chunks -> did we want a link here to the dask readthedocs section on chunks? if we use dask terms, we should point users off to their documentation.
absolutely necessary -> did we want to qualify this some how here i.e. explicitly loading or crunching numbers to do an actual computation (not a lazy computation)

bjlittle · 2017-06-02T09:51:37Z

docs/iris/src/userguide/real_and_lazy_data.rst

+small chunks when absolutely necessary.
+
+The primary advantage of using lazy data is that it enables the loading and manipulating
+of datasets that would otherwise not fit into memory.


Reference out-of-core processing i.e. it's a real term that means exactly what we want e.g. see https://en.wikipedia.org/wiki/Out-of-core_algorithm

bjlittle · 2017-06-02T09:53:46Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    True
+    >>> cube.data
+    >>> cube.has_lazy_data()
+    False


Did we want to add a note, which reinforces that cube.data will make lazy data, real data by loading it into memory. Users should know this already, but it just makes it explicit.

Did we want to introduce the term realising the data to mean making lazy data real? and use it through-out ?

Oh, that would just tee up the next section nicely ...

bjlittle · 2017-06-02T09:56:48Z

docs/iris/src/userguide/real_and_lazy_data.rst

+------------------------------
+
+When you load a dataset using Iris the data array will almost always initially be
+a lazy array. This section details some operations that will realise lazy data


details -> discusses ?

bjlittle · 2017-06-02T10:01:03Z

docs/iris/src/userguide/real_and_lazy_data.rst

+Most operations on data arrays can be run equivalently on both real and lazy data.
+If the data array is real then the operation will be run on the data array
+immediately. The results of the operation will be available as soon as processing is completed.
+If the data array is lazy then the operation will be deferred and the data array will


highlight deferred as it's key to this sentence ? is there a link to an explanation in dask for deferred operations ? might be a good opportunity to cross-reference into the dask documentation ...

bjlittle · 2017-06-02T10:03:46Z

docs/iris/src/userguide/real_and_lazy_data.rst

+
+ * If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
+   will return the cube's lazy dask array. Calling the cube's
+   :meth:`~iris.cube.Cube.core_data` method **will not realise** the cube's data.


will not realise -> will never realise is a stronger contract ...

bjlittle · 2017-06-02T10:05:00Z

docs/iris/src/userguide/real_and_lazy_data.rst

+   will return the cube's lazy dask array. Calling the cube's
+   :meth:`~iris.cube.Cube.core_data` method **will not realise** the cube's data.
+ * If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
+   will return the cube's real NumPy array.


real -> in-memory ? ... or real -> real (or in-memory) ...

bjlittle · 2017-06-02T10:07:17Z

docs/iris/src/userguide/real_and_lazy_data.rst

+-----------
+
+In the same way that Iris cubes contain a data array, Iris coordinates contain
+points and bounds arrays. Coordinate points and bounds arrays can also be real or lazy:


bounds are optional ... don't know if "optional" is the right word, but something that implies that a coordinate doesn't need to have bounds but it must have points (you get my drift, right)

bjlittle · 2017-06-02T10:15:32Z

docs/iris/src/userguide/real_and_lazy_data.rst

+   can have **real or lazy** points and bounds. If all of the
+   :class:`~iris.coords.AuxCoord` instances that the coordinate is derived from have
+   real points and bounds then the derived coordinate will also have real points
+   and bounds, otherwise the derived coordinate will have lazy points and bounds.


If all ... Difficult sentence. Totally understand what you mean, just wonder if we can make it simpler ... e.g.

Note that, a derived coordinate will only have real points and bounds if all of it's associated AuxCoord instances have real points and bounds.

bjlittle · 2017-06-02T10:17:59Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    True
+    >>> points = aux_coord.points
+    >>> print aux_coord.has_lazy_points()
+    False


Would be nice to now show that touching the points doesn't effect the laziness of the coordinates bounds i.e. they are still lazy ... that's the case, right?

bjlittle · 2017-06-02T10:19:55Z

docs/iris/src/userguide/real_and_lazy_data.rst

+-----------------------
+
+As stated earlier in this user guide section, Iris uses dask to provide
+lazy data arrays for both Iris cubes and coordinates. Iris also uses dask


Should we actually mention the real dask thing i.e. dask.array with a a link to dask-land?

bjlittle · 2017-06-02T13:24:43Z

@corinnebosley and @dkillick Great work! 👍

We've got an outstanding issue with a doc-test failure in the new user guide. I vote note that we only merge this PR once we resolve that issue first (and rebase this PR with the fix)

DPeterK · 2017-06-02T15:37:19Z

docs/iris/src/userguide/real_and_lazy_data.rst

+You can check whether a cube has real data or lazy data by using the method
+:meth:`~iris.cube.Cube.has_lazy_data`. For example::
+
+    >>> cube = iris.load_cube(filename, 'air_temp.pp')


Whoops, we should probably fix 'filename' in this line...

Thanks @dkillick, good spot.

bjlittle · 2017-06-07T11:20:39Z

@corinnebosley It's all lookin' good!

I forgot to merge #2590 before you done your rebase, duh! 🙄 ... but now that I've done that, could you please rebase again? This should make the failing doc-test pass!

corinnebosley · 2017-06-07T13:47:51Z

@bjlittle All passing now.

QuLogic added the Status: Work in Progress label May 25, 2017

corinnebosley mentioned this pull request May 25, 2017

New userguide page on dask (8 man days) #2509

Closed

DPeterK reviewed May 25, 2017

View reviewed changes

QuLogic added this to the dask milestone May 25, 2017

pp-mo reviewed May 29, 2017

View reviewed changes

bjlittle reviewed May 30, 2017

View reviewed changes

DPeterK reviewed May 30, 2017

View reviewed changes

DPeterK added the dask label May 31, 2017

bjlittle reviewed Jun 2, 2017

View reviewed changes

bjlittle self-assigned this Jun 2, 2017

DPeterK reviewed Jun 2, 2017

View reviewed changes

corinnebosley force-pushed the dask_docs_0 branch from ee60e28 to 41c0341 Compare June 7, 2017 11:02

first draft of first section

7cf6bde

corinnebosley and others added 14 commits June 7, 2017 14:28

first draft of second section

473e8ea

another section added

df96ea0

first draft of new page of user guide

2be161b

first revision of user guide

1414f28

Titles

aff5739

Code blocks

8ae7785

A significant rewrite

0c35413

Review actions rewrites

48c9a89

Improve wording

ef5851f

adjustments to docstrings

abf687c

Further review actions

623e8b5

changed filename to iris sample data path

70d4767

added newline before truth check

61e827a

print statements corrected

e033395

corinnebosley force-pushed the dask_docs_0 branch from 41c0341 to e033395 Compare June 7, 2017 13:28

bjlittle merged commit 9c9cd2f into SciTools:dask Jun 7, 2017

bjlittle removed the Status: Work in Progress label Jun 7, 2017

bjlittle added a commit to bjlittle/iris that referenced this pull request Jun 9, 2017

Manually add "Dask docs, first draft (SciTools#2583)".

d2af0db

QuLogic modified the milestones: dask, v2.0 Aug 2, 2017

		Please refer to the Whitepaper for further details.


		Dask Processing Options

Dask docs, first draft #2583

Dask docs, first draft #2583

Uh oh!

Conversation

corinnebosley commented May 25, 2017

Uh oh!

corinnebosley commented May 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DPeterK left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corinnebosley May 26, 2017 • edited by ajdawson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo May 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo May 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo May 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle May 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

corinnebosley commented May 25, 2017 •

edited

Loading

corinnebosley May 26, 2017 •

edited by ajdawson

Loading

pp-mo left a comment •

edited

Loading

pp-mo May 29, 2017 •

edited

Loading

pp-mo May 29, 2017 •

edited

Loading

pp-mo May 29, 2017 •

edited

Loading

bjlittle May 30, 2017 •

edited

Loading

bjlittle Jun 2, 2017 •

edited

Loading