-
Notifications
You must be signed in to change notification settings - Fork 298
Dask docs, first draft #2583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask docs, first draft #2583
Conversation
|
@dkillick @pp-mo @bjlittle Apologies for my previous PR, clearly not used to working on the dask branch. Feel free to tear into this draft of the docs, that's what it's there for. Don't beat around the bush, I won't take anything personally. I will also leave open the option to allow edits directly to this PR. It may turn out to be a more efficient method of working. |
DPeterK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corinnebosley this is a great starting point for the new userguide chapter 👍 There's plenty to think about still, but there always is when updating user docs!
| >>> cube = iris.load_cube(filename, 'air_temp.pp') | ||
| >>> cube.has_lazy_data() | ||
| True | ||
| >>> _ = cube.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't actually need to assign this call into anything - just calling cube.data is sufficient to touch (and load) the data.
| What is Real and Lazy Data? | ||
| --------------------------- | ||
|
|
||
| Every Iris cube contains an n-dimensional data array, which could be real or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've got two spaces between "dimensional" and "data".
| Every Iris cube contains an n-dimensional data array, which could be real or | ||
| lazy. | ||
|
|
||
| Real data is contained in an array which has a shape, a data type, some other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth mentioning "NumPy" in this line.
| about its real counterpart but has no actual data points, so its memory | ||
| allocation is much smaller. This will be in the form of a Dask array. | ||
|
|
||
| Arrays in Iris can be converted flexibly(?) between their real and lazy states, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could get away without "flexibly" here.
| Arrays in Iris can be converted flexibly(?) between their real and lazy states, | ||
| although there are some limits to this process. The advantage of using lazy | ||
| data is that it has a small memory footprint, so certain operations | ||
| (such as...?) can be much faster. However, in order to perform other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage is quite right, but the application is not so accurate: lazy data allows you to load and manipulate datasets that you would otherwise not be able to fit into memory.
As it happens, it is likely that operations will be sped up, but this could be considered an indirect benefit:
- the calculations will not be performed immediately as you execute the lines of code that define the calculations. Instead they are deferred until you request a realised result (this happens when you call
computeon a dask array) - dask's parallel-processing schedulers allows the calculations to be performed in parallel when the realised result is requested, which can bring a significant performance improvement. However it is not directly related to the smaller memory footprint.
I think both of these points are also worth adding here as they build the picture of the benefits of lazy data and its processing.
| >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp')) | ||
| >>> cube.has_lazy_data() | ||
| True | ||
| >>> the_data = cube.core_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe core_data() is now a method that needs to be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to work both with and without the brackets. That's pretty confusing.
| Please refer to the Whitepaper for further details. | ||
|
|
||
|
|
||
| Dask Processing Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, I think we could do with a little section here on coords, too. They've been updated now as well, and while DimCoords can only have real data (because of the monotonicity checks that must be run), AuxCoords can now have lazy data, which is a bit of a big deal, I reckon.
|
|
||
| .. doctest:: | ||
|
|
||
| >>> dask.set_options(get=dask.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should read dask.async.get_sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which bit? Do you mean like this:
>>> dask.set_options(get=dask.async.get_sync)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was not how I expected that comment to look...
|
|
||
| There are some default values which are set by Dask and passed through to Iris. | ||
| If you wish to change these options, you can override them globally or using a | ||
| context manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dask options are only controllable by calling dask.set_options, which isn't quite clear from this sentence. The changes you specify with this call will apply to all uses of dask for the lifetime of your "session" (be it a Python session or the extent of the context manager).
|
|
||
| As well as Dask offering the benefit of a smaller memory footprint through the | ||
| handling of lazy arrays, it can significantly speed up performance by allowing | ||
| Iris to use multiprocessing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd call this parallel processing rather than multiprocessing, which is one parallel processing option available to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All really good work, but I'm not convinced that most of this detail needs to go in the User Guide.
I think we need to keep the User Guide absolutely as short as practical.
In this case, I believe that actually most users don't need to know anything about lazy operations except : (1) roughly what 'lazy' means; (2) it's only relevant to performance with big data; (2) there is more detail elsewhere;
So, although it is quite long + gentle, maybe most of this belongs in the Whitepaper space instead ?
| 'touch' the data. This means any way of directly accessing the data, such as | ||
| assigning it to a variable or simply using 'cube.data' as in the example above. | ||
|
|
||
| Any action which requires the use of actual data values (such as cube maths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cube maths is not a good example : arithmetic operations don't realise.
Example non-lazy operations could be plotting or regridding ?
| >>> cube.has_lazy_data() | ||
| False | ||
|
|
||
| You can also convert realized data back into a lazy array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "back into" is misleading, because it implies that it puts the cube "back the way it was", which is absolutely not the case : If you convert real data to lazy, it's only a wrapper : the data is still in memory and the cube no longer gets its data from the file.
| >>> cube.has_lazy_data() | ||
| False | ||
|
|
||
| You can also convert realized data back into a lazy array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my head "realized data" is maybe not good grammar, and simply "real data" is more accurate. : so I'd prefer it to say that a cube or coord points/bounds can be "realized" -- the content is just a "real array".
| >>> cube.has_lazy_data() | ||
| True | ||
|
|
||
| Core data refers to the current state of the cube's data, be it real or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't much like "state of the cube's data", it sounds like it is telling you what state it is in.
? Prefer "Core data returns the cube's current data array, of whichever type, be it real or lazy."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So "state" is my suggested terminology for core_data, although maybe its use here is a little clunky. I mean, the state of the cube's data is likely to change over the lifetime of the cube, between being in a lazy state and a real state.
I'll have a go at using "state" differently here and if it still isn't liked then maybe we could go for something a little more like what @pp-mo has suggested.
|
|
||
| Core data refers to the current state of the cube's data, be it real or | ||
| lazy. This can be used if you wish to refer to the data array but are | ||
| indifferent to its current state. If the cube's data is lazy, it will not be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, don't much like this use of "state".
Maybe "This can be used when you want to use the data array with some operation that applies regardless of whether it is real or lazy, such as indexing it".
|
|
||
| You can copy a cube and assign a completely new data array to the copy. All the | ||
| original cube's metadata will be the same as the new cube's metadata. However, | ||
| the new cube's data array will not be lazy if you replace it with a real array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make it clear here that you can provide a lazy array.
However, in certain cases you will need to use 'replace()' instead, described next...
| [-- -- -- ..., -- -- --] | ||
| [-- -- -- ..., -- -- --] | ||
| [-- -- -- ..., -- -- --]] | ||
| >>> new_cube = cube.copy(data=data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a replace !
| Coordinate Arrays | ||
| ----------------- | ||
|
|
||
| Cubes possess coordinate arrays as well as data arrays, so these also benefit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put this as "Like cubes, coordinates also contain data arrays (points and bounds)."
| Dask applies some default values to certain aspects of the parallel processing | ||
| that it offers with Iris. It is possible to change these values and override | ||
| the defaults by using 'dask.set_options(option)' in your script. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth describing exactly what Dask options Iris sets first, linking it directly to the account in the Dask docs.
It may also be worth explaining how Iris "steps aside" if you make your own set_options call.
| that it offers with Iris. It is possible to change these values and override | ||
| the defaults by using 'dask.set_options(option)' in your script. | ||
|
|
||
| You can use this as a global variable if you wish to use your chosen option for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that, while maybe useful, all this remaining text is basically duplicating Dask documentation.
Apart from the general problems with copying and coupling, I think we really want the user to refer to Dask docs at this point + not just copy our magic recipes without a deeper understanding.
|
|
||
| Lazy data is contained in a conceptual array which retains the information | ||
| about its real counterpart but has no actual data points, so its memory | ||
| allocation is much smaller. This will be in the form of a Dask array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corinnebosley Perhaps format Dask as a link to the dask readthedocs, so the user can follow this if they want ...
|
|
||
| The advantage of using lazy data is that it has a small memory footprint which | ||
| enables the user to load and manipulate datasets that you would otherwise not | ||
| be able to fit into memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corinnebosley Perhaps introduce the concepts of deferred loading i.e. keeping large data on disk and not in memory
| data values) the real data must be realized. Using Dask, the operation will | ||
| be deferred until you request the result, at which point it will be executed | ||
| using Dask's parallel processing schedulers. The combination of these two | ||
| behaviours can offer a significant performance boost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corinnebosley Introduce the concept (and use the term) lazy evaluation ...
| >>> cube.has_lazy_data() | ||
| False | ||
|
|
||
| When does my Data Become Real? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does my data become real? ... capitalization police
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is written in title case, which is fine... other than the fact that the rest of the user guide doesn't appear to use title case.
|
|
||
| Cubes possess coordinate arrays as well as data arrays, so these also benefit | ||
| from Dask's functionality, although there are some distinctions between how | ||
| the different coordinate types are treated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to make a clear statement that coordinates etc are now lazy, just like the data of a cube, rather than mentioning that we use dask, which implies lazy behaviour is supported ... subtle, but perhaps important.
| You can check whether the data array on your cube is lazy using the Iris | ||
| function 'has_lazy_data'. For example: | ||
|
|
||
| .. doctest:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We hardly use the doctest directive anywhere else in the userguide, and this implementation is incorrect (no accompanying testsetup directive). We could do with a decision on how much, if at all, we use doctest in this userguide chapter.
| A dask array also has a shape and data type | ||
| but typically the dask array's data points are not loaded into memory. | ||
| Instead the data points are stored on disk and only loaded into memory in | ||
| small chunks when absolutely necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small chunks -> did we want a link here to the dask readthedocs section on chunks? if we use dask terms, we should point users off to their documentation.
absolutely necessary -> did we want to qualify this some how here i.e. explicitly loading or crunching numbers to do an actual computation (not a lazy computation)
| small chunks when absolutely necessary. | ||
|
|
||
| The primary advantage of using lazy data is that it enables the loading and manipulating | ||
| of datasets that would otherwise not fit into memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference out-of-core processing i.e. it's a real term that means exactly what we want e.g. see https://en.wikipedia.org/wiki/Out-of-core_algorithm
| True | ||
| >>> cube.data | ||
| >>> cube.has_lazy_data() | ||
| False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we want to add a note, which reinforces that cube.data will make lazy data, real data by loading it into memory. Users should know this already, but it just makes it explicit.
Did we want to introduce the term realising the data to mean making lazy data real? and use it through-out ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that would just tee up the next section nicely ...
| ------------------------------ | ||
|
|
||
| When you load a dataset using Iris the data array will almost always initially be | ||
| a lazy array. This section details some operations that will realise lazy data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
details -> discusses ?
| Most operations on data arrays can be run equivalently on both real and lazy data. | ||
| If the data array is real then the operation will be run on the data array | ||
| immediately. The results of the operation will be available as soon as processing is completed. | ||
| If the data array is lazy then the operation will be deferred and the data array will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
highlight deferred as it's key to this sentence ? is there a link to an explanation in dask for deferred operations ? might be a good opportunity to cross-reference into the dask documentation ...
|
|
||
| * If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method | ||
| will return the cube's lazy dask array. Calling the cube's | ||
| :meth:`~iris.cube.Cube.core_data` method **will not realise** the cube's data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will not realise -> will never realise is a stronger contract ...
| will return the cube's lazy dask array. Calling the cube's | ||
| :meth:`~iris.cube.Cube.core_data` method **will not realise** the cube's data. | ||
| * If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method | ||
| will return the cube's real NumPy array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
real -> in-memory ? ... or real -> real (or in-memory) ...
| ----------- | ||
|
|
||
| In the same way that Iris cubes contain a data array, Iris coordinates contain | ||
| points and bounds arrays. Coordinate points and bounds arrays can also be real or lazy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bounds are optional ... don't know if "optional" is the right word, but something that implies that a coordinate doesn't need to have bounds but it must have points (you get my drift, right)
| can have **real or lazy** points and bounds. If all of the | ||
| :class:`~iris.coords.AuxCoord` instances that the coordinate is derived from have | ||
| real points and bounds then the derived coordinate will also have real points | ||
| and bounds, otherwise the derived coordinate will have lazy points and bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all ... Difficult sentence. Totally understand what you mean, just wonder if we can make it simpler ... e.g.
Note that, a derived coordinate will only have real points and bounds if all of it's associated AuxCoord instances have real points and bounds.
| True | ||
| >>> points = aux_coord.points | ||
| >>> print aux_coord.has_lazy_points() | ||
| False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to now show that touching the points doesn't effect the laziness of the coordinates bounds i.e. they are still lazy ... that's the case, right?
| ----------------------- | ||
|
|
||
| As stated earlier in this user guide section, Iris uses dask to provide | ||
| lazy data arrays for both Iris cubes and coordinates. Iris also uses dask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we actually mention the real dask thing i.e. dask.array with a a link to dask-land?
|
@corinnebosley and @dkillick Great work! 👍 We've got an outstanding issue with a doc-test failure in the new user guide. I vote note that we only merge this PR once we resolve that issue first (and rebase this PR with the fix) |
| You can check whether a cube has real data or lazy data by using the method | ||
| :meth:`~iris.cube.Cube.has_lazy_data`. For example:: | ||
|
|
||
| >>> cube = iris.load_cube(filename, 'air_temp.pp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, we should probably fix 'filename' in this line...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dkillick, good spot.
ee60e28 to
41c0341
Compare
|
@corinnebosley It's all lookin' good! I forgot to merge #2590 before you done your rebase, duh! 🙄 ... but now that I've done that, could you please rebase again? This should make the failing doc-test pass! |
41c0341 to
e033395
Compare
|
@bjlittle All passing now. |
In this new page of the user guide, I have attempted to cover: