Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow cubes/coords/etc to share data #3172

Open
DPeterK opened this issue Sep 17, 2018 · 4 comments
Open

Allow cubes/coords/etc to share data #3172

DPeterK opened this issue Sep 17, 2018 · 4 comments
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info Experience: High Status: Decision Required

Comments

@DPeterK
Copy link
Member

DPeterK commented Sep 17, 2018

In some cases it is highly advantageous for cubes to be able to share a data object, which Iris currently cannot handle. This means Iris can in some cases produce views of data and not copies.

Here's @pp-mo's take on this topic:

IMHO we should aim to be "like numpy".
In this context, that means in the worst cases (e.g. indexing) :

"Result is usually a view, but in some cases a copy."
"it's too complicated to explain exactly when."
"it might change in future releases"

There is some prior on this topic, including #2261, #2549 and #2584, #2681 and #2691 . These reflect the importance of this topic. However, given the potential for unexpected behaviour that this change will bring, further thought is still required.

@pp-mo
Copy link
Member

pp-mo commented Sep 17, 2018

unexpected behaviour

Some key points from my prior thought on this ...

  • the key practicality + API design question is "in what context may an Iris operation produce a result which shares data with another Iris object".
  • the key goal is to control it so it only happens when you expect it or asked for it.
  • lazy content could confuse this : when does it get evaluated, can that encapsulate a behaviour switch from when it was created (e.g.) ?
  • the biggy IMHO : Once it is anyway possible to have (e.g.) 2 cubes which share some data, then any operation which can modify its inputs might produce different results. You just can't logically avoid that. Even something as simple as "a = a + b" is potentially affected.

@pp-mo
Copy link
Member

pp-mo commented Feb 1, 2022

Iris 3.2 and the unstructured data model

Since v3.2 / unstructured, we do finally get cubes which share some components : that is, any cube.mesh

Summary of some relevant facts about new datamodel objects

Basic relevant facts + ideas

Mesh

  • does not support copy : we expect multiple things that use it to cross-refer
  • is mapped to only one Cube data dimension, only via a MeshCoord, and therefore not :
    • a cube component (like Coord/Ancil/CellMeasure)
    • a _DimensionalMetadata subclass
    • indexable as part of sub-indexing a cube

Meshcoords

Are a sort of "convenience" component ..

  • they "just" represent a relationship between a cube (and its dims) and a Mesh
  • they are AuxCoords, but don't represent anything in a CF dataset
    • thus, they have standard/long/varname and units/attributes ..
    • .. but these are basically non-functional, don't "mean" anything, aren't used for anything
    • so, there is clearly an argument for these to not be AuxCoords but some distinct, more limited class : the current arrangement is pragmatic (as for Connectivity being a _DimensionalMetadata -- see below).
  • they are not shared between cubes (but in future could be, if any Coords are ?)
  • they support copying, and are copied on cube copy
  • they do not support sub-indexing ..
    • .. but are replaced with ordinary AuxCoords on cube indexing (see above)

Mesh location coordinates and Connectivites

  • are not attached to a cube, or its dims, but only to the Mesh
  • therefore, implicitly, shared + not copied (between cubes of the same mesh)
  • so, like a Mesh, they aren't a cube component ..
  • .. but they are dimensional, and mapped to a Mesh dimension
  • unlike MeshCoords, they do represent objects in a CF dataset
    • so they do have meaningful standard/log/var-name + units + attributes
  • Mesh location coordinates : are just ordinary AuxCoords (for now at least)
  • Mesh Connectivities : at present are a subclass of _DimensionalMetadata
    • but this is not logical, really just a convenience / anomaly and could reasonably change
    • so .. they are in principle indexable and copyable, but this is not really useful or used anywhere at present

Sharing of dimensional components (potentially big arrays)

This is a relevant issue, simply because unstructured data comes with a lot of associated mesh information : large coordinate + connectivity arrays
Typically, much larger than structured equivalents for the same size of data

Mesh Coordinates and Connectivities are effectively shared between cubes, since they belong to the Mesh, which also is.
-- though, identical meshes loaded from different files cannot currently be identified and shared

Any related AuxCoord/CellMeasure/Ancil on the unstructured dimension can not be shared
They can be lazy, of course, but each Cube will have it's own copy

  • like regular (structured data) Coords
  • unlike the Mesh coords + connectivities

@trexfeathers trexfeathers added the Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info label Jul 10, 2023
@trexfeathers trexfeathers changed the title Allow cubes to share data Allow cubes/coords/etc to share data Sep 15, 2023
@trexfeathers trexfeathers moved this to 📌 Prioritised in 🐉 Dragon Taming Sep 15, 2023
@pp-mo
Copy link
Member

pp-mo commented Sep 15, 2023

Discussed briefly offline with @hdyson, since he and IIRC @cpelley were the original users most concerned about the inefficiency of this.

His recollection of what "the problem" to be addressed was, was somewhat different ...
He thinks it was in the context of combining multiple results into a single array to then be saved, rather than to do with sharing of components in loaded data.

The thing is, sharing of partial data arrays by multiple cubes is already possible
For example:

>>> data = np.zeros((10,))
>>> c1, c2, c99 = Cube(data[:5]), Cube(data[5:]), Cube(data[4:8])
>>> c1.data[3] = 7
>>> c2.data[:4] = 99
>>> c99.data[:] = 50
>>> data
array([ 0.,  0.,  0.,  7., 50., 50., 50., 50., 99.,  0.])
>>> c1.data
array([ 0.,  0.,  0.,  7., 50.])
>>> c2.data
array([50., 50., 50., 99.,  0.])
>>> c99.data
array([50., 50., 50., 50.])
>>> 

@pp-mo
Copy link
Member

pp-mo commented Sep 15, 2023

In the course of the above discussion, I rather revised my thoughts.

My understanding is that the major opportunity for inefficiency is where multiple cubes contain identical components, such as aux-coords, ancillary-variables or cell measures.
It doesn't really apply to cube data, since we don't generally expect cube data to be linked.

If all those cube-components' data may be realised, then there is an obvious inefficiency.
( e.g. there was a period when saving cubes realised all aux-coords -- though that is now fixed).
If these contain real data, then this could easily be shared, as the above cube data examples show.
However, normally, when loaded from file, these components would contain multiple lazy arrays, referencing the same data in the file.

So, in the lazy case, it is quite possible that some cube operations might load all that data, or at least transiently fetch it multiple times (e.g. within computation of a lazy result, or a save).
I think there is no clean way to "link" the separate lazy arrays, but it should be possible for the cubes to share either the cube components themselves -- i.e. the objects, such aux-coords -- or, within those, their DataManager's. Effectively, this is already happening with Meshes.
With that provision, realising the components would "cache" the data and not re-read it (still less allocate additional array space). However, that in itself would still not improve lazy operations, -- including lazy streaming during netcdf writes -- since dask does not cache results, and the lazy content would still be re-fetched multiple times.
To address that, It would be possible to implement a caching feature within NetCDFDataProxy objects, but that approach is not very controllable -- and could itself cause problems, if the total data size of a single object is large (in which case, storing only one chunk at a time may be highly desirable).

In short, we may need to focus more carefully on what the common problems cases actually are, since I think there has been some confusion here in the past, and all the solutions so far proposed may have potential drawbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info Experience: High Status: Decision Required
Projects
Status: 📌 Prioritised
Status: No status
Development

No branches or pull requests

3 participants