Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm/Legion: Add support for structured (~affine) indirect copies #705

Open
streichler opened this issue Dec 9, 2019 · 27 comments
Open
Assignees
Labels
best effort indicates the milestone tag for an issue is a goal rather than a commitment enhancement Legion Issues pertaining to Legion planned Feature/fix to be actively worked on - needs release target Realm Issues pertaining to Realm
Milestone

Comments

@streichler
Copy link
Contributor

With unstructured scatter/gather just about done, we still need to handle the structured cases for things like stencils, GMG, etc.

@streichler streichler added enhancement Legion Issues pertaining to Legion Realm Issues pertaining to Realm planned Feature/fix to be actively worked on - needs release target labels Dec 9, 2019
@streichler streichler added this to the 20.03 milestone Dec 9, 2019
@streichler streichler modified the milestones: 20.03, 20.06 Mar 20, 2020
@streichler streichler added the best effort indicates the milestone tag for an issue is a goal rather than a commitment label Apr 1, 2020
@streichler streichler modified the milestones: 20.06, 20.09 May 31, 2020
@streichler streichler modified the milestones: 20.09, 20.12 Sep 23, 2020
@streichler streichler modified the milestones: 20.12, 21.03 Dec 27, 2020
@lightsighter
Copy link
Contributor

lightsighter commented May 8, 2021

@streichler How would you feel about Realm supporting a "mixed mode" indirect copy? For example where the source indirection is a transform and the destination indirection is a field. See this Legate issue:
nv-legate/cunumeric#16

@streichler
Copy link
Contributor Author

The intent is that each indirection used can independently be chosen as structured or unstructured, so this case should work.

@manopapad
Copy link
Contributor

Legate.numpy would also appreciate it if you could supply both an unstructured and a structured transformation on each direction, to be applied one after the other. The motivating use-case would be something like this:

a[10:15][ [4,2,0,1,3] ] = ...

where we need to use [4,2,0,1,3] as an unstructured indirection, and apply the affine transformation 0..5 -> 10..15 on top of that.

@manopapad
Copy link
Contributor

After thinking further on this, we realized cuNumeric actually requires up to 3 transformations on each direction, affine -> explicit -> affine. Consider this example:

x = np.arange(20) + 100   # [100, 101, ..., 119]
xx = x[10:]               # [110, 111, ..., 119]
y = np.arange(9, -1, -1)  # [9, 8, ..., 0]
yy = y[2:7]               # [7, 6, 5, 4, 3]
z = xx[yy]                # [117, 116, 115, 114, 113]

To implement this with a single indirect copy, we would need to proceed as follows:

start: copy index space                       = 0   1   2   3   4
src_indirect_1: apply yy transformation (+2)  = 2   3   4   5   6
src_indirect_2: dereference region backing y  = 7   6   5   4   3
src_indirect_3: apply xx transformation (+10) = 17  16  15  14  13
dereference src, i.e. region backing x        = 117 116 115 114 113

Alternatively, we could attach affine transformations to each field on the copy (source, target, source indirection, target indirection), to mirror how we attach affine transformations to Stores passed to tasks (in that case we do it by defining custom accessors).

@lightsighter
Copy link
Contributor

Right so we effectively need the ability to have an affine transform applied to any of the region requirements to a copy operation in Legion. The place where this gets interesting is that this now will require there to effectively be an "abstract" iteration space for any kind of copy because the source/destination indirection region requirements no longer implicitly represents the iteration space for the copy. We'll need a way to specify that in the new interface, although I suppose that was going to be true anyway just to deal with having affine transforms for the source/indirect parts anyway.

@streichler
Copy link
Contributor Author

now will require there to effectively be an "abstract" iteration space for any kind of copy

The iteration space of the copy has always been explicit (copy is a method of the IndexSpace class) at the Realm API level. However, the chaining of indirections is not something the current API can handle, and I think will also create some challenges for the Legion analysis (e.g. in a affine->explicit case, the subregions of the indirection field that are read during the copy won't necessarily have names in the region tree).

@lightsighter
Copy link
Contributor

I think will also create some challenges for the Legion analysis (e.g. in a affine->explicit case, the subregions of the indirection field that are read during the copy won't necessarily have names in the region tree).

We're just going to do what we do with the other indirection fields in that case and have the application provide region requirements that can optionally over-approximate the needed data for that field, just like we would do on the indirection for a current gather/scatter copy. The application can provide as tight of bounds as it wants, but over-approximate is fine, especially since the indirection fields are read-only.

@lightsighter
Copy link
Contributor

Here is a working merge request for an interface for this issue:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/467

Also, @manopapad did a small worked example to understand how to map NumPy copies down to the interface:

  a = np.arange(10) * 2
  c = np.arange(9, -1, -1)
  x = a[2:7][ c[5:10] ]

which would translate to the interface as:

  copy index space = Rect(lo=0, hi=4)
  src = a.regionfield, src_transform = \x -> x + 2
    => src index space = Rect(lo=2, hi=6)
    => src values = [4, 6, 8, 10, 12]
  dst = x.regionfield, dst_transform = None
  src_indirect = c.regionfield, src_indirect_transform = \x -> x + 5
    => src_indirect index space = Rect(lo=5, hi=9)
    => src_indirect values = [4, 3, 2, 1, 0]
  dst_indirect = None

Translated: for each point in the copy index space, you do the source indirect transform on the point, using the result you read the value out of the source indirect array, then you apply the source transform to the read value, then you use that value to read out of the source array and write into the destination array.

@streichler and @apryakhin to verify that conforms with their expectations of what Realm will eventually be able to support.

@manopapad
Copy link
Contributor

Correcting my example, after some clarifications from @lightsighter:

a = np.arange(10) * 2
c = np.arange(9, -1, -1)
x = a[2:7][ c[5:10] ]

becomes the following copy operation:

copy index space = Rect(lo=0, hi=4)
dst = x.regionfield, dst_transform = None
dst_indirect = None
src = a.regionfield, src_transform = \x -> x + 2
src_indirect = c.regionfield, src_indirect_transform = \x -> x + 5

which works as follows:

starting index space = Rect(lo=0, hi=4)
=> indices = [0, 1, 2, 3, 4]
apply src_indirect_transform = \x -> x + 5
=> indices = [5, 6, 7, 8, 9]
dereference src_indirect
=> indices = [4, 3, 2, 1, 0]
apply src_transform = \x -> x + 2
=> indices = [6, 5, 4, 3, 2]

@lightsighter
Copy link
Contributor

@manopapad Can you extent you example above by writing down what the arguments to the index copy launcher in Legion would be for this case from cuNumeric's perspective? Let's assume we do a index copy launch with two point copy operations. I'm betting that Legion's current semantics won't quite do what we want here, but I'd like to see how you want it to work. Feel free to describe any partitions you want on the data, no need to write down explicit projection functions, just give names to subregions for the arrays and say which subregions each point copy operation is going to use.

@manopapad
Copy link
Contributor

Here is a possible execution:

We have all the required information at the cunumeric level to partition the indirection fields such that each point copy operation will find all its indirection information locally (even if there are transformations on the indirection fields).

Whether or not the partitions of the src and dst fields will match up with the partition of the copy operation itself depends on the contents of the indirection fields. In this example our partitioning happens to be quite off.

Copy index space partitions:

Icopy = Rect(lo=0, hi=4)
p_Icopy[0] = Rect(lo=0, hi=1) # 2 indices
p_Icopy[1] = Rect(lo=2, hi=4) # 3 indices

Partitions of a's index space:

Ia = Rect(lo=0, hi=9)
p_Ia[0] = Rect(lo=0, hi=3)
p_Ia[1] = Rect(lo=4, hi=9)

Partitions of c's index space:

Ic = Rect(lo=0, hi=9)
p_Ic[0] = Rect(lo=0, hi=6) # covers at least the first 2 copy indices, after transformation
p_Ic[1] = Rect(lo=7, hi=9) # covers at least the last 3 copy indices, after transformation

Partitions of c's contents:

Rc = [0->9, 1->8, 2->7, 3->6, 4->5, 5->4, 6->3, 7->2, 8->1, 9->0]
p_Rc[0] = [0->9, 1->8, 2->7, 3->6, 4->5, 5->4, 6->3]
p_Rc[1] = [7->2, 8->1, 9->0]

Point copy 0:

starting index space = Rect(lo=0, hi=1)
=> src_indices = [0, 1]
apply src_indirect_transform = \x -> x + 5
=> src_indices = [5, 6]
dereference p_Rsrc_indirect[0] = p_Rc[0] -- all indices are within bounds
=> src_indices = [4, 3]
apply src_transform = \x -> x + 2
=> src_indices = [6, 5]

The way the source array a was partitioned, these indices are not in p_Isrc[0] = p_Ra[0] = Rect(lo=0, hi=3).

Point copy 1:

starting index space = Rect(lo=2, hi=4)
=> indices = [2, 3, 4]
apply src_indirect_transform = \x -> x + 5
=> indices = [7, 8, 9]
dereference p_Rsrc_indirect[1] = p_Rc[1] -- all indices are within bounds
=> indices = [2, 1, 0]
apply src_transform = \x -> x + 2
=> indices = [4, 3, 2]

The way the source array a was partitioned, some of these indices are not in p_Isrc[1] = p_Ia[1] = Rect(lo=4, hi=9).

@lightsighter
Copy link
Contributor

Whether or not the partitions of the src and dst fields will match up with the partition of the copy operation itself depends on the contents of the indirection fields. In this example our partitioning happens to be quite off.

Right this is kind of what I was expecting to happen. I suspect that we'll need some extra support in legion to support "collective" behavior of the point copy operations in the case when there is a transform on the indirection fields. It will require more preimages, but that'll be the price we'll pay for supporting the additional functionality.

@apryakhin
Copy link
Contributor

apryakhin commented May 5, 2022

Thanks for the discussion!

I think we need to reach a consensus on what functionality needs to be supported by Realm as the end state. We are getting close to merging this PR which adds a native support for structured indirect copies in Realm. The set of use cases can be found here:


When PR is merged, Realm will be able to apply any kind of affine transform on both src and dst index spaces, with a single level of indirection. It will be taking the fast path since a transform is not applied on a set of points individually but directly on lo/hi points of an index space bounds.

The example below is effectively a structured copy with an affine transform on array c "S(ATc)" and an unstructured copy with an affine transform on a set of points through the source indirect field (U(ATa))…so S(ATc)U(ATa). The existing API and the data path (when PR is applied) at this point do not support combining structured and unstructured copies together.


  a = np.arange(10) * 2
  c = np.arange(9, -1, -1)
  x = a[2:7][c[5:10]]

An even simpler version which is also not supported:

  a = np.arange(10) * 2
  c = np.arange(9, -1, -1)
  x = a[c[5:10]]

A structured copy with an affine transform followed by an unstructured copy using the set of transformed points.

After giving it some thoughts, I think it can be done but have several questions:

  • How critical is it for us to support the "mixed" copies natively in Realm?
    • I understand that Legate is able to emulate this by splitting the operation on parts and materializing the intermediate results? If yes, why is it not sufficient?
  • What percentage of operations required for advanced indexing in CuNumeric are mixed? (not sure if that's easy to estimate)
  • Are there other operation types which needs to be discussed?
    • Perhaps we can list more examples here.

Assuming we need this going forwards, we can certainly design an API (Realm) which enables us to incrementally add functionality without the need to update the call sites...e.g. Legion -> Legate -> CuNumeric.

@manopapad
Copy link
Contributor

When PR is merged, Realm will be able to apply any kind of affine transform on both src and dst index spaces, with a single level of indirection.

Could you walk us through an example with some small arrays, showing what we can do with the new interface?

How critical is it for us to support the "mixed" copies natively in Realm? I understand that Legate is able to emulate this by splitting the operation on parts and materializing the intermediate results? If yes, why it is not sufficient?

I have updated nv-legate/cunumeric#41 with the logic followed in the current implementation. As you suggest, we can handle such cases by making intermediate copies, but we could avoid some copies if Realm could handle affine transformations directly (which AFAIU is also useful for other usecases). We certainly don't need Realm to support the full range of cases.

What percentage of operations required for advanced indexing in CuNumeric are mixed? (not sure if that's easy to estimate)

So far we haven't seen any "real-world" instances of mixed indexing (but haven't really seen many workloads that use advanced indexing in general).

Are there other operation types which needs to be discussed?

The cases listed on nv-legate/cunumeric#41 are the only ones that should arise in cuNumeric, assuming we are not fusing operations.

@apryakhin
Copy link
Contributor

apryakhin commented May 9, 2022

Could you walk us through an example with some small arrays, showing what we can do with the new interface?

Yes, certainly but please note that there is no new API (yet). The PR is based on the existing Ream API for affine indirect copies.

So far we haven't seen any "real-world" instances of mixed indexing (but haven't really seen many workloads that use advanced indexing in general).

Thanks! Effectively we don't have any data points yet. Will it be possible for us to estimate nevertheless what set of operations are going to be the most critical? By the critical I mean things such as: provide good performance, low memory footprint and a frequency of their use in applications.

I believe the description below would be a gather and the step#2 materializes to a new intermediate array?

For __get_item__:
1. use newly created result array as dst
2. if the index array is a view, make a copy to remove its transform
3. use (copy of) index array as src_indirect
4. if the base array is a view, make a copy to remove its transform
5. use (copy of) base array as src

For example:

starting index space = Rect(lo=0, hi=4)
=> indices = [0, 1, 2, 3, 4]
apply src_indirect_transform = \x -> x + 5
=> indices = [5, 6, 7, 8, 9]
dereference src_indirect
=> indices = [4, 3, 2, 1, 0]

I took a look at the tests provided by tests/integration/test_advanced_indexing.py.

I think most of the cases come down to the following patterns?

S = structured, SA = structured + affine transform
U = unstructured, UA = unstructured + affine transform
indices = [1, 3, 5, 8]
x = [...]

- transform on src (SA)
  - y = x[1:] 
- transform on src_indirect (UA)
  - y = x[indices[1:]]
- transform on src (SA + U)
  - y = x[1:][indices] 
- transform on src and src_indirect (SA + UA)
  - y = x[1:][indices[1:]] 
  ...
- transform on dst (SA)
  - y[1:] = x 
- transform on dst_indirect (UA)
  - y[indices[1:]] = x
- transform on dst (SA + U)
  - y[1:][indices] = x
- transform on dst and dst_indirect (SA + UA)
  - y[1:][indices[1:]] = x

Am I missing anything here? @manopapad @lightsighter

I don't quite understand in detail how we are going to ensure that dimensionality matches up between different transforms (single copy operation)? Is that something Legion/CuNumeric should/can handle? In addition, for cases where we deal with the distributed data I assume that it's a pre-condition that a copy is split on independent operations on local data only before it's fed to the Realm API?

@ipdemes
Copy link

ipdemes commented May 9, 2022

@apryakhin :

I believe the nv-legate/cunumeric#41 (comment) below would be a gather and the step#2 materializes to a new intermediate array?

Yes it will do gather operation at 3) and copy of the index_array at step 2 materializes transformed array to a new intermediate array.

I think most of the cases come down to the following patterns?

This is correct, but keep in mind that indirect can have multiple transforms.
For the example below indices will have 2 transforms (slice and transpose):

index = [1,2,3]
indices = [index, : , index, 2:]

@manopapad
Copy link
Contributor

Will it be possible for us estimate nevertheless what set of operations are going to be the most critical?

I don't think we can make a meaningful estimate with the information that we have.

I don't quite understand in detail how we are going to ensure that dimensionality matches up between different transforms (single copy operation)? Is that something Legion/CuNumeric should/can handle?

You can assume that the dimensionality of arrays and transforms will be checked before the operation makes it to Legion/Realm (e.g. Legate will never emit a copy where src is a 2d array with a 3d->3d affine transformation).

In addition, for cases where we deal with the distributed data I assume that it's a pre-condition that a copy is split on independent operations on local data only before it's fed to the Realm API?

As analyzed on #705 (comment), I think we can guarantee at the Legate level that each point copy operation will find all the relevant indirection information locally. However, the src/dst_indirect field can contain arbitrary data, which can point outside each point's local portion of src/dst.

AFAIK we are not planning to inspect the indirection data at the Legate level, so Legion and/or Realm will have to handle out-of-partition accesses.

For the example below indices will have 2 transforms (slice and transpose):

I think we are operating under the assumption that all transformations can be collapsed into a single affine transformation.

@apryakhin
Copy link
Contributor

apryakhin commented May 10, 2022

This is correct, but keep in mind that indirect can have multiple transforms.
For the example below indices will have 2 transforms (slice and transpose):

Yes, I believe that multiple affine transforms can be folded into a single one.

Could you walk us through an example with some small arrays, showing what we can do with the new interface?

This API we provide today by Realm (C++):

template <int N2, typename T2 = int>
    class Affine : public CopyIndirection<N,T>::Base {
    public:
      virtual ~Affine(void) {}

      Matrix<N,N2,T2> transform;
      Point<N2,T2> offset_lo, offset_hi;
      Point<N2,T2> divisor;
      Rect<N2,T2> wrap;
      std::vector<IndexSpace<N2,T2> > spaces;
      std::vector<RegionInstance> insts;

      REALM_INTERNAL_API_EXTERNAL_LINKAGE
      virtual IndirectionInfo *create_info(const IndexSpace<N,T>& is) const;
    };

    template <int N2, typename T2 = int>
    class Unstructured : public CopyIndirection<N,T>::Base {
    public:
      virtual ~Unstructured(void) {}

      FieldID field_id;
      RegionInstance inst;
      bool is_ranges;
      bool oor_possible;  // can any pointers fall outside all the target spaces?
      bool aliasing_possible;  // can multiple pointers go to the same element?
      size_t subfield_offset;
      std::vector<IndexSpace<N2,T2> > spaces;
      std::vector<RegionInstance> insts;

      REALM_INTERNAL_API_EXTERNAL_LINKAGE
      virtual IndirectionInfo *create_info(const IndexSpace<N,T>& is) const;
    };
  };

It enables us to handle the following cases (common patterns for structured copy with an affine transform).

x = [0, 1, 2, 3, 4, 5, 6, 7]
y = x[2:7] # translation
y = x[::2]  # scaling
y = x[:, None] # transpose
...

Essentially we can express any transform as a function of A * p + b where:

  • A (N2, N) - transform matrix
  • p (N) - point inside a copy index space
  • b (N2) - offset_lo, offset_hi
  • N - copy domain
  • N2 - destination domain

In addition, the API allows us to express an unstructured indirect copy (however without an affine transform). For example:

indices = [0, 2, 4, 6]
x = [0, 1, 2, 3, 4, 5, 6, 7]
y = x[indices]
...

Besides the mixed indirect copies could you think of any examples that are required for advanced indexing but will not be handled by the current API? @ipdemes @lightsighter @manopapad Thanks!

@manopapad
Copy link
Contributor

It enables us to handle the following cases (common patterns for structured copy with an affine transform).

Note that simply declaring a variable like y = x[2:7] in cuNumeric will not trigger a copy operation, because NumPy semantics dictate that such operations create writable views. Instead we create a lightweight object that carries the transformation info. If y is used in an operation that is implemented using a task (e.g. an arithmetic operation, like z = 2 * y), then we will still not create a new copy of y. Instead we will pass the transformation information over to the C++ task implementation, so it can read at an offset from its assigned chunk.

I can think of a couple of cases where having direct support for one affine indirection on each direction of a copy (but not mixed unstructured + affine) would help us:

  • "Basic" copies on arrays with transformations, e.g. a[1:6] = b[2:7]: These could now be implemented using a copy, instead of a custom copy task.
  • __set_item__ with advanced indexing, e.g. a[2:7][ c[5:10] ] = b[3:8]: We don't have to make a copy to remove the transformation from the value array b[3:8], instead we can pass its transformation as an affine src_indirect field.

@apryakhin
Copy link
Contributor

Just wanted to summarize what we have discussed today:

  • We can start implementing advanced indexing using Legion API (@lightsighter ). It can at least start using an existing Realm support for unstructured copies and will enable us to check for existing bugs in the "plumbing".
  • We need to evaluate the GPU shallow-water solver to come up with a list of use cases for advanced indexing.
  • We need to find and evaluate other applications..to get more data points.

Please find more links below:

  • The proposed version of a new Realm API here
  • The google doc we used for the technical discussion here

@ipdemes
Copy link

ipdemes commented May 18, 2022

@apryakhin :
I have looked through the TorchSW code and these are the advanced indexing APIs that are used here (in the order of # of calls):

y[:, a:b, c:d] = x
y[indices]=x
y[:,a,b:c]=x
x=y[indices]

@apryakhin
Copy link
Contributor

Thanks Irina, there are effectively:

  1. Structured copy, scatter with an affine transform on y and z coordinates
  2. Unstructured copy, scatter
  3. Structured copy, scatter with an affine transform on y and z coordinates
  4. Unstructured copy, gather

We should be able to handle all 4 cases when we merge an "affine" branch into the master.

@manopapad
Copy link
Contributor

@ipdemes: based on what @magnatelee showed in the meeting yesterday, there are cases in the TorchSWE codebase where we have basic indexing mixed with advanced indexing, e.g. something like:

y[:, 0, indices] = x

AFAIU the way we handle this internally results in a scatter-gather copy with a transform on the dst array. First we get rid of the constant 0 by creating a view:

z = y[:, 0]

and then we emit the scatter-gather copy for:

z[:, indices] = x

Currently, because Legion copies cannot handle transformations, the array z is materialized before being passed to a copy. This is one of those cases where we would benefit from having some support for "mixed indexing" at the Legion/Realm level.

Is it possible to collect information on how often this sort of pattern occurs?

Also, in an expression like this x=y[indices], is it possible to tell how often (if at all) x and y are views?

It may not be possible to extract this level of detail by analyzing the text of the code. Assuming we were able to run TorchSWE on one rank, it might be feasible to add some instrumentation to get_item and set_item in deferred.py, and record information on whether self/rhs contains transformations, and what kind of key we're dealing with (basic indexing only, only a boolean array and :, only integer/boolean arrays and :, both basic and advanced indexing) (and whether any integer/boolean index arrays used in the key are themselves views).

@ipdemes
Copy link

ipdemes commented May 19, 2022

I went thought Torch SWE code one time and here is an updated list of Advanced Indexing operations used there:

  Operation                               count
  x[1,…] = y[0,a:b, c:d]                   1
 
  x[a:, indices,indices] +=y.              1
  x= y[a,b:c, d:e]                         1 + N_iterations/constant
  x=y[a,b:c, d:e][tuple of indices]        N_iterations
  x=y[indices]                             2 + N_ interations
  x[indices]=y                             15 * N_iterations
  x=y[a, b:c, d]                           16 +4* some constan
  x[indices]=y[indices]+b[indices]         ~5










here `x` and `y` are output/input arrays




`
a`,`b`,`c`,`d`,`e` - scalars




`
indices` -array of scalar indices

@elliottslaughter elliottslaughter removed this from the 21.09 milestone Mar 14, 2024
@apryakhin apryakhin assigned apryakhin and unassigned lightsighter Sep 17, 2024
@apryakhin apryakhin changed the title realm/legion: support for structured (~affine) indirect copies Realm/Legion: Add support for structured (~affine) indirect copies Sep 17, 2024
@apryakhin apryakhin added this to the realm-24.11 milestone Sep 17, 2024
@apryakhin
Copy link
Contributor

@magnatelee for visibility I am going to be resurrecting this github issue

@apryakhin
Copy link
Contributor

The work is done on the Realm side, however we still need Legion plumbing and testing

@lightsighter
Copy link
Contributor

@apryakhin Can you provide a pointer to the branch where the Realm code is ready?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
best effort indicates the milestone tag for an issue is a goal rather than a commitment enhancement Legion Issues pertaining to Legion planned Feature/fix to be actively worked on - needs release target Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

6 participants