Realm Instance Redistricting #1078

lightsighter · 2021-05-12T00:10:57Z

In several cases, we have a need for taking an existing Realm instance and splitting it apart into new Realm instances with different layouts. The API for this feature is flexible. The only requirements are that it be able to produce new instances from an existing one with specified layout constraints. Whether the prior instance is completely destroyed or is left with remaining memory is open for discussion. We can also discuss whether this should be a deferred operation in Realm or whether it can be done only eagerly.

apryakhin · 2024-01-11T08:10:32Z

As discussed offline, we are going evaluate the scope of work to make sure it's planned ahead so that it won't be blocking control_replication merge

lightsighter · 2024-01-11T08:18:33Z

Just to be clear, you guys won't be blocking the control replication merge no matter what. This is needed for the refactoring of how Legion manages memory "temporary" memory for tasks which will be worked on after control replication merges. So you're not on the critical path. 😉

apryakhin · 2024-01-11T14:20:19Z

@lightsighter I am just re-writing your original problem statement here just to see whether I got it.

From a given instance A we should be able to create instances B and C (possibly any number of those limited to the size of the allocated memory of A) where B and C can be of different layouts. B and C will be created from the same memory allocation.

When either B or C is destroyed, the space can be reclaimed by creating a new instance D from the original instance A.

While B and C exist, instance A should also exist. If A is destroyed...e.g. calling A.destroy() instances B and C should also be destroyed. Which means A should keep references to B and C....or B and C could continue living independently (copy-on-destroy as an option).

While B and C exist, instance A should not be used as it is considered a "split-instance". Instance A remains a split-instance as long as B and C exist.

While B and C exist and originated from the split-instance A, B and C cannot become split-instances themselves as we restrict it to only a single level of indirection.

When B and C are deleted and there are no more instances created from A, the split-instance A becomes a normal instance again and can be used to store data for it's original layout.

If B is deleted and C is about to become deleted as the last instance of A, we allow (optionally) for deletion of C to trigger the deletion of A.

The interface:

RegionInstance::create_instance(A , memory, instance_layout..);

Event e1 = A.create_child_instance(B, instance_layout_b);
Event e2 = A.create_child_instance(C ,instance_layout_b);

vector<RegionInstance> insts; 
vector<Event> es = A.create_child_instances( insts, {instance_layout_a, instance_layout_b} );

I obviously made a lot of assumptions. Please drop comments what you agree/disagree on and we can hash out gradually the spec for what are we going to design and implement here.

lightsighter · 2024-01-11T21:17:49Z

While B and C exist, instance A should also exist.

No, at the point of redistricting, instance A ceases to exist (its instance ID should be reclaimed, and its meta-data deleted). Instance B and C then come into existence as completely new and independent instances, just using the same memory that instance A used to occupy.

When either B or C is destroyed, the space can be reclaimed by creating a new instance D from the original instance A.

After an instance is redistricted, the new instances that come into existence should be normal instances and when they are deleted their memory should be reclaimed and available for new instances to be made from it. That should be completely independent though for B and C, so B should be able to be deleted its memory reused independently of the lifetime of C.

I think that should greatly simplify the design. Instances should be instances, there shouldn't be anything special kinds of them. We're just deleting one instance and then reusing its memory to create one or more new instances potentially with different layouts from the original instance.

eddy16112 · 2024-01-12T09:46:56Z

Are we going to keep some of the data of A into B or C? If not, why do we explicitly want to reuse the memory of A for B and C?

lightsighter · 2024-01-12T10:06:08Z

Are we going to keep some of the data of A into B or C?

No, at least not explicitly. Realm doesn't have to go through and reinitialize the memory when it does the redistricting so whatever data was there when A existed will still happen to be there when B and C are created, but that won't be important to these semantics.

If not, why do we explicitly want to reuse the memory of A for B and C?

To avoid resource deadlocks. Legion needs to reserve memory in advance of running a task for various different objects that might need to create instances of their own during the execution of the task (e.g. deferred values/buffers, output regions, futures, etc). Legion doesn't know what those things are though in advance or what their layouts look like, so we can't make instances for them before the task starts. So instead we reserve a single instance to be the memory pool for the task, and then as the task runs, we'll create new instances from the memory pool instance using redistricting. If we tried to make instances on the fly we would be exposing ourselves to resource deadlocks. We work around those deadlocks today with the two pools approach, but we've already determined that is non-viable going forward for a number of reasons. So Legion is going to make a 1-D instance to be the memory pool of each task (size controlled by the task variant/mapper) and then gradually make new instances by breaking off parts of that 1-D pool as the task is running and giving the new instances different layouts from the 1-D one-field layout of the pool instance.

eddy16112 · 2024-01-12T23:03:13Z

OK, the reason of avoiding resource deadlocks makes sense to me. I have two other questions:

Once A.create_child_instance(B, instance_layout_b); is called, applications should not be able to apply copy/fill operations to A, right? because A is not a complete instance any more.
What if A is sparse? I do not know if Realm will allocate a contiguous or sparse memory for A, if the underlying memory is sparse and can not satisfy the layout of B, what should we do?

BTW, to avoid resource deadlocks, there could be another simpler solution other than the nested region instances. We can propose a new API named create_group_instances(std::vector<RegionInstance> &inst, Memory m, ...), which makes sure all instances can be created at a time.

lightsighter · 2024-01-12T23:57:28Z

Once A.create_child_instance(B, instance_layout_b); is called, applications should not be able to apply copy/fill operations to A, right? because A is not a complete instance any more.

Redistricting instance A should completely destroy it. It should be as if we had called destroy on A. There should be no outstanding operations using A at the point of redistricting (unless we want redistricting to be a deferred operation, more on this shortly) and no operations (including copies/fills/accessors) can refer to A again after the point of redistricting.

Note: we could optionally make redistricting a deferrable operation just like destroy is. In that case A would still be viable until the event triggered, and Realm would hand back an event for when the new instances are safe to use which Realm would trigger after the redistricting was complete.

What if A is sparse? I do not know if Realm will allocate a contiguous or sparse memory for A, if the underlying memory is sparse and can not satisfy the layout of B, what should we do?

I don't think it matters what the layout of A was. For the purposes of redistricting it is just a contiguous block of memory. Even compact-sparse instance layouts guarantee that the pieces of the instance are compact in memory. So Realm is just going to take that block of memory and split it up between the new instances and then reclaim any leftover memory. If there is not enough space for all the new instances to be satisfied that can be an error. The layout of A though should be meaningless when it comes to doing redistricting; all that matter is whether there is enough space in A to satisfy the instances B, C, etc.

BTW, to avoid resource deadlocks, there could be another simpler solution other than the nested region instances. We can propose a new API named create_group_instances(std::vector &inst, Memory m, ...), which makes sure all instances can be created at a time.

The ability to create grouped instances is not actually the problem from a deadlock perspective. Deadlocks occur when Realm does a deferred allocation (meaning the instance won't be available until some other instance is deleted) for an earlier operation that depends the deletion of an instance allocated for a later operation in program order. Say we have two tasks A and B with B depending on the results of A (a true data dependence). Task A maps and starts running. Then task B maps and allocates an instance X that it will use when it will run. Next as task A is running it decides it needs to create an on-the-fly eager instance. It asks Realm to do this, but memory is tight. Realm can see that instance X will be deleted after task B runs, so it happily says to task A, here is instance Y for you to use, but you can only use it once this event triggers and I'll trigger it as soon as instance X is deleted. We've now created a resource deadlock cycle: B has a true data dependence on A, but A cannot continue until B runs and deletes instance X. To avoid this, Legion needs to pre-reserve an instance during the mapping stage of the pipeline for task A to use as a memory pool for any dynamic allocations that A needs to do when it is running, to avoid creating the back-edge in the dependence graph illustrated above.

Now you might say: why not try to make an allocation interface in Realm that only succeeds if the allocation can be immediately satisfied with no deferred allocations? It's true that it would avoid the cycle, but remember that memory was tight in the example above. Realm couldn't satisfy A's request for an instance eagerly which is why it had to do a deferred allocation. So if you make an interface that only succeeds for eager allocations, then the example above will still fail, just with a perceived out-of-memory error. Better than a deadlock, but still suboptimal. This program can actually fit in memory, we just need A to pre-reserve its memory so that B is the one that gets the deferred allocation. Legion will pre-reserve the memory pool for task A, but then it needs to be able to split that pool up into new instances with arbitrary layouts.

eddy16112 · 2024-01-13T01:48:24Z

Yeah, I understand the case of resource deadlock. So I assume that the Legion runtime does not even know the there is an eager instance during the mapping stage of task A, right? I am thinking about if we can let all realm Memory to have an eager and deferred pool, such that all the create_eager_instance will use the eager pool. This should solve the deadlock without introducing instance redistricting.

lightsighter · 2024-01-13T05:28:11Z

So I assume that the Legion runtime does not even know the there is an eager instance during the mapping stage of task A, right?

I don't want to ask users to enumerate all the ways that they will need to create an instance during the execution of the task. Asking them to put an upper bound on how much dynamic memory they will need during the execution of the task seems like a reasonable ask though.

I am thinking about if we can let all realm Memory to have an eager and deferred pool, such that all the create_eager_instance will use the eager pool.

That doesn't actually fix the problem. The problem as @manopapad and @magnatelee will tell you is that having two pools is absolutely terrible for the user experience. The whole reason that we're doing this is to get rid of having two pools so users don't need to decide how to split their memories up because they really don't know how to do that well and they are almost guaranteed to make the wrong split of their memory into the two pools.

eddy16112 · 2024-01-13T06:28:02Z

Asking them to put an upper bound on how much dynamic memory they will need during the execution of the task seems like a reasonable ask though.

It means Legion will allocate a fix size eager instance for all tasks. I think it may waste resources because not all the tasks will need an eager memory pool of the same size. If we have two global pools for each type of memory, then each task can allocate instances when necessary. Actually, I do not think it is terrible for the user experience, because applications will only need to use a separate API (create_eager_instance) to create eager instances when needed.

lightsighter · 2024-01-13T08:22:34Z

It means Legion will allocate a fix size eager instance for all tasks.

No, Legion will never guess how much dynamic memory to allocate for a task.

I think it may waste resources because not all the tasks will need an eager memory pool of the same size.

That's why Legion will never guess how much dynamic memory a task will require: it will mandate that the application/mapper specify an upper bound. For every single task either the application will need to put a static upper bound on the amount of dynamic memory a task variant requires when it is registered with Legion, or the mapper will have to specify an upper bound when it maps the task before it runs. Either way, how much "waste" there is will be a function of how tight the upper bound provided by the application/mapper is to the actual amount of memory used by the task. Applications/mappers can be as precise/imprecise as they want to be in bounding the amount of dynamic memory they might ultimately need. Legion will never guess how much dynamic memory a task might need; it will always be under the control of the application/mapper.

If we have two global pools for each type of memory, then each task can allocate instances when necessary. Actually, I do not think it is terrible for the user experience, because applications will only need to use a separate API (create_eager_instance) to create eager instances when needed.

That's not the only thing they need to do. Before the program starts, every single memory in the machine where eager-only instances need to be made has to be partitioned into two pools. Specifying the percentage of memory that goes into each of the two pools is very hard and is often different for many different memories across the machine. Furthermore, users don't understand what it is for and therefore they almost never size it correctly. This is what we have today and the Legate team has already said categorically that it is intolerable from a user experience perspective and needs to be abolished. You're welcome to try to convince them otherwise. 😉 I personally think having two pools is ugly from an implementation side of things, but I probably wouldn't have prioritized addressing it for quite a while since it works. However the user experience is so bad and so many people have complained about it that the Legate team has mandated that removing two pools should be the next most important thing to do in Legion after merging the control replication branch, which should give you some indication of how horrible it is for them.

I'll also say that I'm not opposed to having Realm add a way to only allocate an instance if it can be allocated immediately (e.g. create_eager_instance), but Legion is never going to use it. As I pointed out in the scenario above, if we happened to use that method to allocate the instance for task A as it was running, then we would have failed the allocation. That is a false positive out-of-memory error. If we had been more disciplined in the way we allocated memory then the program should have fit in memory. I don't want Legion to ever give a false positive out-of-memory error.

apryakhin · 2024-01-16T11:14:38Z

Redistricting instance A should completely destroy it. It should be as if we had called destroy on A. There should be no outstanding operations using A at the point of redistricting (unless we want redistricting to be a deferred operation, more on this shortly) and no operations (including copies/fills/accessors) can refer to A again after the point of redistricting.

Redistricting operation should be deferred in my opinion as most of the operations realm does today - it is consistent with the programming model. You will be able to turn it off by providing a NO_EVENT handle. You may not need for the case you described and just plan for a particular instance to be redistricted from the start. However, I believe we may think of use cases that may have some in-flight operations going into 'A' when an application decides that it needs to be redistricted. To the very least, we may just design for a deferred-redistricting and not necessarily implement it right away.

It means Legion will allocate a fix size eager instance for all tasks. I think it may waste resources because not all the tasks will need an eager memory pool of the same size. If we have two global pools for each type of memory, then each task can allocate instances when necessary. Actually, I do not think it is terrible for the user experience, because applications will only need to use a separate API (create_eager_instance) to create eager instances when needed.

Let's not go down this road. I don't think we need a two-pool-system here and most importantly it does not solve the problem that we are trying to solve. Assuming the available memory for the overall application is tight, eager-pool will only "steal" parts of it from a regular pool (used by task B from this example) which may result in OOM failure when scheduling the task B far out. It's also possible that task 'A' will need all the available memory to run (again assuming we are really tight) which means the eager-pool should be sized to the maximum which leaves the regular-pool with nothing and if we size them wrong this whole example will again result in OOM failure. And overall the two-pool-approach sounds somewhat complicated...somehow I am not in favor of this. What we need here (example with tasks A and B) is just a single re-usable allocation and a forward dependency (as opposed to the backward edge).

Say we have two tasks A and B with B depending on the results of A (a true data dependence). Task A maps and starts running. Then task B maps and allocates an instance X that it will use when it will run. Next as task A is running it decides it needs to create an on-the-fly eager instance. It asks Realm to do this, but memory is tight. Realm can see that instance X will be deleted after task B runs, so it happily says to task A, here is instance Y for you to use, but you can only use it once this event triggers and I'll trigger it as soon as instance X is deleted. We've now created a resource deadlock cycle: B has a true data dependence on A, but A cannot continue until B runs and deletes instance X. To avoid this, Legion needs to pre-reserve an instance during the mapping stage of the pipeline for task A to use as a memory pool for any dynamic allocations that A needs to do when it is running, to avoid creating the back-edge in the dependence graph illustrated above.

@lightsighter Can you think of any other examples that would be good as a case-study? I think we should document them as much as possible right now to make sure we design it all the right way from the start.

lightsighter · 2024-01-16T22:46:46Z

However, I believe we may think of use cases that may have some in-flight operations going into 'A' when an application decides that it needs to be redistricted. To the very least, we may just design for a deferred-redistricting and not necessarily implement it right away.

Yes, I'm fine with the interface being deferrable and then in the Legion case I'll just call the version with a NO_EVENT precondition so it happens immediately.

Can you think of any other examples that would be good as a case-study? I think we should document them as much as possible right now to make sure we design it all the right way from the start.

At least all of the Legion examples are some variation on the one I presented above. Deferred values/buffers, output regions, futures: they all have the same flavor as the case above. I suppose at least one other case that I've considered although don't have a plan to implement right now is to allow mappers to try to "realloc" an instance from an existing instance which would also involve redistricting, although in that case the current interface would work as well. Other than that I can't think of any other uses cases off the top of my head that couldn't be handled by the interface described above.

apryakhin · 2024-04-16T14:20:29Z

We are getting close to getting the MR through review and merging:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1097

lightsighter added enhancement Realm Issues pertaining to Realm planned Feature/fix to be actively worked on - needs release target labels May 12, 2021

lightsighter assigned streichler May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm Instance Redistricting #1078

Realm Instance Redistricting #1078

lightsighter commented May 12, 2021

apryakhin commented Jan 11, 2024 •

edited

lightsighter commented Jan 11, 2024 •

edited

apryakhin commented Jan 11, 2024 •

edited

lightsighter commented Jan 11, 2024

eddy16112 commented Jan 12, 2024

lightsighter commented Jan 12, 2024

eddy16112 commented Jan 12, 2024

lightsighter commented Jan 12, 2024

eddy16112 commented Jan 13, 2024

lightsighter commented Jan 13, 2024 •

edited

eddy16112 commented Jan 13, 2024

lightsighter commented Jan 13, 2024

apryakhin commented Jan 16, 2024 •

edited

lightsighter commented Jan 16, 2024

apryakhin commented Apr 16, 2024

Realm Instance Redistricting #1078

Realm Instance Redistricting #1078

Comments

lightsighter commented May 12, 2021

apryakhin commented Jan 11, 2024 • edited

lightsighter commented Jan 11, 2024 • edited

apryakhin commented Jan 11, 2024 • edited

lightsighter commented Jan 11, 2024

eddy16112 commented Jan 12, 2024

lightsighter commented Jan 12, 2024

eddy16112 commented Jan 12, 2024

lightsighter commented Jan 12, 2024

eddy16112 commented Jan 13, 2024

lightsighter commented Jan 13, 2024 • edited

eddy16112 commented Jan 13, 2024

lightsighter commented Jan 13, 2024

apryakhin commented Jan 16, 2024 • edited

lightsighter commented Jan 16, 2024

apryakhin commented Apr 16, 2024

apryakhin commented Jan 11, 2024 •

edited

lightsighter commented Jan 11, 2024 •

edited

apryakhin commented Jan 11, 2024 •

edited

lightsighter commented Jan 13, 2024 •

edited

apryakhin commented Jan 16, 2024 •

edited