Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic reductions #788

Open
elliottslaughter opened this issue Mar 26, 2020 · 12 comments
Open

Deterministic reductions #788

elliottslaughter opened this issue Mar 26, 2020 · 12 comments
Assignees
Labels
backlog Feature/fix that is desirable, but not currently planned enhancement Legion Issues pertaining to Legion

Comments

@elliottslaughter
Copy link
Contributor

There are two different use cases I can think of:

  • Debugging: if you suspect there is something wrong with your reduction operator, it's potentially easier to check and debug this way
  • Expressiveness: some reduction operators may not be commutative (associative? I can never keep track of which one we care about)

For the former it might be sufficient for this to be a flag (like -lg:inorder) but maybe if we want to handle both use cases we'd want a way to express this in the code. We've suggested in the past this might be associated with REDUCE EXCLUSIVE, but that would be a big performance degradation for existing codes.

@streichler
Copy link
Contributor

streichler commented Mar 26, 2020

Another problem with using the coherence modes is that we don't actually have a mode that corresponds to the current (fast, and generally desirable) behavior - SIMULTANEOUS would prohibit a tree-shaped folding of reduction instances for example.

My vote would be to add two options to reduction-op registration:

  1. commutative - defaults to to true (whether it's really associative or commutative depends on your choice of formalism for reductions, but most folks expect to see commutative I think)
  2. exclusive-only - defaults to false - set to true if the "non-exclusive" form of the reduction cannot be made thread-safe and a heavier-weight locking is required by legion or realm

This handles your second case on a per-reduction basis, but then new command-line options could be added to force everything to non-commutative and/or exclusive-only for debugging.

@lightsighter lightsighter added backlog Feature/fix that is desirable, but not currently planned enhancement Legion Issues pertaining to Legion labels Mar 27, 2020
@RAMitchell
Copy link

In machine learning or linear algebra applications it is important that floating point operations should be reproducible. This aids the debugging and experimentation process, as floating point non-determinism can be excluded. It also makes unit testing easier.

Full determinism under changing number of machines is extremely difficult to achieve, requiring (emulated) extended precision operations. A lesser and more useful requirement is to have reproducible results on the same machine configuration (e.g. same number of GPUs, CPUs). This would mean the reduction tree should be the same on subsequent runs.

@RAMitchell
Copy link

I believe it would be sufficient to specify determinism as a template parameter to ReductionAccessor, ensuring that the generated reduction tree is consistent for the given inputs.

@magnatelee is this correct? I don't think its necessary (or practical) to ask for determinism in other scenarios, e.g. concurrent writes/reads to a region.

The end goal would be to have operations like cunumeric matrix multiples return consistent results, which depend on legion reductions.

@lightsighter
Copy link
Contributor

So I think there's bit more nuance to this than just telling Legion we want reductions to be deterministic because that has ramifications that ripple back to the user. There are two sources of non-determinism when it comes to reductions. The first source of non-determinism is the runtime system in how it stores the names of reduction instances and applies those reductions to normal instances. There are benign races right now to record the names of reduction instances with equivalence sets. These can be fixed (if the user is willing to pay the cost of having the runtime provide a deterministic order, more on this later); the application of reductions from the equivalence sets are already deterministic. The more concerning source of non-determinisim though comes from the mapper. Consider the following Legion program:

t1: reduce<+> exclusive r
t2: reduce<+> exclusive r
t3: reduce<+> exclusive r
t4: read r

On the surface this appears like it should be quite easy to provide determinism for this program. However, because reductions are accumulated into reduction instances, different choices of mapping can produce non-deterministic results. Consider the following two mapping choices:

t1: r -> reduction instance A
t2: r-> reduction instance A
t3: r-> reduction instance B
t4: r-> normal instance C
t1: r -> reduction instance A
t2: r-> reduction instance B
t3: r-> reduction instance B
t4: r-> normal instance C

Even if we execute these tasks in program order (literally running each one to completion before starting the next one), and have the runtime apply the reduction instances to the normal instance in the same order, you can still get different results from one run to the next because the mapper changed its decisions and the way reductions were grouped created different results (e.g. because floating point math is not associative). Therefore, the sequential specification of the application and a deterministic runtime system is insufficient to guarantee deterministic reductions. Something needs to constrain the mapping decisions too.

Now that we've established what the sources of non-determinism are, we can start to discuss potential solutions. I think there are actually four different classes of reduction semantics we might consider supporting in Legion. Two of them have real users, while the second two are more hypothetical (although I have met non-Legion users who want these semantics).

  1. Parallel + Non-Deterministic: This is the semantics we have currently. The runtime is allowed to execute all tasks performing reduction operations (of the same kind of reduction operator) in parallel and group/apply those reductions in any way it sees fit as long as all values eventually contribute once to the value read by the next reader.
  2. Parallel + Deterministic: I believe (but am not totally sure) that this is the semantics that is being requested in this issue. The runtime is allowed to execute all tasks as performing reduction operations (of the same kind of reduction operator) in parallel, but the way that the reductions are accumulated must result in the same value being produced across runs (regardless of mapping decisions).
  3. Sequential: Legion must maintain the original sequential semantics of the program and make it appear as though all of the reductions are applied in exactly program order.
  4. Precise: Legion should sort reductions into the order that guarantees the resulting value read by the next reader has maximal precision (at least one user has floated the idea of having Legion sort floating point values before performing summations).

I'm going to go ahead rule out semantics (4) since it is expensive to implement and (almost) nobody uses it. The next thing to notice is that if your reduction operator is both associative and commutative (e.g. integer summation modulo over/under-flow), then the implementation for (1) will also produce the same result consistent with the semantics of (2) and (3). So that means what we're really talking about here is how to provide options (2) and/or (3) for reduction operators that are not associative or not commutative or both and how a user is going to specify that. I'd like both @elliottslaughter and @RAMitchell to indicate whether they want (2) or (3) or both.

Now for the rest of this comment I'm going to go ahead and assume that we're talking about providing (2).

I personally don't like the idea of specifying whether a reduction operator is deterministic or not. Reduction operators are not inherently deterministic or non-deterministic; it's how they are used that determines whether they are deterministic or not. Consider floating point summation: you can make floating point summation deterministic by ensuring that all floating point values are grouped and summed in the same way from one run to the next and you'll have a deterministic reduction even though floating point summation is neither commutative or associative. I do think we need to add a way for applications to signal to the runtime whether a particular reduction operator is both associative and commutative or not when it is registered with the runtime. For the rest of this we'll assume the runtime has some way to know whether a reduction operator is associative and commutative or not.

Now, when it comes to specifying determinism, I do think coherence modes are the right way to specify this as it allows the application to control semantics at different points in the program in different ways. Here's my proposed semantics for each of the coherence modes for reduction operators that are both associative and commutative or not:

  • Exclusive (Reduction operator is A&C): Each task performing reductions gets exclusive access to the reduction instance it maps to. The runtime can use a non-deterministic algorithm for combining reduction results as the result will always be the same.
  • Exclusive (Reduction operator is not A&C): Each task must use a "fresh" reduction instance (one not previously used in the current reduction epoch) to avoid the mapping non-determinism problem outlined above (it will be up to the mappers to pick a new reduction instance for each of these, but the runtime will check they abide by the rules). The runtime will ensure that there is a deterministic application of these reduction instances so the result appears deterministic across runs.
  • Atomic (regardless of reduction operator A&C): Tasks can map to the same reduction instance and will be given coarse-grained atomic access for the lifetime of their task to perform their reductions. The runtime can use a parallel and potentially non-deterministic algorithm for applying reductions since the application already relaxed ordering constraints for tasks performing reductions.
  • Simultaneous (regardless of reduction operator A&C): Tasks can map to the same reduction instance and be executed in parallel on the same reduction instance concurrently. The runtime can use a parallel and potentially non-deterministic algorithm for applying reductions since the application already relaxed ordering constraints for tasks performing reduction.

I like this approach for two reasons:

  1. For reduction operators that are marked associative and commutative, they suffer no performance degradation relative to the current implementation.
  2. For reduction operators that are marked as being not associative or not commutative, they can be used with both deterministic and non-deterministic settings even in the same application. You might have a numerical algorithm that you want to be deterministic in your HPC simulation, but you might also want your machine learning data analytics code to use the same reduction operator in a more relaxed non-deterministic way in the same application.

The only thing I don't like about this approach is the requirement that it places on the mapper for mapping exclusive reductions with reduction operators that are not associative and commutative. They effectively need to be able to determine when a reduction instance has previously been used in the same epoch and then avoid reusing it. Perhaps we can modify the acquire_instance semantics to support this. Unfortunately I don't see another way to guarantee the determinism of reductions for non-associative or non-commutative operators other than forcing each task to use its own reduction instance across all iterations, thereby allowing the runtime to reduce all those instances in the same way each time. Anything else risks allowing the mapper to combine reductions in a non-deterministic way across runs.

Responding to some of the other considerations brought up before:

We've suggested in the past this might be associated with REDUCE EXCLUSIVE, but that would be a big performance degradation for existing codes.

At least in my proposal above, that would only be true if reduction operators were marked as not associative and commutative. For summations on floating point types though we would need to educate folks to pay attention to that though and they would need to switch to atomic coherence to get the semantics that they are getting today.

SIMULTANEOUS would prohibit a tree-shaped folding of reduction instances for example.

I don't think that's the case. For non-reduction cases, simultaneous just says that two tasks running concurrently must be using the same instance so they can see each others' updates. We still allow tasks with simultaneous coherence to map to different instances, it's just that when they do then we serialize their execution (e.g. one task runs, we copy the data from the instance used by the first task to the instance used by the second task, and then run the second task). For simultaneous reductions, my interpretation is that two tasks using the reduction instance with simultaneous coherence don't have any guarantees about atomicity to the instance and will need to use fine grained read-modify-write reductions to safely update the reduction instance. The runtime can still reorder tasks as it sees fit for simultaneous reductions as there is no implied ordering of execution by simultaneous coherence and there's no way to "read" with reduction privileges.

@RAMitchell
Copy link

Yes (2). (3) seems overly restrictive and in my applications unecessary.

@elliottslaughter
Copy link
Contributor Author

I have no horse in this race because Regent (like Legion) doesn't care. The people we need to hear from are application folks like @mariodirenzo and @jpietarilagraham.

But this is incorrect:

Even if we execute these tasks in program order (literally running each one to completion before starting the next one), and have the runtime apply the reduction instances to the normal instance in the same order, you can still get different results from one run to the next because the mapper changed its decisions and the way reductions were grouped created different results (e.g. because floating point math is not associative).

Because Legion can execute the reduction as follows:

In the first example:

Legion fills A with the identity
t1: r -> reduction instance A
Legion applies A to C
Legion fills A with the identity
t2: r-> reduction instance A
Legion applies A to C
Legion fills B with the identity
t3: r-> reduction instance B
Legion applies B to C
t4: r-> normal instance C

And the second example:

Legion fills A with the identity
t1: r -> reduction instance A
Legion applies A to C
Legion fills B with the identity
t2: r-> reduction instance B
Legion applies B to C
Legion fills B with the identity
t3: r-> reduction instance B
Legion applies B to C
t4: r-> normal instance C

That is, if Legion wants to enforce a truly sequential semantics, Legion must (a) fill each reduction instance prior to use, and (b) flush it to a normal instance (which then becomes the valid copy of the data) after each reduction task completes. This reduces parallelism, but not necessarily to zero. I.e., the second example produces an event graph like (in Graphviz syntax):

fill A -> t1 -> apply A to C
fill B -> t2 -> apply B to C (first) -> fill B -> t3 -> apply B to C (second) -> t4
apply A to C -> apply B to C (first)

Therefore, no mapper changes are required and the mapper only needs to think about this to the extent it wants to improve parallelism (like with breaking WAR dependencies).

@lightsighter
Copy link
Contributor

@elliottslaughter There's an issue with your logic. Tasks can only map in order with respect to their mapping dependences in order to avoid resource deadlocks. There are mapping dependences between all the reduction tasks and the reading task. Therefore you cannot safely map task t4 until all prior tasks are mapped. There is often not any instance in which you can early apply reductions to. For example, the program often looks like this:

fill r
t1: reduce<+> exclusive r
t2: reduce<+> exclusive r
t3: reduce<+> exclusive r
t4: read r

Instance C will not come into existence until t4 is mapped, which cannot happen until after all prior tasks have mapped. Therefore the runtime has no guaranteed place where the runtime can apply reduction instances early. The runtime cannot predict the future and mapping out-of-dependence-order will risk resource deadlocks.

@elliottslaughter
Copy link
Contributor Author

To be fair, this is a tradeoff we can bend. Maybe the total complexity of this solution isn't lower overall, but we can at least shift the burden around.

If I understand your suggestion, the runtime must check when the mapper has asked for two sequential reductions to the same instance, and issues a hard error in that case. It's just a mapper error, and mappers are expected not to request that mapping (possibly with the help of a runtime query to figure out if a reduction instance is safe for reuse or not).

The other option is when the runtime detects this scenario, we have an additional mapper call where we ask the mapper to concretize the current reduction. I.e., we essentially insert an implicit read into the task stream and force the mapper to resolve this read before moving forward. Keep in mind that if we reach this point, the mapper already asked for two sequential reductions back-to-back in the same instance. So all your parallel cases continue to work unmodified. But in a sequential case, this gives the mapper an opportunity to materialize the region state rather than being forced to pick N different reduction instances (potentially at higher overall memory usage) in the case where the application issues that many reductions in a row.

I don't like increasing the mapper surface area but it feels more "correct" if this is, in fact, a use case we want Legion to support, and potentially gives us memory savings in some scenarios.

The other option I see is to declare that all of this is too much work and avoid supporting a feature that very few users will use anyhow.

@lightsighter
Copy link
Contributor

If I understand your suggestion, the runtime must check when the mapper has asked for two sequential reductions to the same instance, and issues a hard error in that case.

To be clear, I wasn't actually suggesting it necessarily had to be a hard error. It could also fail to do the "acquire" in that case so the mapper would be free to select a different reduction instance. However, ensuring that each reduction task update its own reduction instance was how I was going to be able to provide determinism. The mapper can still shuffle which instances are used by which tasks from run-to run, but the way that reductions would be accumulated would be deterministic based on the tasks that are launched.

The other option is when the runtime detects this scenario, we have an additional mapper call where we ask the mapper to concretize the current reduction.

Not sure if you or anyone else remembers, but the version of the Legion mapper from 2011-2016 actually had a mapper call like this (you can actually still see the residual input and output structs for backward compatibility, e.g. for tasks). It was called something like "create temporary instance" and the runtime could invoke it anytime it needed to materialize an intermediate result unrelated to the mapping of the current operation. Everybody hated it because it was really hard to have any context for how the instance would ultimately used by downstream tasks so it was hard to predict where to make the instance, how big it should be, and what it's layout should be. It's a bit of an unwritten rule of the mapper interface today we don't do this based on our prior experience; all mapper calls should be directly related to some operation that the application created and the mapper can clearly make decisions about why certain things are being done for that operation. I'm not completely against adding a new mapper call like this, but I do think we should think carefully about it, especially since we have the unique experience of having had this exact mapper call before and deciding it was really bad and it should be removed.

The other option I see is to declare that all of this is too much work and avoid supporting a feature that very few users will use anyhow.

So I think this is probably untenable. The request here at least is coming from the cuNumeric team as they would like to support deterministic floating point reductions. I personally think it is a reasonable thing for Legion to provide deterministic (but still parallel) reductions of non-associative and non-commutative operators for people willing to pay the performance cost of that determinism. I would probably have pushed back on requests for semantics (3) and (4) from above, but (2) seems like a reasonable thing to support.

@manopapad
Copy link
Contributor

cuNumeric can live with the separate-reduction-instance-per-point-task restriction.

Unfortunately I don't see another way to guarantee the determinism of reductions for non-associative or non-commutative operators other than forcing each task to use its own reduction instance across all iterations, thereby allowing the runtime to reduce all those instances in the same way each time. Anything else risks allowing the mapper to combine reductions in a non-deterministic way across runs.

To play devil's advocate, you could support this by building the reduction network in two stages: first serialize (in point task id order) all the point tasks that reduce into the same instance, then build the tree on top of that. I.e. this situation:

point task:              0 1 2 3 4 5 6 7
uses reduction instance: a a a b b c c d

produces this dependence tree:

0 -> 1 -> 2 ---*-----*
              /     /
     3 -> 4 -*     /
                  /
     5 -> 6 ---*-*
              /
          7 -*

Whether this is too much overhead to be worth the trouble, that's another story :-)

@lightsighter
Copy link
Contributor

lightsighter commented Oct 19, 2023

To play devil's advocate, you could support this by building the reduction network in two stages: first serialize (in point task id order) all the point tasks that reduce into the same instance,

What happens when the mapper changes the number and/or grouping of point tasks that map to the same instance from one run to the next (e.g. because they implemented a dynamic load balancing strategy)? (See the example in this comment.)

@manopapad
Copy link
Contributor

Fair, you would only be able to guarantee the same order of reductions modulo sharing of reduction instances, which is a mapper decision. And since you want to support this at the task privilege level, you wouldn't want the mapper's decisions to affect it. So scratch my previous post then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Feature/fix that is desirable, but not currently planned enhancement Legion Issues pertaining to Legion
Projects
None yet
Development

No branches or pull requests

5 participants