Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Volume Overflow #1639

Open
mpokorny opened this issue Feb 23, 2024 · 15 comments
Open

Large Volume Overflow #1639

mpokorny opened this issue Feb 23, 2024 · 15 comments
Assignees
Labels
bug Legion Issues pertaining to Legion Realm Issues pertaining to Realm

Comments

@mpokorny
Copy link
Contributor

The following program:

#include <cassert>
#include <legion.h>
#include <limits>
#include <vector>

Legion::coord_t min;
Legion::coord_t max;

void
main_task(
  const Legion::Task* task,
  const std::vector<Legion::PhysicalRegion>& /*regions*/,
  Legion::Context ctx,
  Legion::Runtime* rt) {

  std::vector<Legion::Rect<3>> rects{
    Legion::Rect<3>{{12, min, min}, {12, max, max}},
    Legion::Rect<3>{{14, min, min}, {14, max, max}}};
  auto is = rt->create_index_space(ctx, rects);
  Legion::Domain dom = rt->get_index_space_domain(is);
  assert(!dom.contains(Legion::Point<3>{13, 0, 0}));
}

int
main(int argc, char* argv[]) {
  assert(argc >= 3);
  min = std::strtoll(argv[1], nullptr, 0);
  max = std::strtoll(argv[2], nullptr, 0);
  Legion::TaskVariantRegistrar registrar(0, "DomainTest");
  registrar.add_constraint(
    Legion::ProcessorConstraint(Legion::Processor::LOC_PROC));
  Legion::Runtime::preregister_task_variant<main_task>(registrar, "DomainTest");

  Legion::Runtime::set_top_level_task_id(0);
  return Legion::Runtime::start(argc, argv);
}

Succeeds or fails depending on the values of min and max (well within the limits of the index coordinate type) provided on the command line. For example

$ test/DomainTest -1000000000 1000000000
$ test/DomainTest -10000000000 10000000000
...DomainTest.cpp:19: void main_task(const Legion::Task*, const std::vector<Legion::PhysicalRegion>&, Legion::Context, Legion::Runtime*): Assertion `!dom.contains(Legion::Point<3>{13, 0, 0})' failed.
Aborted (core dumped)
@lightsighter
Copy link
Contributor

This is actually a Realm bug. Here's an even smaller Realm-only reproducer:

#include <cassert>
#include <realm.h>
#include <vector>

using namespace Realm;

typedef long long coord_t;

int main(int argc, char **argv) {

  Runtime rt;
  rt.init(&argc, &argv);
  
  std::vector<Rect<3,coord_t> > rects{
    Rect<3,coord_t>{{12, -10000000000, -10000000000}, {12, 10000000000, 10000000000}},
    Rect<3,coord_t>{{14, -10000000000, -10000000000}, {14, 10000000000, 10000000000}}};

  IndexSpace<3,coord_t> is(rects);
  IndexSpace<3,coord_t> tight = is.tighten();

  Point<3,coord_t> point{13, 0, 0};
  assert(!is.contains(point));
  assert(!tight.contains(point));

  rt.shutdown();
  return rt.wait_for_shutdown();
}

Something is going wrong in the tighten function call of the index space because the first assertion in this program succeeds, but the second assertion will fail. The tightening operation shouldn't change the set of points in the index space so both assertions should succeed. If I look at the two index spaces in the debugger, the first one correctly has a sparsity map, but the second one does not.

(gdb) p is
$1 = {
  bounds = {
    lo = {
      x = 12,
      y = -10000000000,
      z = -10000000000
    },
    hi = {
      x = 14,
      y = 10000000000,
      z = 10000000000
    }
  },
  sparsity = {
    id = 3458764513820540929
  }
}
(gdb) p tight
$2 = {
  bounds = {
    lo = {
      x = 12,
      y = -10000000000,
      z = -10000000000
    },
    hi = {
      x = 14,
      y = 10000000000,
      z = 10000000000
    }
  },
  sparsity = {
    id = 0
  }
}

It shouldn't be possible for Realm to get rid of the sparsity map here as the two rectangles are not touching in space, so clearly the tighten call has a bug in it.

@eddy16112 @artempriakhin @muraj: Can one of you guys take a look at this failure?

@mpokorny
Copy link
Contributor Author

FYI, although this is clearly a bug, there's no urgency to fix this issue from my immediate perspective. I identified a reasonable workaround in my application where the failure occurred originally while creating the reproducer I provided here.

@lightsighter
Copy link
Contributor

I identified the issue in Realm. It's actually the code that computes the volume of a rectangle that is at "fault" here:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/point.inl?ref_type=heads#L498-520

The volume of this rectangle is larger than what can be represented by a 64-bit unsigned integer (size_t) so we get an unsigned integer overflow which is defined, but does cause the tighten logic to fail when it gets bad rectangle volumes.

@apryakhin
Copy link
Contributor

apryakhin commented Feb 26, 2024

Okay, so I meant to write new unit tests for IndexSpace/Rect that includes tighten and volume anyway, so am I going to do that and part of the PR will be having test that reproduces and fixes the bug.

@lightsighter
Copy link
Contributor

I suspect this will actually go deeper than just fixing volume (conceptually there's nothing wrong with the logic in tighten, it just doesn't pick a big enough integral type to represent the volume of the rectangles). There are lots of places in Realm and Legion that we compute the volume of rectangles/index spaces and right now we assume those volumes can be stored in a size_t (usually a 64-bit unsigned integer). However, especially in higher dimensions, those volumes can get much bigger than size_t, e.g. the max volume of a rectangle expressed mathematically in terms of size_t (for long long coordinate types) can be:

DIM 1: std::numeric_limits<size_t>+1
DIM 2: (std::numeric_limits<size_t>+1)^2
DIM 3: (std::numeric_limits<size_t>+1)^3

So clearly we need a bigger type for representing those potential volumes, and that has to be true both inside the volume method and on the caller side as well. Unfortunately c++ doesn't provide a built in type large enough for that...

Perhaps for now just detecting the overflow and issuing an error will be enough to not give silently bad results.

@lightsighter
Copy link
Contributor

Here's a diff that would catch two different kinds of overflow in the volume computation:

diff --git a/runtime/realm/point.inl b/runtime/realm/point.inl
index 1ce9defef..3df98967f 100644
--- a/runtime/realm/point.inl
+++ b/runtime/realm/point.inl
@@ -504,8 +504,14 @@ namespace Realm {
       else {
         // have to convert both 'hi' and 'lo' to size_t before subtracting
         //  to avoid potential signed integer overflow
-       v *= (static_cast<size_t>(hi[i]) -
-              static_cast<size_t>(lo[i]) + 1);
+        size_t range = static_cast<size_t>(hi[i]) - static_cast<size_t>(lo[i]);
+        // Watch for overflow on the range since we have to increment by one
+        assert(range < SIZE_MAX);
+        range++;
+        size_t vnew = v * range;
+        // Watch for overflow on the multiplication
+        assert((vnew/range) == v);
+        v = vnew;
       }
     return v;
   }

@apryakhin
Copy link
Contributor

@lightsighter do you want to send out the patch? I think for now it's okay to acknowledge and check-fail in volume(). We can discuss on how to actually do a proper fix as we go..with a custom type to hold the sizes...etc.

There are probably multiple places where this need attention

@apryakhin
Copy link
Contributor

unless this is actually blocking @mpokorny now in some ways?

@mpokorny
Copy link
Contributor Author

@apryakhin, this issue isn't blocking me, I have a very reasonable workaround in my application code in place, and I don't foresee having any problems with it for a while.

@lightsighter
Copy link
Contributor

lightsighter commented Feb 26, 2024

@mpokorny Do you expect to have rectangles like this often that have volumes that are bigger than 2^64?

@mpokorny
Copy link
Contributor Author

No. Initially I chose a volume that large to be as conservative as possible in the construction of that index space, which ultimately gets intersected with another index space. But in my application the bounds on the latter index space are known to be significantly smaller -- even smaller than the range I used in the passing test case above -- so in practice a volume of 2^64 is more than sufficient.

@lightsighter
Copy link
Contributor

That's actually very helpful to hear. I was pretty worried there for a moment because there's going to be a lot of code in both Legion and Realm that will need to be updated to deal with this. 😅 I think what we'll probably do is put in some detection code for now to detect these overflows as we gradually move to handling the large volume cases. For now keeping your rects under 2^64 total points should be a good rule of thumb to avoiding these issues, but thanks for finding it and reporting it.

@eddy16112 eddy16112 added the Realm Issues pertaining to Realm label Feb 28, 2024
@apryakhin
Copy link
Contributor

As discussed offline we are going to work on initial planning for a long-term fix to handle large volume sizes..across realm. This won't be P0 though at the moment. The immediate action would be to add overflow-checkers across the code base.

@lightsighter lightsighter added bug Legion Issues pertaining to Legion labels Feb 28, 2024
@lightsighter
Copy link
Contributor

Adding myself also since Legion also needs to check for large volume overflow.

@lightsighter lightsighter changed the title Sparse index space construction failure using large index values Large Volume Overflow Feb 28, 2024
@apryakhin
Copy link
Contributor

The draft is out:

Once I fix CI etc...will send it out for the full review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Legion Issues pertaining to Legion Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

4 participants