S3D: creating hundreds of instances #1678

syamajala · 2024-04-05T03:57:09Z

I can see S3D creating hundreds of small instances in system memory and I have no idea why or where they're coming from.

I tried the logging wrapper and can see things like:

[0 - 1554f8036000]    8.970248 {2}{inst}: creating new local instance: 4000000000000022
[0 - 1554f8036000]    8.970253 {1}{inst}: instance layout: inst=4000000000000022 layout=Layout(bytes=3504, align=7008, fields={0=0+0}, lists=[[<0>..<0>->affine(<3504>+0)]])
[0 - 1554f8036000]    8.970255 {1}{inst}: allocation completed: inst=4000000000000022 offset=133152
[0 - 1554f8036000]    8.970257 {2}{inst}: instance created: inst=4000000000000022 external=memory(base=154c8e020820, size=3504) ready=0
...
[0 - 1554f8053000]   15.294676 {2}{inst}: instance destroyed: inst=4000000000000022 wait_on=0
[0 - 1554f8053000]   15.294679 {1}{inst}: deallocation completed: inst=4000000000000022
[0 - 1554f8053000]   15.294771 {2}{inst}: releasing local instance: 4000000000000022

but I never see the instances get used anywhere.

The output of the logging wrapper is here: http://sapling2.stanford.edu/~seshu/s3d_tdb/instances/run_0.log

A profile is here: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_tdb/instances/legion_prof/

The text was updated successfully, but these errors were encountered:

elliottslaughter · 2024-04-05T04:02:32Z

Could these be future instances?

elliottslaughter · 2024-04-05T04:33:12Z

Or else, could it be the deferred buffers we create when the kernel launch arguments overflow the limit?

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/helper.t#L227

lightsighter · 2024-04-05T06:00:35Z

Could be either. Both would show up as external instances. Future instances would occur if they were buffers being returned from the application for Legion to take ownership of as the future result. Deferred buffers would look like external instances made on top of the eager allocation pool. Given the size quoted here of 3504 bytes, I'm going to guess that the second guess is the more likely case.

lightsighter · 2024-04-05T06:01:07Z

Although if they are in system memory, that means they are not being used for the GPU and might therefore be more likely to be futures.

syamajala · 2024-04-05T11:38:24Z

Is there some way we could get provenance for these instances?

lightsighter · 2024-04-05T15:21:08Z

Legion is logging the creator operation for each instance, so the profiler can look up the provenance string for that operation. Although it doesn't do that today it could do that without any changes to the logging interface. Every instance has the name of the operation that created it:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.h?ref_type=heads#L367
and every operation has a provenance string:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.h?ref_type=heads#L185

I'm hesitant to add provenance strings to the task postamble and deferred buffer classes. In the case of the postamble the instances don't get made right away so we'd have to copy the provenance string and store it on the heap for indeterminate amount of time. We might also need to copy it around between nodes if the future data moves without ever creating an instance. The case is better for the deferred buffers since their instances get made right away, but their interface is already a mess and I don't want to make it more messy than it already is.

elliottslaughter · 2024-04-05T15:39:24Z

If the profiling change is sufficient, let's do that?

lightsighter · 2024-04-05T19:52:00Z

It should at least tell you which operation is making the instances. It won't tell you exactly which line of code is responsible though, but maybe it is close enough.

Are we sure these instances are actually in the system memory and not in the zero-copy memory?

elliottslaughter · 2024-04-05T20:20:40Z

I think we confirmed via logging in the compiler that these instances are the result of spilling arguments for CUDA kernels, and there is a fairly straightforward path to splitting the kernels up so we don't need to spill so much.

lightsighter · 2024-04-05T20:43:56Z

Ok, so they were going in the zero-copy memory then instead of the system memory right? That way they were visible on the host for scribbling and on the device for reading.

elliottslaughter · 2024-04-05T21:55:39Z

Yes, Regent places spill arguments into zero-copy memory. I'm not sure why Seshu would have seen them in system memory, the code to put them in zero copy is right here:

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/helper.t#L1103

lightsighter · 2024-04-06T01:55:30Z

If @syamajala can confirm that he was actually seeing them in the sysmem and not the zero-copy memory, I suspect there might actually be a bug in Realm. These instances would have been eagerly allocated out of the eager pool for the zero-copy memory by getting a pointer into the zero-copy memory. To make an instance for use by the deferred buffer object, Legion would ask Realm to do an external instance creation based on the pointer. Legion asks Realm to pick the "suggested memory" for that instance to go into. My guess is that Realm is failing to recognize that the pointer is actually contained within the zero-copy memory when it does the look-up for the suggested memory based on the pointer.

syamajala · 2024-04-06T09:02:27Z

Yes they are in system memory you can see them in the profile I linked above: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_tdb/instances/legion_prof/

I guess there are a lot in zero copy as well but those all have provenance.

lightsighter · 2024-04-07T17:39:54Z

I decided I'm not actually going to ask the Realm team to fix this particular issue. It's a pretty obscure case and it's not clear we should be aliasing Realm instances this way. Once we have instance redistricting and I can redo the memory management then we won't be encountering this problem.

syamajala added the S3D label Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3D: creating hundreds of instances #1678

S3D: creating hundreds of instances #1678

syamajala commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

lightsighter commented Apr 5, 2024

lightsighter commented Apr 5, 2024

syamajala commented Apr 5, 2024

lightsighter commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

lightsighter commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

lightsighter commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

lightsighter commented Apr 6, 2024

syamajala commented Apr 6, 2024

lightsighter commented Apr 7, 2024

S3D: creating hundreds of instances #1678

S3D: creating hundreds of instances #1678

Comments

syamajala commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

lightsighter commented Apr 5, 2024

lightsighter commented Apr 5, 2024

syamajala commented Apr 5, 2024

lightsighter commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

lightsighter commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

lightsighter commented Apr 5, 2024

elliottslaughter commented Apr 5, 2024

lightsighter commented Apr 6, 2024

syamajala commented Apr 6, 2024

lightsighter commented Apr 7, 2024