Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3D: creating hundreds of instances #1678

Open
syamajala opened this issue Apr 5, 2024 · 14 comments
Open

S3D: creating hundreds of instances #1678

syamajala opened this issue Apr 5, 2024 · 14 comments
Labels

Comments

@syamajala
Copy link
Contributor

I can see S3D creating hundreds of small instances in system memory and I have no idea why or where they're coming from.

I tried the logging wrapper and can see things like:

[0 - 1554f8036000]    8.970248 {2}{inst}: creating new local instance: 4000000000000022
[0 - 1554f8036000]    8.970253 {1}{inst}: instance layout: inst=4000000000000022 layout=Layout(bytes=3504, align=7008, fields={0=0+0}, lists=[[<0>..<0>->affine(<3504>+0)]])
[0 - 1554f8036000]    8.970255 {1}{inst}: allocation completed: inst=4000000000000022 offset=133152
[0 - 1554f8036000]    8.970257 {2}{inst}: instance created: inst=4000000000000022 external=memory(base=154c8e020820, size=3504) ready=0
...
[0 - 1554f8053000]   15.294676 {2}{inst}: instance destroyed: inst=4000000000000022 wait_on=0
[0 - 1554f8053000]   15.294679 {1}{inst}: deallocation completed: inst=4000000000000022
[0 - 1554f8053000]   15.294771 {2}{inst}: releasing local instance: 4000000000000022

but I never see the instances get used anywhere.

The output of the logging wrapper is here: http://sapling2.stanford.edu/~seshu/s3d_tdb/instances/run_0.log

A profile is here: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_tdb/instances/legion_prof/

@syamajala syamajala added the S3D label Apr 5, 2024
@elliottslaughter
Copy link
Contributor

Could these be future instances?

@elliottslaughter
Copy link
Contributor

Or else, could it be the deferred buffers we create when the kernel launch arguments overflow the limit?

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/helper.t#L227

@lightsighter
Copy link
Contributor

Could be either. Both would show up as external instances. Future instances would occur if they were buffers being returned from the application for Legion to take ownership of as the future result. Deferred buffers would look like external instances made on top of the eager allocation pool. Given the size quoted here of 3504 bytes, I'm going to guess that the second guess is the more likely case.

@lightsighter
Copy link
Contributor

Although if they are in system memory, that means they are not being used for the GPU and might therefore be more likely to be futures.

@syamajala
Copy link
Contributor Author

Is there some way we could get provenance for these instances?

@lightsighter
Copy link
Contributor

Legion is logging the creator operation for each instance, so the profiler can look up the provenance string for that operation. Although it doesn't do that today it could do that without any changes to the logging interface. Every instance has the name of the operation that created it:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.h?ref_type=heads#L367
and every operation has a provenance string:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.h?ref_type=heads#L185

I'm hesitant to add provenance strings to the task postamble and deferred buffer classes. In the case of the postamble the instances don't get made right away so we'd have to copy the provenance string and store it on the heap for indeterminate amount of time. We might also need to copy it around between nodes if the future data moves without ever creating an instance. The case is better for the deferred buffers since their instances get made right away, but their interface is already a mess and I don't want to make it more messy than it already is.

@elliottslaughter
Copy link
Contributor

If the profiling change is sufficient, let's do that?

@lightsighter
Copy link
Contributor

It should at least tell you which operation is making the instances. It won't tell you exactly which line of code is responsible though, but maybe it is close enough.

Are we sure these instances are actually in the system memory and not in the zero-copy memory?

@elliottslaughter
Copy link
Contributor

I think we confirmed via logging in the compiler that these instances are the result of spilling arguments for CUDA kernels, and there is a fairly straightforward path to splitting the kernels up so we don't need to spill so much.

@lightsighter
Copy link
Contributor

Ok, so they were going in the zero-copy memory then instead of the system memory right? That way they were visible on the host for scribbling and on the device for reading.

@elliottslaughter
Copy link
Contributor

Yes, Regent places spill arguments into zero-copy memory. I'm not sure why Seshu would have seen them in system memory, the code to put them in zero copy is right here:

https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/helper.t#L1103

@lightsighter
Copy link
Contributor

If @syamajala can confirm that he was actually seeing them in the sysmem and not the zero-copy memory, I suspect there might actually be a bug in Realm. These instances would have been eagerly allocated out of the eager pool for the zero-copy memory by getting a pointer into the zero-copy memory. To make an instance for use by the deferred buffer object, Legion would ask Realm to do an external instance creation based on the pointer. Legion asks Realm to pick the "suggested memory" for that instance to go into. My guess is that Realm is failing to recognize that the pointer is actually contained within the zero-copy memory when it does the look-up for the suggested memory based on the pointer.

@syamajala
Copy link
Contributor Author

Yes they are in system memory you can see them in the profile I linked above: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_tdb/instances/legion_prof/

I guess there are a lot in zero copy as well but those all have provenance.

@lightsighter
Copy link
Contributor

I decided I'm not actually going to ask the Realm team to fix this particular issue. It's a pretty obscure case and it's not clear we should be aliasing Realm instances this way. Once we have instance redistricting and I can redo the memory management then we won't be encountering this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants