-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3D: creating hundreds of instances #1678
Comments
Could these be future instances? |
Or else, could it be the deferred buffers we create when the kernel launch arguments overflow the limit? https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/helper.t#L227 |
Could be either. Both would show up as external instances. Future instances would occur if they were buffers being returned from the application for Legion to take ownership of as the future result. Deferred buffers would look like external instances made on top of the eager allocation pool. Given the size quoted here of 3504 bytes, I'm going to guess that the second guess is the more likely case. |
Although if they are in system memory, that means they are not being used for the GPU and might therefore be more likely to be futures. |
Is there some way we could get provenance for these instances? |
Legion is logging the creator operation for each instance, so the profiler can look up the provenance string for that operation. Although it doesn't do that today it could do that without any changes to the logging interface. Every instance has the name of the operation that created it: I'm hesitant to add provenance strings to the task postamble and deferred buffer classes. In the case of the postamble the instances don't get made right away so we'd have to copy the provenance string and store it on the heap for indeterminate amount of time. We might also need to copy it around between nodes if the future data moves without ever creating an instance. The case is better for the deferred buffers since their instances get made right away, but their interface is already a mess and I don't want to make it more messy than it already is. |
If the profiling change is sufficient, let's do that? |
It should at least tell you which operation is making the instances. It won't tell you exactly which line of code is responsible though, but maybe it is close enough. Are we sure these instances are actually in the system memory and not in the zero-copy memory? |
I think we confirmed via logging in the compiler that these instances are the result of spilling arguments for CUDA kernels, and there is a fairly straightforward path to splitting the kernels up so we don't need to spill so much. |
Ok, so they were going in the zero-copy memory then instead of the system memory right? That way they were visible on the host for scribbling and on the device for reading. |
Yes, Regent places spill arguments into zero-copy memory. I'm not sure why Seshu would have seen them in system memory, the code to put them in zero copy is right here: https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/gpu/helper.t#L1103 |
If @syamajala can confirm that he was actually seeing them in the sysmem and not the zero-copy memory, I suspect there might actually be a bug in Realm. These instances would have been eagerly allocated out of the eager pool for the zero-copy memory by getting a pointer into the zero-copy memory. To make an instance for use by the deferred buffer object, Legion would ask Realm to do an external instance creation based on the pointer. Legion asks Realm to pick the "suggested memory" for that instance to go into. My guess is that Realm is failing to recognize that the pointer is actually contained within the zero-copy memory when it does the look-up for the suggested memory based on the pointer. |
Yes they are in system memory you can see them in the profile I linked above: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_tdb/instances/legion_prof/ I guess there are a lot in zero copy as well but those all have provenance. |
I decided I'm not actually going to ask the Realm team to fix this particular issue. It's a pretty obscure case and it's not clear we should be aliasing Realm instances this way. Once we have instance redistricting and I can redo the memory management then we won't be encountering this problem. |
I can see S3D creating hundreds of small instances in system memory and I have no idea why or where they're coming from.
I tried the logging wrapper and can see things like:
but I never see the instances get used anywhere.
The output of the logging wrapper is here: http://sapling2.stanford.edu/~seshu/s3d_tdb/instances/run_0.log
A profile is here: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_tdb/instances/legion_prof/
The text was updated successfully, but these errors were encountered: