You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2741087.out:[192 - 20143958f8b0] 154.205468 {4}{runtime}: [warning 1098] LEGION WARNING: Internal runtime performance warning: equivalence set e00000000001d1b of region (1283,256,3) has 64 different users which is the same as the sampling rate of 64. Region requirement 3 of operation task_5 (UID 155584) triggered this warning. Please report this application use case to the Legion developers mailing list. (from file /g/g15/yadav2/taco/legion/legion/runtime/legion/legion_analysis.cc:11156)
Based on our conversations about this from last time, I have a pretty clear diagnosis of why this behavior is occuring. I'm trying to implement the algorithm here: https://ieeexplore.ieee.org/document/8425209, which performs a tensor contraction of the following form: A(i, l) = B(i, j, k) * C(j, l) * D(k, l). At a high level, the algorithm creates a 3-d processor grid, and partitions the B tensor onto each processor in the grid. Next, it partitions the A, C and D matrices into rows, and places them on different axes of the processor grid. The algorithm proceeds with a 3-d index launch over the grid, where each processor in a slice in the i dimension of the processor grid receives a piece of A, each j slice receives a piece of C, and each k slice receives a piece of D.
At 256 nodes, the processor cube I'm running on is 8x8x4. A slice in the k dimension has 64 processors in it, so if a single piece of a region is replicated among those 64 nodes then it seems like the equivalence set for that region will hit the 64 different users cap.
Let me know if my analysis sounds right here. I'm not sure there's anything that can be changed in my code here -- it's a flat index launch with tight bounds on the subregions that each task accesses. There probably isn't an easy fix here, but I thought to report the use case.
The text was updated successfully, but these errors were encountered:
That analysis looks accurate and will be resolved in the future by choosing to use collective instances, which will also permit the runtime to replicate the equivalence set meta-data to avoid unnecessary communication.
I'm seeing these logs:
Based on our conversations about this from last time, I have a pretty clear diagnosis of why this behavior is occuring. I'm trying to implement the algorithm here: https://ieeexplore.ieee.org/document/8425209, which performs a tensor contraction of the following form:
A(i, l) = B(i, j, k) * C(j, l) * D(k, l)
. At a high level, the algorithm creates a 3-d processor grid, and partitions the B tensor onto each processor in the grid. Next, it partitions the A, C and D matrices into rows, and places them on different axes of the processor grid. The algorithm proceeds with a 3-d index launch over the grid, where each processor in a slice in the i dimension of the processor grid receives a piece of A, each j slice receives a piece of C, and each k slice receives a piece of D.At 256 nodes, the processor cube I'm running on is 8x8x4. A slice in the k dimension has 64 processors in it, so if a single piece of a region is replicated among those 64 nodes then it seems like the equivalence set for that region will hit the 64 different users cap.
Let me know if my analysis sounds right here. I'm not sure there's anything that can be changed in my code here -- it's a flat index launch with tight bounds on the subregions that each task accesses. There probably isn't an easy fix here, but I thought to report the use case.
The text was updated successfully, but these errors were encountered: