Conversation
|
@YonsiG |
Seems like from profiling results, kernel time for createTripletsInGPUv2 is decreased by 25%. createQuintupletsInGPUv2 also speed up ~25%, createSegmentsInGPUv2 speeds up ~3% |
|
Adding the highedge and lowedge phi does not seem help from sdl_timing But if we read the profiler reports, it is decreasing the LS kernel time by ~0.01ms, T3 for ~0.1ms, T5 for ~0.15ms, at the cost of increase MD kernel time by 0.1ms. Going at the correct direction at least. |
|
@YonsiG I am unclear on the timeline for this PR. What else needs to be done and when do you foresee it to be ready for review? Or is it ready already? |
Hi Manos, this PR is ready to review, and I have finished adding 2 variables saved in MD for usage: HighEdgePhi and LowEdgePhi |
|
@YonsiG As I am starting to review this, could you post some screenshots from the profiler results, so that people have a general picture of what you improved and can know where to look for more in the profiler? For example, it would be interesting to compare how the stalls change in the lines you modified. |
I can paste a few profiler reports here for quick reference. This is a comparison of the T5 in master and createT5 after change. The green is the baseline master while the blue is current after change. The "stalls wait" and "stalls no instructions" have been reduced a lot, while the third longest "stall long scoreboard" increased a bit. |
|
Thanks for the profiler screenshots, they are useful. Could you modify the comment with the one where you show specific lines, so that you explain the color code and the different numbers? |
| mdsInGPU.anchorHighEdgePhi[idx] = atan2f(mdsInGPU.anchorHighEdgeY[idx], mdsInGPU.anchorHighEdgeX[idx]); | ||
| mdsInGPU.anchorLowEdgePhi[idx] = atan2f(mdsInGPU.anchorLowEdgeY[idx], mdsInGPU.anchorLowEdgeX[idx]); |
There was a problem hiding this comment.
Should we be applying the phi function here now?
There was a problem hiding this comment.
I think we can, using phi(x,y)





The dPhi calculation is very time consuming from the compiler. (Which is the largest warp stalls of the Triplet kernel). We store already the dPhi information in the mdsInGPU, so we can avoid some of the dPhi calculation in creating the T3 and use the variable directly.


After this change, the timing of T3 kernel is decreased. From single stream to multi stream, the timing decrease is visible.
As a validation check, using Phi instead of computed from XY has the same physics performance as before.
master: http://uaf-10.t2.ucsd.edu/~yagu/SDL_GPU_plots/fix_dphi/master_again_again_PU200_NEVT-1_b498de-PU200/compare/TC_eff_base_0.html
after this PR: http://uaf-10.t2.ucsd.edu/~yagu/SDL_GPU_plots/fix_dphi/T3T5_removedPhi_PU200_NEVT-1_61a75dD-PU200/compare/TC_eff_base_0.html
The master timing is
The timing after using anchorPhi in Triplet kernels
Furthermore, apply this change to the Quintuplet.cu kernels. The timing after using anchorPhi in Quintuplet kernels:

The Segments kernels do not have many usage of the deltaPhi, optimizing it does not give obvious performance gains. Timing after change:
