Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

avoid the dPhi calculation in createTriplet#296

Merged
VourMa merged 6 commits intomasterfrom
fix-dPhi
Jul 8, 2023
Merged

avoid the dPhi calculation in createTriplet#296
VourMa merged 6 commits intomasterfrom
fix-dPhi

Conversation

@YonsiG
Copy link
Copy Markdown
Contributor

@YonsiG YonsiG commented Jun 22, 2023

The dPhi calculation is very time consuming from the compiler. (Which is the largest warp stalls of the Triplet kernel). We store already the dPhi information in the mdsInGPU, so we can avoid some of the dPhi calculation in creating the T3 and use the variable directly.
After this change, the timing of T3 kernel is decreased. From single stream to multi stream, the timing decrease is visible.
As a validation check, using Phi instead of computed from XY has the same physics performance as before.
master: http://uaf-10.t2.ucsd.edu/~yagu/SDL_GPU_plots/fix_dphi/master_again_again_PU200_NEVT-1_b498de-PU200/compare/TC_eff_base_0.html
after this PR: http://uaf-10.t2.ucsd.edu/~yagu/SDL_GPU_plots/fix_dphi/T3T5_removedPhi_PU200_NEVT-1_61a75dD-PU200/compare/TC_eff_base_0.html
The master timing is
Screen Shot 2023-06-22 at 7 00 53 PM
The timing after using anchorPhi in Triplet kernels
Screen Shot 2023-06-22 at 6 57 15 PM

Furthermore, apply this change to the Quintuplet.cu kernels. The timing after using anchorPhi in Quintuplet kernels:
Screen Shot 2023-06-22 at 9 03 58 PM

The Segments kernels do not have many usage of the deltaPhi, optimizing it does not give obvious performance gains. Timing after change:
Screen Shot 2023-06-28 at 6 10 36 PM

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Jun 22, 2023

@YonsiG
please prepare/post the profiler reports

Comment thread SDL/Hit.cuh Outdated
@YonsiG
Copy link
Copy Markdown
Contributor Author

YonsiG commented Jun 22, 2023

@YonsiG
please prepare/post the profiler reports

Seems like from profiling results, kernel time for createTripletsInGPUv2 is decreased by 25%. createQuintupletsInGPUv2 also speed up ~25%, createSegmentsInGPUv2 speeds up ~3%
profiler results in cgpu
profiler results in simplify dPhi:
http://uaf-10.t2.ucsd.edu/~yagu/SDL_GPU_plots/fix_dphi/profiling_fixdPhiLST3T5.ncu-rep
profiler results in master:
http://uaf-10.t2.ucsd.edu/~yagu/SDL_GPU_plots/fix_dphi/profiling_master_cgpu2.ncu-rep

Comment thread SDL/Quintuplet.cu Outdated
@YonsiG
Copy link
Copy Markdown
Contributor Author

YonsiG commented Jun 28, 2023

Adding the highedge and lowedge phi does not seem help from sdl_timing
Screen Shot 2023-06-28 at 7 32 33 PM

But if we read the profiler reports, it is decreasing the LS kernel time by ~0.01ms, T3 for ~0.1ms, T5 for ~0.15ms, at the cost of increase MD kernel time by 0.1ms. Going at the correct direction at least.
http://uaf-10.t2.ucsd.edu/~yagu/SDL_GPU_plots/fix_dphi/profiling_precomputedPhi.ncu-rep

@VourMa
Copy link
Copy Markdown
Contributor

VourMa commented Jun 30, 2023

@YonsiG I am unclear on the timeline for this PR. What else needs to be done and when do you foresee it to be ready for review? Or is it ready already?

@YonsiG
Copy link
Copy Markdown
Contributor Author

YonsiG commented Jun 30, 2023

@YonsiG I am unclear on the timeline for this PR. What else needs to be done and when do you foresee it to be ready for review? Or is it ready already?

Hi Manos, this PR is ready to review, and I have finished adding 2 variables saved in MD for usage: HighEdgePhi and LowEdgePhi

@VourMa
Copy link
Copy Markdown
Contributor

VourMa commented Jun 30, 2023

@YonsiG As I am starting to review this, could you post some screenshots from the profiler results, so that people have a general picture of what you improved and can know where to look for more in the profiler? For example, it would be interesting to compare how the stalls change in the lines you modified.

Comment thread SDL/Hit.cuh Outdated
Comment thread SDL/Quintuplet.cu Outdated
@YonsiG
Copy link
Copy Markdown
Contributor Author

YonsiG commented Jul 4, 2023

@YonsiG As I am starting to review this, could you post some screenshots from the profiler results, so that people have a general picture of what you improved and can know where to look for more in the profiler? For example, it would be interesting to compare how the stalls change in the lines you modified.

I can paste a few profiler reports here for quick reference. This is a comparison of the T5 in master and createT5 after change. The green is the baseline master while the blue is current after change. The "stalls wait" and "stalls no instructions" have been reduced a lot, while the third longest "stall long scoreboard" increased a bit.
Screen Shot 2023-07-04 at 9 19 53 PM

The overall memory throughput is increased
Screen Shot 2023-07-04 at 9 22 38 PM

@YonsiG
Copy link
Copy Markdown
Contributor Author

YonsiG commented Jul 4, 2023

This is the atan line in Hit.cuh, it has warp stats and instructions executed in create T5 kernels. It's a bit hard to read from the lines, since those number doesn't precisely match the warp usage in each line and they move around up and down. But still can see that the numbers generally decreased after changing
In master
Screen Shot 2023-07-07 at 5 28 39 PM

after the change
Screen Shot 2023-07-07 at 5 29 24 PM

@VourMa
Copy link
Copy Markdown
Contributor

VourMa commented Jul 4, 2023

Thanks for the profiler screenshots, they are useful. Could you modify the comment with the one where you show specific lines, so that you explain the color code and the different numbers?

Copy link
Copy Markdown
Contributor

@VourMa VourMa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the updates, @YonsiG! Merging this...

@VourMa VourMa merged commit 6e6e2d0 into master Jul 8, 2023
Comment thread SDL/Hit.cuh
Comment thread SDL/MiniDoublet.cu
Comment on lines +201 to +202
mdsInGPU.anchorHighEdgePhi[idx] = atan2f(mdsInGPU.anchorHighEdgeY[idx], mdsInGPU.anchorHighEdgeX[idx]);
mdsInGPU.anchorLowEdgePhi[idx] = atan2f(mdsInGPU.anchorLowEdgeY[idx], mdsInGPU.anchorLowEdgeX[idx]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be applying the phi function here now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can, using phi(x,y)

@YonsiG YonsiG mentioned this pull request Jul 27, 2023
@ariostas ariostas deleted the fix-dPhi branch May 8, 2024 21:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants