New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store the array length next to its dimensions. #1303
Conversation
Local performance testing didn't reveal any regressions. |
Codecov Report
@@ Coverage Diff @@
## master #1303 +/- ##
=======================================
Coverage 78.94% 78.94%
=======================================
Files 119 119
Lines 8650 8650
=======================================
Hits 6829 6829
Misses 1821 1821 Continue to review full report at Codecov.
|
Nice, eager to do the benchmark again. FYI: here is a real world tensor network contraction: https://github.com/TensorBFS/TensorNetworkBenchmarks , we still have a 5x gap with PyTorch in tensor network contraction. |
Hi, I just benchmarked and confirmed this PR fixed the problem of Now, I printed the |
It requires to load 28 integers (for ndims=28, as with your example) to compute the length before it can proceed with the kernel which only performed a single permutation. So the overhead is not unexpected? I right now don't have the time to do a performance analysis, but I'd recommend using NSight Compute and compare both kernels (you can conveniently 'add a baseline' so that you can compare both). Especially register pressure and the resulting occupancy can significantly effect performance. See https://github.com/JuliaComputing/Training/blob/master/AdvancedGPU/2-2-kernel_analysis_optimization.ipynb for an example. From a quick glance at that code: you could probably get rid of most of those exception branches using |
Thank you very much for showing me the link, this repo is a treasure! |
Alternative to JuliaGPU/GPUArrays.jl#385, should fix #1301.
cc @GiggleLiu