Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store the array length next to its dimensions. #1303

Merged
merged 1 commit into from Jan 3, 2022
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jan 3, 2022

Alternative to JuliaGPU/GPUArrays.jl#385, should fix #1301.

cc @GiggleLiu

@maleadt maleadt added cuda kernels Stuff about writing CUDA kernels. performance How fast can we go? labels Jan 3, 2022
@maleadt
Copy link
Member Author

maleadt commented Jan 3, 2022

Local performance testing didn't reveal any regressions.

@codecov
Copy link

codecov bot commented Jan 3, 2022

Codecov Report

Merging #1303 (fe8f5cf) into master (bf26270) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1303   +/-   ##
=======================================
  Coverage   78.94%   78.94%           
=======================================
  Files         119      119           
  Lines        8650     8650           
=======================================
  Hits         6829     6829           
  Misses       1821     1821           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94ba745...fe8f5cf. Read the comment docs.

@maleadt maleadt merged commit 57de639 into master Jan 3, 2022
@maleadt maleadt deleted the tb/array_len branch January 3, 2022 20:05
@GiggleLiu
Copy link
Contributor

GiggleLiu commented Jan 3, 2022

Nice, eager to do the benchmark again. FYI: here is a real world tensor network contraction: https://github.com/TensorBFS/TensorNetworkBenchmarks , we still have a 5x gap with PyTorch in tensor network contraction.

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Jan 5, 2022

Hi, I just benchmarked and confirmed this PR fixed the problem of @linearidx, thank you. But I am still confused why prod(::Dims) is so slow? it is just a product over the dimensions, which is just a few multiplications. Is there some bad design patterns that users should avoid in CUDA programming?

Now, I printed the @device_code_llvm and @device_code_llvm for the permutedims kernel in this file: https://github.com/TensorBFS/TensorNetworkBenchmarks/blob/debug-permutedims.jl/scripts/NOTE.md
Can you see anything obviously strange that makes it so much slower than the one in pytorch?

@maleadt
Copy link
Member Author

maleadt commented Jan 5, 2022

But I am still confused why prod(::Dims) is so slow?

It requires to load 28 integers (for ndims=28, as with your example) to compute the length before it can proceed with the kernel which only performed a single permutation. So the overhead is not unexpected?

I right now don't have the time to do a performance analysis, but I'd recommend using NSight Compute and compare both kernels (you can conveniently 'add a baseline' so that you can compare both). Especially register pressure and the resulting occupancy can significantly effect performance. See https://github.com/JuliaComputing/Training/blob/master/AdvancedGPU/2-2-kernel_analysis_optimization.ipynb for an example. From a quick glance at that code: you could probably get rid of most of those exception branches using assume (but that shouldn't affect performance much), but also everything is int64 so maybe the PyTorch code can pack twice as much threads by using 32-bit indices.

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Jan 5, 2022

Thank you very much for showing me the link, this repo is a treasure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unreasonablely slow copy kernel
2 participants