Store the array length next to its dimensions. #1303

maleadt · 2022-01-03T15:07:30Z

Alternative to JuliaGPU/GPUArrays.jl#385, should fix #1301.

maleadt · 2022-01-03T15:28:56Z

Local performance testing didn't reveal any regressions.

codecov · 2022-01-03T18:14:32Z

Codecov Report

Merging #1303 (fe8f5cf) into master (bf26270) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1303   +/-   ##
=======================================
  Coverage   78.94%   78.94%           
=======================================
  Files         119      119           
  Lines        8650     8650           
=======================================
  Hits         6829     6829           
  Misses       1821     1821

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94ba745...fe8f5cf. Read the comment docs.

GiggleLiu · 2022-01-03T21:16:21Z

Nice, eager to do the benchmark again. FYI: here is a real world tensor network contraction： https://github.com/TensorBFS/TensorNetworkBenchmarks , we still have a 5x gap with PyTorch in tensor network contraction.

GiggleLiu · 2022-01-05T02:20:45Z

Hi, I just benchmarked and confirmed this PR fixed the problem of @linearidx, thank you. But I am still confused why prod(::Dims) is so slow? it is just a product over the dimensions, which is just a few multiplications. Is there some bad design patterns that users should avoid in CUDA programming?

Now, I printed the @device_code_llvm and @device_code_llvm for the permutedims kernel in this file: https://github.com/TensorBFS/TensorNetworkBenchmarks/blob/debug-permutedims.jl/scripts/NOTE.md
Can you see anything obviously strange that makes it so much slower than the one in pytorch?

maleadt · 2022-01-05T06:38:27Z

But I am still confused why prod(::Dims) is so slow?

It requires to load 28 integers (for ndims=28, as with your example) to compute the length before it can proceed with the kernel which only performed a single permutation. So the overhead is not unexpected?

I right now don't have the time to do a performance analysis, but I'd recommend using NSight Compute and compare both kernels (you can conveniently 'add a baseline' so that you can compare both). Especially register pressure and the resulting occupancy can significantly effect performance. See https://github.com/JuliaComputing/Training/blob/master/AdvancedGPU/2-2-kernel_analysis_optimization.ipynb for an example. From a quick glance at that code: you could probably get rid of most of those exception branches using assume (but that shouldn't affect performance much), but also everything is int64 so maybe the PyTorch code can pack twice as much threads by using 32-bit indices.

GiggleLiu · 2022-01-05T09:28:14Z

Thank you very much for showing me the link, this repo is a treasure!

Store the array length next to its dimensions.

fe8f5cf

maleadt added cuda kernels Stuff about writing CUDA kernels. performance How fast can we go? labels Jan 3, 2022

maleadt merged commit 57de639 into master Jan 3, 2022

maleadt deleted the tb/array_len branch January 3, 2022 20:05

GiggleLiu mentioned this pull request Jan 12, 2022

use faster tensor contraction for CuArrays (BLAS number types) under-Peter/OMEinsum.jl#133

Closed

simonbyrne pushed a commit to simonbyrne/CUDA.jl that referenced this pull request Nov 13, 2023

Store the array length next to its dimensions. (JuliaGPU#1303)

8381dea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store the array length next to its dimensions. #1303

Store the array length next to its dimensions. #1303

maleadt commented Jan 3, 2022

maleadt commented Jan 3, 2022

codecov bot commented Jan 3, 2022 •

edited

GiggleLiu commented Jan 3, 2022 •

edited

GiggleLiu commented Jan 5, 2022 •

edited

maleadt commented Jan 5, 2022

GiggleLiu commented Jan 5, 2022 •

edited

Store the array length next to its dimensions. #1303

Store the array length next to its dimensions. #1303

Conversation

maleadt commented Jan 3, 2022

maleadt commented Jan 3, 2022

codecov bot commented Jan 3, 2022 • edited

Codecov Report

GiggleLiu commented Jan 3, 2022 • edited

GiggleLiu commented Jan 5, 2022 • edited

maleadt commented Jan 5, 2022

GiggleLiu commented Jan 5, 2022 • edited

codecov bot commented Jan 3, 2022 •

edited

GiggleLiu commented Jan 3, 2022 •

edited

GiggleLiu commented Jan 5, 2022 •

edited

GiggleLiu commented Jan 5, 2022 •

edited