Use lazy broadcasting for `Statistics.var` #443

mcabbott · 2023-01-04T04:20:23Z

I realised that var(x; dims) allocates like copy(x), but this can be easily fixed by using mapreduce & lazy broadcasting.

Before, 3.822 MiB:

julia> let x = CUDA.rand(1000, 1000)
           @btime CUDA.@sync copy($x)  # just as a baseline
           CUDA.@time copy(x) 
           println()

           μ1 = mean(x; dims=1)
           μ2 = mean(x; dims=2)
           @btime CUDA.@sync var($x, dims=1, mean=$μ1)
           @btime CUDA.@sync var($x, dims=2, mean=$μ2)
           CUDA.@time var(x, dims=1, mean=μ1)
           CUDA.@time var(x, dims=2, mean=μ2)
       end;
  29.824 μs (13 allocations: 400 bytes)
  0.014674 seconds (403 CPU allocations: 18.891 KiB) (1 GPU allocation: 3.815 MiB, 0.10% memmgmt time)

  71.751 μs (136 allocations: 6.11 KiB)
  76.459 μs (136 allocations: 6.11 KiB)
  0.000113 seconds (145 CPU allocations: 6.531 KiB) (3 GPU allocations: 3.822 MiB, 12.62% memmgmt time)
  0.000102 seconds (145 CPU allocations: 6.531 KiB) (3 GPU allocations: 3.822 MiB, 14.61% memmgmt time)

After, 3.906 KiB:

  29.953 μs (13 allocations: 400 bytes)
  0.000046 seconds (16 CPU allocations: 608 bytes) (1 GPU allocation: 3.815 MiB, 12.81% memmgmt time)

  65.186 μs (85 allocations: 4.72 KiB)
  69.830 μs (85 allocations: 4.72 KiB)
  0.007978 seconds (124 CPU allocations: 6.797 KiB) (1 GPU allocation: 3.906 KiB, 0.16% memmgmt time)
  0.000120 seconds (94 CPU allocations: 5.141 KiB) (1 GPU allocation: 3.906 KiB, 6.69% memmgmt time)

mcabbott · 2023-01-04T04:38:18Z

While I'm less sure this is a good idea, the second commit changes mean to allocate one array not two.

Before:

julia> let x = CUDA.ones(1000, 10_000) * pi
         CUDA.@time mean(x; dims=1)
         CUDA.@time mean(sqrt, x; dims=1)
       end
  0.000336 seconds (101 CPU allocations: 4.625 KiB) (2 GPU allocations: 78.125 KiB, 2.61% memmgmt time)
  0.000342 seconds (101 CPU allocations: 4.625 KiB) (2 GPU allocations: 78.125 KiB, 3.62% memmgmt time)
1×10000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 1.77245  1.77245  1.77245  1.77245  1.77245  …  1.77245  1.77245  1.77245  1.77245  1.77245

After:

  0.000374 seconds (89 CPU allocations: 3.531 KiB) (1 GPU allocation: 39.062 KiB, 1.53% memmgmt time)
  0.000351 seconds (55 CPU allocations: 2.453 KiB) (1 GPU allocation: 39.062 KiB, 1.94% memmgmt time)
1×10000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 1.77245  1.77245  1.77245  1.77245  1.77245  …  1.77245  1.77245  1.77245  1.77245  1.77245

maleadt · 2023-01-18T12:36:57Z

Thanks (sorry for the delay); both seem like a good idea.

mcabbott added 2 commits January 3, 2023 23:16

use lazy broadcasting for var

e457a00

allocate once for mean(x; dims)

077f26a

maleadt enabled auto-merge January 18, 2023 12:37

maleadt disabled auto-merge January 18, 2023 12:37

maleadt merged commit 829f433 into JuliaGPU:master Jan 18, 2023

mcabbott deleted the variance branch January 18, 2023 13:20

mcabbott mentioned this pull request Feb 14, 2023

Fix rounding problems in _mean function #453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lazy broadcasting for `Statistics.var` #443

Use lazy broadcasting for `Statistics.var` #443

mcabbott commented Jan 4, 2023

mcabbott commented Jan 4, 2023

maleadt commented Jan 18, 2023

Use lazy broadcasting for Statistics.var #443

Use lazy broadcasting for Statistics.var #443

Conversation

mcabbott commented Jan 4, 2023

mcabbott commented Jan 4, 2023

maleadt commented Jan 18, 2023

Use lazy broadcasting for `Statistics.var` #443

Use lazy broadcasting for `Statistics.var` #443