Julia version 1.6.2 was able to optimize out the memory allocation of the following code, but version1.7.0-beta3.0 is not:
using StaticArrays
function cpu_kernel(a, i)
SIZE = 15
v = ones(MVector{SIZE, UInt32})
for x in v
a[i] += x
end
end
a = zeros(UInt32, 1)
cpu_kernel(a, 1)
@time cpu_kernel(a, 1)
This has been noticed originally here: JuliaGPU/CUDA.jl#38 (comment)
(Maybe related to #41512 ?)