I ran this code on Grace Hopper with CUDA-enabled Reactant
using Reactant
x = Reactant.to_rarray(randn(Float32, 100, 2))
W = Reactant.to_rarray(randn(Float32, 10, 100))
b = Reactant.to_rarray(randn(Float32, 10))
linear(x, W, b) = (W * x) .+ b
Reactant.with_profiler("./"; create_perfetto_link=true) do
mylinear = Reactant.@compile linear(x, W, b)
[mylinear(x, W, b) for i in 1:1111]
end
but there are frequent host-device communications:

Here is the profile trace.