In [18]:
using CUDA, LinearAlgebra, CUDA.CUSPARSE, CUDA.CUBLAS, SparseArrays, BenchmarkTools,  Random

### Testes

**Função de projeção**

In [36]:
function proj_CPU(p₀, u, β)
    return  p₀ .- ((dot(u, p₀)- β)/dot(u, u)).*u
end

proj_CPU (generic function with 1 method)

**Variáveis de teste**

In [28]:
n = Int32(2^20)
X = CUDA.rand(n)
Y = CUDA.rand(n)
x = Array(X)
y = Array(Y)
β = Float32(1.0)


1.0f0

**Teste de tempo da função projeção**

In [37]:
@benchmark proj_CPU(x, y, β) 

BenchmarkTools.Trial: 3508 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.098 ms[22m[39m … [35m  4.356 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 44.71%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.120 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.411 ms[22m[39m ± [32m778.553 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m13.12% ± 16.29%

  [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[39m▁[39m▁[39m▁[39m▁[39m

In [38]:
@benchmark proj_CPU(X, Y, β) 

BenchmarkTools.Trial: 9639 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m338.800 μs[22m[39m … [35m31.368 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m396.406 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m517.279 μs[22m[39m ± [32m 1.442 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.52% ± 4.08%

  [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[39m▁[39m▁[39m▁[39m▁

**Função reflexão**

In [39]:
function reflexao(p₀, u, β)
    return  2 .*proj_CPU(p₀, u, β) .- p₀
end

reflexao (generic function with 1 method)

**Teste de tempo da função reflexão**

In [40]:
@benchmark reflexao(x, y, β) 

BenchmarkTools.Trial: 1919 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.857 ms[22m[39m … [35m5.892 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 32.21%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.919 ms             [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.589 ms[22m[39m ± [32m1.124 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m14.72% ± 18.31%

  [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m

In [41]:
@benchmark reflexao(X, Y, β) 

BenchmarkTools.Trial: 6106 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m417.690 μs[22m[39m … [35m21.554 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m580.573 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m816.050 μs[22m[39m ± [32m 1.989 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.80% ± 5.28%

  [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[39m▁[39m▁[39m▁[39m▁

**Variáveis de teste**

**Função que precisa ser paralelizada**

In [None]:
#using CUDAKernels, KernelAbstractions

In [42]:
function reflexao_simultanea_CPU(xₖ, A, b, n, r)
    rₖ = zeros(r)
    for i=1:n
        rₖ .+= reflexao(xₖ, A[i,:], b[i])
    end
    return rₖ./n
end

reflexao_simultanea_CPU (generic function with 1 method)

In [None]:
a = CuArray{Float32}(1:100000)
b = CuArray{Float32}(2:2:200000)
c = similar(a)

100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

In [None]:

function vadd!(c, a, b)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    #if i <= length(a)
    @inbounds c[i] = a[i] + b[i]
   # end
    return
end
CUDA.@sync begin 
    @cuda threads=1024 blocks=cld(length(a),1024) vadd!(c, a, b)
end


CUDA.HostKernel{typeof(vadd!), Tuple{CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}}}(vadd!, CuContext(0x0000000003349d80, instance 4534bdbc47de504c), CuModule(Ptr{Nothing} @0x0000000006bd1700, CuContext(0x0000000003349d80, instance 4534bdbc47de504c)), CuFunction(Ptr{Nothing} @0x0000000005ec3dc0, CuModule(Ptr{Nothing} @0x0000000006bd1700, CuContext(0x0000000003349d80, instance 4534bdbc47de504c))))

In [None]:
Random.seed!(73)
n=200
r=100
A = CUDA.randn(n, r)
b = CUDA.randn(n)
xₖ = CUDA.randn(r)
A₀ = Matrix(A)
b₀ = Array(b)
x₀ = Array(xₖ)
rₖ = CUDA.zeros(r,n);

In [43]:
function reflexao_simultanea_GPU!(xₖ, rₖ, A, b)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    @inbounds r[:,i] = reflexao(A[i,:], b[i], xₖ)
    return 
end
CUDA.@sync begin
    @cuda threads = 1024 blocks = cld(length(x),1024) reflexao_simultanea_GPU!(xₖ, rₖ, A, b)
end

LoadError: GPU compilation of kernel reflexao_simultanea_GPU!(CuDeviceVector{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceVector{Float32, 1}) failed
KernelError: kernel returns a value of type `Union{}`

Make sure your kernel function ends in `return`, `return nothing` or `nothing`.
If the returned value is of type `Union{}`, your Julia code probably throws an exception.
Inspect the code with `@device_code_warntype` for more details.


In [None]:
reflexao_simultanea_GPU(x, A, b, n, r)

│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays /home/tainasilva/.julia/packages/GPUArrays/8dzSJ/src/host/indexing.jl:56


100-element CuArray{Float32, 1}:
 -0.26272383
  0.35282493
  1.1438211
 -0.39701682
 -0.08849038
 -0.6344901
  0.13485298
  0.6545772
 -0.93717253
 -0.63921946
  2.652239
 -0.6923158
 -0.36071926
  ⋮
 -1.5494885
 -0.7311042
 -0.38845024
  0.976345
  0.18359241
 -1.3031749
  0.35674506
 -1.3254977
  0.13730527
 -1.1412529
  1.057222
  0.66788656

**Teste de tempo da reflexão simultânea**

In [None]:
@benchmark reflexao_simultanea_CPU(x₀, A₀, b₀, n, r)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m46.378 μs[22m[39m … [35m  1.842 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 92.36%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m54.813 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m65.054 μs[22m[39m ± [32m112.008 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m13.66% ±  7.64%

  [39m [39m▂[39m▄[39m▆[39m█[34m█[39m[39m▆[39m▂[39m [39m▁[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m▇[39m█[39m█[39m█

In [None]:
@benchmark reflexao_simultanea_GPU(x, A, b, n, r)

BenchmarkTools.Trial: 313 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m14.487 ms[22m[39m … [35m51.557 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 20.82%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m14.617 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m16.009 ms[22m[39m ± [32m 6.739 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.50% ±  4.05%

  [34m█[39m[39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[39m▇[39m▄[32m▁[39m[39m

### Comandos

CUDA.reclaim()  - LIMPA A MEMORIA DA GPU

CUDA.memory_status() - DIZ A QUANTIDADE DE MEMÓRIA LIVRE NA GPU

using Cthulhu  - Este pacote ajuda a entender os erros nas funções em GPU

@device_code_warntype interactive=true @cuda proj_GPU(X,Y, β, n)

rₖ = Vector{Float32}(undef, 1_000) - Aloca uma memória para um vetor de tamanho 1_000 na GPU

CuArray{Int}(undef, 2) - cria um array em pé de 2 entradas

CuArray{Int}(undef, (1,2))- cria um array deitado de 2 entradas

fill!(rₖ, 0.) - Atribui a memória alocada um vetor cujas etradas são todas nulas

@sync - pausa as tarefas da CPU até as tarefas da GPU dentro do Bloco serem concluídas

@btime nome da função - mede o tempo como benchmarktools

@cuprintln("thread $index, block $stride") - imprime

synchronize() - sincroniza a GPU, necessário usar com o @cuprint()

broadcast - Faz operções com elementos que não possuem a mesma dimensão como somar um vetor nas colunas de uma matriz por exemplo, com strings faz concatenação

map(f, c) -> coleção -Transformar a colecção c através da aplicação de f a cada elemento. Para múltiplos argumentos de recolha, aplicar f elemento a elemento. Ex: aplica uma função nas entradas de um vetor, opera com vetores de mesmo tamanho.

a = reshape(Vector(1:16), (4,4)) - cria o vetor de 1 até 16 depois transforma numa matriz 4x4 por colunas

reduce(max, a, dims=2) - verifica qual o vetor de maior valores na matriz e devolve somente este vetor em forma de matriz em pé

reduce(max, a, dims=1)- verifica qual o vetor de maior valores na matriz e devolve somente este vetor em forma de matriz deitada

reduce(*, [2; 3; 4]) - devolve a multiplicação das entradas do vetor considerando o elemento neutro da multiplicação como sendo 1

reduce(*, [2; 3; 4]; init=-1) - devolve a multiplicação das entradas do vetor considerando o elemento neutro da multiplicação como sendo -1

tamanho = length(a)/1024 - em que 1024 é o numero de threads
@cuda threads=length(a)/tamanho função(a)  - Faz a divisão para as threads

a = CuArray([1,2]) - array na GPU

b = Array(a) - array na CPU

copyto!(b, a) - aloca b na GPU no lugar de a

**SparseArrays em cuda**

A1 = sprand(10,10,0.2) - cria matriz espasa 10x10 com esparcidade 0.8 e distribuição normal

x1 = sprand(10,0.2) - cria vetor esparço 10x1 com distribuição normal e esparsidade 0.8