In [2]:
using CUDA, LinearAlgebra, CUDA.CUSPARSE, CUDA.CUBLAS, SparseArrays, BenchmarkTools,  Random

### Testes

**Função de projeção**

In [3]:
function proj_CPU(p₀, u, β)
    return  p₀ .- ((dot(u, p₀)- β)/dot(u, u)).*u
end

proj_CPU (generic function with 1 method)

**Variáveis de teste**

In [4]:
n = Int32(2^20)
X = CUDA.rand(n)
Y = CUDA.rand(n)
x = Array(X)
y = Array(Y)
β = Float32(1.0)


1.0f0

**Teste de tempo da função projeção**

In [5]:
@benchmark proj_CPU(x, y, β) 

BenchmarkTools.Trial: 3614 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.146 ms[22m[39m … [35m  3.282 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 28.85%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.167 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.368 ms[22m[39m ± [32m519.872 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m7.23% ± 12.05%

  [39m█[34m▅[39m[39m▁[39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▃[39m [39m [39m [39m [39m [39m▃[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m [39m 
  [39m█[34m█[39m[39m█[39m▆[39m▁[3

In [6]:
@benchmark proj_CPU(X,Y, β) 

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m332.031 μs[22m[39m … [35m  8.935 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m392.916 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m428.073 μs[22m[39m ± [32m479.508 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.78% ± 3.05%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▅[39m▇[39m█[34m▇[39m[39m▅[39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m▃[39m▁[39m▁[3

**Função reflexão**

In [7]:
function reflexao(p₀, u, β)
    return  2 .*proj_CPU(p₀, u, β) .- p₀
end

reflexao (generic function with 1 method)

**Teste de tempo da função reflexão**

In [8]:
@benchmark reflexao(x, y, β) 

BenchmarkTools.Trial: 1939 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.956 ms[22m[39m … [35m  5.400 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 25.33%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.001 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.559 ms[22m[39m ± [32m927.816 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m10.80% ± 15.13%

  [39m█[34m▆[39m[39m▃[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▄[39m▁[39m [39m [39m [39m [39m [39m▃[39m▃[39m [39m [39m [39m [39m [39m [39m [39m▄[39m▁[39m [39m 
  [39m█[34m█[39m[39m█[39m█[39m▅

In [9]:
@benchmark reflexao(X,Y, β) 

BenchmarkTools.Trial: 7816 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m366.621 μs[22m[39m … [35m  9.220 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m573.630 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m637.126 μs[22m[39m ± [32m672.409 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.96% ± 3.78%

  [39m [39m [39m [39m [39m▃[39m [39m [39m [39m█[34m▁[39m[39m [32m [39m[39m▂[39m▃[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m▃[39m▅[39m█[39

**Variáveis de teste**

In [37]:
 Random.seed!(73)
    n=200
    r=100
    A = CUDA.randn(n, r)
    b = CUDA.randn(n)
    x = CUDA.randn(r)
    A₀ = Matrix(A)
    b₀ = Array(b)
    x₀ = Array(x)


100-element Vector{Float32}:
 -0.617854
  0.88156813
  0.64799774
 -1.159574
 -0.51960874
  0.17684749
 -0.49456695
 -1.3229952
  0.5434727
 -0.054395057
 -0.6752088
 -0.10279365
  0.8141797
  ⋮
 -0.51613015
  0.15553147
  0.75520635
 -1.7799246
  0.61415666
 -1.0246894
  1.5173224
 -0.5294779
  2.0692415
 -1.5487236
  0.63029295
  0.67391145

**Função que precisa ser paralelizada**

In [11]:
function reflexao_simultanea_CPU(xₖ, A, b, n, r)
    rₖ = zeros(r)
    for i=1:n
        rₖ .+= reflexao(xₖ, A[i,:], b[i])
    end
    return rₖ./n
end

reflexao_simultanea_CPU (generic function with 1 method)

In [50]:
function reflexao_simultanea_GPU(xₖ, A, b, n, r)
    rₖ =  CuArray{Float32}(undef, r)
    for i=1:n
        rₖ .+= reflexao(xₖ, A[i,:], b[i])
    end
    return rₖ./n
end

reflexao_simultanea_GPU (generic function with 1 method)

In [52]:
reflexao_simultanea_GPU(x, A, b, n, r)

100-element CuArray{Float32, 1}:
 -0.58813906
  0.8818405
  0.63639057
 -1.1584798
 -0.5148546
  0.15514593
 -0.4607878
 -1.2969103
  0.5415919
 -0.04806526
 -0.68358284
 -0.099011555
  0.80679953
  ⋮
 -0.50324214
  0.14415102
  0.75824493
 -1.7563723
  0.6345063
 -0.99202055
  1.5000345
 -0.54730326
  2.0267031
 -1.5332632
  0.61127895
  0.6765529

**Teste de tempo da reflexão simultânea**

In [38]:
@benchmark reflexao_simultanea_CPU(x₀, A₀, b₀, n, r)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m46.378 μs[22m[39m … [35m  1.842 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 92.36%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m54.813 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m65.054 μs[22m[39m ± [32m112.008 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m13.66% ±  7.64%

  [39m [39m▂[39m▄[39m▆[39m█[34m█[39m[39m▆[39m▂[39m [39m▁[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m▇[39m█[39m█[39m█

In [51]:
@benchmark reflexao_simultanea_GPU(x, A, b, n, r)

BenchmarkTools.Trial: 313 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m14.487 ms[22m[39m … [35m51.557 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 20.82%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m14.617 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m16.009 ms[22m[39m ± [32m 6.739 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.50% ±  4.05%

  [34m█[39m[39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[39m▇[39m▄[32m▁[39m[39m

### Comandos

CUDA.reclaim()  - LIMPA A MEMORIA DA GPU

CUDA.memory_status() - DIZ A QUANTIDADE DE MEMÓRIA LIVRE NA GPU

using Cthulhu  - Este pacote ajuda a entender os erros nas funções em GPU

@device_code_warntype interactive=true @cuda proj_GPU(X,Y, β, n)

rₖ = Vector{Float32}(undef, 1_000) - Aloca uma memória para um vetor de tamanho 1_000 na GPU

CuArray{Int}(undef, 2) - cria um array em pé de 2 entradas

CuArray{Int}(undef, (1,2))- cria um array deitado de 2 entradas

fill!(rₖ, 0.) - Atribui a memória alocada um vetor cujas etradas são todas nulas

@sync - pausa as tarefas da CPU até as tarefas da GPU dentro do Bloco serem concluídas

@btime nome da função - mede o tempo como benchmarktools

@cuprintln("thread $index, block $stride") - imprime

synchronize() - sincroniza a GPU, necessário usar com o @cuprint()

broadcast - Faz operções com elementos que não possuem a mesma dimensão como somar um vetor nas colunas de uma matriz por exemplo, com strings faz concatenação

map(f, c) -> coleção -Transformar a colecção c através da aplicação de f a cada elemento. Para múltiplos argumentos de recolha, aplicar f elemento a elemento. Ex: aplica uma função nas entradas de um vetor, opera com vetores de mesmo tamanho.

a = reshape(Vector(1:16), (4,4)) - cria o vetor de 1 até 16 depois transforma numa matriz 4x4 por colunas

reduce(max, a, dims=2) - verifica qual o vetor de maior valores na matriz e devolve somente este vetor em forma de matriz em pé

reduce(max, a, dims=1)- verifica qual o vetor de maior valores na matriz e devolve somente este vetor em forma de matriz deitada

reduce(*, [2; 3; 4]) - devolve a multiplicação das entradas do vetor considerando o elemento neutro da multiplicação como sendo 1

reduce(*, [2; 3; 4]; init=-1) - devolve a multiplicação das entradas do vetor considerando o elemento neutro da multiplicação como sendo -1

tamanho = length(a)/1024 - em que 1024 é o numero de threads
@cuda threads=length(a)/tamanho função(a)  - Faz a divisão para as threads

a = CuArray([1,2]) - array na GPU

b = Array(a) - array na CPU

copyto!(b, a) - aloca b na GPU no lugar de a

**SparseArrays em cuda**

A1 = sprand(10,10,0.2) - cria matriz espasa 10x10 com esparcidade 0.8 e distribuição normal

x1 = sprand(10,0.2) - cria vetor esparço 10x1 com distribuição normal e esparsidade 0.8