Skip to content

Support for LinearAlgebra.pinv #883

@dovfurman

Description

@dovfurman

Describe the bug

CUDA.pinv does not work on a CuArray variable (stored in GPU memory). It works fine on a Matrix variable (stored in CPU memory)

To reproduce

a=CUDA.rand(Float32,(4,4))
CUDA.pinv(a)

Actual results:

ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{Vector{Float32}, Tuple{Bool}, Tuple{Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{Vector{Float32}, Tuple{Bool}, Tuple{Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, which is not isbits:
  .args is of type Tuple{Base.Broadcast.Extruded{Vector{Float32}, Tuple{Bool}, Tuple{Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}} which 
is not isbits.
    .1 is of type Base.Broadcast.Extruded{Vector{Float32}, Tuple{Bool}, Tuple{Int64}} which is not isbits.
      .x is of type Vector{Float32} which is not isbits.


Stacktrace:
  [1] check_invocation(job::GPUCompiler.CompilerJob, entry::LLVM.Function)
    @ GPUCompiler C:\Users\User\.julia\packages\GPUCompiler\8sSXl\src\validation.jl:66
  [2] macro expansion
    @ C:\Users\User\.julia\packages\GPUCompiler\8sSXl\src\driver.jl:301 [inlined]
  [3] macro expansion
    @ C:\Users\User\.julia\packages\TimerOutputs\4QAIk\src\TimerOutput.jl:206 [inlined]
  [4] macro expansion
    @ C:\Users\User\.julia\packages\GPUCompiler\8sSXl\src\driver.jl:300 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module, kernel::LLVM.Function; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler C:\Users\User\.julia\packages\GPUCompiler\8sSXl\src\utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA C:\Users\User\.julia\packages\CUDA\k52QH\src\compiler\execution.jl:301
  [7] check_cache
    @ C:\Users\User\.julia\packages\GPUCompiler\8sSXl\src\cache.jl:47 [inlined]
  [8] cached_compilation
    @ C:\Users\User\.julia\packages\GPUArrays\0ShDd\src\host\broadcast.jl:57 [inlined]
  [9] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#16", Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{Vector{Float32}, Tuple{Bool}, Tuple{Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler C:\Users\User\.julia\packages\GPUCompiler\8sSXl\src\cache.jl:0
 [10] cufunction(f::GPUArrays.var"#broadcast_kernel#16", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{Vector{Float32}, Tuple{Bool}, Tuple{Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\User\.julia\packages\CUDA\k52QH\src\compiler\execution.jl:289
 [11] cufunction
    @ C:\Users\User\.julia\packages\CUDA\k52QH\src\compiler\execution.jl:283 [inlined]
 [12] macro expansion
    @ C:\Users\User\.julia\packages\CUDA\k52QH\src\compiler\execution.jl:102 [inlined]
 [13] #launch_heuristic#309
    @ C:\Users\User\.julia\packages\CUDA\k52QH\src\gpuarrays.jl:17 [inlined]
 [14] launch_heuristic
    @ C:\Users\User\.julia\packages\CUDA\k52QH\src\gpuarrays.jl:17 [inlined]
 [15] copyto!
    @ C:\Users\User\.julia\packages\GPUArrays\0ShDd\src\host\broadcast.jl:63 [inlined]
 [16] copyto!
    @ .\broadcast.jl:936 [inlined]
 [17] materialize!
    @ .\broadcast.jl:894 [inlined]
 [18] materialize!
    @ .\broadcast.jl:891 [inlined]
 [19] lmul!(D::LinearAlgebra.Diagonal{Float32, Vector{Float32}}, B::CuArray{Float32, 2})
    @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\diagonal.jl:212
 [20] *
    @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\diagonal.jl:275 [inlined]
 [21] pinv(A::CuArray{Float32, 2}; atol::Float64, rtol::Float32)
    @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\dense.jl:1395
 [22] pinv(A::CuArray{Float32, 2})
    @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\dense.jl:1367
 [23] top-level scope
    @ none:1

CUDA version:
[052768ef] CUDA v3.1.0

Expected behavior
CUDA.pinv should act upon a CuArray, and return a CuArray.

Version info

Details on Julia:

# please post the output of:
versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9900 CPU @ 3.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\User\AppData\Local\atom\app-1.56.0\atom.exe"  -a
  JULIA_NUM_THREADS = 8

Details on CUDA:

# please post the output of:
CUDA.versioninfo()
CUDA toolkit 11.2.2, artifact installation
CUDA driver 11.3.0
NVIDIA driver 465.89.0

Libraries:
- CUBLAS: 11.4.1
- CURAND: 10.2.3
- CUFFT: 10.4.1
- CUSOLVER: 11.1.0
- CUSPARSE: 11.4.1
- CUPTI: 14.0.0
- NVML: 11.0.0+465.89
- CUDNN: 8.10.0 (for CUDA 11.2.0)
- CUTENSOR: 1.2.2 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA Quadro P2200 (sm_61, 4.272 GiB / 5.000 GiB available)

Additional context

Currently CUDA.pinv does works on matrices stored on CPU memory, and returns the result on CPU array. Assuming the computation is performed on the GPU, this does not make sense as the transfer between CPU and GPU memories are very inefficient, and should be left to the developer discretion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions