Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in build step: Inconsistency detected by ld.so #68

Closed
Cvikli opened this issue Oct 21, 2020 · 23 comments
Closed

Error in build step: Inconsistency detected by ld.so #68

Cvikli opened this issue Oct 21, 2020 · 23 comments
Labels
bug Something isn't working build upstream

Comments

@Cvikli
Copy link

Cvikli commented Oct 21, 2020

Dear @jpsamaroo,
I want to know what is the best source to setup this library? I tried to follow the basic setup mentioned in the docs, which is a little bit confusing for me and I don't know if am missed something or not but trieing hard for a hour already without success. 😞

Is it possible to ask a cleaner install instruction list?

I would be glad if I could use it, it looks damn promising!

@jpsamaroo
Copy link
Member

I'm happy to help you figure this out, but I need you to post what you tried to do, where it failed, and the errors and stacktrace you got. Otherwise I don't know what doesn't work.

@Cvikli Cvikli changed the title I just can't get is work I just can't get it work Oct 21, 2020
@Cvikli
Copy link
Author

Cvikli commented Oct 21, 2020

I went through this documentation again:
https://juliagpu.gitlab.io/AMDGPU.jl/

The code I tried to run:

using AMDGPU

@show AMDGPU.agents()

N=32
@time a = rand(Float64, N)

@time a_d = AMDGPU.ROCArray(a)

The results:

┌ Warning: HSA runtime has not been built, runtime functionality will be unavailable.
│ Please run Pkg.build("AMDGPU") and reload AMDGPU.
└ @ AMDGPU ~/.julia/packages/AMDGPU/lrlUy/src/AMDGPU.jl:152
┌ Warning: ROCm-Device-Libs have not been downloaded, device intrinsics will be unavailable.
│ Please run Pkg.build("AMDGPU") and reload AMDGPU.
└ @ AMDGPU ~/.julia/packages/AMDGPU/lrlUy/src/AMDGPU.jl:160
AMDGPU.agents() = HSAAgent[]
  0.059273 seconds (84.48 k allocations: 4.414 MiB)
ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
 [1] getproperty at ./Base.jl:33 [inlined]
 [2] getindex at ./refvalue.jl:32 [inlined]
 [3] get_default_agent at /home/hm/.julia/packages/AMDGPU/lrlUy/src/agent.jl:109 [inlined]
 [4] #alloc#2 at /home/hm/.julia/packages/AMDGPU/lrlUy/src/memory.jl:223 [inlined]
 [5] alloc at /home/hm/.julia/packages/AMDGPU/lrlUy/src/memory.jl:223 [inlined]
 [6] ROCArray at /home/hm/.julia/packages/AMDGPU/lrlUy/src/array.jl:93 [inlined]
 [7] ROCArray at /home/hm/.julia/packages/AMDGPU/lrlUy/src/array.jl:107 [inlined]
 [8] ROCArray at /home/hm/.julia/packages/AMDGPU/lrlUy/src/array.jl:134 [inlined]
 [9] ROCArray(::Array{Float64,1}) at /home/hm/.julia/packages/AMDGPU/lrlUy/src/array.jl:140
 [10] top-level scope at ./timing.jl:174 [inlined]
 [11] top-level scope at /home/hm/repo/amd/tests/test_AMD.jl:0
 [12] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1088
 [13] include_string(::Module, ::String, ::String) at ./loading.jl:1096
 [14] invokelatest(::Any, ::Any, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./essentials.jl:710
 [15] invokelatest(::Any, ::Any, ::Vararg{Any,N} where N) at ./essentials.jl:709
 [16] inlineeval(::Module, ::String, ::Int64, ::Int64, ::String; softscope::Bool) at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:132
 [17] (::VSCodeServer.var"#50#53"{String,Int64,Int64,String,Module,Bool,VSCodeServer.ReplRunCodeRequestParams})() at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:93
 [18] withpath(::VSCodeServer.var"#50#53"{String,Int64,Int64,String,Module,Bool,VSCodeServer.ReplRunCodeRequestParams}, ::String) at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/repl.jl:119
 [19] (::VSCodeServer.var"#49#52"{String,Int64,Int64,String,Module,Bool,Bool,VSCodeServer.ReplRunCodeRequestParams})() at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:91
 [20] hideprompt(::VSCodeServer.var"#49#52"{String,Int64,Int64,String,Module,Bool,Bool,VSCodeServer.ReplRunCodeRequestParams}) at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/repl.jl:36
 [21] (::VSCodeServer.var"#48#51"{VSCodeServer.ReplRunCodeRequestParams})() at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:71
 [22] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [23] invokelatest(::Any) at ./essentials.jl:709
 [24] macro expansion at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:27 [inlined]
 [25] (::VSCodeServer.var"#46#47")() at ./task.jl:356
in expression starting at /home/hm/repo/amd/tests/test_AMD.jl:8

That is why I tried:
(@v1.5) pkg> build AMDGPU

 Building AMDGPU → `~/.julia/packages/AMDGPU/lrlUy/deps/build.log`
┌ Error: Error building `AMDGPU`: 
│ Inconsistency detected by ld.so: dl-close.c: 223: _dl_close_worker: Assertion `(*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded' failed!
└ @ Pkg.Operations /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:949

Which I guess is somehow affected by LLVM? I tried to change the ld.lld package but I just don't understand what is going on, so I don't know if it was necessary.
I updated these package:
apt-get install clang-format clang-tidy clang-tools clang clangd libc++-dev libc++1 libc++abi-dev libc++abi1 libclang-dev libclang1 liblldb-dev libllvm-ocaml-dev libomp-dev libomp5 lld lldb llvm-dev llvm-runtime llvm python-clang

Changed the permission of the /dev/kfd and tried to set LD_LIBRARY_PATH without success.

@jpsamaroo
Copy link
Member

Ahh yes I recall you showing me this error. This is almost definitely not an issue with Julia (as far as I can tell), but an issue with one of your ROCm libraries. You can try adding some @info statements into your deps/build.jl file to see where it happens. It's probably occurring somewhere in this region:

AMDGPU.jl/deps/build.jl

Lines 138 to 180 in f681252

config[:libhsaruntime_path] = find_hsa_library("libhsa-runtime64", roc_dirs)
if config[:libhsaruntime_path] == nothing
build_error("Could not find HSA runtime library.")
end
# initializing the library isn't necessary, but flushes out errors that otherwise would
# happen during `version` or, worse, at package load time.
status = init_hsa(config[:libhsaruntime_path])
if status != 0
build_error("Initializing HSA runtime failed with code $status.")
end
config[:libhsaruntime_version] = version_hsa(config[:libhsaruntime_path])
# also shutdown just in case
status = shutdown_hsa(config[:libhsaruntime_path])
if status != 0
build_error("Shutdown of HSA runtime failed with code $status.")
end
# find the ld.lld program for linking kernels
ld_path = find_ld_lld()
if ld_path == ""
build_error("Couldn't find ld.lld, please install it with your package manager")
end
config[:ld_lld_path] = ld_path
config[:hsa_configured] = true
for name in ("rocblas", "rocsparse", "rocalution", "rocfft", "rocrand", "MIOpen")
lib = Symbol("lib$(lowercase(name))")
config[lib] = find_roc_library("lib$name")
if config[lib] == nothing
build_warning("Could not find library '$name'")
end
end
lib_hip = Symbol("libhip")
_paths = String[]
config[lib_hip] = Libdl.find_library(["libhip_hcc","libamdhip64"], _paths)
config[lib_hip] == nothing && build_warning("Could not find library HIP")

@jpsamaroo jpsamaroo changed the title I just can't get it work Error in build step: Inconsistency detected by ld.so Oct 21, 2020
@jpsamaroo jpsamaroo added bug Something isn't working build upstream labels Oct 21, 2020
@jpsamaroo
Copy link
Member

Please retry this with AMDGPU#master, we've merged support for ROCT and ROCR artifacts, so this might help alleviate this error.

@Cvikli
Copy link
Author

Cvikli commented Feb 26, 2021

Hey! I till try it tomorrow! Thank you for your efforts! :)

(I check the package and follow its update btw, just still didn't know if I could try again, great notification!) :)

@Cvikli
Copy link
Author

Cvikli commented Feb 27, 2021

I reinstalled everything by following these steps:

So it definitely improved, I could allocate arrays!

For the code:

using AMDGPU

@show AMDGPU.agents()

N=2^20
a = rand(Float64, N)
b = rand(Float64, N)
c_cpu = a + b
a_d = ROCArray(a)
b_d = ROCArray(b)
c_d = similar(a_d)

function vadd!(c, a, b)
	i = workitemIdx().x
	c[i] = a[i] + b[i]
	return
end

@roc groupsize=32 vadd!(c_d, a_d, b_d)

I get the following error message:

ERROR: LoadError: MethodError: no method matching cached_compilation(::Dict{UInt64, Any}, ::GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.ROCCompilerParams}, ::typeof(AMDGPU.rocfunction_compile), ::typeof(AMDGPU.rocfunction_link))
Closest candidates are:
  cached_compilation(::Dict, ::Function, ::Function, ::GPUCompiler.FunctionSpec{f, tt}; kwargs...) where {f, tt} at /home/hm/.julia/packages/GPUCompiler/AdCnd/src/cache.jl:65
Stacktrace:
 [1] rocfunction(f::Function, tt::Type; name::Nothing, device::AMDGPU.RuntimeDevice{HSAAgent}, global_hooks::NamedTuple{(), Tuple{}}, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:291
 [2] rocfunction(f::Function, tt::Type)
   @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:286
 [3] top-level scope
   @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:165
in expression starting at /home/username/repo/TestScripts/tests/test_AMD.jl:20

Is this error message somewhat helpful for you?
I have: + GPUCompiler v0.10.0 which can be helpful for you I guess.

@jpsamaroo
Copy link
Member

Can you pull AMDGPU again? I don't know how you ended up in this situation (that version of AMDGPU shouldn't allow you to use GPUCompiler 0.10), but the latest master of AMDGPU has explicit support for GPUCompiler 0.10.

@Cvikli
Copy link
Author

Cvikli commented Feb 27, 2021

I pulled AMDGPU again and now:
It actually doesn't crash we are getting closer!

julia> @roc groupsize=N vadd!(c_d, a_d, b_d)
AMDGPU.RuntimeEvent{AMDGPU.HSAStatusSignal}(AMDGPU.HSAStatusSignal(HSASignal(Base.RefValue{AMDGPU.HSA.Signal}(AMDGPU.HSA.Signal(0x00007fdccae6e780))), HSAExecutable{AMDGPU.Mem.Buffer}(GPU: Vega 20 [Radeon VII] (gfx906), Base.RefValue{AMDGPU.HSA.Executable}(AMDGPU.HSA.Executable(0x00000000091a3950)), UInt8[0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x40, 0x01, 0x00  …  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], Dict{Symbol, AMDGPU.Mem.Buffer}(:__global_malloc_hostcall => AMDGPU.Mem.Buffer(Ptr{Nothing} @0x00007fdccae27000, 40, GPU: Vega 20 [Radeon VII] (gfx906), true), :__global_exception_ring => AMDGPU.Mem.Buffer(Ptr{Nothing} @0x00007fdccae25000, 8, GPU: Vega 20 [Radeon VII] (gfx906), true), :__global_exception_flag => AMDGPU.Mem.Buffer(Ptr{Nothing} @0x00007fdccae23000, 16, GPU: Vega 20 [Radeon VII] (gfx906), true)))))

After running the command I don1t get the aritmetics to perform, so:

c = Array(c_d)

isapprox(c, c_cpu) # == false

Running julia> wait(@roc groupsize=N vadd!(c_d, a_d, b_d)) hangs forever I guess?

@jpsamaroo
Copy link
Member

So you should always do wait(@roc ...), because @roc doesn't wait for the kernel to complete. Does wait actually hang (give Julia a minute or two to compile the first time)? If it does hang, can you tell me which GPU you're using?

@Cvikli
Copy link
Author

Cvikli commented Feb 27, 2021

I didn't select any GPU and yeah it hangs forever.
Maybe I should select a GPU? How to do that? :o

@jpsamaroo
Copy link
Member

You don't necessarily need to set a GPU; AMDGPU selects the first available GPU for you, which you can see with @show AMDGPU.get_default_agent(). There isn't really a great way to set the current GPU right now, but you could do something like:

agents = AMDGPU.agents()
AMDGPU.DEFAULT_AGENT[] = agents[2] # Make the 2nd GPU the default

The reason I asked what GPU you're using (or really, what GPU AMDGPU selects for you) is that you could be using an unsupported GPU, which can possibly hang when trying to use it. I've had that happen with my Raven Ridge integrated GPU.

@Cvikli
Copy link
Author

Cvikli commented Feb 27, 2021

Hey,

julia> AMDGPU.get_default_agent()
GPU: Vega 20 [Radeon VII] (gfx906)

I selected GPU 3 to test an other one as you described and it worked I think.

Runnig this: wait(@roc groupsize=N vadd!(c_d, a_d, b_d))
The error I get now:

julia> wait(@roc groupsize=N vadd!(c_d, a_d, b_d))
ERROR: InexactError: trunc(UInt16, 1048576)
Stacktrace:
  [1] throw_inexacterror(f::Symbol, #unused#::Type{UInt16}, val::UInt32)
    @ Core ./boot.jl:602
  [2] checked_trunc_uint
    @ ./boot.jl:632 [inlined]
  [3] toUInt16
    @ ./boot.jl:709 [inlined]
  [4] UInt16
    @ ./boot.jl:755 [inlined]
  [5] convert
    @ ./number.jl:7 [inlined]
  [6] KernelDispatchPacket
    @ ~/.julia/packages/AMDGPU/XxCm7/src/hsa/libhsa_types.jl:112 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/ConstructionBase/Lt33X/src/ConstructionBase.jl:0 [inlined]
  [8] _setproperties
    @ ~/.julia/packages/ConstructionBase/Lt33X/src/ConstructionBase.jl:60 [inlined]
  [9] setproperties
    @ ~/.julia/packages/ConstructionBase/Lt33X/src/ConstructionBase.jl:57 [inlined]
 [10] set
    @ ~/.julia/packages/Setfield/XM37G/src/lens.jl:110 [inlined]
 [11] macro expansion
    @ ~/.julia/packages/Setfield/XM37G/src/sugar.jl:182 [inlined]
 [12] (::AMDGPU.var"#22#23"{AMDGPU.ROCDim3, AMDGPU.ROCDim3, HSAKernelInstance{Tuple{ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}}}, HSASignal})(_packet::AMDGPU.HSA.KernelDispatchPacket)
    @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/kernel.jl:141
 [13] _launch!(f::AMDGPU.var"#22#23"{AMDGPU.ROCDim3, AMDGPU.ROCDim3, HSAKernelInstance{Tuple{ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}}}, HSASignal}, T::Type, queue::HSAQueue, signal::HSASignal)
    @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/kernel.jl:178
 [14] #launch!#21
    @ ~/.julia/packages/AMDGPU/XxCm7/src/kernel.jl:139 [inlined]
 [15] #launch_kernel#31
    @ ~/.julia/packages/AMDGPU/XxCm7/src/runtime.jl:114 [inlined]
 [16] #launch_kernel#30
    @ ~/.julia/packages/AMDGPU/XxCm7/src/runtime.jl:109 [inlined]
 [17] macro expansion
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:203 [inlined]
 [18] _launch
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:180 [inlined]
 [19] launch
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:160 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:131 [inlined]
 [21] roccall(::ROCFunction, ::Type{Tuple{ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}}}, ::ROCDeviceVector{Float64, 1}, ::ROCDeviceVector{Float64, 1}, ::ROCDeviceVector{Float64, 1}; queue::AMDGPU.RuntimeQueue{HSAQueue}, signal::AMDGPU.RuntimeEvent{AMDGPU.HSAStatusSignal}, groupsize::Int64, gridsize::Int64)
    @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:109
 [22] #roccall#208
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:265 [inlined]
 [23] macro expansion
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:246 [inlined]
 [24] #call#196
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:222 [inlined]
 [25] #_#236
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:412 [inlined]
 [26] top-level scope
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:167

julia> 

Interesting error, now I don't know what can be the problem?

@jpsamaroo
Copy link
Member

Hmm, I've never personally tested any of the gfx906 chips, although they should probably work. You might consider updating your Linux kernel and making sure you're on the latest ROCm packages (currently we distribute ROCm 3.8; we should probably get those updated to 4.0).

What value of N did you specify? The groupsize is generally only 1024 on Vega, as far I can recall, but is definitely limited to what can fit in a UInt16. I'll file an issue to add a check for invalid workgroup sizes.

@Cvikli
Copy link
Author

Cvikli commented Mar 1, 2021

I used N=32, but yeah I got a strange error at bigger values. :D

For me I can't believe different chips can produce this big deal. It would be a hell of work if there is no common interface for them. :o

Do you think I should update to 4.0 then to solve?

@jpsamaroo
Copy link
Member

Yeah, that one's on me, it really should have been an error.

Just remember, ROCm is basically beta software right now (even though they're on version >4.0). Bugs and broken configurations are easy to stumble upon.

I never asked you, what Linux Kernel version are you using?

@Cvikli
Copy link
Author

Cvikli commented Mar 1, 2021

I think:

➜  ~ uname -a
Linux user 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I will try to install this 4.0.1 but I see this is not a one click for now and don't want to restart my whole workspace due to a lot of other work. I will do it in the end of week maybe! :)

@jpsamaroo
Copy link
Member

That kernel is probably recent enough for most cards, but the VII might be too new to work on that kernel. I'd consider upgrading to something newer, if you can.

@Cvikli
Copy link
Author

Cvikli commented Mar 5, 2021

Hey, I am really suprised since I don't know how did I install 4.0 now I didn't find any description for installing specified versions. But now everything seems working after another fresh install.

Weeeellll I guess installing the master branch and the 4.0 rocm-dkms maybe solved then now. :)

Well done! ;)

Ok let's check speeds!! :D

@Cvikli
Copy link
Author

Cvikli commented Mar 5, 2021

I just write my ideas to improve as I face the problem during writing a basic speed testing code. Sorry to use this thread:

  • The basic example is for 32 length array. It would be nice to have by time a little bit bigger test like: N=1<<26

  • I didn't find @rocprintf enough easy to use because I couldn't find the "%d %s " the ANY type... so I couldn't print for example types. Sadly I couldn't discover the world of AMD environment because I don't know what type and what fields do I have and I had a really hard time to figure anything out. But I could some test code to print like:

@rocprint "%d" workitemIdx().x
@rocprint "\n" 

but of course this isn't that nice as it would be and could be a little bit more convenient if possible, so it would be easier to descover for anyone. Also this is the function that will be used a lot in the beginning so it could improve the start for every single developer. :)

  • There is a typo in https://juliagpu.github.io/AMDGPU.jl/stable/printing/#Printing where the brackets didn't match: so the closing "end" was "ed" and got error at first copy paste.
  • Any time I run a kernel I get the long:
    AMDGPU.RuntimeEvent{AMDGPU.HSAStatusSignal}(AMDGPU.HSAStatusSignal(HSASignal(Base.RefValue{AMDGPU.HSA.Signal}(AMDGPU.HSA.Signal(0x00007f7c04007a00))), HSAExecutable{AMDGPU.Mem.Buffer}(GPU: Vega 20 [Radeon VII] (gfx906), Base.RefValue{AMDGPU.HSA.Executable}(AMDGPU.HSA.Executable(0x000000000ae30c00)), UInt8[0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x40, 0x01, 0x00 … 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], Dict{Symbol, AMDGPU.Mem.Buffer}()))) but I am on #master so I guess this can be some debug?
  • It was hard to figure out what do I have for environment I would definitely add some code that holds all the thing I can use.
function kernel(y)
	@rocprintf "\n"
	wgdim = workgroupDim().x
	for i in 1:wgdim
		gr = gridDimWG().x
		grd = gridDim().x
		gidx = workgroupIdx().x
		idx = workitemIdx().x
		@rocprintf "workgroup: %3d/%-3d  idx: %-3d  groupidx: %-3d  grid: %-3d  grididk: %-3d\n" i wgdim idx gidx gr grd
	end
	return nothing
end
  • Somehow we could create a @RocMax [device] fn() that just run the function with the maximum possible capacity. Because later on nobody wants to waste time with finding the appropriate config, it have to be done by the @roc command. Or maybe just create something device config object that points to the device with the maximum capacity that handle the @roc {device_withtheMAXPOWER} fn(), so we can bridge this assembly level refining of the whole workgroup. I mean I know this is crazy but... if someone wants to bother with low level, he can any time... but 98% of the developer would just go with an instant command that ROCKS. :D
  • gridsize how to use example missing.
  • Why does a function has to have a parameter, I didn't see in the documentation.

Sadly I couldn't make the c .= a .+ b to work parallel with the kernel function. But I made this... It works but slow because I couldn't make to work in parallel

function vadd!(c, a, b)
	wgdim = workgroupDim().x
	i = workitemIdx().x
	batch = Int(size(a,1)/wgdim)
	for j in (i-1)*batch+1:i*batch
		c[j] = a[j] + b[j]
	end
	return
end
@time wait(@roc groupsize=1024 vadd!(c_d, a_d, b_d))

I guess it does it somehow wrong, I didn't get how does this work now. :)

But really great work all in all, I just added some note so you can see how do the beginner fails based on the documentation.

@jpsamaroo
Copy link
Member

I didn't find @rocprintf enough easy to use because I couldn't find the "%d %s " the ANY type... so I couldn't print for example types. Sadly I couldn't discover the world of AMD environment because I don't know what type and what fields do I have and I had a really hard time to figure anything out.

So that's my fault: I should have documented that @rocprintf just calls Julia's Printf.@printf, so for any type, you could just do @rocprintf("Value: %s\n", myvalue) and it will probably interpret it correctly.

@rocprint "%d" workitemIdx().x
@rocprint "\n"

I don't recommend using @rocprint anymore, since you can't interpolate values like you're trying to here, and @rocprintf is just generally more flexible. They were really just implemented as a proof-of-concept. I'll probably remove them soon. What you want is:

@rocprintf("%d\n", workitemIdx().x)

There is a typo in https://juliagpu.github.io/AMDGPU.jl/stable/printing/#Printing where the brackets didn't match: so the closing "end" was "ed" and got error at first copy paste.

Fixed, thanks!

Any time I run a kernel I get the long:
AMDGPU.RuntimeEvent{AMDGPU.HSAStatusSignal}(AMDGPU.HSAStatusSignal(HSASignal(Base.RefValue{AMDGPU.HSA.Signal}(AMDGPU.HSA.Signal(0x00007f7c04007a00))), HSAExecutable{AMDGPU.Mem.Buffer}(GPU: Vega 20 [Radeon VII] (gfx906), Base.RefValue{AMDGPU.HSA.Executable}(AMDGPU.HSA.Executable(0x000000000ae30c00)), UInt8[0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x40, 0x01, 0x00 … 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], Dict{Symbol, AMDGPU.Mem.Buffer}()))) but I am on #master so I guess this can be some debug?

As pointed out in the docs (near the end of https://juliagpu.github.io/AMDGPU.jl/stable/quickstart/#Running-a-simple-kernel), you need to wait on the result of @roc. The object you're seeing here is not an error, but the object that comes of @roc, which should probably print nicer. I'll file an issue for that.

It was hard to figure out what do I have for environment I would definitely add some code that holds all the thing I can use.

Can you elaborate on what you mean by this? Are you thinking that we should have a function that lets you see all the useful information about a thread's location in a kernel, which can then be printed? If so, I agree, and I'd be happy to accept a PR that implements this 🙂

Somehow we could create a @RocMax [device] fn() that just run the function with the maximum possible capacity. Because later on nobody wants to waste time with finding the appropriate config, it have to be done by the @roc command. Or maybe just create something device config object that points to the device with the maximum capacity that handle the @roc {device_withtheMAXPOWER} fn(), so we can bridge this assembly level refining of the whole workgroup. I mean I know this is crazy but... if someone wants to bother with low level, he can any time... but 98% of the developer would just go with an instant command that ROCKS. :D

I appreciate the enthusiasm! I think what would be best the ability to pass groupsize=auto to @roc, and implement a simply occupancy estimator that will pick some valid value automatically. I've filed an issue about this.

gridsize how to use example missing.

Issue filed; feel free to help add these docs if you know how grids and groups work (they're the same as OpenCL workgroups and grids).

Why does a function has to have a parameter, I didn't see in the documentation.

That's no longer the case on v0.2.3, maybe you need to update AMDGPU.jl?

Sadly I couldn't make the c .= a .+ b to work parallel with the kernel function. But I made this... It works but slow because I couldn't make to work in parallel

Try this:

function vadd!(c, a, b)
  idx = (workgroupDim().x * (workgroupIdx().x-1)) + workitemIdx().x
  c[idx] = a[idx] + b[idx]
  nothing
end
@time wait(@roc groupsize=min(1024,length(c_d)) gridsize=length(c_d) vadd!(c_d, a_d, b_d))

I get 0.011611 seconds (120 allocations: 4.328 KiB), which is pretty OK for creating and launching a kernel (although the GPU is far faster than this; I'm aware of the fact that we spend too long waiting for a kernel to complete).

But really great work all in all, I just added some note so you can see how do the beginner fails based on the documentation.

Thanks for reporting all of these! If you get the chance to help fix some of these issues, I would greatly appreciate it 😄

@Cvikli
Copy link
Author

Cvikli commented Mar 6, 2021

Oh I see soo grid size is the size of the aritmetic operation WOW! Very cool! That sounds really effective!

I realised i used @rocprintf so sorry for the typo.

For groupsize=auto is great, maybe it would be nice to consider making it default. Also if you ever used @Everywhere [workers list] fn(), it would be nice to be able to specify the device maybe like "worker", but I know this is harder then I say.

I am workin on a company I think it is more beneficial if I try to build a team and adapt AMDGPU with our open packages. Also I think we have the best machine learning library on the way adapting AMDGPU support would be crazy for that. I know there is Flux Zygote and many more out there but hey all have seruous hard time and limitations because of the core.

On the figring out the environment topic, what you described is really nice I would just be satisfied to have a 10 liner code that shows all the information I have during a kernel run. Just a whole example that shows everything from my runtime environment, so wit running the cose and reading the doc I could learn the whole kernel programming in 15 seconds and understand the details. :) I know this is just an idea, what could be the best way to simply make the whole AMDGPU simple and easy to learn.

Btw I would be glad if you could tell me if you think it possible to do 10,100 kerel operation in a row to have a syntax like this:
function fn(x,y)
x.= x .+ x .* x .* 2 .+ y
y .= y .- 1 .+ y .^ 2
return x .+ y
end
fn(x,y) # where x, y is rocarray
So like nvidia does by defining the cuarray aritmetic? Or is this on the way already?

(Sent from mobile)

@Cvikli
Copy link
Author

Cvikli commented Mar 9, 2021

Hey,

It is interesting to see the timing of the example code you wrote is similar in my case too. If everything measures right then this is 10x speedup ATM.

I think it would be a good idea to update the basic example to this one you wrote here. It explains a lot also and shows the way how this whole system work.

Also what do you think about the "cuarray" approach of defining the aritmetic operations between ROCArray-s? Is that possible like CUDA did? I feel like the broadcasting could allow the gridsize to work in case of aritmetic redefining. I am just asking if this all is possible in the future, because that would allow ROCArray to be a 1on1 replacement for Array?

@jpsamaroo
Copy link
Member

Closing since this issue meandered over too many unrelated things; further discussion can continue on Discourse, or specific issues should be filed separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build upstream
Projects
None yet
Development

No branches or pull requests

2 participants