Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: HIPError(code hipErrorOutOfMemory, out of memory) #591

Closed
markerrors opened this issue Feb 16, 2024 · 7 comments · Fixed by #594
Closed

ERROR: HIPError(code hipErrorOutOfMemory, out of memory) #591

markerrors opened this issue Feb 16, 2024 · 7 comments · Fixed by #594

Comments

@markerrors
Copy link

markerrors commented Feb 16, 2024

  • AMDGPU version v0.8.7
  • AMD rocm version 5.5
  • AMDGPU: Radeon 500 Series
    When one tries to run the following standard code, an error occurs
function vadd(a,b,c)
        i = workitemIdx().x
        c[i] = a[i] + b[i]
        sync_workgroup()
        return nothing
    end

    dims = (8,)
    a = round.(rand(Float32, dims) * 100)
    b = round.(rand(Float32, dims) * 100)

    d_a = ROCArray(a)
    d_b = ROCArray(b)
    d_c = similar(d_a)
    len = prod(dims)

    @roc groupsize=len vadd(d_a, d_b, d_c)
    c = Array(d_c)
    @test a+b  c
*ERROR: HIPError(code hipErrorOutOfMemory, out of memory)*

Stacktrace:
  [1] check
    @ AMDGPU.Runtime.Mem [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\hip\error.jl:149](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/hip/error.jl:149) [inlined]
  [2] |>
    @ AMDGPU.Runtime.Mem [.\operators.jl:915](https://file+.vscode-resource.vscode-cdn.net/d%3A/OneDrivePrj/OneDrive/program/Julia/Flux/amdgpu/operators.jl:915) [inlined]
  [3] AMDGPU.Runtime.Mem.HostBuffer(bytesize::Int64, flags::UInt8; stream::HIPStream)
    @ AMDGPU.Runtime.Mem [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\runtime\memory\hip.jl:183](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/runtime/memory/hip.jl:183)
  [4] HostBuffer
    @ AMDGPU [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\runtime\memory\hip.jl:177](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/runtime/memory/hip.jl:177) [inlined]
  [5] AMDGPU.ExceptionHolder()
    @ AMDGPU [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\exception_handler.jl:42](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/exception_handler.jl:42)
  [6] #56
    @ Base [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\exception_handler.jl:69](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/exception_handler.jl:69) [inlined]
  [7] get!(default::AMDGPU.var"#56#57", h::Dict{UInt64, AMDGPU.ExceptionHolder}, key::UInt64)
    @ Base [.\dict.jl:477](https://file+.vscode-resource.vscode-cdn.net/d%3A/OneDrivePrj/OneDrive/program/Julia/Flux/amdgpu/dict.jl:477)
  [8] exception_holder
    @ AMDGPU [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\exception_handler.jl:69](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/exception_handler.jl:69) [inlined]
  [9] has_exception
    @ AMDGPU [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\exception_handler.jl:73](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/exception_handler.jl:73) [inlined]
 [10] throw_if_exception(dev::HIPDevice)
    @ AMDGPU [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\exception_handler.jl:118](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/exception_handler.jl:118)
 [11] (::AMDGPU.Runtime.HIPKernel{typeof(vadd), Tuple{AMDGPU.Device.ROCDeviceVector{Float32, 1}, AMDGPU.Device.ROCDeviceVector{Float32, 1}, AMDGPU.Device.ROCDeviceVector{Float32, 1}}})(::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}; stream::HIPStream, call_kwargs::@Kwargs{groupsize::Int64})
    @ AMDGPU.Runtime [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\runtime\hip-execution.jl:45](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/runtime/hip-execution.jl:45)
 [12] top-level scope
    @ [D:\programfile\Julia10\pkg\packages\AMDGPU\68yOW\src\highlevel.jl:160](file:///D:/programfile/Julia10/pkg/packages/AMDGPU/68yOW/src/highlevel.jl:160)

Another, AMDGPU.exception_holder(AMDGPU.device()) has a similar error.

@luraess
Copy link
Collaborator

luraess commented Feb 16, 2024

What's the output of AMDGPU.versioninfo()?

@pxl-th
Copy link
Collaborator

pxl-th commented Feb 16, 2024

And also the output of rocminfo command

@markerrors
Copy link
Author

@luraess
image

@pxl-th The rocm is v5.5.

@pxl-th
Copy link
Collaborator

pxl-th commented Feb 19, 2024

Does ROCm work with C++?
E.g. try allocating:

#include <hip/hip_runtime.h>

int main() {
    int n_elements = 8;
    int size = 8 * sizeof(int);

    int *d_a;
    hipMalloc(&d_a, size);

    return 0;
}

Compile hipcc main.cpp

@markerrors
Copy link
Author

Yes, hipcc main.cpp can run correctly.

My hipinfo is:

HIP version : 5.5.0-3e8db564

== hipconfig
HIP_PATH : C:/Program Files/AMD/ROCm/5.5/
ROCM_PATH : C:\Program Files\AMD\ROCm\5.5
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME : rocclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I"C:/Program Files/AMD/ROCm/5.5//include" -I"C:/Program Files/AMD/ROCm/5.5//bin/../lib/clang/17.0.0"

== hip-clang
HIP_CLANG_PATH : C:/Program Files/AMD/ROCm/5.5//bin
clang version 17.0.0 (git@github.amd.com:Compute-Mirrors/llvm-project e3201662d21c48894f2156d302276eb1cf47c7be)
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\AMD\ROCm\5.5\bin
AOMP-16.0-45 (http://github.com/ROCm-Developer-Tools/aomp):
Source ID:16.0-45-6b875fb548b9ded0f07df02bc2af6e12568504a9
LLVM version 17.0.0git
Optimized build.
Default target: x86_64-pc-windows-msvc
Host CPU: skylake

Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -isystem "C:/Program Files/AMD/ROCm/5.5//include" -O3 --hip-path="C:/Program Files/AMD/ROCm/5.5/"
hip-clang-ldflags : --driver-mode=g++ -fuse-ld=lld --ld-path="C:/Program Files/AMD/ROCm/5.5//bin/lld-link.exe" -O3 --hip-path="C:/Program Files/AMD/ROCm/5.5/" --hip-link

=== Environment Variables
PATH=/c/Program Files/dotnet:/d/programfile/CTEX1/miktex/bin/x64:/d/programfile/CTEX/MiKTeX/miktex/bin:/c/Windows/system32:/mingw64/bin:/usr/bin:/d/programfile/Mosek/7/tools/platform/win64x86/bin:/mingw64/bin:/d/programfile/julia/pkgs/Conda:/d/programfile/julia/pkgs/Conda/Lib:/d/programfile/julia/pkgs/Conda/Scripts:/d/programfile/julia/pkgs/Conda/Library/bin:/d/programfile/CTEX/MiKTeX/miktex/bin/x64:/d/programfile/matlab2014/bin:/d/programfile/julia/pkgs/Conda/pkgs/sqlite-3.33.0-h2a8f88b_0/Library/bin:/d/programfile/Git/cmd:/c/Program Files/AMD/ROCm/5.5/bin:/c/Users/lenovo/.cargo/bin:/c/Users/lenovo/.windows-build-tools/python27:/c/Users/lenovo/AppData/Local/Microsoft/WindowsApps:/d/programfile/MicrosoftVSCode/bin:/e/packages/bin:/c/Users/lenovo/AppData/Local/Pandoc:/d/programfile/CTEX/MiKTeX/miktex/bin/xelatex.exe:/c/Program Files/Git/cmd/git.exe:/c/Program Files/pandoc:/d/programfile/Julia10/bin:/d/programfile/Git/usr/bin:/d/programfile/R4/bin:/d/programfile/Arduino:/d/programfile/CTEX/MiKTex/miktex/bin/x64:/d/programfile/MiKTex/miktex/bin/x64:/c/Users/lenovo/.dotnet/tools:/d/programfile/VSCodium/bin:/d/programfile/julia/pkgs/Conda/Lib/site-packages/playwright/driver:/c/Program Files/AMD/ROCm
HIPCONFIG='"C:\Program Files\AMD\ROCm\5.5\bin\hipconfig"'
HIP_PATH='C:\Program Files\AMD\ROCm\5.5'
HIP_PATH_55='C:\Program Files\AMD\ROCm\5.5'

== Windows Display Drivers
Hostname : mark
Advanced Micro Devices, Inc. C:\Windows\System32\DriverStore\FileRepository\u0399660.inf_amd64_d7fa3539ce499e50\B399655\aticfx64.dll,C:\Windows\System32\DriverStore\FileRepository\u0399660.inf_amd64_d7fa3539ce499e50\B399655\aticfx64.dll,C:\Windows\System32\DriverStore\FileRepository\u0399660.inf_amd64_d7fa3539ce499e50\B399655\aticfx64.dll,C:\Windows\System32\DriverStore\FileRepository\u0399660.inf_amd64_d7fa3539ce499e50\B399655\amdxc64.dll Radeon 500 Series

use AMDGPU.HIP.status_message(0), I can get
"the operation completed successfully"

So, I think the HIP may be correct.

After trying, I find that the error is from:
ptr_ref = Ref{Ptr{Cvoid}}() AMDGPU.HIP.hipHostMalloc(ptr_ref, 4, AMDGPU.HIP.hipHostAllocMapped)
When I completely modify the AMDGPU.HIP.hipHostAllocMapped as AMDGPU.HIP.hipHostAllocDefault, the error info vanishes.
Such a modification correct?

@pxl-th
Copy link
Collaborator

pxl-th commented Feb 20, 2024

So your system probably does not support hipHostAllocMapped flag.
As a temporary workaround you can replace AMDGPU.HIP.hipHostAllocMapped with AMDGPU.HIP.hipHostAllocDefault, but I haven't tested that it works correctly.

It will have an effect only when you encounter some errors/exceptions though.

Will come up with a fix soon

@pxl-th
Copy link
Collaborator

pxl-th commented Feb 20, 2024

Should be fixed by PR: #594

It is probably fine to use the default flag, since we always pass the device pointer to the GPU (which differs from the host pointer with the default flag).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants