Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Creating two HIP streams causes 100% GPU utilization #3388

Closed
pxl-th opened this issue Jan 12, 2024 · 6 comments
Closed

[Issue]: Creating two HIP streams causes 100% GPU utilization #3388

pxl-th opened this issue Jan 12, 2024 · 6 comments
Assignees

Comments

@pxl-th
Copy link

pxl-th commented Jan 12, 2024

Problem Description

Creating two HIP streams causes 100% GPU utilization.
This is observed on ROCm 5.7-6.0 and on RX 7600, RX 7800 XT and RX 7900 XTX (at least).

Here's the utilization graph using resources during the execution of C++ MWE below (this is observed with rocm-smi as well):

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 7 5800X 8-Core Processor

GPU

AMD Radeon RX 7900 XT

ROCm Version

ROCm 6.0.0

Steps to Reproduce

C++ MWE:

#include <hip/hip_runtime.h>
#include <iostream>
#include <chrono>
#include <thread>

using namespace std;

void check(int res) {
    if (res != 0) {
        std::cerr << "Fail" << std::endl;
    }
}

int main(int argc, char* argv[]) {
    hipStream_t s1;
    check(hipStreamCreateWithPriority(&s1, 0, 0));

    hipStream_t s2;
    check(hipStreamCreateWithPriority(&s2, 0, 0));

    std::this_thread::sleep_for(std::chrono::seconds(5));
    return 0;
}

Compile with hipcc main.cpp & run a.out and observe utilization during program execution.

@cjatin
Copy link
Contributor

cjatin commented Jan 15, 2024

I could not reproduce it on Navi21(6900 XT).

rocm-smi reads the data from the driver to populate percent usage. Will forward this to relevant teams to get more information.

@pxl-th
Copy link
Author

pxl-th commented Jan 15, 2024

This looks to be a Navi 3 issue. I was also not able to reproduce it on RX6700 XT.

@pxl-th
Copy link
Author

pxl-th commented Jan 16, 2024

If this is not a monitoring bug, it might partially explain, why we are seeing random hangs in our AMDGPU.jl CI only with Navi 3, because tests run on multiple workers using multiple streams.

@pxl-th
Copy link
Author

pxl-th commented Jan 23, 2024

Hi! Just curious if there's any update on the issue?

@cjatin
Copy link
Contributor

cjatin commented Jan 24, 2024

Nothing as of now. I will update here once we have a solution.

@pxl-th
Copy link
Author

pxl-th commented Feb 20, 2024

This issue seems to be fixed with ROCm 6.0.2 & Linux 6.5.0-18.
Not sure from where the fix came though.

@pxl-th pxl-th closed this as completed Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants