Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SplitCounters() susceptible to hangs (inifinite loop) #69

Open
jcortell68 opened this issue Jan 6, 2023 · 0 comments
Open

SplitCounters() susceptible to hangs (inifinite loop) #69

jcortell68 opened this issue Jan 6, 2023 · 0 comments

Comments

@jcortell68
Copy link
Contributor

The various SplitCounters() method implementation have a loop that is susceptible to hangs given bad GPU data. E.g., for the GPU I was adding, there was a goof in the auto generated files that left a counter group with a max-counters of zero. When I ran the color cube app, it hung in GpaSplitCountersConsolidated::SplitSingleCounter() because CanCounterBeAdded() always returns false in that case, and so it just keeps trying to add it to the next pass, which will go on for billions of iterations.

while (done_allocating_counter == false)
{
 ....
}

The hang was time consuming to debug. I think we can probably pick some reasonable maximum pass count which is impractical in the real world, and error out (with a log statement) if the looping exceeds that.

I've seen this problematic pattern in at least one other method: GpaContextCounterMediator::ScheduleCounters()
There may be more.

jcortell68 added a commit to jcortell68/gpu_performance_api that referenced this issue Jan 6, 2023
A counter group having a counter max of zero is invalid and
will ultimately result in a hang when we try to split counters
into multiple passes. This is one of various scenarios that
result in a hang during counter splitting; see

 GPUOpen-Tools#69

This fixes only that specific scenario. We now check that the
group max isn't zero, and if it is, we give up trying to split
a public counter's HW counters into multiple passes. We log
an error, too.

Again, this isn't a comprehensive fix for issue 69. There could
be other cases of bad data that result in a hang. Issue 69 should
be fixed with a pass cap limit to cover all cases. But this
commit still adds value in that it flags the specific invalid GPU
counter metadata in addition to avoiding the hang.

Change-Id: I56d7d2043ba92c1b6088f0fdd68f5ec844e7b823
PLohrmannAMD pushed a commit that referenced this issue Jul 1, 2024
* Error out of counter splitting if group max is zero

A counter group having a counter max of zero is invalid and
will ultimately result in a hang when we try to split counters
into multiple passes. This is one of various scenarios that
result in a hang during counter splitting; see

 #69

This fixes only that specific scenario. We now check that the
group max isn't zero, and if it is, we give up trying to split
a public counter's HW counters into multiple passes. We log
an error, too.

Again, this isn't a comprehensive fix for issue 69. There could
be other cases of bad data that result in a hang. Issue 69 should
be fixed with a pass cap limit to cover all cases. But this
commit still adds value in that it flags the specific invalid GPU
counter metadata in addition to avoiding the hang.

Change-Id: I56d7d2043ba92c1b6088f0fdd68f5ec844e7b823
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant