Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler: goroutineswait saw-stop mechanism #942

Merged
merged 8 commits into from
Jun 30, 2021

Conversation

felixge
Copy link
Member

@felixge felixge commented Jun 2, 2021

This is a safety mechanism for the new goroutineWaitProfile that
automatically disables the profile if the number of goroutines is too
high in order to avoid excessive stop-the-world pauses.

The default limit is 1000 which is somewhat arbitrary and should limit
STW pauses to ~30ms. It can be overwritten with
DD_PROFILING_WAIT_PROFILE_MAX_GOROUTINES.

WARNING: The goroutineWaitProfile is still experimental and not meant to
be used by users outside of datadog.

Fixes PROF-3234

@felixge felixge added this to the 1.32.0 milestone Jun 2, 2021
profiler/profile.go Outdated Show resolved Hide resolved
@felixge
Copy link
Member Author

felixge commented Jun 22, 2021

@pmbauer @knusbaum this still needs a quick look at some point. Thanks :)

@r1viollet
Copy link

I liked these changes. Especially the addition of a test to check this setting.
Things I would have considered :

  • having a dedicated error count to see if this actually happens in real life (this can easily be done in another PR).
  • splitting the test file (but that is a personal preference)

@felixge
Copy link
Member Author

felixge commented Jun 22, 2021

I liked these changes. Especially the addition of a test to check this setting.
Things I would have considered :

Thanks for the suggestions.

  • having a dedicated error count to see if this actually happens in real life (this can easily be done in another PR).

Will ping you on slack about this.

  • splitting the test file (but that is a personal preference)

Not sure how you'd split it? One file per test? The way it's currently done is pretty idiomatic in Go, i.e. the tests for code implement in x.go will be in x_test.go.

r1viollet
r1viollet previously approved these changes Jun 22, 2021
knusbaum
knusbaum previously approved these changes Jun 22, 2021
Copy link
Contributor

@knusbaum knusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple questions/suggestions.
Looks good in general.

profiler/profile_test.go Outdated Show resolved Hide resolved
profiler/profile.go Outdated Show resolved Hide resolved
launched <- struct{}{}
stopped <- struct{}{}
}()
<-launched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of launched? It looks to me like if you removed it, this would function the same way, since the goroutines would all block on stopped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intend behind launched is to ensure that the goroutines have started executing before spawnGoroutines() returns. Without launched this invariant would be subject to a race condition.

In practice launched could be removed without repercussions because the go runtime immediately registers newly created goroutines as _Grunnable. That being said, this behavior seems to be an implementation detail that isn't covered by the language specification. I.e. a future version of Go might decide to put the goroutines into a new _Glaunchable state that isn't included in runtime.NumGoroutines(). So launched could prevent our test from breaking in the future.

That being said, I don't expect the runtime details of this to change anytime soon, so I don't feel strongly about it. I can remove the launched if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. That makes sense and seems to be a good solution.

My only comment is that it does not seem obvious why this code is like it is unless you know the go runtime internals. It might be good to make a small comment here about launched and how we need the goroutines to be running before spawnGoroutines returns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, PTAL at 2e95053 to see if this does a good enough job in clarifying the intend.

@felixge felixge dismissed stale reviews from knusbaum and r1viollet via d87eb26 June 23, 2021 09:36
This is a safety mechanism for the new goroutineWaitProfile that
automatically disables the profile if the number of goroutines is too
high in order to avoid excessive stop-the-world pauses.

The default limit is 1000 which is somewhat arbitrary and should limit
STW pauses to ~30ms. It can be overwritten with
DD_PROFILING_WAIT_PROFILE_MAX_GOROUTINES.

WARNING: The goroutineWaitProfile is still experimental and not meant to
be used by users outside of datadog.

Fixes PROF-3234
@felixge felixge force-pushed the felix.geisendoerfer/goroutineswait-sawstop branch from d87eb26 to 768a8b3 Compare June 23, 2021 09:37
Copy link
Member Author

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knusbaum thanks for the review, I applied 2/3 changes you suggested. PTAL.

launched <- struct{}{}
stopped <- struct{}{}
}()
<-launched
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intend behind launched is to ensure that the goroutines have started executing before spawnGoroutines() returns. Without launched this invariant would be subject to a race condition.

In practice launched could be removed without repercussions because the go runtime immediately registers newly created goroutines as _Grunnable. That being said, this behavior seems to be an implementation detail that isn't covered by the language specification. I.e. a future version of Go might decide to put the goroutines into a new _Glaunchable state that isn't included in runtime.NumGoroutines(). So launched could prevent our test from breaking in the future.

That being said, I don't expect the runtime details of this to change anytime soon, so I don't feel strongly about it. I can remove the launched if you like.

profiler/profile_test.go Outdated Show resolved Hide resolved
@felixge
Copy link
Member Author

felixge commented Jun 23, 2021

@knusbaum I'm struggling to get CI to pass, it seems like test-contrib is failing because of FAIL gopkg.in/DataDog/dd-trace-go.v1/contrib/hashicorp/consul [build failed]. Any idea?

knusbaum
knusbaum previously approved these changes Jun 23, 2021
Copy link
Contributor

@knusbaum knusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

launched <- struct{}{}
stopped <- struct{}{}
}()
<-launched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. That makes sense and seems to be a good solution.

My only comment is that it does not seem obvious why this code is like it is unless you know the go runtime internals. It might be good to make a small comment here about launched and how we need the goroutines to be running before spawnGoroutines returns.

@knusbaum
Copy link
Contributor

knusbaum commented Jun 23, 2021

@felixge Yeah, I think it's because consul just released a version that's incompatible with <go1.16
See: hashicorp/consul#10470

I might pin the version in CI until they fix this.

The new name is technically more accurate as the goroutine may not have
stopped yet when a message appears in the channel.
@felixge
Copy link
Member Author

felixge commented Jun 28, 2021

@knusbaum @gbbr this is ready for final review/approval, PTAL. (I hit the re-run workflow button on GitLab until the flaky TestWorker test passed)

@felixge felixge merged commit 1389fb2 into v1 Jun 30, 2021
@felixge felixge deleted the felix.geisendoerfer/goroutineswait-sawstop branch June 30, 2021 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants