Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-7307] Add support for allocation samples to ThreadContext #2657

Merged
merged 4 commits into from
Mar 3, 2023

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 1, 2023

What does this PR do?:

This PR adds the thread_context_collector_sample_allocation API to the ThreadContext collector.

This will be used to support allocation profiling (but right now it's not being called automatically when allocations happen, yet).

Motivation:

This is an incremental step towards supporting allocation profiling.

I'm still working on the scaling/weighting strategy, and may end up adding that logic directly on the ThreadContext, haven't decided yet.

Additional Notes:

I'm opening this PR is on top of #2654 to avoid merge conflicts, but this PR is otherwise completely independent from it.

How to test the change?:

Change includes test coverage. End-to-end testing will come later once this is wired up in the CpuAndWallTimeWorker.

This comes in handy while debugging.
**What does this PR do?**:

This PR adds the `thread_context_collector_sample_allocation` API to
the `ThreadContext` collector.

This will be used to support allocation profiling (but right now
it's not being called automatically when allocations happen, yet).

**Motivation**:

This is an incremental step towards supporting allocation profiling.

I'm still working on the scaling/weighting strategy, and may
end up adding that logic directly on the `ThreadContext`, haven't
decided yet.

**Additional Notes**:

N/A

**How to test the change?**:

Change includes test coverage. End-to-end testing will come later
once this is wired up in the `CpuAndWallTimeWorker`.
@ivoanjo ivoanjo requested a review from a team March 1, 2023 18:43
@github-actions github-actions bot added the profiling Involves Datadog profiling label Mar 1, 2023
@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2023

Codecov Report

Merging #2657 (7d22e54) into master (324d56f) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2657   +/-   ##
=======================================
  Coverage   98.07%   98.07%           
=======================================
  Files        1166     1166           
  Lines       63832    63871   +39     
  Branches     2849     2849           
=======================================
+ Hits        62603    62643   +40     
+ Misses       1229     1228    -1     
Impacted Files Coverage Δ
...filing/collectors/cpu_and_wall_time_worker_spec.rb 95.45% <100.00%> (+0.01%) ⬆️
...atadog/profiling/collectors/thread_context_spec.rb 97.75% <100.00%> (-0.24%) ⬇️
lib/datadog/core/diagnostics/environment_logger.rb 98.46% <0.00%> (+0.76%) ⬆️
...atadog/tracing/contrib/grpc/support/grpc_helper.rb 100.00% <0.00%> (+1.75%) ⬆️
lib/datadog/core/workers/polling.rb 100.00% <0.00%> (+3.44%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Base automatically changed from ivoanjo/prof-7307-rename-cpu-and-wall-time-collector to master March 3, 2023 09:44
@@ -1,5 +1,8 @@
$LOAD_PATH.unshift File.expand_path('..', __dir__)
$LOAD_PATH.unshift File.expand_path('../lib', __dir__)

Thread.main.name = 'Thread.main' unless Gem::Version.new(RUBY_VERSION) < Gem::Version.new('2.3')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to make it easier to debug when we have multiple threads -- if you print/list the threads it's sometimes not obvious which thread is the main one (and main is the one driving the RSpec framework). It's unrelated to the other changes (which is why I put it on a separate commit -- 67b7fa8 ).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one example where this is useful. This spec failed and printed the list of threads:

[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>]

This test failed in a flaky way in CI
(<https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/9211/workflows/8431a38e-fa1d-4a93-a937-b8488b960a89/jobs/342577/tests#failed-test-0>).

Interesting observation in the output:

```
Failure/Error: expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
  sample_count: 7, stats: {:trigger_sample_attempts=>8, :trigger_simulated_signal_delivery_attempts=>8, :simulated_signal_delivery=>8, :signal_handler_enqueued_sample=>8, :signal_handler_wrong_thread=>0, :sampled=>7, :sampling_time_ns_min=>122586, :sampling_time_ns_max=>2849779, :sampling_time_ns_total=>3664094, :sampling_time_ns_avg=>523442.0}, debug_failures: {:thread_list=>[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>],
```

There was at least one sample that was super slow
(`sampling_time_ns_max` -- almost 3ms).

Because of that, the profiler throttled sampling down (using the dynamic
sampling rate component), and so did not take as many samples as we
were expecting. This is by design, although it's really hard to know
_why_ the profiler slowed down -- maybe the CI machine was busy?
Maybe that leftover webrick thread leaked from another test was
the culprit?

Right now the dynamic sampling rate is by design not very configurable,
so I've chosen to tweak the number of samples down yet again.

If this last adjustment isn't enough, I'll bite the bullet and make it
possible to disable the dynamic sampling rate (or tweak it), so as to
not cause issues for this spec.

(Like I mentioned in the comment -- I still think this test is worth
the trouble, as it validates an important behavior of the profiler
that otherwise may not get exercised in most validations we run --
profiling while idle).
@ivoanjo ivoanjo merged commit a644ca3 into master Mar 3, 2023
@ivoanjo ivoanjo deleted the ivoanjo/prof-7307-allocation-samples-support branch March 3, 2023 16:55
@github-actions github-actions bot added this to the 1.10.0 milestone Mar 3, 2023
@TonyCTHsu TonyCTHsu mentioned this pull request Mar 6, 2023
ivoanjo added a commit that referenced this pull request Sep 1, 2023
**What does this PR do?**

In #2657 we added support for the `Collectors::ThreadContext` to
sample a given object allocation.

But until now, triggering these kinds of allocations was only done in
our test suite.

This PR adds the missing part: actually sampling objects as Ruby
allocates them.

In the `Collectors::CpuAndWallTimeWorker`, we already
had a `TracePoint` that is called on every new object allocation
(`RUBY_INTERNAL_EVENT_NEWOBJ`), but we were only using it to
count allocations.

This PR changes the `TracePoint` code to actually call into the
`Collectors::ThreadContext` and thus allow allocation sampling to
happen.

**Motivation:**

This PR is another step towards the allocation profiling feature.

**Additional Notes:**

It's not yet possible to turn on allocation profiling via configuration.

I am reserving that change for later. But it's easy to do:
change the `profiling/component.rb` to use `alloc_samples_enabled: true`
and `allocation_sample_every: 1`.

This will result in a big overhead impact to the application. I'll be
working on this in later PRs.

In particular, the sampling approach we're using
(`allocation_sample_every`), is only a placeholder; it will get
replaced before we get to beta. This is why I don't want to add it
as a real user-available setting -- because it's temporary.

**How to test the change?**

This change includes test coverage. You can also manually check it out,
by following the instructions in the "Additional Notes".

(There's a feature flag needed for this data to show up in the
Datadog UX).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants