[PROF-7307] Add support for allocation samples to `ThreadContext` #2657

ivoanjo · 2023-03-01T18:43:53Z

What does this PR do?:

This PR adds the thread_context_collector_sample_allocation API to the ThreadContext collector.

This will be used to support allocation profiling (but right now it's not being called automatically when allocations happen, yet).

Motivation:

This is an incremental step towards supporting allocation profiling.

I'm still working on the scaling/weighting strategy, and may end up adding that logic directly on the ThreadContext, haven't decided yet.

Additional Notes:

I'm opening this PR is on top of #2654 to avoid merge conflicts, but this PR is otherwise completely independent from it.

How to test the change?:

Change includes test coverage. End-to-end testing will come later once this is wired up in the CpuAndWallTimeWorker.

This comes in handy while debugging.

**What does this PR do?**: This PR adds the `thread_context_collector_sample_allocation` API to the `ThreadContext` collector. This will be used to support allocation profiling (but right now it's not being called automatically when allocations happen, yet). **Motivation**: This is an incremental step towards supporting allocation profiling. I'm still working on the scaling/weighting strategy, and may end up adding that logic directly on the `ThreadContext`, haven't decided yet. **Additional Notes**: N/A **How to test the change?**: Change includes test coverage. End-to-end testing will come later once this is wired up in the `CpuAndWallTimeWorker`.

codecov-commenter · 2023-03-03T09:41:18Z

Codecov Report

Merging #2657 (7d22e54) into master (324d56f) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2657   +/-   ##
=======================================
  Coverage   98.07%   98.07%           
=======================================
  Files        1166     1166           
  Lines       63832    63871   +39     
  Branches     2849     2849           
=======================================
+ Hits        62603    62643   +40     
+ Misses       1229     1228    -1

Impacted Files	Coverage Δ
...filing/collectors/cpu_and_wall_time_worker_spec.rb	`95.45% <100.00%> (+0.01%)`	⬆️
...atadog/profiling/collectors/thread_context_spec.rb	`97.75% <100.00%> (-0.24%)`	⬇️
lib/datadog/core/diagnostics/environment_logger.rb	`98.46% <0.00%> (+0.76%)`	⬆️
...atadog/tracing/contrib/grpc/support/grpc_helper.rb	`100.00% <0.00%> (+1.75%)`	⬆️
lib/datadog/core/workers/polling.rb	`100.00% <0.00%> (+3.44%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

GustavoCaso · 2023-03-03T10:29:44Z

spec/spec_helper.rb

@@ -1,5 +1,8 @@
 $LOAD_PATH.unshift File.expand_path('..', __dir__)
 $LOAD_PATH.unshift File.expand_path('../lib', __dir__)
+
+Thread.main.name = 'Thread.main' unless Gem::Version.new(RUBY_VERSION) < Gem::Version.new('2.3')


Why are we adding this change?

It's to make it easier to debug when we have multiple threads -- if you print/list the threads it's sometimes not obvious which thread is the main one (and main is the one driving the RSpec framework). It's unrelated to the other changes (which is why I put it on a separate commit -- 67b7fa8 ).

Here's one example where this is useful. This spec failed and printed the list of threads:

[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>]

This test failed in a flaky way in CI (<https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/9211/workflows/8431a38e-fa1d-4a93-a937-b8488b960a89/jobs/342577/tests#failed-test-0>). Interesting observation in the output: ``` Failure/Error: expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}" sample_count: 7, stats: {:trigger_sample_attempts=>8, :trigger_simulated_signal_delivery_attempts=>8, :simulated_signal_delivery=>8, :signal_handler_enqueued_sample=>8, :signal_handler_wrong_thread=>0, :sampled=>7, :sampling_time_ns_min=>122586, :sampling_time_ns_max=>2849779, :sampling_time_ns_total=>3664094, :sampling_time_ns_avg=>523442.0}, debug_failures: {:thread_list=>[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>], ``` There was at least one sample that was super slow (`sampling_time_ns_max` -- almost 3ms). Because of that, the profiler throttled sampling down (using the dynamic sampling rate component), and so did not take as many samples as we were expecting. This is by design, although it's really hard to know _why_ the profiler slowed down -- maybe the CI machine was busy? Maybe that leftover webrick thread leaked from another test was the culprit? Right now the dynamic sampling rate is by design not very configurable, so I've chosen to tweak the number of samples down yet again. If this last adjustment isn't enough, I'll bite the bullet and make it possible to disable the dynamic sampling rate (or tweak it), so as to not cause issues for this spec. (Like I mentioned in the comment -- I still think this test is worth the trouble, as it validates an important behavior of the profiler that otherwise may not get exercised in most validations we run -- profiling while idle).

**What does this PR do?** In #2657 we added support for the `Collectors::ThreadContext` to sample a given object allocation. But until now, triggering these kinds of allocations was only done in our test suite. This PR adds the missing part: actually sampling objects as Ruby allocates them. In the `Collectors::CpuAndWallTimeWorker`, we already had a `TracePoint` that is called on every new object allocation (`RUBY_INTERNAL_EVENT_NEWOBJ`), but we were only using it to count allocations. This PR changes the `TracePoint` code to actually call into the `Collectors::ThreadContext` and thus allow allocation sampling to happen. **Motivation:** This PR is another step towards the allocation profiling feature. **Additional Notes:** It's not yet possible to turn on allocation profiling via configuration. I am reserving that change for later. But it's easy to do: change the `profiling/component.rb` to use `alloc_samples_enabled: true` and `allocation_sample_every: 1`. This will result in a big overhead impact to the application. I'll be working on this in later PRs. In particular, the sampling approach we're using (`allocation_sample_every`), is only a placeholder; it will get replaced before we get to beta. This is why I don't want to add it as a real user-available setting -- because it's temporary. **How to test the change?** This change includes test coverage. You can also manually check it out, by following the instructions in the "Additional Notes". (There's a feature flag needed for this data to show up in the Datadog UX).

ivoanjo added 2 commits March 1, 2023 18:36

Minor: Name Ruby main thread

67b7fa8

This comes in handy while debugging.

ivoanjo requested a review from a team March 1, 2023 18:43

github-actions bot added the profiling Involves Datadog profiling label Mar 1, 2023

Base automatically changed from ivoanjo/prof-7307-rename-cpu-and-wall-time-collector to master March 3, 2023 09:44

GustavoCaso reviewed Mar 3, 2023

View reviewed changes

ivoanjo added 2 commits March 3, 2023 11:27

Add comment to explain why inside_t1 exists

1a3caef

GustavoCaso approved these changes Mar 3, 2023

View reviewed changes

ivoanjo merged commit a644ca3 into master Mar 3, 2023

ivoanjo deleted the ivoanjo/prof-7307-allocation-samples-support branch March 3, 2023 16:55

github-actions bot added this to the 1.10.0 milestone Mar 3, 2023

TonyCTHsu mentioned this pull request Mar 6, 2023

Bump to version 1.10.0 #2667

Merged

ivoanjo mentioned this pull request Sep 1, 2023

[PROF-7307] Wire up allocation sampling into CpuAndWallTimeWorker #3103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-7307] Add support for allocation samples to `ThreadContext` #2657

[PROF-7307] Add support for allocation samples to `ThreadContext` #2657

ivoanjo commented Mar 1, 2023

codecov-commenter commented Mar 3, 2023 •

edited

GustavoCaso Mar 3, 2023

ivoanjo Mar 3, 2023

ivoanjo Mar 3, 2023

[PROF-7307] Add support for allocation samples to ThreadContext #2657

[PROF-7307] Add support for allocation samples to ThreadContext #2657

Conversation

ivoanjo commented Mar 1, 2023

codecov-commenter commented Mar 3, 2023 • edited

Codecov Report

GustavoCaso Mar 3, 2023

Choose a reason for hiding this comment

ivoanjo Mar 3, 2023

Choose a reason for hiding this comment

ivoanjo Mar 3, 2023

Choose a reason for hiding this comment

[PROF-7307] Add support for allocation samples to `ThreadContext` #2657

[PROF-7307] Add support for allocation samples to `ThreadContext` #2657

codecov-commenter commented Mar 3, 2023 •

edited