[PROF-7307] Avoid triggering allocation sampling during sampling #2690

ivoanjo · 2023-03-13T17:40:16Z

What does this PR do?:

In rare cases (such as when throwing exceptions), the profiler may need to allocate Ruby objects during sampling.

This object allocation will trigger the object_allocation_tracepoint but we don't want to try to sample in the middle of sampling.

To avoid any issues altogether, I've introduced a during_sample flag that can be used to detect this and avoid trying to do sampling during sampling.

Note that the functions modified all run while holding on to the global VM lock, which is why this flag does not need any special concerns re: concurrency.

I've also chosen to keep a counter for how often this happens that may come in handy when debugging any issues around this (and should be cheap enough to keep).

Motivation:

This PR is one more piece on the path to providing allocation profiling. The key missing part is the "// TODO: Sampling goes here", which is what will actually trigger recording of stack traces when objects get allocated.

In the spirit of small, incremental changes, I'm opening a PR with just this, and will later follow up with the rest of the implementation.

Additional Notes:

I also slightly moved around the declarations of fields inside struct cpu_and_wall_time_worker_state and in the constructor. I've made this change in a separate commit, so I recommend reviewing this PR commit-by-commit as that makes a lot more sense.

How to test the change?:

This change is a bit hard to test, since it affects only the C-level internal state of the profiler. I can't think of a great way of testing this that doesn't involve adding a lot of complexity to the profiler C code just for the purpose of enabling testing.

Instead, I plan to test this using integration testing, e.g. when making sure that the allocation profiling can be run at the same time as other kinds of profiling.

This new grouping hopefully makes a bit more sense than the somewhat arbitrary order we were using before.

In rare cases (such as when throwing exceptions), the profiler may need to allocate Ruby objects during sampling. This object allocation will trigger the `object_allocation_tracepoint` but we don't want to try to sample in the middle of sampling. To avoid any issues altogether, I've introduced a `during_sample` flag that can be used to detect this and avoid trying to do sampling during sampling. Note that the functions modified all run while holding on to the global VM lock, which is why this flag does not need any special concerns re: concurrency. I've also chosen to keep a counter for how often this happens that may come in handy when debugging any issues around this (and should be cheap enough to keep).

marcotc · 2023-03-13T18:03:22Z

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c

+  // In a few cases, we may actually be allocating an object as part of profiler sampling. We don't want to recursively
+  // sample, so we just return early
+  if (state->during_sample) {
+    state->stats.allocations_during_sample++;


Are we tracking it "just in case"?

This is a good question. No, that wasn't my intention; I vaguely intended to use it for validation, but hadn't thought very far after that.

In fact, I initially implemented this on my working branch before I did the change in #2644, so I've been sitting on this for quite some time.

TL;DR, now that #2644 is in, I think rather than vague intentions, I can actually add specs asserting that common operations in the profiler actually don't cause allocations.

So, let me revisit this PR and add a few tests where I actually assert on this value. That seems like a nice low-hanging win that can already go in with this change.

Added a spec that makes use of this in 1d16535

codecov-commenter · 2023-03-13T18:10:55Z

Codecov Report

Merging #2690 (1d16535) into master (70f59e6) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2690   +/-   ##
=======================================
  Coverage   98.08%   98.09%           
=======================================
  Files        1168     1168           
  Lines       64221    64230    +9     
  Branches     2857     2858    +1     
=======================================
+ Hits        62994    63004   +10     
+ Misses       1227     1226    -1

Impacted Files	Coverage Δ
...filing/collectors/cpu_and_wall_time_worker_spec.rb	`95.56% <100.00%> (+0.11%)`	⬆️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

As discussed during PR review, let's use this counter for correctness validation.

Regroup fields inside cpu_and_wall_time_worker_state state

69968fc

This new grouping hopefully makes a bit more sense than the somewhat arbitrary order we were using before.

ivoanjo requested a review from a team March 13, 2023 17:40

github-actions bot added the profiling Involves Datadog profiling label Mar 13, 2023

ivoanjo force-pushed the ivoanjo/prof-7307-allocation-sampling branch from 16e3dde to 7317ee4 Compare March 13, 2023 17:41

marcotc reviewed Mar 13, 2023

View reviewed changes

Add spec for not allocating during sample

1d16535

As discussed during PR review, let's use this counter for correctness validation.

GustavoCaso approved these changes Mar 16, 2023

View reviewed changes

ivoanjo merged commit a05d2eb into master Mar 16, 2023

ivoanjo deleted the ivoanjo/prof-7307-allocation-sampling branch March 16, 2023 12:23

github-actions bot added this to the 1.11.0 milestone Mar 16, 2023

lloeki modified the milestones: 1.11.0, 1.11.0.beta1 Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-7307] Avoid triggering allocation sampling during sampling #2690

[PROF-7307] Avoid triggering allocation sampling during sampling #2690

ivoanjo commented Mar 13, 2023

marcotc Mar 13, 2023

ivoanjo Mar 14, 2023

ivoanjo Mar 14, 2023

codecov-commenter commented Mar 13, 2023 •

edited

[PROF-7307] Avoid triggering allocation sampling during sampling #2690

[PROF-7307] Avoid triggering allocation sampling during sampling #2690

Conversation

ivoanjo commented Mar 13, 2023

marcotc Mar 13, 2023

Choose a reason for hiding this comment

ivoanjo Mar 14, 2023

Choose a reason for hiding this comment

ivoanjo Mar 14, 2023

Choose a reason for hiding this comment

codecov-commenter commented Mar 13, 2023 • edited

Codecov Report

codecov-commenter commented Mar 13, 2023 •

edited