Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-7251] Exclude disabled profiling sample value types from output #2634

Merged
merged 5 commits into from
Feb 22, 2023

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Feb 20, 2023

What does this PR do?:

Up until now, the profiler always included all sample value types in the resulting profile, even if some value is was expected to always be zero.

(Why would a value always be zero? The cpu-time value is only available on Linux, and would always be zero on macOS; and we plan to have a way to disable allocation profiling, and thus alloc-samples will also be zero.)

This PR:

  • Abstracts away from all components other than the StackRecorder which sample value types are enabled, by introducing a new sample_values struct that every other component can use instead of having to deal with the old array and other implementation details.

  • Extends the StackRecorder to allow cpu-time and alloc-samples to be disabled

  • Updates the profiler component creation to configure StackRecorder correctly.

Motivation:

Having the extra sample value types always being included as we did so far is mostly harmless.

But, it does make one thing difficult -- detection of which profile types are actually enabled in an application, in general.

Thus, the main reason for doing this change is so that the Datadog profiling backend can easily detect which profile types are enabled, and which aren't, so it can display better guidance and error messages to customers.

Additional Notes:

N/A

How to test the change?:

The change includes test coverage.

To manually validate this, you can download a pprof profile via the Datadog UX and confirm that only the enabled sample value types are available in it, for instance by running go tool pprof -raw rubyprofile.pprof and observing the columns shown after the Samples: header -- disabled value types will not show up there.

A lot of the knowledge for the values was spread out everywhere --
the collectors had to care about a lot of libdatadog details that they
were getting from the `stack_recorder.h` header.

As part of the work to make some values optional, I've replaced
the previous `metric_values` array with a very concrete struct, and
moved the knowledge about how to turn it into the array inside the
`StackRecorder`.
… configurable

In the previous commit the `sample_values` struct was introduced, which
abstracted how values are passed to libdatadog away from everywhere
else in the profiler, and centralized this into the `StackRecorder`.

In this commit, I reimplemented the `record_sample` function to,
instead of using a hardcoded position for every value type, rely
on two extra indirections:
* `state->position_for`
* `state->enabled_values_count`

By default (e.g. when all profile types are enabled), this new
strategy behaves exactly as before.

The interesting thing happens when some profile types are disabled
(via the constructor). When profile types are disabled,
the two indirections above are reconfigured: `enabled_values_count`
becomes less than `ALL_VALUE_TYPES_COUNT`, and `position_for` is
updated to account for this as well.

In pratice, the `position_for` array is treated as if it was a
hashmap -- the key is a given profile type, and the value is the
position that libdatadog is expected it to be written to.

Thus, converting a `sample_values` to an array for libdatadog
is as simple as

```c
  metric_values[position_for[CPU_TIME_VALUE_ID]]      = values.cpu_time_ns;
  metric_values[position_for[CPU_SAMPLES_VALUE_ID]]   = values.cpu_samples;
  metric_values[position_for[WALL_TIME_VALUE_ID]]     = values.wall_time_ns;
  metric_values[position_for[ALLOC_SAMPLES_VALUE_ID]] = values.alloc_samples;
```

The trick here, is that when certain profile_types are disabled
their `position_for` is changed, so they are put at the end of the
`metric_values` array.

For instance, when we disable both `CPU_TIME_VALUE` and
`ALLOC_SAMPLES_VALUE` the `position_for` "hashmap" will look something
like

```ruby
{
  CPU_SAMPLES_VALUE_ID: 0,
  WALL_TIME_VALUE_ID: 1,
  CPU_TIME_VALUE_ID: 2,
  ALLOC_SAMPLES_VALUE_ID: 3,
}
```

And thus, given

```ruby
{ 'cpu-time' => 123, 'cpu-samples' => 456, 'wall-time' => 789, 'alloc-samples' => 4242 }
```

We will produce a `metrics_values` array with

```
+-----+-----+-----+------+
| 456 | 789 | 123 | 4242 |
+-----+-----+-----+------+
```

...but we'll tell libdatadog to only use the first 2 positions of
the array, which contain the values for the enabled profile types!

To be honest, this was more boilerplate than I wanted, but I'm happy
that most of the complexity lies in `_native_initialize` around the
creation of `position_for` and the values list for libdatadog, and
everywhere else still looks kinda sane.
…om `StackRecorder`

I plan to remove the default values for the `StackRecorder`
`cpu_time_enabled:` and `alloc_samples_enabled:` args.

Thus, this refactors all tests and one benchmark where the
`StackRecorder` was being manually created to make sure it provides its
own values for those arguments.

Why remove the default values? I want to force callers in production
code to be explicit about their decisions on the profile types in use,
as this is important to get right so the profiling backend can
distinguish which types of profiles are enabled.

Thus, I want to avoid accidentally enabling or disabling some profile
type because that was the default on the constructor.
**What does this PR do?**:

Up until now, the profiler always included all sample value types in
the resulting profile, even if some value is was expected to always be
zero.

(Why would a value always be zero? The `cpu-time` value is only
available on Linux, and would always be zero on macOS; and we plan
to have a way to disable allocation profiling, and thus
`alloc-samples` will also be zero.)

This PR:

* Abstracts away from all components other than the `StackRecorder`
  which sample value types are enabled, by introducing a new
  `sample_values` struct that every other component can use instead
  of having to deal with the old array and other implementation
  details.

* Extends the `StackRecorder` to allow `cpu-time` and
  `alloc-samples` to be disabled

* Updates the profiler component creation to configure `StackRecorder`
  correctly.

**Motivation**:

Having the extra sample value types always being included as we did
so far is mostly harmless.

But, it does make one thing difficult -- detection of which profile
types are actually enabled in an application, in general.

Thus, the main reason for doing this change is so that the Datadog
profiling backend can easily detect which profile types are enabled,
and which aren't, so it can display better guidance and error
messages to customers.

**Additional Notes**:

N/A

**How to test the change?**:

The change includes test coverage.

To manually validate this, you can download a pprof profile via the
Datadog UX and confirm that only the enabled sample value types are
available in it, for instance by running
`go tool pprof -raw rubyprofile.pprof` and observing the columns
shown after the `Samples:` header -- disabled value types will not
show up there.
In this case it's equivalent, but `Qfalse` and `Qnil` are different in
Ruby, so let's make sure we actually test for `Qtrue`.
@ivoanjo ivoanjo requested a review from a team February 20, 2023 11:19
@github-actions github-actions bot added the profiling Involves Datadog profiling label Feb 20, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #2634 (5e1dcf5) into master (3ffd9bf) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 5e1dcf5 differs from pull request most recent head a8729a9. Consider uploading reports for the commit a8729a9 to get more accurate results

@@           Coverage Diff           @@
##           master    #2634   +/-   ##
=======================================
  Coverage   98.07%   98.07%           
=======================================
  Files        1153     1153           
  Lines       63126    63164   +38     
  Branches     2813     2813           
=======================================
+ Hits        61908    61950   +42     
+ Misses       1218     1214    -4     
Impacted Files Coverage Δ
lib/datadog/profiling/component.rb 95.45% <100.00%> (ø)
lib/datadog/profiling/stack_recorder.rb 77.41% <100.00%> (+0.75%) ⬆️
spec/datadog/core/configuration/components_spec.rb 99.87% <100.00%> (+<0.01%) ⬆️
spec/datadog/core/workers/runtime_metrics_spec.rb 93.88% <100.00%> (-0.04%) ⬇️
...dog/profiling/collectors/cpu_and_wall_time_spec.rb 97.98% <100.00%> (ø)
...filing/collectors/cpu_and_wall_time_worker_spec.rb 94.73% <100.00%> (-0.04%) ⬇️
spec/datadog/profiling/collectors/stack_spec.rb 97.29% <100.00%> (ø)
spec/datadog/profiling/spec_helper.rb 95.34% <100.00%> (+0.22%) ⬆️
spec/datadog/profiling/stack_recorder_spec.rb 100.00% <100.00%> (ø)
... and 4 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ivoanjo ivoanjo merged commit f53e993 into master Feb 22, 2023
@ivoanjo ivoanjo deleted the ivoanjo/prof-7251-exclude-disabled-profiling-types branch February 22, 2023 10:08
@github-actions github-actions bot added this to the 1.10.0 milestone Feb 22, 2023
@TonyCTHsu TonyCTHsu mentioned this pull request Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants