Add support for annotating profiler stacks with the resource of the active web trace, if any #1623

ivoanjo · 2021-07-30T13:57:05Z

This PR extends the existing support for linking traces to profiles with one extra feature: the resource from root spans representing inbound web requests (aka "the endpoint") will be included as part of the profiling data.

This will enable the profiler to expose this as part of the profiling UX (which is still WIP/under discussion).

Getting the correct `resource`

This feature would've been really easy to add ~~if it weren't for those meddling kids~~ if it weren't for the fact that quite a few integrations set the resource on the span at the end of the request, not at the beginning. This creates a problem: the profiler can (and probably will) observe the span before this, and would observe a missing or incorrect resource if it just recorded the current resource at the time of sampling.

To tackle this issue, I introduced a new ResourceContainer object in Span that serves only to keep a mutable reference to the latest resource. This way, and during sampling, the profiler records not the resource it sees but the ResourceContainer (which is always the same throughout the span's lifetime). Only when the profiling data is encoded does the profiler actually get the resource from inside the ResourceContainer. (This is the equivalent of the profiler holding on to a reference to the span, but without actually doing that)

While the above doesn't guarantee that the profiler will always observe the correct resource -- for instance we may observe a request that started just as the profiler was about to report the data, and is still ongoing -- I think this should be a "good enough" approximation to start. To improve on this, I plan to put aside some time to go through our most common tracing integrations and see if I can change any that set the resource at the end and move it to the beginning. Worst case, the correct data will be sent with the trace, so we can always chose to correct this information on the backend by querying the apm data.

Personal Identifiable Information (PII) in the resource?

When I started this work, there was some doubts on wether the resource could contain some PII data. While discussing with @ericmustin and @marcotc and inspired on DataDog/dd-trace-py#2573 it became clear the resource would not include this information, especially not in traces with type web.

Of course, customers with really weird custom integrations can still "break" this, but that's the same with the rest of the profiling data -- you can always create a class, file name or method name with PII that will then be gathered by the profiler.

Nevertheless, I've added an option to disable the gathering of the resource, if needed (see below).

Configuration

I've created a new advanced section in the profiling settings, and added a new extract_trace_resource option that can be used to control this feature (defaults to enabled).

I also moved two other obscure profiler settings -- max_events and max_frames inside this group. This move is technically backwards-incompatible, but these settings were never documented, and the profiler still being in beta, I think we should just rip of the band-aid now, rather than waiting for a later GA/1.0 to change them.

Final note

This PR is on top #1620 to preemptively fix a merge conflict. Once that one is merged, GitHub will change the target branch of this PR to master.

Some integrations only set the resource name at the end of the request. When the profiler is taking a sample of a given thread with an active span, the resource name may not yet be available, or it may be incorrect. By introducing the `ResourceContainer` object in a span, the profiler can keep a reference to the `ResourceContainer` objects of all root spans it observes during sampling, and thus delay getting the resource name to as late as possible. Hopefully by the time the profiler is preparing a submission to the backend, the final correct resource name is available, and it can be used by the profiler. Note that this still doesn't guarantee 100% correctness, since the profiler will not delay sending a profile while a span is still ongoing, but this approach should provide us with a reasonable approximation.

This commit only modifies the `Event::Stack` and its users, but does not yet add support to write this information to the pprof nor gathering it from the tracer.

This extends the previous behavior of TraceIdentifiers::Ddtrace to also return the root span resource container, which will be used to enable aggregation-by-endpoint functionality in the profiler.

…oint

…vents/max_frames inside These settings should in general not be messed with, and I doubt any customers are, so while this is a backwards-incompatible change, I think since profiler is still in beta and so few (if any) customers are impacted, I'll do the change right away without waiting for GA or 1.0.

This (advanced, not recommended option) can be used to disable the gathering of trace resources (usually the endpoint names) to be reported together with profiles.

codecov-commenter · 2021-07-30T14:02:56Z

Codecov Report

Merging #1623 (2cca1dc) into ivoanjo/profiler-cleanups (c885016) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@                     Coverage Diff                     @@
##           ivoanjo/profiler-cleanups    #1623    +/-   ##
===========================================================
  Coverage                      98.30%   98.30%            
===========================================================
  Files                            901      901            
  Lines                          43253    43368   +115     
===========================================================
+ Hits                           42520    42635   +115     
  Misses                           733      733

Impacted Files	Coverage Δ
lib/ddtrace/configuration/components.rb	`98.21% <100.00%> (+0.01%)`	⬆️
lib/ddtrace/configuration/settings.rb	`100.00% <100.00%> (ø)`
lib/ddtrace/ext/profiling.rb	`100.00% <100.00%> (ø)`
lib/ddtrace/profiling/collectors/stack.rb	`98.92% <100.00%> (ø)`
lib/ddtrace/profiling/events/stack.rb	`100.00% <100.00%> (ø)`
lib/ddtrace/profiling/pprof/stack_sample.rb	`100.00% <100.00%> (ø)`
lib/ddtrace/profiling/trace_identifiers/ddtrace.rb	`100.00% <100.00%> (ø)`
lib/ddtrace/profiling/trace_identifiers/helper.rb	`100.00% <100.00%> (ø)`
lib/ddtrace/span.rb	`97.07% <100.00%> (+0.07%)`	⬆️
lib/ddtrace/tracer.rb	`95.78% <100.00%> (-0.03%)`	⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c885016...2cca1dc. Read the comment docs.

marcotc · 2021-07-30T18:26:39Z

I'm glad the new struct fits in a VALUE slot:

ObjectSpace.memsize_of(Struct.new(:latest).new('resource'))
=> 40

ivoanjo · 2021-08-02T08:46:23Z

I'm glad the new struct fits in a VALUE slot

ᵃˡˡ ʰᵃᶦˡ ᵒᵘʳ ᵗᶦⁿʸ ᵒᵛᵉʳˡᵒʳᵈˢ

As of #1623, the profiler records the `resource` of the root span, on top of the span id and trace id during sampling. Also, as discusssed in the description of #1623 (in the "Getting the correct `resource`" section) integrations where the `resource` is only set at the end pose an extra challenge, as the request may not be finished in time for the `resource` to be included in the profiler payload. This means that the profiler may miss the `resource` for some of the requests, depending on timing of when the sampling happens and when the request finishes. (Note that in this case, the trace id and span id will still be propagated, so the samples will always be able to be tied back to the trace that originated them; it's just the `resource` that will not be included in the profiler data). To avoid this issue for many of our customers, let's move the Rails `resource` setting to the beginning of the request, thus avoiding the issue altogether for the Rails (action_pack/action_controller) integration.

As of #1623, the profiler records the `resource` of the root span, on top of the span id and trace id during sampling. Also, as discusssed in the description of #1623 (in the "Getting the correct `resource`" section) integrations where the `resource` is only set at the end pose an extra challenge, as the request may not be finished in time for the `resource` to be included in the profiler payload. This means that the profiler may miss the `resource` for some of the requests, depending on timing of when the sampling happens and when the request finishes. (Note that in this case, the trace id and span id will still be propagated, so the samples will always be able to be tied back to the trace that originated them; it's just the `resource` that will not be included in the profiler data). To avoid this issue for many of our customers, let's set the Sinatra `resource` setting to the beginning of the request, thus avoiding the issue altogether for Sinatra. Note that I did not touch the existing code to set the resource at the end, which had a quite scary TODO (lines 47-64) and seemed like something I shouldn't touch. Note 2: This is similar to what we did in #1626 for the Rails integration.

As of #1623, the profiler records the `resource` of the root span, on top of the span id and trace id during sampling. Also, as discusssed in the description of #1623 (in the "Getting the correct `resource`" section) integrations where the `resource` is only set at the end pose an extra challenge, as the request may not be finished in time for the `resource` to be included in the profiler payload. This means that the profiler may miss the `resource` for some of the requests, depending on timing of when the sampling happens and when the request finishes. (Note that in this case, the trace id and span id will still be propagated, so the samples will always be able to be tied back to the trace that originated them; it's just the `resource` that will not be included in the profiler data). To avoid this issue for many of our customers, let's move the Grape `resource` setting to the beginning of the request, thus avoiding the issue altogether for this integration. (Note: This is similar to what we did for Rails in #1626 and Sinatra in #1628)

ivoanjo · 2021-08-04T10:50:53Z

Just as an update to my promised plan of trying to move as many integrations to set the request at the beginning if possible, I went through all integrations that generate spans of type web (Ext::HTTP::TYPE_INBOUND) and the results were:

Rails: Moved resource setting to the beginning in Move Rails span resource setting to beginning of request #1626
Sinatra: Set Sinatra resource setting at beginning in Set Sinatra resource setting at beginning of request and delay setting fallback resource #1628
Grape: Move Grape span resource setting to beginning of request in Move Grape span resource setting to beginning of request #1629
grpc: Already set at the beginning 🎉
Rack: Not possible to move to beginning. We have two modes: in the default mode the resource gets set to the request method + request status ("GET 200") and obviously the status cannot be set at the beginning; in the "middleware_names" mode we use the name of the middleware that actually handled the request, and again, since middlewares can decide using code if they handle the request or just pass it on, it's not something we can set at the beginning (there's no API to ask a middleware if it plans to handle something).

So I think we're in pretty good shape, better than I expected! :)

This PR updates the Profiler supported versions page with the new Ruby profile types (allocation and heap) that we just released this week. I also tweaked two additional things: * Recommended runtime version for Ruby: 2.3 => 3.1 (other Rubies are no longer kept up-to-date by the Ruby developers, and thus may not provide the best experience, although we still support and test them) * Minimum version for endpoint profiling for Ruby: 0.48.0 => 0.52.0 (the previous value was incorrect, this feature was initially added in DataDog/dd-trace-rb#1623 )

#21345) This PR updates the Profiler supported versions page with the new Ruby profile types (allocation and heap) that we just released this week. I also tweaked two additional things: * Recommended runtime version for Ruby: 2.3 => 3.1 (other Rubies are no longer kept up-to-date by the Ruby developers, and thus may not provide the best experience, although we still support and test them) * Minimum version for endpoint profiling for Ruby: 0.48.0 => 0.52.0 (the previous value was incorrect, this feature was initially added in DataDog/dd-trace-rb#1623 )

ivoanjo added 10 commits July 30, 2021 14:49

Add support for storing trace_resource_container in Event::Stack

4d8111f

This commit only modifies the `Event::Stack` and its users, but does not yet add support to write this information to the pprof nor gathering it from the tracer.

Include trace_resource_container as "trace endpoint" in pprof

922026d

Include trace_resource_container in StackSample events, if available

9957ee4

Avoid including empty trace endpoint labels in pprof

bf239c1

Extract resource container from ddtrace HTTP::TYPE_INBOUND traces

d1b4a55

This extends the previous behavior of TraceIdentifiers::Ddtrace to also return the root span resource container, which will be used to enable aggregation-by-endpoint functionality in the profiler.

Update profiling integration test to validate inclusion of trace endp…

94afc33

…oint

Introduce settings.profiling.advanced.extract_trace_resource option

78efdfd

This (advanced, not recommended option) can be used to disable the gathering of trace resources (usually the endpoint names) to be reported together with profiles.

Fix rubocop complaints

2cca1dc

ivoanjo requested a review from a team July 30, 2021 13:57

marcotc approved these changes Jul 30, 2021

View reviewed changes

marcotc added the profiling Involves Datadog profiling label Jul 30, 2021

ivoanjo mentioned this pull request Aug 2, 2021

Move Rails span resource setting to beginning of request #1626

Merged

ivoanjo mentioned this pull request Aug 3, 2021

Set Sinatra resource setting at beginning of request and delay setting fallback resource #1628

Merged

Base automatically changed from ivoanjo/profiler-cleanups to master August 4, 2021 07:50

ivoanjo merged commit 5e9d8ad into master Aug 4, 2021

ivoanjo deleted the ivoanjo/profiler-trace-endpoint-v3 branch August 4, 2021 07:55

github-actions bot added this to the 0.52.0 milestone Aug 4, 2021

ivoanjo mentioned this pull request Aug 4, 2021

Move Grape span resource setting to beginning of request #1629

Merged

ivoanjo mentioned this pull request Sep 22, 2021

Use most recent event for trace resource name #1695

Merged

ivoanjo mentioned this pull request Jan 12, 2024

Update profiler supported versions with new ruby types + a few tidbits DataDog/documentation#21345

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for annotating profiler stacks with the resource of the active web trace, if any #1623

Add support for annotating profiler stacks with the resource of the active web trace, if any #1623

ivoanjo commented Jul 30, 2021 •

edited

codecov-commenter commented Jul 30, 2021

marcotc commented Jul 30, 2021

ivoanjo commented Aug 2, 2021

ivoanjo commented Aug 4, 2021

Add support for annotating profiler stacks with the resource of the active web trace, if any #1623

Add support for annotating profiler stacks with the resource of the active web trace, if any #1623

Conversation

ivoanjo commented Jul 30, 2021 • edited

Getting the correct resource

Personal Identifiable Information (PII) in the resource?

Configuration

Final note

codecov-commenter commented Jul 30, 2021

Codecov Report

marcotc commented Jul 30, 2021

ivoanjo commented Aug 2, 2021

ivoanjo commented Aug 4, 2021

ivoanjo commented Jul 30, 2021 •

edited

Getting the correct `resource`