Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for annotating profiler stacks with the resource of the active web trace, if any #1623

Merged
merged 10 commits into from
Aug 4, 2021

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jul 30, 2021

This PR extends the existing support for linking traces to profiles with one extra feature: the resource from root spans representing inbound web requests (aka "the endpoint") will be included as part of the profiling data.

This will enable the profiler to expose this as part of the profiling UX (which is still WIP/under discussion).

Getting the correct resource

This feature would've been really easy to add if it weren't for those meddling kids if it weren't for the fact that quite a few integrations set the resource on the span at the end of the request, not at the beginning. This creates a problem: the profiler can (and probably will) observe the span before this, and would observe a missing or incorrect resource if it just recorded the current resource at the time of sampling.

To tackle this issue, I introduced a new ResourceContainer object in Span that serves only to keep a mutable reference to the latest resource. This way, and during sampling, the profiler records not the resource it sees but the ResourceContainer (which is always the same throughout the span's lifetime). Only when the profiling data is encoded does the profiler actually get the resource from inside the ResourceContainer. (This is the equivalent of the profiler holding on to a reference to the span, but without actually doing that)

While the above doesn't guarantee that the profiler will always observe the correct resource -- for instance we may observe a request that started just as the profiler was about to report the data, and is still ongoing -- I think this should be a "good enough" approximation to start. To improve on this, I plan to put aside some time to go through our most common tracing integrations and see if I can change any that set the resource at the end and move it to the beginning. Worst case, the correct data will be sent with the trace, so we can always chose to correct this information on the backend by querying the apm data.

Personal Identifiable Information (PII) in the resource?

When I started this work, there was some doubts on wether the resource could contain some PII data. While discussing with @ericmustin and @marcotc and inspired on DataDog/dd-trace-py#2573 it became clear the resource would not include this information, especially not in traces with type web.

Of course, customers with really weird custom integrations can still "break" this, but that's the same with the rest of the profiling data -- you can always create a class, file name or method name with PII that will then be gathered by the profiler.

Nevertheless, I've added an option to disable the gathering of the resource, if needed (see below).

Configuration

I've created a new advanced section in the profiling settings, and added a new extract_trace_resource option that can be used to control this feature (defaults to enabled).

I also moved two other obscure profiler settings -- max_events and max_frames inside this group. This move is technically backwards-incompatible, but these settings were never documented, and the profiler still being in beta, I think we should just rip of the band-aid now, rather than waiting for a later GA/1.0 to change them.

Final note

This PR is on top #1620 to preemptively fix a merge conflict. Once that one is merged, GitHub will change the target branch of this PR to master.

Some integrations only set the resource name at the end of the request.
When the profiler is taking a sample of a given thread with an active
span, the resource name may not yet be available, or it may be
incorrect.

By introducing the `ResourceContainer` object in a span, the profiler
can keep a reference to the `ResourceContainer` objects of all root
spans it observes during sampling, and thus delay getting the
resource name to as late as possible.

Hopefully by the time the profiler is preparing a submission to the
backend, the final correct resource name is available, and it can
be used by the profiler.

Note that this still doesn't guarantee 100% correctness, since the
profiler will not delay sending a profile while a span is still
ongoing, but this approach should provide us with a reasonable
approximation.
This commit only modifies the `Event::Stack` and its users, but does
not yet add support to write this information to the pprof nor
gathering it from the tracer.
This extends the previous behavior of TraceIdentifiers::Ddtrace to also
return the root span resource container, which will be used to enable
aggregation-by-endpoint functionality in the profiler.
…vents/max_frames inside

These settings should in general not be messed with, and I doubt any
customers are, so while this is a backwards-incompatible
change, I think since profiler is still in beta and so few (if any)
customers are impacted, I'll do the change right away without waiting
for GA or 1.0.
This (advanced, not recommended option) can be used to disable the
gathering of trace resources (usually the endpoint names) to be
reported together with profiles.
@ivoanjo ivoanjo requested a review from a team July 30, 2021 13:57
@codecov-commenter
Copy link

Codecov Report

Merging #1623 (2cca1dc) into ivoanjo/profiler-cleanups (c885016) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@                     Coverage Diff                     @@
##           ivoanjo/profiler-cleanups    #1623    +/-   ##
===========================================================
  Coverage                      98.30%   98.30%            
===========================================================
  Files                            901      901            
  Lines                          43253    43368   +115     
===========================================================
+ Hits                           42520    42635   +115     
  Misses                           733      733            
Impacted Files Coverage Δ
lib/ddtrace/configuration/components.rb 98.21% <100.00%> (+0.01%) ⬆️
lib/ddtrace/configuration/settings.rb 100.00% <100.00%> (ø)
lib/ddtrace/ext/profiling.rb 100.00% <100.00%> (ø)
lib/ddtrace/profiling/collectors/stack.rb 98.92% <100.00%> (ø)
lib/ddtrace/profiling/events/stack.rb 100.00% <100.00%> (ø)
lib/ddtrace/profiling/pprof/stack_sample.rb 100.00% <100.00%> (ø)
lib/ddtrace/profiling/trace_identifiers/ddtrace.rb 100.00% <100.00%> (ø)
lib/ddtrace/profiling/trace_identifiers/helper.rb 100.00% <100.00%> (ø)
lib/ddtrace/span.rb 97.07% <100.00%> (+0.07%) ⬆️
lib/ddtrace/tracer.rb 95.78% <100.00%> (-0.03%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c885016...2cca1dc. Read the comment docs.

@marcotc
Copy link
Member

marcotc commented Jul 30, 2021

I'm glad the new struct fits in a VALUE slot:

ObjectSpace.memsize_of(Struct.new(:latest).new('resource'))
=> 40

@marcotc marcotc added the profiling Involves Datadog profiling label Jul 30, 2021
@ivoanjo
Copy link
Member Author

ivoanjo commented Aug 2, 2021

I'm glad the new struct fits in a VALUE slot

ᵃˡˡ ʰᵃᶦˡ ᵒᵘʳ ᵗᶦⁿʸ ᵒᵛᵉʳˡᵒʳᵈˢ

ivoanjo added a commit that referenced this pull request Aug 2, 2021
As of #1623, the profiler records the `resource` of the root span,
on top of the span id and trace id during sampling.

Also, as discusssed in the description of #1623 (in the
"Getting the correct `resource`" section) integrations where the
`resource` is only set at the end pose an extra challenge, as the
request may not be finished in time for the `resource` to be included
in the profiler payload.
This means that the profiler may miss the `resource` for some of the
requests, depending on timing of when the sampling happens and when the
request finishes.
(Note that in this case, the trace id and span id will still be
propagated, so the samples will always be able to be tied back to the
trace that originated them; it's just the `resource` that will not be
included in the profiler data).

To avoid this issue for many of our customers, let's move the Rails
`resource` setting to the beginning of the request, thus avoiding
the issue altogether for the Rails (action_pack/action_controller)
integration.
ivoanjo added a commit that referenced this pull request Aug 2, 2021
As of #1623, the profiler records the `resource` of the root span,
on top of the span id and trace id during sampling.

Also, as discusssed in the description of #1623 (in the
"Getting the correct `resource`" section) integrations where the
`resource` is only set at the end pose an extra challenge, as the
request may not be finished in time for the `resource` to be included
in the profiler payload.
This means that the profiler may miss the `resource` for some of the
requests, depending on timing of when the sampling happens and when the
request finishes.
(Note that in this case, the trace id and span id will still be
propagated, so the samples will always be able to be tied back to the
trace that originated them; it's just the `resource` that will not be
included in the profiler data).

To avoid this issue for many of our customers, let's move the Rails
`resource` setting to the beginning of the request, thus avoiding
the issue altogether for the Rails (action_pack/action_controller)
integration.
ivoanjo added a commit that referenced this pull request Aug 3, 2021
As of #1623, the profiler records the `resource` of the root span,
on top of the span id and trace id during sampling.

Also, as discusssed in the description of #1623 (in the
"Getting the correct `resource`" section) integrations where the
`resource` is only set at the end pose an extra challenge, as the
request may not be finished in time for the `resource` to be included
in the profiler payload.
This means that the profiler may miss the `resource` for some of the
requests, depending on timing of when the sampling happens and when the
request finishes.
(Note that in this case, the trace id and span id will still be
propagated, so the samples will always be able to be tied back to the
trace that originated them; it's just the `resource` that will not be
included in the profiler data).

To avoid this issue for many of our customers, let's set the Sinatra
`resource` setting to the beginning of the request, thus avoiding
the issue altogether for Sinatra.

Note that I did not touch the existing code to set the resource at the
end, which had a quite scary TODO (lines 47-64) and seemed like
something I shouldn't touch.

Note 2: This is similar to what we did in #1626 for the Rails
integration.
Base automatically changed from ivoanjo/profiler-cleanups to master August 4, 2021 07:50
@ivoanjo ivoanjo merged commit 5e9d8ad into master Aug 4, 2021
@ivoanjo ivoanjo deleted the ivoanjo/profiler-trace-endpoint-v3 branch August 4, 2021 07:55
@github-actions github-actions bot added this to the 0.52.0 milestone Aug 4, 2021
ivoanjo added a commit that referenced this pull request Aug 4, 2021
As of #1623, the profiler records the `resource` of the root span,
on top of the span id and trace id during sampling.

Also, as discusssed in the description of #1623 (in the
"Getting the correct `resource`" section) integrations where the
`resource` is only set at the end pose an extra challenge, as the
request may not be finished in time for the `resource` to be included
in the profiler payload.
This means that the profiler may miss the `resource` for some of the
requests, depending on timing of when the sampling happens and when the
request finishes.
(Note that in this case, the trace id and span id will still be
propagated, so the samples will always be able to be tied back to the
trace that originated them; it's just the `resource` that will not be
included in the profiler data).

To avoid this issue for many of our customers, let's move the Grape
`resource` setting to the beginning of the request, thus avoiding
the issue altogether for this integration.

(Note: This is similar to what we did for Rails in #1626 and Sinatra
in #1628)
ivoanjo added a commit that referenced this pull request Aug 4, 2021
As of #1623, the profiler records the `resource` of the root span,
on top of the span id and trace id during sampling.

Also, as discusssed in the description of #1623 (in the
"Getting the correct `resource`" section) integrations where the
`resource` is only set at the end pose an extra challenge, as the
request may not be finished in time for the `resource` to be included
in the profiler payload.
This means that the profiler may miss the `resource` for some of the
requests, depending on timing of when the sampling happens and when the
request finishes.
(Note that in this case, the trace id and span id will still be
propagated, so the samples will always be able to be tied back to the
trace that originated them; it's just the `resource` that will not be
included in the profiler data).

To avoid this issue for many of our customers, let's move the Grape
`resource` setting to the beginning of the request, thus avoiding
the issue altogether for this integration.

(Note: This is similar to what we did for Rails in #1626 and Sinatra
in #1628)
@ivoanjo
Copy link
Member Author

ivoanjo commented Aug 4, 2021

Just as an update to my promised plan of trying to move as many integrations to set the request at the beginning if possible, I went through all integrations that generate spans of type web (Ext::HTTP::TYPE_INBOUND) and the results were:

So I think we're in pretty good shape, better than I expected! :)

ivoanjo added a commit to DataDog/documentation that referenced this pull request Jan 12, 2024
This PR updates the Profiler supported versions page with the new
Ruby profile types (allocation and heap) that we just released this
week.

I also tweaked two additional things:

* Recommended runtime version for Ruby: 2.3 => 3.1 (other Rubies are
  no longer kept up-to-date by the Ruby developers, and thus may not
  provide the best experience, although we still support and test
  them)
* Minimum version for endpoint profiling for Ruby: 0.48.0 => 0.52.0
  (the previous value was incorrect, this feature was initially added
  in DataDog/dd-trace-rb#1623 )
cswatt pushed a commit to DataDog/documentation that referenced this pull request Jan 12, 2024
#21345)

This PR updates the Profiler supported versions page with the new
Ruby profile types (allocation and heap) that we just released this
week.

I also tweaked two additional things:

* Recommended runtime version for Ruby: 2.3 => 3.1 (other Rubies are
  no longer kept up-to-date by the Ruby developers, and thus may not
  provide the best experience, although we still support and test
  them)
* Minimum version for endpoint profiling for Ruby: 0.48.0 => 0.52.0
  (the previous value was incorrect, this feature was initially added
  in DataDog/dd-trace-rb#1623 )
MaelNamNam pushed a commit to DataDog/documentation that referenced this pull request Jan 17, 2024
#21345)

This PR updates the Profiler supported versions page with the new
Ruby profile types (allocation and heap) that we just released this
week.

I also tweaked two additional things:

* Recommended runtime version for Ruby: 2.3 => 3.1 (other Rubies are
  no longer kept up-to-date by the Ruby developers, and thus may not
  provide the best experience, although we still support and test
  them)
* Minimum version for endpoint profiling for Ruby: 0.48.0 => 0.52.0
  (the previous value was incorrect, this feature was initially added
  in DataDog/dd-trace-rb#1623 )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants