Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(llmobs): support distributed tracing #9152

Merged
merged 14 commits into from
May 24, 2024

Conversation

Yun-Kim
Copy link
Contributor

@Yun-Kim Yun-Kim commented May 3, 2024

TLDR for Reviewers

This PR adds support for distributed tracing for LLM Observability, by propagating a LLMObs parent ID tag on distributed requests. Note that the notion of a LLMObs parent ID != APM parent ID because we only submit llm type spans to LLM Observability.

The files to review are (in order of importance):

  • ddtrace/propagation/http.py - inject LLMobs parent ID in distributed contexts
  • ddtrace/llmobs/_utils.py - implementation details of how we determine / inject LLMObs parent ID
  • ddtrace/llmobs/_llmobs.py - add edge case handling, explained below
  • tests/llmobs/test_propagation.py - test cases showing (hopefully) what is expected behavior
  • tests/tracer/test_propagation.py - test cases for general propagation cases, mostly ensuring that our behavior can be turned on/off.

Context

LLM Obs parent ID

LLM Observability workflows involve marking tracer-generated spans with a specific llm span type, then extracting information from the given span (such as span ID, trace ID, start/end timestamps, errors, etc) to create a LLMObs-specific span event to be submitted to LLM Obs.

However since LLM Obs only accepts spans of type llm (or spans generated by the openai/langchain/bedrock integrations), we do not send other auto-generated spans to LLM Obs. This means that parent IDs can get tricky, especially in cases with non-LLMObs spans sprinkled between LLMObs spans (see example below):

Span 1 (LLM Obs)
|--> Span 2 (Non-LLM Obs)
        | --> Span 3 (LLM Obs) 

Technically span 3's parent is span 2, but since we only submit LLMObs type spans to LLM Obs (i.e. span 1 and span 3), LLM Observability's trace structure breaks down as it has no information about span 2. Therefore, we need to set span 3's llmobs_parent_id to be span 1, in other words the nearest ancestor of type llm.

The current solution is to go up the span's ancestor tree (which is connected via span._parent) until we hit a span of type llm, and use that span's span ID as the llmobs_parent_id. This works fine in most cases, except for distributed scenarios.

Problem in distributed tracing

In distributed scenarios, a trace can encompass multiple services connected via requests. However, while the immediate trace/parent IDs are automatically propagated on request headers, we have no access to the spans beyond the immediate parent, i.e. span._parent is None. This means that in distributed scenarios involving LLMObs spans, the first LLMObs span in each service will consider itself the root.

===================== Service A
Span 1 (LLM Obs)
|--> Span 2 (Non-LLM Obs)
===================== Service B (No access to spans from service A, just distributed request headers)
        | --> Span 3 (LLM Obs) 

Solution

This PR adds three things:

  1. Injects the context propagated in a distributed request with the llmobs_parent_id.

By injecting the context with _dd.p.llmobs_parent_id, this will automatically be propagated on distributed request headers/tracecontexts, then be tagged on all spans in the next called service at span start. Note that at span finish time in the original service, the _dd.p.llmobs_parent_id tag also gets set on the local root span as well.

  1. On LLMObs manual span startup, if there are no propagated parent IDs on the span, we assume that it is the first service in a distributed trace, i.e. set the span's llmobs_parent_id parent ID manually as a tag.

Since _dd.p.llmobs_parent_id tags are set at span finish time in the original service, we need to add this manual check to avoid relying exclusively on the propagated _dd.p.llmobs_parent_id tag for non-distributed cases or spans in the original service.

  1. On span processing to be exported to LLMObs, do three checks:
  • Check if the span has manually set _ml_obs.parent_id tags. If so, use that as the parent ID.
  • Go up the span's ancestor tree to find the nearest LLMObs type ancestor span. If we find one, then use its span ID as the parent ID.
  • If no spans are available in the ancestor tree, then it must be a distributed case. Use the local root span (i.e. the first span in the service)'s propagated _dd.p.llmobs_parent_id tag as the parent ID.

Testing

Unit tests were added to ensure injection is performed as expected when called explicitly. However, there are some manual local testing that was performed to ensure that propagation does indeed happen as expected, with 3 test services (A which calls B, which calls C), all of which were instrumented with an LLMObs span, running using FastAPI and making requests via the requests library.

APM Trace (for context)

Screenshot 2024-05-22 at 5 57 11 PM This trace includes non-LLMObs spans, including FastAPI spans, requests spans, and manually constructed `APM` spans.

LLMObs trace(s) before changes

Screenshot 2024-05-22 at 6 03 04 PM Screenshot 2024-05-22 at 6 00 16 PM All LLMObs spans in each service are the roots of their own trace.

LLMObs trace after changes

Screenshot 2024-05-22 at 6 01 10 PM All spans in the distributed trace are collected together in LLMObs.

Considerations

This solution relies on using the _dd.p.* context tag propagation for distributed traces in the tracer internals. However, one caveat of using this internal functionality is that the context tag not only gets propagated in the request to the spans in the next service at span start time, but this also gets set on the local root span in the original service at span finish time. This is the reason for manually adding a _ml_obs.parent_id tag in these cases, which takes precedence over the propagated _dd.p.llmobs_parent_id tag.

Additionally, this PR ensures that when LLMObs.enable() is called, we implicitly patch the below distributed tracing integrations, as they are required for propagating llmobs parent IDs via distributed request headers. Note that we only support distributed tracing through our current list of supported request integrations, including:

  • "aiohttp",
  • "asgi",
  • "bottle",
  • "celery",
  • "cherrypy",
  • "django",
  • "falcon",
  • "fastapi",
  • "flask",
  • "grpc",
  • "httplib",
  • "httpx",
  • "molten",
  • "pyramid",
  • "requests",
  • "rq",
  • "sanic",
  • "starlette",
  • "tornado",
  • "urllib3",
  • "wsgi"

For users that make requests outside of these libraries, we do not provide support for distributed tracing currently. We will need to create a custom LLMObsPropagator class to enable them to inject/extract the parent IDs, similar to how ddtrace documents here: https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#custom

Checklist

  • Change(s) are motivated and described in the PR description
  • Testing strategy is described if automated tests are not included in the PR
  • Risks are described (performance impact, potential for breakage, maintainability)
  • Change is maintainable (easy to change, telemetry, documentation)
  • Library release note guidelines are followed or label changelog/no-changelog is set
  • Documentation is included (in-code, generated user docs, public corp docs)
  • Backport labels are set (if applicable)
  • If this PR changes the public interface, I've notified @DataDog/apm-tees.

Reviewer Checklist

  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Description motivates each change
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Change is maintainable (easy to change, telemetry, documentation)
  • Release note makes sense to a user of the library
  • Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

@Yun-Kim Yun-Kim force-pushed the yunkim/llmobs-parent-id-propagation branch from 17d03e7 to b79a52a Compare May 15, 2024 19:50
@pr-commenter
Copy link

pr-commenter bot commented May 15, 2024

Benchmarks

Benchmark execution time: 2024-05-23 20:40:34

Comparing candidate commit 8786fb5 in PR branch yunkim/llmobs-parent-id-propagation with baseline commit 3de0cf5 in branch main.

Found 0 performance improvements and 37 performance regressions! Performance is the same for 172 metrics, 9 unstable metrics.

scenario:httppropagationextract-all_styles_all_headers

  • 🟥 max_rss_usage [+2.248MB; +2.303MB] or [+10.658%; +10.920%]

scenario:httppropagationextract-b3_headers

  • 🟥 max_rss_usage [+1.556MB; +1.937MB] or [+7.219%; +8.987%]

scenario:httppropagationextract-datadog_tracecontext_tracestate_not_propagated_on_trace_id_no_match

  • 🟥 max_rss_usage [+1.586MB; +1.636MB] or [+7.287%; +7.518%]

scenario:httppropagationextract-datadog_tracecontext_tracestate_propagated_on_trace_id_match

  • 🟥 max_rss_usage [+1.788MB; +2.261MB] or [+8.353%; +10.564%]

scenario:httppropagationextract-full_t_id_datadog_headers

  • 🟥 max_rss_usage [+1.675MB; +1.965MB] or [+7.784%; +9.133%]

scenario:httppropagationextract-invalid_priority_header

  • 🟥 max_rss_usage [+1.554MB; +1.625MB] or [+7.131%; +7.457%]

scenario:httppropagationextract-invalid_span_id_header

  • 🟥 max_rss_usage [+1.590MB; +1.652MB] or [+7.298%; +7.580%]

scenario:httppropagationextract-large_valid_headers_all

  • 🟥 max_rss_usage [+1.600MB; +1.655MB] or [+7.339%; +7.593%]

scenario:httppropagationextract-medium_valid_headers_all

  • 🟥 max_rss_usage [+1.608MB; +1.660MB] or [+7.381%; +7.619%]

scenario:httppropagationextract-none_propagation_style

  • 🟥 max_rss_usage [+1.565MB; +1.610MB] or [+7.423%; +7.640%]

scenario:httppropagationextract-valid_headers_all

  • 🟥 max_rss_usage [+2.231MB; +2.316MB] or [+10.559%; +10.960%]

scenario:httppropagationextract-wsgi_invalid_span_id_header

  • 🟥 max_rss_usage [+1.524MB; +1.589MB] or [+7.222%; +7.527%]

scenario:httppropagationextract-wsgi_invalid_trace_id_header

  • 🟥 max_rss_usage [+1.563MB; +1.605MB] or [+7.428%; +7.628%]

scenario:httppropagationextract-wsgi_large_header_no_matches

  • 🟥 max_rss_usage [+1.772MB; +2.091MB] or [+8.245%; +9.727%]

scenario:httppropagationextract-wsgi_large_valid_headers_all

  • 🟥 max_rss_usage [+1.595MB; +1.650MB] or [+7.312%; +7.564%]

scenario:httppropagationextract-wsgi_medium_header_no_matches

  • 🟥 max_rss_usage [+1.639MB; +1.922MB] or [+7.588%; +8.897%]

scenario:httppropagationextract-wsgi_medium_valid_headers_all

  • 🟥 max_rss_usage [+1.618MB; +1.666MB] or [+7.424%; +7.647%]

scenario:httppropagationextract-wsgi_valid_headers_all

  • 🟥 max_rss_usage [+1.598MB; +1.650MB] or [+7.333%; +7.571%]

scenario:httppropagationextract-wsgi_valid_headers_basic

  • 🟥 max_rss_usage [+1.615MB; +1.736MB] or [+7.701%; +8.278%]

scenario:httppropagationinject-with_all

  • 🟥 max_rss_usage [+2.294MB; +2.332MB] or [+10.895%; +11.077%]

scenario:httppropagationinject-with_priority_and_origin

  • 🟥 max_rss_usage [+1.526MB; +1.560MB] or [+7.007%; +7.164%]

scenario:httppropagationinject-with_tags

  • 🟥 max_rss_usage [+2.264MB; +2.301MB] or [+10.765%; +10.940%]

scenario:httppropagationinject-with_tags_invalid

  • 🟥 max_rss_usage [+2.597MB; +2.641MB] or [+12.518%; +12.730%]

scenario:httppropagationinject-with_tags_max_size

  • 🟥 max_rss_usage [+1.637MB; +1.674MB] or [+7.801%; +7.976%]

scenario:sethttpmeta-all-disabled

  • 🟥 max_rss_usage [+1.621MB; +1.674MB] or [+7.336%; +7.573%]

scenario:sethttpmeta-all-enabled

  • 🟥 max_rss_usage [+1.594MB; +1.644MB] or [+7.210%; +7.436%]

scenario:sethttpmeta-no-collectipvariant

  • 🟥 max_rss_usage [+2.066MB; +2.315MB] or [+9.629%; +10.790%]

scenario:sethttpmeta-obfuscation-disabled

  • 🟥 max_rss_usage [+1.638MB; +1.688MB] or [+7.388%; +7.617%]

scenario:sethttpmeta-obfuscation-no-query

  • 🟥 max_rss_usage [+2.064MB; +2.314MB] or [+9.642%; +10.811%]

scenario:sethttpmeta-obfuscation-send-querystring-disabled

  • 🟥 max_rss_usage [+1.609MB; +1.683MB] or [+7.546%; +7.891%]

scenario:sethttpmeta-obfuscation-worst-case-explicit-query

  • 🟥 max_rss_usage [+1.602MB; +1.839MB] or [+7.390%; +8.482%]

scenario:sethttpmeta-obfuscation-worst-case-implicit-query

  • 🟥 max_rss_usage [+1.650MB; +1.703MB] or [+7.736%; +7.985%]

scenario:sethttpmeta-useragentvariant_exists_1

  • 🟥 max_rss_usage [+1.539MB; +1.706MB] or [+7.171%; +7.949%]

scenario:sethttpmeta-useragentvariant_exists_2

  • 🟥 max_rss_usage [+1.636MB; +1.681MB] or [+7.642%; +7.852%]

scenario:sethttpmeta-useragentvariant_exists_3

  • 🟥 max_rss_usage [+1.618MB; +1.668MB] or [+7.552%; +7.784%]

scenario:sethttpmeta-useragentvariant_not_exists_1

  • 🟥 max_rss_usage [+1.667MB; +1.721MB] or [+7.546%; +7.790%]

scenario:sethttpmeta-useragentvariant_not_exists_2

  • 🟥 max_rss_usage [+2.117MB; +2.233MB] or [+9.824%; +10.363%]

@Yun-Kim Yun-Kim marked this pull request as ready for review May 22, 2024 22:53
@Yun-Kim Yun-Kim requested review from a team as code owners May 22, 2024 22:53
@Yun-Kim Yun-Kim requested a review from emmettbutler May 22, 2024 22:53
@Yun-Kim Yun-Kim changed the title wip: propagate llmobs parent feat(llmobs): support distributed tracing May 22, 2024
@Yun-Kim Yun-Kim force-pushed the yunkim/llmobs-parent-id-propagation branch from 015385c to d631d0b Compare May 22, 2024 22:59
@Yun-Kim Yun-Kim added the changelog/no-changelog A changelog entry is not required for this PR. label May 22, 2024
@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented May 22, 2024

Datadog Report

Branch report: yunkim/llmobs-parent-id-propagation
Commit report: a65b91f
Test service: dd-trace-py

✅ 0 Failed, 18622 Passed, 41085 Skipped, 3h 31m 35.56s Total duration (4h 39m 4.86s time saved)

ddtrace/llmobs/_utils.py Outdated Show resolved Hide resolved
@Yun-Kim Yun-Kim enabled auto-merge (squash) May 24, 2024 17:08
@Yun-Kim Yun-Kim merged commit 9b632b7 into main May 24, 2024
181 of 184 checks passed
@Yun-Kim Yun-Kim deleted the yunkim/llmobs-parent-id-propagation branch May 24, 2024 19:28
github-actions bot pushed a commit that referenced this pull request May 24, 2024
### TLDR for Reviewers
This PR adds support for distributed tracing for LLM Observability, by
propagating a LLMObs parent ID tag on distributed requests. Note that
the notion of a `LLMObs parent ID != APM parent ID` because we only
submit `llm` type spans to LLM Observability.

The files to review are (in order of importance):
- `ddtrace/propagation/http.py` - inject LLMobs parent ID in distributed
contexts
- `ddtrace/llmobs/_utils.py` - implementation details of how we
determine / inject LLMObs parent ID
- `ddtrace/llmobs/_llmobs.py` - add edge case handling, explained below
- `tests/llmobs/test_propagation.py` - test cases showing (hopefully)
what is expected behavior
- `tests/tracer/test_propagation.py` - test cases for general
propagation cases, mostly ensuring that our behavior can be turned
on/off.

## Context

### LLM Obs parent ID
LLM Observability workflows involve marking tracer-generated spans with
a specific `llm` span type, then extracting information from the given
span (such as span ID, trace ID, start/end timestamps, errors, etc) to
create a LLMObs-specific span event to be submitted to LLM Obs.

However since LLM Obs only accepts spans of type `llm` (or spans
generated by the `openai/langchain/bedrock` integrations), we do not
send other auto-generated spans to LLM Obs. This means that parent IDs
can get tricky, especially in cases with non-LLMObs spans sprinkled
between LLMObs spans (see example below):

```
Span 1 (LLM Obs)
|--> Span 2 (Non-LLM Obs)
        | --> Span 3 (LLM Obs)
```
Technically span 3's parent is span 2, but since we only submit LLMObs
type spans to LLM Obs (i.e. span 1 and span 3), LLM Observability's
trace structure breaks down as it has no information about span 2.
Therefore, we need to set span 3's *llmobs_parent_id* to be span 1, in
other words the nearest ancestor of type `llm`.

The current solution is to go up the span's ancestor tree (which is
connected via `span._parent`) until we hit a span of type `llm`, and use
that span's span ID as the *llmobs_parent_id*. This works fine in most
cases, except for distributed scenarios.

### Problem in distributed tracing

In distributed scenarios, a trace can encompass multiple services
connected via requests. However, while the immediate trace/parent IDs
are automatically propagated on request headers, we have no access to
the spans beyond the immediate parent, i.e. `span._parent is None`. This
means that in distributed scenarios involving LLMObs spans, the first
LLMObs span in each service will consider itself the root.

```
===================== Service A
Span 1 (LLM Obs)
|--> Span 2 (Non-LLM Obs)
===================== Service B (No access to spans from service A, just distributed request headers)
        | --> Span 3 (LLM Obs)
```
## Solution

This PR adds three things:
1. Injects the context propagated in a distributed request with the
*llmobs_parent_id*.

By injecting the context with `_dd.p.llmobs_parent_id`, this will
automatically be propagated on distributed request
headers/tracecontexts, then be tagged on all spans in the next called
service at span start. Note that at span finish time in the original
service, the `_dd.p.llmobs_parent_id` tag also gets set on the local
root span as well.

2. On LLMObs manual span startup, if there are no propagated parent IDs
on the span, we assume that it is the first service in a distributed
trace, i.e. set the span's *llmobs_parent_id* parent ID manually as a
tag.

Since `_dd.p.llmobs_parent_id` tags are set at span finish time in the
original service, we need to add this manual check to avoid relying
exclusively on the propagated `_dd.p.llmobs_parent_id` tag for
non-distributed cases or spans in the original service.

3. On span processing to be exported to LLMObs, do three checks:
- Check if the span has manually set `_ml_obs.parent_id` tags. If so,
use that as the parent ID.
- Go up the span's ancestor tree to find the nearest LLMObs type
ancestor span. If we find one, then use its span ID as the parent ID.
- If no spans are available in the ancestor tree, then it must be a
distributed case. Use the local root span (i.e. the first span in the
service)'s propagated `_dd.p.llmobs_parent_id` tag as the parent ID.

## Testing
Unit tests were added to ensure injection is performed as expected when
called explicitly. However, there are some manual local testing that was
performed to ensure that propagation does indeed happen as expected,
with 3 test services (A which calls B, which calls C), all of which were
instrumented with an LLMObs span, running using `FastAPI` and making
requests via the `requests` library.

### APM Trace (for context)
<img width="1130" alt="Screenshot 2024-05-22 at 5 57 11 PM"
src="https://github.com/DataDog/dd-trace-py/assets/35776586/f0e3e605-24a8-46c5-8378-45c62724bc63">
This trace includes non-LLMObs spans, including FastAPI spans, requests
spans, and manually constructed `APM` spans.

### LLMObs trace(s) before changes
<img width="980" alt="Screenshot 2024-05-22 at 6 03 04 PM"
src="https://github.com/DataDog/dd-trace-py/assets/35776586/6035ccd7-974c-4c88-9798-7f9ceab22e04">
<img width="1441" alt="Screenshot 2024-05-22 at 6 00 16 PM"
src="https://github.com/DataDog/dd-trace-py/assets/35776586/975a6943-6e4b-42de-9a52-89cf8de1f47e">
All LLMObs spans in each service are the roots of their own trace.

### LLMObs trace after changes
<img width="983" alt="Screenshot 2024-05-22 at 6 01 10 PM"
src="https://github.com/DataDog/dd-trace-py/assets/35776586/108862af-894f-4ef9-a8c6-64c1a386f2b7">
All spans in the distributed trace are collected together in LLMObs.

## Considerations
This solution relies on using the `_dd.p.*` context tag propagation for
distributed traces in the tracer internals. However, one caveat of using
this internal functionality is that the context tag not only gets
propagated in the request to the spans in the next service at span start
time, but this also gets set on the local root span in the original
service at span finish time. This is the reason for manually adding a
`_ml_obs.parent_id` tag in these cases, which takes precedence over the
propagated `_dd.p.llmobs_parent_id` tag.

Additionally, this PR ensures that when `LLMObs.enable()` is called, we
implicitly patch the below distributed tracing integrations, as they are
required for propagating llmobs parent IDs via distributed request
headers. Note that we only support distributed tracing through our
current list of supported request integrations, including:
- "aiohttp",
- "asgi",
- "bottle",
- "celery",
- "cherrypy",
- "django",
- "falcon",
- "fastapi",
- "flask",
- "grpc",
- "httplib",
- "httpx",
- "molten",
- "pyramid",
- "requests",
- "rq",
- "sanic",
- "starlette",
- "tornado",
- "urllib3",
- "wsgi"

For users that make requests outside of these libraries, we do not
provide support for distributed tracing currently. We will need to
create a custom `LLMObsPropagator` class to enable them to
inject/extract the parent IDs, similar to how ddtrace documents here:
https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#custom

## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit 9b632b7)
@Yun-Kim
Copy link
Contributor Author

Yun-Kim commented May 24, 2024

Note: The http propagation benchmark tests have reported a significant increase in memory usage from this commit: #9152. However the changes from that commit do not justify such a memory increase, and LLMObs should not be enabled in benchmark tests. Additionally, the benchmarks have not run on this branch since that commit. This will need more investigation but I do not believe that this PR is responsible for such a memory increase.

EDIT: after further investigation I've found that importing from ddtrace.llmobs could be responsible for memory usage increase due to indirectly initializing the LLMObs service (even though it is disabled.).

Yun-Kim added a commit that referenced this pull request May 24, 2024
Backport 9b632b7 from #9152 to 2.9.

### TLDR for Reviewers
This PR adds support for distributed tracing for LLM Observability, by
propagating a LLMObs parent ID tag on distributed requests. Note that
the notion of a `LLMObs parent ID != APM parent ID` because we only
submit `llm` type spans to LLM Observability.

The files to review are (in order of importance):
- `ddtrace/propagation/http.py` - inject LLMobs parent ID in distributed
contexts
- `ddtrace/llmobs/_utils.py` - implementation details of how we
determine / inject LLMObs parent ID
- `ddtrace/llmobs/_llmobs.py` - add edge case handling, explained below
- `tests/llmobs/test_propagation.py` - test cases showing (hopefully)
what is expected behavior
- `tests/tracer/test_propagation.py` - test cases for general
propagation cases, mostly ensuring that our behavior can be turned
on/off.

## Context

### LLM Obs parent ID
LLM Observability workflows involve marking tracer-generated spans with
a specific `llm` span type, then extracting information from the given
span (such as span ID, trace ID, start/end timestamps, errors, etc) to
create a LLMObs-specific span event to be submitted to LLM Obs.

However since LLM Obs only accepts spans of type `llm` (or spans
generated by the `openai/langchain/bedrock` integrations), we do not
send other auto-generated spans to LLM Obs. This means that parent IDs
can get tricky, especially in cases with non-LLMObs spans sprinkled
between LLMObs spans (see example below):

```
Span 1 (LLM Obs)
|--> Span 2 (Non-LLM Obs)
        | --> Span 3 (LLM Obs) 
```
Technically span 3's parent is span 2, but since we only submit LLMObs
type spans to LLM Obs (i.e. span 1 and span 3), LLM Observability's
trace structure breaks down as it has no information about span 2.
Therefore, we need to set span 3's *llmobs_parent_id* to be span 1, in
other words the nearest ancestor of type `llm`.

The current solution is to go up the span's ancestor tree (which is
connected via `span._parent`) until we hit a span of type `llm`, and use
that span's span ID as the *llmobs_parent_id*. This works fine in most
cases, except for distributed scenarios.

### Problem in distributed tracing

In distributed scenarios, a trace can encompass multiple services
connected via requests. However, while the immediate trace/parent IDs
are automatically propagated on request headers, we have no access to
the spans beyond the immediate parent, i.e. `span._parent is None`. This
means that in distributed scenarios involving LLMObs spans, the first
LLMObs span in each service will consider itself the root.

```
===================== Service A
Span 1 (LLM Obs)
|--> Span 2 (Non-LLM Obs)
===================== Service B (No access to spans from service A, just distributed request headers)
        | --> Span 3 (LLM Obs) 
```
## Solution

This PR adds three things:
1. Injects the context propagated in a distributed request with the
*llmobs_parent_id*.

By injecting the context with `_dd.p.llmobs_parent_id`, this will
automatically be propagated on distributed request
headers/tracecontexts, then be tagged on all spans in the next called
service at span start. Note that at span finish time in the original
service, the `_dd.p.llmobs_parent_id` tag also gets set on the local
root span as well.

2. On LLMObs manual span startup, if there are no propagated parent IDs
on the span, we assume that it is the first service in a distributed
trace, i.e. set the span's *llmobs_parent_id* parent ID manually as a
tag.

Since `_dd.p.llmobs_parent_id` tags are set at span finish time in the
original service, we need to add this manual check to avoid relying
exclusively on the propagated `_dd.p.llmobs_parent_id` tag for
non-distributed cases or spans in the original service.
 
3. On span processing to be exported to LLMObs, do three checks:
- Check if the span has manually set `_ml_obs.parent_id` tags. If so,
use that as the parent ID.
- Go up the span's ancestor tree to find the nearest LLMObs type
ancestor span. If we find one, then use its span ID as the parent ID.
- If no spans are available in the ancestor tree, then it must be a
distributed case. Use the local root span (i.e. the first span in the
service)'s propagated `_dd.p.llmobs_parent_id` tag as the parent ID.

## Testing
Unit tests were added to ensure injection is performed as expected when
called explicitly. However, there are some manual local testing that was
performed to ensure that propagation does indeed happen as expected,
with 3 test services (A which calls B, which calls C), all of which were
instrumented with an LLMObs span, running using `FastAPI` and making
requests via the `requests` library.

### APM Trace (for context)
<img width="1130" alt="Screenshot 2024-05-22 at 5 57 11 PM"
src="https://github.com/DataDog/dd-trace-py/assets/35776586/f0e3e605-24a8-46c5-8378-45c62724bc63">
This trace includes non-LLMObs spans, including FastAPI spans, requests
spans, and manually constructed `APM` spans.

### LLMObs trace(s) before changes
<img width="980" alt="Screenshot 2024-05-22 at 6 03 04 PM"
src="https://github.com/DataDog/dd-trace-py/assets/35776586/6035ccd7-974c-4c88-9798-7f9ceab22e04">
<img width="1441" alt="Screenshot 2024-05-22 at 6 00 16 PM"
src="https://github.com/DataDog/dd-trace-py/assets/35776586/975a6943-6e4b-42de-9a52-89cf8de1f47e">
All LLMObs spans in each service are the roots of their own trace.

### LLMObs trace after changes
<img width="983" alt="Screenshot 2024-05-22 at 6 01 10 PM"
src="https://github.com/DataDog/dd-trace-py/assets/35776586/108862af-894f-4ef9-a8c6-64c1a386f2b7">
All spans in the distributed trace are collected together in LLMObs.

## Considerations
This solution relies on using the `_dd.p.*` context tag propagation for
distributed traces in the tracer internals. However, one caveat of using
this internal functionality is that the context tag not only gets
propagated in the request to the spans in the next service at span start
time, but this also gets set on the local root span in the original
service at span finish time. This is the reason for manually adding a
`_ml_obs.parent_id` tag in these cases, which takes precedence over the
propagated `_dd.p.llmobs_parent_id` tag.

Additionally, this PR ensures that when `LLMObs.enable()` is called, we
implicitly patch the below distributed tracing integrations, as they are
required for propagating llmobs parent IDs via distributed request
headers. Note that we only support distributed tracing through our
current list of supported request integrations, including:
- "aiohttp",
- "asgi",
- "bottle",
- "celery",
- "cherrypy",
- "django",
- "falcon",
- "fastapi",
- "flask",
- "grpc",
- "httplib",
- "httpx",
- "molten",
- "pyramid",
- "requests",
- "rq",
- "sanic",
- "starlette",
- "tornado",
- "urllib3",
- "wsgi"

For users that make requests outside of these libraries, we do not
provide support for distributed tracing currently. We will need to
create a custom `LLMObsPropagator` class to enable them to
inject/extract the parent IDs, similar to how ddtrace documents here:
https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#custom


## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
Yun-Kim added a commit that referenced this pull request May 24, 2024
This PR is a follow up of #9152, and attempts to minimize any added
memory overhead by moving the `llmobs` utility import to inside the
conditional check that `LLMObs` is enabled.

By importing inside the `ddtrace.llmobs.` directory we are implicitly
running the `ddtrace.llmobs.__init__.py` code, which involves
instantiating a `LLMObs` instance. This is likely the largest culprit of
the memory overhead.

Moving the import to only happening if LLMObs is enabled should avoid
that added overhead, given that LLMObs is only running in a select few
customer applications at the moment.

## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
github-actions bot pushed a commit that referenced this pull request May 24, 2024
This PR is a follow up of #9152, and attempts to minimize any added
memory overhead by moving the `llmobs` utility import to inside the
conditional check that `LLMObs` is enabled.

By importing inside the `ddtrace.llmobs.` directory we are implicitly
running the `ddtrace.llmobs.__init__.py` code, which involves
instantiating a `LLMObs` instance. This is likely the largest culprit of
the memory overhead.

Moving the import to only happening if LLMObs is enabled should avoid
that added overhead, given that LLMObs is only running in a select few
customer applications at the moment.

## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit be9936b)
Yun-Kim added a commit that referenced this pull request May 28, 2024
Backport be9936b from #9387 to 2.9.

This PR is a follow up of #9152, and attempts to minimize any added
memory overhead by moving the `llmobs` utility import to inside the
conditional check that `LLMObs` is enabled.

By importing inside the `ddtrace.llmobs.` directory we are implicitly
running the `ddtrace.llmobs.__init__.py` code, which involves
instantiating a `LLMObs` instance. This is likely the largest culprit of
the memory overhead.

Moving the import to only happening if LLMObs is enabled should avoid
that added overhead, given that LLMObs is only running in a select few
customer applications at the moment.

## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
Yun-Kim added a commit that referenced this pull request May 29, 2024
…9417)

This PR adds a fix to add handling for integration-generated spans for
LLMObs parent ID propagation. #9152 added a edge case handling in
`LLMObs._start_span()` if the span was part of the first service in a
distributed trace, in which case we would need to check the
`span.get_tag(PROPAGATED_PARENT_KEY)` due to the distributed header
being propagated upwards to the local root of the original service at
span finish time (but would always be propagated to all spans in
subsequent services at span start time).

Integration (openai, bedrock, langchain) generated spans use
`BaseLLMIntegration.trace(...)` instead of `LLMObs._start_span()` so we
needed to add handling here.

## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
github-actions bot pushed a commit that referenced this pull request May 29, 2024
…9417)

This PR adds a fix to add handling for integration-generated spans for
LLMObs parent ID propagation. #9152 added a edge case handling in
`LLMObs._start_span()` if the span was part of the first service in a
distributed trace, in which case we would need to check the
`span.get_tag(PROPAGATED_PARENT_KEY)` due to the distributed header
being propagated upwards to the local root of the original service at
span finish time (but would always be propagated to all spans in
subsequent services at span start time).

Integration (openai, bedrock, langchain) generated spans use
`BaseLLMIntegration.trace(...)` instead of `LLMObs._start_span()` so we
needed to add handling here.

## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit 538a024)
Yun-Kim added a commit that referenced this pull request May 29, 2024
…backport 2.9] (#9422)

Backport 538a024 from #9417 to 2.9.

This PR adds a fix to add handling for integration-generated spans for
LLMObs parent ID propagation. #9152 added a edge case handling in
`LLMObs._start_span()` if the span was part of the first service in a
distributed trace, in which case we would need to check the
`span.get_tag(PROPAGATED_PARENT_KEY)` due to the distributed header
being propagated upwards to the local root of the original service at
span finish time (but would always be propagated to all spans in
subsequent services at span start time).

Integration (openai, bedrock, langchain) generated spans use
`BaseLLMIntegration.trace(...)` instead of `LLMObs._start_span()` so we
needed to add handling here.

## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.9 changelog/no-changelog A changelog entry is not required for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants