-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(llmobs): support distributed tracing #9152
Conversation
17d03e7
to
b79a52a
Compare
BenchmarksBenchmark execution time: 2024-05-23 20:40:34 Comparing candidate commit 8786fb5 in PR branch Found 0 performance improvements and 37 performance regressions! Performance is the same for 172 metrics, 9 unstable metrics. scenario:httppropagationextract-all_styles_all_headers
scenario:httppropagationextract-b3_headers
scenario:httppropagationextract-datadog_tracecontext_tracestate_not_propagated_on_trace_id_no_match
scenario:httppropagationextract-datadog_tracecontext_tracestate_propagated_on_trace_id_match
scenario:httppropagationextract-full_t_id_datadog_headers
scenario:httppropagationextract-invalid_priority_header
scenario:httppropagationextract-invalid_span_id_header
scenario:httppropagationextract-large_valid_headers_all
scenario:httppropagationextract-medium_valid_headers_all
scenario:httppropagationextract-none_propagation_style
scenario:httppropagationextract-valid_headers_all
scenario:httppropagationextract-wsgi_invalid_span_id_header
scenario:httppropagationextract-wsgi_invalid_trace_id_header
scenario:httppropagationextract-wsgi_large_header_no_matches
scenario:httppropagationextract-wsgi_large_valid_headers_all
scenario:httppropagationextract-wsgi_medium_header_no_matches
scenario:httppropagationextract-wsgi_medium_valid_headers_all
scenario:httppropagationextract-wsgi_valid_headers_all
scenario:httppropagationextract-wsgi_valid_headers_basic
scenario:httppropagationinject-with_all
scenario:httppropagationinject-with_priority_and_origin
scenario:httppropagationinject-with_tags
scenario:httppropagationinject-with_tags_invalid
scenario:httppropagationinject-with_tags_max_size
scenario:sethttpmeta-all-disabled
scenario:sethttpmeta-all-enabled
scenario:sethttpmeta-no-collectipvariant
scenario:sethttpmeta-obfuscation-disabled
scenario:sethttpmeta-obfuscation-no-query
scenario:sethttpmeta-obfuscation-send-querystring-disabled
scenario:sethttpmeta-obfuscation-worst-case-explicit-query
scenario:sethttpmeta-obfuscation-worst-case-implicit-query
scenario:sethttpmeta-useragentvariant_exists_1
scenario:sethttpmeta-useragentvariant_exists_2
scenario:sethttpmeta-useragentvariant_exists_3
scenario:sethttpmeta-useragentvariant_not_exists_1
scenario:sethttpmeta-useragentvariant_not_exists_2
|
015385c
to
d631d0b
Compare
Datadog ReportBranch report: ✅ 0 Failed, 18622 Passed, 41085 Skipped, 3h 31m 35.56s Total duration (4h 39m 4.86s time saved) |
### TLDR for Reviewers This PR adds support for distributed tracing for LLM Observability, by propagating a LLMObs parent ID tag on distributed requests. Note that the notion of a `LLMObs parent ID != APM parent ID` because we only submit `llm` type spans to LLM Observability. The files to review are (in order of importance): - `ddtrace/propagation/http.py` - inject LLMobs parent ID in distributed contexts - `ddtrace/llmobs/_utils.py` - implementation details of how we determine / inject LLMObs parent ID - `ddtrace/llmobs/_llmobs.py` - add edge case handling, explained below - `tests/llmobs/test_propagation.py` - test cases showing (hopefully) what is expected behavior - `tests/tracer/test_propagation.py` - test cases for general propagation cases, mostly ensuring that our behavior can be turned on/off. ## Context ### LLM Obs parent ID LLM Observability workflows involve marking tracer-generated spans with a specific `llm` span type, then extracting information from the given span (such as span ID, trace ID, start/end timestamps, errors, etc) to create a LLMObs-specific span event to be submitted to LLM Obs. However since LLM Obs only accepts spans of type `llm` (or spans generated by the `openai/langchain/bedrock` integrations), we do not send other auto-generated spans to LLM Obs. This means that parent IDs can get tricky, especially in cases with non-LLMObs spans sprinkled between LLMObs spans (see example below): ``` Span 1 (LLM Obs) |--> Span 2 (Non-LLM Obs) | --> Span 3 (LLM Obs) ``` Technically span 3's parent is span 2, but since we only submit LLMObs type spans to LLM Obs (i.e. span 1 and span 3), LLM Observability's trace structure breaks down as it has no information about span 2. Therefore, we need to set span 3's *llmobs_parent_id* to be span 1, in other words the nearest ancestor of type `llm`. The current solution is to go up the span's ancestor tree (which is connected via `span._parent`) until we hit a span of type `llm`, and use that span's span ID as the *llmobs_parent_id*. This works fine in most cases, except for distributed scenarios. ### Problem in distributed tracing In distributed scenarios, a trace can encompass multiple services connected via requests. However, while the immediate trace/parent IDs are automatically propagated on request headers, we have no access to the spans beyond the immediate parent, i.e. `span._parent is None`. This means that in distributed scenarios involving LLMObs spans, the first LLMObs span in each service will consider itself the root. ``` ===================== Service A Span 1 (LLM Obs) |--> Span 2 (Non-LLM Obs) ===================== Service B (No access to spans from service A, just distributed request headers) | --> Span 3 (LLM Obs) ``` ## Solution This PR adds three things: 1. Injects the context propagated in a distributed request with the *llmobs_parent_id*. By injecting the context with `_dd.p.llmobs_parent_id`, this will automatically be propagated on distributed request headers/tracecontexts, then be tagged on all spans in the next called service at span start. Note that at span finish time in the original service, the `_dd.p.llmobs_parent_id` tag also gets set on the local root span as well. 2. On LLMObs manual span startup, if there are no propagated parent IDs on the span, we assume that it is the first service in a distributed trace, i.e. set the span's *llmobs_parent_id* parent ID manually as a tag. Since `_dd.p.llmobs_parent_id` tags are set at span finish time in the original service, we need to add this manual check to avoid relying exclusively on the propagated `_dd.p.llmobs_parent_id` tag for non-distributed cases or spans in the original service. 3. On span processing to be exported to LLMObs, do three checks: - Check if the span has manually set `_ml_obs.parent_id` tags. If so, use that as the parent ID. - Go up the span's ancestor tree to find the nearest LLMObs type ancestor span. If we find one, then use its span ID as the parent ID. - If no spans are available in the ancestor tree, then it must be a distributed case. Use the local root span (i.e. the first span in the service)'s propagated `_dd.p.llmobs_parent_id` tag as the parent ID. ## Testing Unit tests were added to ensure injection is performed as expected when called explicitly. However, there are some manual local testing that was performed to ensure that propagation does indeed happen as expected, with 3 test services (A which calls B, which calls C), all of which were instrumented with an LLMObs span, running using `FastAPI` and making requests via the `requests` library. ### APM Trace (for context) <img width="1130" alt="Screenshot 2024-05-22 at 5 57 11 PM" src="https://github.com/DataDog/dd-trace-py/assets/35776586/f0e3e605-24a8-46c5-8378-45c62724bc63"> This trace includes non-LLMObs spans, including FastAPI spans, requests spans, and manually constructed `APM` spans. ### LLMObs trace(s) before changes <img width="980" alt="Screenshot 2024-05-22 at 6 03 04 PM" src="https://github.com/DataDog/dd-trace-py/assets/35776586/6035ccd7-974c-4c88-9798-7f9ceab22e04"> <img width="1441" alt="Screenshot 2024-05-22 at 6 00 16 PM" src="https://github.com/DataDog/dd-trace-py/assets/35776586/975a6943-6e4b-42de-9a52-89cf8de1f47e"> All LLMObs spans in each service are the roots of their own trace. ### LLMObs trace after changes <img width="983" alt="Screenshot 2024-05-22 at 6 01 10 PM" src="https://github.com/DataDog/dd-trace-py/assets/35776586/108862af-894f-4ef9-a8c6-64c1a386f2b7"> All spans in the distributed trace are collected together in LLMObs. ## Considerations This solution relies on using the `_dd.p.*` context tag propagation for distributed traces in the tracer internals. However, one caveat of using this internal functionality is that the context tag not only gets propagated in the request to the spans in the next service at span start time, but this also gets set on the local root span in the original service at span finish time. This is the reason for manually adding a `_ml_obs.parent_id` tag in these cases, which takes precedence over the propagated `_dd.p.llmobs_parent_id` tag. Additionally, this PR ensures that when `LLMObs.enable()` is called, we implicitly patch the below distributed tracing integrations, as they are required for propagating llmobs parent IDs via distributed request headers. Note that we only support distributed tracing through our current list of supported request integrations, including: - "aiohttp", - "asgi", - "bottle", - "celery", - "cherrypy", - "django", - "falcon", - "fastapi", - "flask", - "grpc", - "httplib", - "httpx", - "molten", - "pyramid", - "requests", - "rq", - "sanic", - "starlette", - "tornado", - "urllib3", - "wsgi" For users that make requests outside of these libraries, we do not provide support for distributed tracing currently. We will need to create a custom `LLMObsPropagator` class to enable them to inject/extract the parent IDs, similar to how ddtrace documents here: https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#custom ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) (cherry picked from commit 9b632b7)
Note: The http propagation benchmark tests have reported a significant increase in memory usage from this commit: #9152. However the changes from that commit do not justify such a memory increase, and EDIT: after further investigation I've found that importing from ddtrace.llmobs could be responsible for memory usage increase due to indirectly initializing the |
Backport 9b632b7 from #9152 to 2.9. ### TLDR for Reviewers This PR adds support for distributed tracing for LLM Observability, by propagating a LLMObs parent ID tag on distributed requests. Note that the notion of a `LLMObs parent ID != APM parent ID` because we only submit `llm` type spans to LLM Observability. The files to review are (in order of importance): - `ddtrace/propagation/http.py` - inject LLMobs parent ID in distributed contexts - `ddtrace/llmobs/_utils.py` - implementation details of how we determine / inject LLMObs parent ID - `ddtrace/llmobs/_llmobs.py` - add edge case handling, explained below - `tests/llmobs/test_propagation.py` - test cases showing (hopefully) what is expected behavior - `tests/tracer/test_propagation.py` - test cases for general propagation cases, mostly ensuring that our behavior can be turned on/off. ## Context ### LLM Obs parent ID LLM Observability workflows involve marking tracer-generated spans with a specific `llm` span type, then extracting information from the given span (such as span ID, trace ID, start/end timestamps, errors, etc) to create a LLMObs-specific span event to be submitted to LLM Obs. However since LLM Obs only accepts spans of type `llm` (or spans generated by the `openai/langchain/bedrock` integrations), we do not send other auto-generated spans to LLM Obs. This means that parent IDs can get tricky, especially in cases with non-LLMObs spans sprinkled between LLMObs spans (see example below): ``` Span 1 (LLM Obs) |--> Span 2 (Non-LLM Obs) | --> Span 3 (LLM Obs) ``` Technically span 3's parent is span 2, but since we only submit LLMObs type spans to LLM Obs (i.e. span 1 and span 3), LLM Observability's trace structure breaks down as it has no information about span 2. Therefore, we need to set span 3's *llmobs_parent_id* to be span 1, in other words the nearest ancestor of type `llm`. The current solution is to go up the span's ancestor tree (which is connected via `span._parent`) until we hit a span of type `llm`, and use that span's span ID as the *llmobs_parent_id*. This works fine in most cases, except for distributed scenarios. ### Problem in distributed tracing In distributed scenarios, a trace can encompass multiple services connected via requests. However, while the immediate trace/parent IDs are automatically propagated on request headers, we have no access to the spans beyond the immediate parent, i.e. `span._parent is None`. This means that in distributed scenarios involving LLMObs spans, the first LLMObs span in each service will consider itself the root. ``` ===================== Service A Span 1 (LLM Obs) |--> Span 2 (Non-LLM Obs) ===================== Service B (No access to spans from service A, just distributed request headers) | --> Span 3 (LLM Obs) ``` ## Solution This PR adds three things: 1. Injects the context propagated in a distributed request with the *llmobs_parent_id*. By injecting the context with `_dd.p.llmobs_parent_id`, this will automatically be propagated on distributed request headers/tracecontexts, then be tagged on all spans in the next called service at span start. Note that at span finish time in the original service, the `_dd.p.llmobs_parent_id` tag also gets set on the local root span as well. 2. On LLMObs manual span startup, if there are no propagated parent IDs on the span, we assume that it is the first service in a distributed trace, i.e. set the span's *llmobs_parent_id* parent ID manually as a tag. Since `_dd.p.llmobs_parent_id` tags are set at span finish time in the original service, we need to add this manual check to avoid relying exclusively on the propagated `_dd.p.llmobs_parent_id` tag for non-distributed cases or spans in the original service. 3. On span processing to be exported to LLMObs, do three checks: - Check if the span has manually set `_ml_obs.parent_id` tags. If so, use that as the parent ID. - Go up the span's ancestor tree to find the nearest LLMObs type ancestor span. If we find one, then use its span ID as the parent ID. - If no spans are available in the ancestor tree, then it must be a distributed case. Use the local root span (i.e. the first span in the service)'s propagated `_dd.p.llmobs_parent_id` tag as the parent ID. ## Testing Unit tests were added to ensure injection is performed as expected when called explicitly. However, there are some manual local testing that was performed to ensure that propagation does indeed happen as expected, with 3 test services (A which calls B, which calls C), all of which were instrumented with an LLMObs span, running using `FastAPI` and making requests via the `requests` library. ### APM Trace (for context) <img width="1130" alt="Screenshot 2024-05-22 at 5 57 11 PM" src="https://github.com/DataDog/dd-trace-py/assets/35776586/f0e3e605-24a8-46c5-8378-45c62724bc63"> This trace includes non-LLMObs spans, including FastAPI spans, requests spans, and manually constructed `APM` spans. ### LLMObs trace(s) before changes <img width="980" alt="Screenshot 2024-05-22 at 6 03 04 PM" src="https://github.com/DataDog/dd-trace-py/assets/35776586/6035ccd7-974c-4c88-9798-7f9ceab22e04"> <img width="1441" alt="Screenshot 2024-05-22 at 6 00 16 PM" src="https://github.com/DataDog/dd-trace-py/assets/35776586/975a6943-6e4b-42de-9a52-89cf8de1f47e"> All LLMObs spans in each service are the roots of their own trace. ### LLMObs trace after changes <img width="983" alt="Screenshot 2024-05-22 at 6 01 10 PM" src="https://github.com/DataDog/dd-trace-py/assets/35776586/108862af-894f-4ef9-a8c6-64c1a386f2b7"> All spans in the distributed trace are collected together in LLMObs. ## Considerations This solution relies on using the `_dd.p.*` context tag propagation for distributed traces in the tracer internals. However, one caveat of using this internal functionality is that the context tag not only gets propagated in the request to the spans in the next service at span start time, but this also gets set on the local root span in the original service at span finish time. This is the reason for manually adding a `_ml_obs.parent_id` tag in these cases, which takes precedence over the propagated `_dd.p.llmobs_parent_id` tag. Additionally, this PR ensures that when `LLMObs.enable()` is called, we implicitly patch the below distributed tracing integrations, as they are required for propagating llmobs parent IDs via distributed request headers. Note that we only support distributed tracing through our current list of supported request integrations, including: - "aiohttp", - "asgi", - "bottle", - "celery", - "cherrypy", - "django", - "falcon", - "fastapi", - "flask", - "grpc", - "httplib", - "httpx", - "molten", - "pyramid", - "requests", - "rq", - "sanic", - "starlette", - "tornado", - "urllib3", - "wsgi" For users that make requests outside of these libraries, we do not provide support for distributed tracing currently. We will need to create a custom `LLMObsPropagator` class to enable them to inject/extract the parent IDs, similar to how ddtrace documents here: https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#custom ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
This PR is a follow up of #9152, and attempts to minimize any added memory overhead by moving the `llmobs` utility import to inside the conditional check that `LLMObs` is enabled. By importing inside the `ddtrace.llmobs.` directory we are implicitly running the `ddtrace.llmobs.__init__.py` code, which involves instantiating a `LLMObs` instance. This is likely the largest culprit of the memory overhead. Moving the import to only happening if LLMObs is enabled should avoid that added overhead, given that LLMObs is only running in a select few customer applications at the moment. ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
This PR is a follow up of #9152, and attempts to minimize any added memory overhead by moving the `llmobs` utility import to inside the conditional check that `LLMObs` is enabled. By importing inside the `ddtrace.llmobs.` directory we are implicitly running the `ddtrace.llmobs.__init__.py` code, which involves instantiating a `LLMObs` instance. This is likely the largest culprit of the memory overhead. Moving the import to only happening if LLMObs is enabled should avoid that added overhead, given that LLMObs is only running in a select few customer applications at the moment. ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) (cherry picked from commit be9936b)
Backport be9936b from #9387 to 2.9. This PR is a follow up of #9152, and attempts to minimize any added memory overhead by moving the `llmobs` utility import to inside the conditional check that `LLMObs` is enabled. By importing inside the `ddtrace.llmobs.` directory we are implicitly running the `ddtrace.llmobs.__init__.py` code, which involves instantiating a `LLMObs` instance. This is likely the largest culprit of the memory overhead. Moving the import to only happening if LLMObs is enabled should avoid that added overhead, given that LLMObs is only running in a select few customer applications at the moment. ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
…9417) This PR adds a fix to add handling for integration-generated spans for LLMObs parent ID propagation. #9152 added a edge case handling in `LLMObs._start_span()` if the span was part of the first service in a distributed trace, in which case we would need to check the `span.get_tag(PROPAGATED_PARENT_KEY)` due to the distributed header being propagated upwards to the local root of the original service at span finish time (but would always be propagated to all spans in subsequent services at span start time). Integration (openai, bedrock, langchain) generated spans use `BaseLLMIntegration.trace(...)` instead of `LLMObs._start_span()` so we needed to add handling here. ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
…9417) This PR adds a fix to add handling for integration-generated spans for LLMObs parent ID propagation. #9152 added a edge case handling in `LLMObs._start_span()` if the span was part of the first service in a distributed trace, in which case we would need to check the `span.get_tag(PROPAGATED_PARENT_KEY)` due to the distributed header being propagated upwards to the local root of the original service at span finish time (but would always be propagated to all spans in subsequent services at span start time). Integration (openai, bedrock, langchain) generated spans use `BaseLLMIntegration.trace(...)` instead of `LLMObs._start_span()` so we needed to add handling here. ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) (cherry picked from commit 538a024)
…backport 2.9] (#9422) Backport 538a024 from #9417 to 2.9. This PR adds a fix to add handling for integration-generated spans for LLMObs parent ID propagation. #9152 added a edge case handling in `LLMObs._start_span()` if the span was part of the first service in a distributed trace, in which case we would need to check the `span.get_tag(PROPAGATED_PARENT_KEY)` due to the distributed header being propagated upwards to the local root of the original service at span finish time (but would always be propagated to all spans in subsequent services at span start time). Integration (openai, bedrock, langchain) generated spans use `BaseLLMIntegration.trace(...)` instead of `LLMObs._start_span()` so we needed to add handling here. ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) Co-authored-by: Yun Kim <35776586+Yun-Kim@users.noreply.github.com>
TLDR for Reviewers
This PR adds support for distributed tracing for LLM Observability, by propagating a LLMObs parent ID tag on distributed requests. Note that the notion of a
LLMObs parent ID != APM parent ID
because we only submitllm
type spans to LLM Observability.The files to review are (in order of importance):
ddtrace/propagation/http.py
- inject LLMobs parent ID in distributed contextsddtrace/llmobs/_utils.py
- implementation details of how we determine / inject LLMObs parent IDddtrace/llmobs/_llmobs.py
- add edge case handling, explained belowtests/llmobs/test_propagation.py
- test cases showing (hopefully) what is expected behaviortests/tracer/test_propagation.py
- test cases for general propagation cases, mostly ensuring that our behavior can be turned on/off.Context
LLM Obs parent ID
LLM Observability workflows involve marking tracer-generated spans with a specific
llm
span type, then extracting information from the given span (such as span ID, trace ID, start/end timestamps, errors, etc) to create a LLMObs-specific span event to be submitted to LLM Obs.However since LLM Obs only accepts spans of type
llm
(or spans generated by theopenai/langchain/bedrock
integrations), we do not send other auto-generated spans to LLM Obs. This means that parent IDs can get tricky, especially in cases with non-LLMObs spans sprinkled between LLMObs spans (see example below):Technically span 3's parent is span 2, but since we only submit LLMObs type spans to LLM Obs (i.e. span 1 and span 3), LLM Observability's trace structure breaks down as it has no information about span 2. Therefore, we need to set span 3's llmobs_parent_id to be span 1, in other words the nearest ancestor of type
llm
.The current solution is to go up the span's ancestor tree (which is connected via
span._parent
) until we hit a span of typellm
, and use that span's span ID as the llmobs_parent_id. This works fine in most cases, except for distributed scenarios.Problem in distributed tracing
In distributed scenarios, a trace can encompass multiple services connected via requests. However, while the immediate trace/parent IDs are automatically propagated on request headers, we have no access to the spans beyond the immediate parent, i.e.
span._parent is None
. This means that in distributed scenarios involving LLMObs spans, the first LLMObs span in each service will consider itself the root.Solution
This PR adds three things:
By injecting the context with
_dd.p.llmobs_parent_id
, this will automatically be propagated on distributed request headers/tracecontexts, then be tagged on all spans in the next called service at span start. Note that at span finish time in the original service, the_dd.p.llmobs_parent_id
tag also gets set on the local root span as well.Since
_dd.p.llmobs_parent_id
tags are set at span finish time in the original service, we need to add this manual check to avoid relying exclusively on the propagated_dd.p.llmobs_parent_id
tag for non-distributed cases or spans in the original service._ml_obs.parent_id
tags. If so, use that as the parent ID._dd.p.llmobs_parent_id
tag as the parent ID.Testing
Unit tests were added to ensure injection is performed as expected when called explicitly. However, there are some manual local testing that was performed to ensure that propagation does indeed happen as expected, with 3 test services (A which calls B, which calls C), all of which were instrumented with an LLMObs span, running using
FastAPI
and making requests via therequests
library.APM Trace (for context)
This trace includes non-LLMObs spans, including FastAPI spans, requests spans, and manually constructed `APM` spans.LLMObs trace(s) before changes
All LLMObs spans in each service are the roots of their own trace.LLMObs trace after changes
All spans in the distributed trace are collected together in LLMObs.Considerations
This solution relies on using the
_dd.p.*
context tag propagation for distributed traces in the tracer internals. However, one caveat of using this internal functionality is that the context tag not only gets propagated in the request to the spans in the next service at span start time, but this also gets set on the local root span in the original service at span finish time. This is the reason for manually adding a_ml_obs.parent_id
tag in these cases, which takes precedence over the propagated_dd.p.llmobs_parent_id
tag.Additionally, this PR ensures that when
LLMObs.enable()
is called, we implicitly patch the below distributed tracing integrations, as they are required for propagating llmobs parent IDs via distributed request headers. Note that we only support distributed tracing through our current list of supported request integrations, including:For users that make requests outside of these libraries, we do not provide support for distributed tracing currently. We will need to create a custom
LLMObsPropagator
class to enable them to inject/extract the parent IDs, similar to how ddtrace documents here: https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#customChecklist
changelog/no-changelog
is set@DataDog/apm-tees
.Reviewer Checklist