Skip to content

auto_instrument() resource attrs are dropped on pods with a pre-installed (e.g. OTel-Operator-injected) TracerProvider #81

@AshishThakurSAP

Description

@AshishThakurSAP

Summary

On Python pods where an OpenTelemetry auto-instrumentation wrapper has already installed an OTel TracerProvider before user code runs (e.g. when an OpenTelemetry-Operator Instrumentation CR injects an init-container that copies its bundled SDK into /otel-auto-instrumentation-python and prepends it to PYTHONPATH), sap_cloud_sdk.core.telemetry.auto_instrument() silently fails to deliver its resource attributes to the globally active TracerProvider. SAP-cloud-sdk attrs — sap.cloud_sdk.*, sap.solution_area, mlflow.experiment_id, sap.cld.*, deployment.environment.name, cloud.region — are missing from emitted spans.

This affects every sap-cloud-sdk consumer running on a managed Kubernetes runtime that auto-injects Python OTel auto-instrumentation. SAP App Foundation is one such environment — every Python pod gets the operator wrapper via a Kyverno ClusterPolicy that matches on the otel.instrumentation/enabled: python label, which the runtime's CI/CD workflow stamps automatically. Platform tracking ticket: AFSDK-2840.

Root cause

auto_instrument() calls Traceloop.init(..., resource_attributes=resource, ...). Internally Traceloop builds its own TracerProvider with the supplied Resource and calls trace.set_tracer_provider(...) to install it globally. But OTel's set_tracer_provider honours only the first call per process — it's gated by _TRACER_PROVIDER_SET_ONCE, with no override=True parameter (upstream issue thread). When a wrapper has already called set_tracer_provider during Python startup, Traceloop's call is silently dropped, and the resource_attributes we passed never reach the globally active provider.

Reproduction

  1. Deploy a Python application that calls sap_cloud_sdk.core.telemetry.auto_instrument() from its startup path.
  2. Run it on a Kubernetes cluster with an OTel-Operator Instrumentation CR that auto-injects Python auto-instrumentation (or set instrumentation.opentelemetry.io/inject-python: "true" on the pod manually). The pod will get an init-container that mounts an OTel SDK at /otel-auto-instrumentation-python and prepends it to PYTHONPATH.
  3. Read trace.get_tracer_provider().resource.attributes after auto_instrument() returns.

Expected: the active provider's Resource carries the full sap-cloud-sdk enrichment — sap.cloud_sdk.*, sap.solution_area, mlflow.experiment_id, sap.cld.*, deployment.environment.name, cloud.region, plus service.name from APPFND_CONHOS_APP_NAME.

Observed: the active provider's Resource carries only operator-supplied attrs (telemetry.sdk.*, telemetry.auto.version, k8s.*, service.namespace, service.instance.id, and service.name derived from the k8s deployment name). All sap-cloud-sdk attrs are missing.

App Foundation reproducer

For SAP App Foundation tenants the trigger chain is automatic on every Python deploy:

  1. CI auto-detects Python and stamps otel.instrumentation/enabled: python onto the workload CR — see ci-cd-workflow/.github/actions/detect-otel-runtime/detect.py and inject-otel-app-yaml/inject.py:39-47.
  2. The workload chart propagates the CR label onto the rendered Deployment + Pod template — see helm-templates/charts/agent/templates/_helpers.tpl:170-193 ("Kyverno matches Deployment labels").
  3. The cluster's Kyverno otel-inject-python-pod ClusterPolicy fires the OTel-Operator webhook → init-container injection → wrapper-installed TracerProvider → bug above.

Concrete evidence

OTEL resource attributes on a deployed App Foundation pod (sap-cloud-sdk==0.11.6, OTel-Operator bundle telemetry.auto.version=0.62b1, OTel SDK 1.41.1):

{
  "resource_attribute_count": 15,
  "resource_attributes": {
    "telemetry.sdk.language": "python",
    "telemetry.sdk.name": "opentelemetry",
    "telemetry.sdk.version": "1.41.1",
    "service.version": "0.0.1",
    "sap.service.display_name": "buyer-agent-evals-fina",
    "k8s.container.name": "buyer-agent-evals-fina",
    "k8s.deployment.name": "buyer-agent-evals-fina-deployment",
    "k8s.namespace.name": "buyer-agent-evals-fsmcba",
    "k8s.node.name": "ip-10-250-152-52.eu-central-1.compute.internal",
    "k8s.pod.name": "buyer-agent-evals-fina-deployment-97d4f6795-7r2vz",
    "k8s.replicaset.name": "buyer-agent-evals-fina-deployment-97d4f6795",
    "service.instance.id": "buyer-agent-evals-fsmcba.buyer-agent-evals-fina-deployment-97d4f6795-7r2vz.buyer-agent-evals-fina",
    "service.namespace": "buyer-agent-evals-fsmcba",
    "service.name": "buyer-agent-evals-fina-deployment",
    "telemetry.auto.version": "0.62b1"
  }
}

For comparison, auto_instrument() on a single-tenant pod (no wrapper active) produces a Resource with 23 attributes including all expected sap-cloud-sdk keys — so the resource-building path is fine, the problem is purely that those attrs never reach the globally active provider when something else got there first.

Impact

  • MLflow trace routing breaks. The mlflow.experiment_id resource attribute is the routing key on the collector side; without it, spans land in the wrong (or default) experiment.
  • Solution-area / sub-account attribution breaks. sap.solution_area, sap.cld.subaccount_id, sap.cld.system_role — used for filtering / quota / chargeback — are absent.
  • Cloud-SDK provenance signals lost. sap.cloud_sdk.{name,language,version} no longer identify spans as coming from a sap-cloud-sdk-instrumented workload.
  • Service identity is incorrect. service.name ends up as <appname>-deployment (k8s deployment name) rather than the cloud-sdk-supplied <appname> from APPFND_CONHOS_APP_NAME.

The bug is silent — auto_instrument() returns successfully and emits the "Cloud auto instrumentation initialized successfully" log line — so it's easy to miss without specifically inspecting the resulting provider's Resource.

Versions

  • sap-cloud-sdk 0.11.x (current).
  • traceloop-sdk 0.54.0.
  • OTel SDK 1.41.x (any).
  • Reproduces on any cluster with an OTel-Operator Instrumentation CR auto-injecting Python auto-instrumentation. Confirmed on App Foundation's Kyma runtime.

Proposed fix

Detect the wrapper-installed-provider case via the standard upstream OTel-Operator marker telemetry.auto.version on the active provider's Resource, and merge the sap-cloud-sdk attrs onto it via provider._resource = provider.resource.merge(Resource.create(sap_attrs)). Right-side wins on collisions. No new public API; no parameter additions; existing single-tenant flows unaffected (no marker → no merge). Auto-detection means existing callers pick up the fix transparently on upgrade. PR forthcoming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions