feat(telemetry): support custom resource attributes and add metric cardinality controls#4367
Conversation
…rdinality controls Resolves #4365. Adds two coupled OpenTelemetry capabilities to make qwen-code's telemetry production-ready in multi-team / multi-tenant deployments: 1. Custom resource attributes via standard `OTEL_RESOURCE_ATTRIBUTES` and `OTEL_SERVICE_NAME` env vars and a new `telemetry.resourceAttributes` setting. Operators can now tag every span / log / metric with `team`, `env`, `cost_center`, or anything else their backend needs. 2. Metric cardinality controls. `session.id` is moved off the OpenTelemetry Resource (where it auto-attached to every metric data point and caused unbounded time-series fan-out on Prometheus / ARMS Metric / etc.) and gated behind a new opt-in `telemetry.metrics.includeSessionId` toggle. Spans and logs still carry `session.id` for trace and log correlation. Reserved keys (`service.version`, `session.id`) are stripped from both env and settings sources with a `diag.warn`. `OTEL_SERVICE_NAME` follows the OTel spec precedence (highest priority for `service.name`). Settings JSON values are runtime-coerced to strings as defense against hand-edited non-conforming JSON. Breaking change: metrics no longer carry `session.id` by default. Operators who need it can restore the previous behavior with `QWEN_TELEMETRY_METRICS_INCLUDE_SESSION_ID=true` or `telemetry.metrics.includeSessionId: true` in settings.json; recommended only for short-term debugging since it re-introduces the cardinality problem. For long-term session-level analysis, prefer trace and log backends which handle per-event data without cardinality pressure. Design doc: docs/design/telemetry-resource-attributes-design.md 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
📋 Review SummaryThis PR introduces two coupled OpenTelemetry capabilities: custom resource attributes (via standard 🔍 General Feedback
🎯 Specific Feedback🔴 CriticalNo critical issues identified — security, breaking changes, and major functionality are well-handled. 🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
Adds production-oriented OpenTelemetry controls by letting operators define custom Resource attributes (from OTEL_RESOURCE_ATTRIBUTES / OTEL_SERVICE_NAME and settings) and by preventing high-cardinality session.id from attaching to metrics unless explicitly enabled.
Changes:
- Add synchronous parsing/merging for custom OTel Resource attributes with reserved-key stripping.
- Move
session.idoff the OTelResourceand gate metric emission ofsession.idbehindtelemetry.metrics.includeSessionId. - Update config/schema/docs and expand unit tests around resolver precedence and metric cardinality behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/vscode-ide-companion/schemas/settings.schema.json | Exposes new telemetry settings (resourceAttributes, metrics.includeSessionId) to the VS Code companion schema. |
| packages/core/src/telemetry/sdk.ts | Merges user Resource attributes into SDK Resource, strips reserved keys, removes session.id from Resource. |
| packages/core/src/telemetry/sdk.test.ts | Adds coverage for Resource attribute behavior (reserved keys, service.name/version precedence). |
| packages/core/src/telemetry/resource-attributes.ts | New helper module: env parsing, reserved-key stripping, settings value coercion. |
| packages/core/src/telemetry/resource-attributes.test.ts | New unit tests for parsing/coercion/stripping edge cases. |
| packages/core/src/telemetry/metrics.ts | Makes metric session.id emission opt-in via config toggle. |
| packages/core/src/telemetry/metrics.test.ts | Updates existing tests and adds toggle on/off coverage for metric attributes. |
| packages/core/src/telemetry/config.ts | Adds resolver logic for Resource attributes + metrics.includeSessionId env/settings merge. |
| packages/core/src/telemetry/config.test.ts | Adds resolver tests for attribute precedence, reserved key stripping, and metric toggle resolution. |
| packages/core/src/config/config.ts | Extends TelemetrySettings to include resourceAttributes and metrics plus new getters. |
| packages/cli/src/config/settingsSchema.ts | Adds JSON schema entries for telemetry.resourceAttributes and telemetry.metrics.includeSessionId. |
| docs/developers/development/telemetry.md | Documents Resource attributes, cardinality controls, and migration guidance for metric session.id. |
| docs/design/telemetry-resource-attributes-design.md | Adds design doc describing layering/precedence and the cardinality rationale. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // keys (service.version) are stripped from both sources with a warning. | ||
| // OTEL_SERVICE_NAME is a standard escape hatch that overrides | ||
| // service.name from any other source. |
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Round 1 review fixes (#4367). After session.id was added to RESERVED_RESOURCE_ATTRIBUTE_KEYS in Codex review, four user-facing descriptions still claimed only service.version was reserved: - packages/core/src/telemetry/config.ts (merge comment) - packages/core/src/config/config.ts (TelemetrySettings JSDoc) - packages/cli/src/config/settingsSchema.ts (schema description) - packages/vscode-ide-companion/schemas/settings.schema.json (regenerated) Also corrects scope claim: resource attributes apply to every signal the SDK exports (OTLP and file outfile share the same Resource), not just OTLP. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Review feedback addressedPushed
Other auto-bot suggestions — declined with reasons
All 1193 telemetry + config tests pass; CI green. |
…ding hint Round 2 self-review fixes (#4367). Two small but real UX gaps: 1. Reserved-key / malformed-pair / coerce warnings route to the debug log (per #3986), not the console — so a user who types `OTEL_RESOURCE_ATTRIBUTES=service.version=2.0` sees no feedback that the value was silently dropped. Adds a "Troubleshooting" section in telemetry.md telling users where to look, and a note in the parser docstring documenting where warns go. 2. A literal (unencoded) comma in an env var value is a common foot-gun: the parser splits on it, producing a malformed second half that is silently dropped. Updates the warn text to include a "hint: percent-encode literal commas as %2C" callout, and adds the same guidance to the docs. Deferred to a follow-up: startup-time stderr summary of dropped attributes. Stderr during TUI render could break Ink rendering, so the right surface needs separate design. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
…IBUTES parser Per review feedback on #4367. The parser uses `indexOf('=')` so the first `=` separates key and value while subsequent `=` stay in the value. The behavior was correct but untested; a future refactor to `split('=')` would silently break base64-padded, JWT, or connection-string values. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
Suggestion: New Config getters (getTelemetryResourceAttributes, getTelemetryMetricsIncludeSessionId) at config.ts:2834-2838 lack direct unit tests — they are only tested indirectly via mocks in sdk.test.ts and metrics.test.ts. Add dedicated tests in config.test.ts that construct a real Config instance and verify the ?? {} / ?. ?? false fallback behavior.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
…p summary Adopts review feedback from #4367 (wenshao via Qwen Code /review). Five accepted suggestions, bundled because they all touch the same parse/coerce/strip pipeline: 1. Key percent-decoding (CRITICAL). `parseOtelResourceAttributes` now percent-decodes both keys and values per the OTel / W3C Baggage spec. Without this, `OTEL_RESOURCE_ATTRIBUTES=service%2Eversion=99` lands on Resource as the literal key `service%2Eversion`, bypassing the reserved-key filter; a collector that decodes keys downstream could then resurrect `service.version` and spoof the version label. 2. Startup summary of dropped attributes. Every `diag.warn` in resource-attributes.ts routes only to the OTel debug log (per #3986), giving operators zero feedback when their attributes are silently dropped. Helpers now optionally accumulate diagnostics into a `ResourceAttributeWarnings` array; the resolver collects them and the SDK emits a one-time console summary at init (before Ink renders, so no TUI conflict). 3. `||` instead of `??` for service.name fallback. Settings can put an empty string through `??`, producing a blank `service.name` that some backends reject. `||` falls through to the default. 4. `coerceStringResourceAttributes` now trims keys and skips empty/whitespace-only keys, matching `parseOtelResourceAttributes`. Previously `{" ": "x"}` or `{"team ": "y"}` from settings.json would land as malformed Resource attributes. 5. `OTEL_SERVICE_NAME` is trimmed before the truthy check, so values like `' '` or `'\t'` are treated as unset rather than producing a whitespace-only service name on Resource. One suggestion declined (in-thread reply on PR): - "Redundant `?? {}` in sdk.ts:160" — intentional defense-in-depth for `vi.mock('../config/config.js')` callers in `telemetry.test.ts` where auto-stub returns undefined. The reviewer is right that production code paths never hit it, but tests do. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
…ncoding test Adopts two review suggestions on #4367 (wenshao via Qwen Code /review): 1. `service.name` fallback uses `.trim() || SERVICE_NAME` instead of plain `||`. Plain `||` lets whitespace-only values (`" "`, `"\t"`) through as truthy, producing a blank service name on Resource that some backends reject. Both settings (no value trimming) and env (`%20` decodes to `" "`) can deliver such values. Test added. 2. Adds `key%ZZ=val` to the parameterized parser test to cover the invalid-percent-encoding-on-key catch branch. Previously only the value-side catch was tested. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
Both Round 3 Suggestions addressed correctly in 1439b38:
- Whitespace-only
service.namebypass →userServiceName?.trim() || SERVICE_NAMEcatches""and" "paths. Test added. - Invalid percent-encoding in key →
key%ZZ=valtest case covers the key-sideURIErrorcatch branch. Test added.
No new findings. LGTM! ✅
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
Closes #4365. Sub-issue of #3731 (P3 line: resource attribute policy and cardinality controls).
Summary
Two coupled OpenTelemetry capabilities to make qwen-code's telemetry production-ready in multi-team / multi-tenant deployments:
OTEL_RESOURCE_ATTRIBUTESandOTEL_SERVICE_NAMEenv vars now work, plus a newtelemetry.resourceAttributessetting. Operators can tag every span / log / metric withteam,env,cost_center, or anything else their backend cares about. (Previously:autoDetectResources: falsesilently dropped both standard env vars.)session.idis moved off the OpenTelemetryResource(where it auto-attached to every metric data point, causing unbounded time-series fan-out on Prometheus / ARMS Metric / etc.) and gated behind a new opt-intelemetry.metrics.includeSessionIdtoggle. Spans and logs still carrysession.idfor trace and log correlation.Reserved keys (
service.version,session.id) are stripped from both env and settings sources with adiag.warn—service.versionbecause users shouldn't be able to spoof telemetry version,session.idbecause it's runtime-injected and Resource-level placement would bypass the cardinality toggle.Design
Full design doc at
docs/design/telemetry-resource-attributes-design.mdcovers layering, precedence chains (includingOTEL_SERVICE_NAMEspec compliance), reserved-key policy, edge cases (malformed env values, percent-encoding, duplicate keys), and a comparison with the Claude Code implementation.Breaking change
Metrics no longer carry
session.idby default. To restore previous behavior (short-term debugging only): setQWEN_TELEMETRY_METRICS_INCLUDE_SESSION_ID=trueortelemetry.metrics.includeSessionId: truein settings.json. For long-term session-level analysis, prefer trace / log backends — they handle per-event data without cardinality pressure.Documented in the new ${"
"}Migration${""} section ofdocs/developers/development/telemetry.md.Test plan
resource-attributes.test.ts,config.test.ts,metrics.test.ts,sdk.test.ts— covers parser edge cases (malformed input, percent-encoding of keys and values, duplicates), precedence chain (env / settings / OTEL_SERVICE_NAME / reserved keys), Resource construction (session.id absent + defense-in-depth strip), metric toggle on/off, settings-value coercion, startup-summary emissionpackages/core/src/telemetry/andpackages/core/src/config/passsettingsSchema.test.tspasses人工验证
本地settings文件增加customAttributes

aliyun telemetry 验证预期应该有user.id和tenant.id

E2E verification script
/tmp/verify-telemetry-pr-4367.mjs(drivesresolveTelemetrySettings → initializeTelemetry → emit span/log/metric → shutdown → parse FileExporter output → assert) plus a runner shell script that invokes one process per scenario. Both scripts are self-contained and re-runnable from a fresh checkout.E2E scenario matrix
OTEL_RESOURCE_ATTRIBUTES=team=platform,env=prod,my%20key=val,k=base64==team,env, decodedmy key,k=base64==(first-=split contract) on Resource; defaults forservice.name/service.version; nosession.idOTEL_RESOURCE_ATTRIBUTES=service.version=spoofed,session.id=spoofed,service%2Eversion=alsoSpoofed,bogus,team=okservice%2Eversion) stripped; malformedbogusskipped without crash;team=okapplied; startup-summary emittedsettings.resourceAttributes={team, " tier ", "", service.version, session.id}tier; empty key dropped; reserved keys strippedOTEL_SERVICE_NAME=svc-from-env+ env attrs + settings overlapOTEL_SERVICE_NAMEwins forservice.name; settings wins over env forrole; settings-onlyzoneappliedsession.id; spans/logs DO carrysession.idsettings.metrics.includeSessionId=truesession.id; spans still carrysession.idsession.id=spoof+metrics.includeSessionId=truesession.idabsent from Resource (both spoofs blocked); metricsession.idis real value not spoof; startup summary lists both blocked attemptsVerified evidence (sample output)
Scenario 2 — env reserved-key + malformed handling (note the startup summary appearing on stderr per T8 fix):
Scenario 6 — metric toggle ON re-attaches session.id to metric data points without polluting Resource:
Scenario 7 — defense-in-depth: env+settings both try to spoof
session.id, even with toggle ON. Result: Resource stays clean, metrics get the real runtime session.id (not spoof), spans/logs use the real session.id:Final result: 7 passed, 0 failed (
Result: 7 passed, 0 failed; EXIT=0).Files
packages/core/src/telemetry/resource-attributes.ts(+ test) — parser, reserved-key set, settings-value coercerpackages/core/src/telemetry/sdk.ts— Resource construction (folds user attrs, removes session.id, defense-in-depth strip)packages/core/src/telemetry/config.ts— resolver merge logicpackages/core/src/telemetry/metrics.ts—getCommonAttributes()toggle gatepackages/core/src/config/config.ts—TelemetrySettingsinterface + getterspackages/cli/src/config/settingsSchema.ts— JSON schema (+ regeneratedsettings.schema.json)docs/developers/development/telemetry.md— Resource attributes / Cardinality controls sections + migration guide + 5 example configsdocs/design/telemetry-resource-attributes-design.md🤖 Generated with Qwen Code