Skip to content

feat: comprehensive telemetry instrumentation#39

Merged
anandgupta42 merged 2 commits intomainfrom
feat/comprehensive-telemetry-instrumentation
Mar 5, 2026
Merged

feat: comprehensive telemetry instrumentation#39
anandgupta42 merged 2 commits intomainfrom
feat/comprehensive-telemetry-instrumentation

Conversation

@anandgupta42
Copy link
Contributor

@anandgupta42 anandgupta42 commented Mar 5, 2026

Summary

  • 25 telemetry event types instrumented across the CLI lifecycle — up from 10 defined (many never emitted) to full coverage
  • Error visibility: auth failures, MCP server errors, Python engine crashes, provider errors, permission denials, doom loops
  • Privacy-safe helpers: categorizeToolName() for tool classification, bucketCount() for anonymized counts
  • Telemetry docs page with event table, privacy policy, opt-out instructions, and contributor guide
  • 25 unit tests covering all helpers, event types, privacy validation, and naming conventions

Files changed (15)

File Changes
src/telemetry/index.ts 25 event types, categorizeToolName(), bucketCount(), sequence_index/previous_tool on tool_call
src/session/processor.ts tool_call sequencing, context_utilization, error_recovered, provider_error, doom_loop_detected
src/session/prompt.ts agent_outcome tracking
src/mcp/index.ts mcp_server_status + mcp_server_census
src/bridge/engine.ts engine_started, engine_error
src/cli/cmd/auth.ts auth_login (7 paths), auth_logout
src/installation/index.ts upgrade_attempted
src/session/index.ts session_forked
src/session/compaction.ts compaction_triggered (was defined, never emitted)
src/permission/next.ts permission_denied
src/tool/project-scan.ts environment_census
test/telemetry/telemetry.test.ts 25 tests, 81 assertions
docs/docs/configure/telemetry.md New telemetry docs page
docs/docs/network.md AppInsights firewall endpoint
docs/mkdocs.yml Nav entry for telemetry docs

Test plan

  • npx tsc --noEmit — no new type errors (2 pre-existing)
  • bun test — 1413 pass, 0 fail across 85 files (including 25 new telemetry tests)
  • Privacy: no SQL/code/credentials/file paths sent; error messages truncated to 500 chars; counts bucketed
  • All events use snake_case domain_action naming convention
  • Manual: start altimate, trigger tool calls, verify tool_category and sequence_index appear in events
  • Manual: disconnect network, verify telemetry failures are silent

🤖 Generated with Claude Code

…d data moat

Instrument 25 event types across the CLI lifecycle — auth, MCP servers, Python
engine, provider errors, permissions, upgrades, context utilization, agent
outcomes, workflow sequencing, and environment census. Adds telemetry docs page,
firewall endpoint, 25 unit tests, and privacy-safe helpers (categorizeToolName,
bucketCount). No behavioral changes; all telemetry is opt-out via config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@jontsai jontsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comprehensive telemetry instrumentation with good privacy practices. Verified:

✓ 25 well-typed event types with consistent snake_case naming
✓ Privacy-safe: no SQL/code/credentials/paths, error messages truncated to 500 chars
categorizeToolName() and bucketCount() are smart privacy helpers
sequence_index + previous_tool on tool_call enables workflow pattern analysis
✓ Engine error tracking across phases (uv_download, venv_create, pip_install) — good for debugging install issues
✓ Docs page with event table, opt-out instructions, and contributor guide
✓ 25 tests covering helpers, events, and privacy validation

Minor observations:

  • agent_outcome has doom_loops: 0, permission_denials: 0 hardcoded — worth wiring up actual counters in a follow-up
  • categorizeToolName uses substring matching which could miscategorize (e.g., a tool named sql_warehouse_connection hits sql first) — ordering matters, current priority seems reasonable

Ship it! 🚢


export async function process(input: {
parentID: string
messages: MessageV2.WithParts[]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: compactionAttempt is a module-level let — it persists across sessions in the same process. If a user runs multiple sessions (or the CLI is long-lived), the counter leaks between them. Should this be scoped inside process() or reset per-session?

permission_denials: 0,
outcome,
})
Telemetry.track({
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: doom_loops: 0 and permission_denials: 0 are hardcoded — they'll never reflect actual counts. Consider tracking these via counters in the prompt loop (increment on doom_loop_detected / permission_denied events) and wiring them in here, or remove the fields until they're real to avoid misleading dashboards.

const remoteResourcesList = await client.listResources().catch(() => ({ resources: [] }))
Telemetry.track({
type: "mcp_server_census",
timestamp: Date.now(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: listTools() and listResources() are awaited inline during the connect path. If an MCP server is slow to respond to these, it adds latency to the connection sequence. Consider firing these asynchronously (fire-and-forget with .catch) since census data is purely telemetry and shouldn't block the critical path.

cache_hit_ratio: totalInput > 0 ? Math.round((cacheRead / totalInput) * 1000) / 1000 : 0,
})
}
if (snapshot) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: generation_number is set to toolCallCounter, but these count different things — toolCallCounter tracks tool calls, not generations/steps. This will overcount in multi-tool turns and undercount in zero-tool turns. Should this use a separate generation counter (or step from the prompt loop)?

if (type === "mcp") return "mcp"
const n = name.toLowerCase()
if (n.includes("sql") || n.includes("query")) return "sql"
if (n.includes("schema") || n.includes("column") || n.includes("table")) return "schema"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The substring match ordering can cause edge-case miscategorization. For example, warehouse_usage_stats matches warehouse before reaching the warehouse_usage → finops check. Currently the finops check with warehouse_usage comes before the warehouse check so it works — but this ordering dependency is fragile. Consider using a more explicit matching approach (exact prefixes or a lookup map) to make this order-independent.

Telemetry.track({
type: "upgrade_attempted",
timestamp: Date.now(),
session_id: Telemetry.getContext().sessionId || "cli",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The as "npm" | "bun" | "brew" cast silently maps unknown methods (like choco, yarn, etc.) to "npm". Consider adding an "other" bucket to the type union and using it as the fallback, so dashboards can distinguish actual npm installs from unknown methods.

Copy link

@jontsai jontsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #39 Review Tour Guide 🗺️

Overview

+979/-9 across 15 files — comprehensive telemetry instrumentation adding 25 event types across the CLI lifecycle. Single commit by @anandgupta42 + Claude.

File Order

# File +/- Purpose
1 docs/docs/configure/telemetry.md +122 New telemetry docs page
2 docs/docs/network.md +1 AppInsights firewall endpoint
3 docs/mkdocs.yml +1 Nav entry
4 src/bridge/engine.ts +49/-3 engine_started, engine_error events
5 src/cli/cmd/auth.ts +83 auth_login (7 paths), auth_logout
6 src/installation/index.ts +22 upgrade_attempted
7 src/mcp/index.ts +75/-1 mcp_server_status + mcp_server_census
8 src/permission/next.ts +19/-1 permission_denied
9 src/session/compaction.ts +10 compaction_triggered
10 src/session/index.ts +10 session_forked
11 src/session/processor.ts +70/-4 Tool sequencing, context_utilization, error_recovered, provider_error, doom_loop_detected
12 src/session/prompt.ts +24 agent_outcome
13 src/telemetry/index.ts +168 15 new event types + categorizeToolName() + bucketCount()
14 src/tool/project-scan.ts +41 environment_census
15 test/telemetry/telemetry.test.ts +284 25 tests, 81 assertions

Tour Highlights

🟢 What's Good

  • Privacy design is solid — no SQL/code/creds, errors truncated to 500 chars, counts bucketed via bucketCount(), file paths excluded
  • Type system — 25 event types as a discriminated union, compile-time safe
  • Comprehensive coverage — auth (7 paths!), MCP lifecycle, engine, permissions, compaction, context utilization
  • Tests — 284 lines covering helpers, privacy validation, naming conventions, and all event types
  • Docs — clear telemetry page with event table, privacy policy, opt-out, and contributor guide

🟡 Issues Found (6 inline comments)

# Severity File Issue
1 issue compaction.ts Module-level compactionAttempt counter leaks across sessions
2 suggestion prompt.ts doom_loops: 0 and permission_denials: 0 hardcoded — misleading data
3 thought mcp/index.ts listTools()/listResources() census calls block the MCP connect path
4 issue processor.ts generation_number uses toolCallCounter — wrong semantic
5 suggestion telemetry/index.ts categorizeToolName substring ordering is fragile
6 nit installation/index.ts Unknown upgrade methods silently map to "npm"

🟢 No Red Flags

  • No security concerns — telemetry is opt-out, privacy-safe
  • No breaking changes to existing behavior
  • Error paths re-throw properly after tracking

Summary

Solid PR. The 6 inline comments are mostly minor follow-ups — #1 (module-level state) and #4 (wrong counter) are the most impactful. None are blockers for shipping.

- compaction.ts: scope compactionAttempt per-session via Map instead of
  module-level counter that leaked across sessions
- prompt.ts: remove hardcoded doom_loops/permission_denials fields from
  agent_outcome (were always 0); doom loops and permission denials are
  already tracked as separate events
- mcp/index.ts: make mcp_server_census fire-and-forget so listTools/
  listResources never block the MCP connect critical path
- processor.ts: use dedicated generationCounter for context_utilization
  instead of toolCallCounter which counts a different thing
- telemetry/index.ts: refactor categorizeToolName to use a pattern array
  with explicit ordering, making the match logic order-independent and
  easier to maintain
- installation/index.ts: add "other" to upgrade method union instead of
  silently casting unknown methods to "npm"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anandgupta42 anandgupta42 merged commit 91923fe into main Mar 5, 2026
4 checks passed
Copy link
Contributor

@suryaiyer95 suryaiyer95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Model Code Review — 6 AI Models

Verdict: REQUEST CHANGES
Critical: 1 | Major: 7 | Minor: 4 | Nits: 2

Reviewed by Claude Opus 4.6, Kimi K2.5, Grok 4, MiniMax M2.5, GLM-5, Qwen 3.5 Plus — converged in 1 round.


Critical Issues

1. Docs claim path scrubbing + 500 char truncation, but code doesn't match

Security / Documentationsrc/session/processor.ts:459, docs/docs/configure/telemetry.md:80

The telemetry docs state: "Error messages are truncated to 500 characters and scrubbed of file paths before sending."

Two problems:

(a) In processor.ts:459, the error event truncates to 1000 characters, not 500:

error_message: (e?.message ?? String(e)).slice(0, 1000),  // line 459

While tool_call error (line 263) and provider_error (line 488) correctly truncate to 500.

(b) There is no path scrubbing implementation anywhere. Error messages containing full filesystem paths (e.g., /Users/john/projects/secret-client/...) are sent as-is.

Fix: Standardize to 500 chars with a const MAX_ERROR_LENGTH = 500, implement a scrubPaths() helper, or update docs to reflect reality.

Flagged by: Claude, Kimi (Consensus)


Major Issues

2. agent_outcome hardcodes doom_loops: 0 and permission_denials: 0

Logic Errorsrc/session/prompt.ts:808-809

These fields always send zero despite doom_loop_detected and permission_denied events being emitted elsewhere. The outcome summary loses correlation with these events.

Fix: Add session-scoped counters and pass actual values, or remove these fields.

Flagged by: Kimi, MiniMax, Qwen, GLM-5 (Consensus)

3. ALTIMATE_TELEMETRY_DISABLED env var documented but not implemented

Bug / Documentationdocs/docs/configure/telemetry.md:63-64, src/telemetry/index.ts:360-383

The docs say export ALTIMATE_TELEMETRY_DISABLED=true disables telemetry, but init() only checks userConfig.telemetry?.disabled. The env var is never read.

Fix: Add || process.env.ALTIMATE_TELEMETRY_DISABLED === 'true' to the init check.

Flagged by: Claude (Unique)

4. Module-level compactionAttempt counter never resets

Logic Errorsrc/session/compaction.ts:156

compactionAttempt persists across sessions. If Session A triggers 3 compactions, Session B's first compaction reports attempt: 4.

Fix: Make the counter session-scoped.

Flagged by: MiniMax, Kimi, Qwen, GLM-5 (Consensus)

5. generation_number in context_utilization uses tool call count

Logic Errorsrc/session/processor.ts:359

generation_number: toolCallCounter,  // counts tool invocations, not generations

A single generation can produce multiple tool calls. The field name implies generation count.

Fix: Add a separate generationCounter incremented in finish-step.

Flagged by: Claude (Unique)

6. MCP disconnect hardcodes transport: "stdio"

Logic Errorsrc/mcp/index.ts:634

transport: "stdio",  // Wrong for SSE/streamable-http servers

The disconnect function always reports "stdio" regardless of actual transport type.

Fix: Track the transport type when connecting and look it up on disconnect.

Flagged by: GLM-5 (Convergence fix)

7. upgrade_attempted silently loses method data

Logic Errorsrc/installation/index.ts:180, 201

const telemetryMethod = (["npm", "bun", "brew"].includes(method) ? method : "npm") as "npm" | "bun" | "brew"

Methods like pnpm, yarn, curl, scoop, choco are silently mapped to "npm".

Fix: Expand the type to include all methods.

Flagged by: GLM-5 (Convergence fix)

8. agent_outcome never produces "error" outcome

Logic Errorsrc/session/prompt.ts:793-797

const outcome = abort.aborted ? "abandoned"
  : sessionTotalCost === 0 && toolCallCount === 0 ? "abandoned"
  : "completed"  // "error" is never produced

Sessions ending with errors are incorrectly reported as "completed".

Fix: Track a sessionHadError flag and use it in outcome determination.

Flagged by: GLM-5 (Convergence fix)


Minor Issues

9. Session context not updated after fork

src/session/index.ts:275 — After session_forked is tracked, Telemetry.setContext() isn't called with the new session ID. Subsequent events may be attributed to the parent session. (MiniMax)

10. categorizeToolName has order-dependent keyword overlap

src/telemetry/index.ts:259-270 — A tool named warehouse_query_stats matches "query""sql" before "warehouse". (Claude)

11. Hardcoded session_id: "cli" in auth events

src/cli/cmd/auth.ts (10+ locations) — Should use Telemetry.getContext().sessionId || "cli" for consistency. (Kimi, Qwen, GLM-5)

12. Privacy tests verify type shapes, not runtime behavior

test/telemetry/telemetry.test.ts:174-246 — Tests construct events manually and check field absence, but don't verify actual track() call sites truncate error messages. (Claude, Kimi, Qwen)


Nits

  • Redundant .toLowerCase() at src/telemetry/index.ts:268n is already lowercased on line 261.
  • Unnecessary as const at src/session/processor.ts:214, 250 — TypeScript infers literal types from string literal ternaries.

Positive Observations

  • Privacy-by-design: categorizeToolName() and bucketCount() are thoughtful privacy primitives. Tool arguments/outputs never included.
  • Fail-safe telemetry: Silent catch in flush(), unref() on timer, !enabled guard — telemetry never breaks the CLI.
  • Strong type safety: Discriminated union Telemetry.Event ensures compile-time correctness for all 25 event variants.
  • Comprehensive documentation: Event table, privacy policy, opt-out instructions, contributor guide.
  • Clean App Insights integration: Proper envelope serialization, connection string parsing, batched flush model.

Missing Tests

  • toAppInsightsEnvelopes() serialization
  • parseConnectionString() edge cases
  • flush() / shutdown() behavior
  • Buffer overflow (> 200 events)
  • init() respecting disabled config / env var
  • Network failure silent handling

Finding Attribution
Issue Origin Type
1. Error truncation + path scrubbing mismatch Claude, Kimi Consensus
2. Hardcoded zeros in agent_outcome Kimi, MiniMax, Qwen, GLM-5 Consensus
3. ALTIMATE_TELEMETRY_DISABLED env var not implemented Claude Unique
4. compactionAttempt counter never resets MiniMax, Kimi, Qwen, GLM-5 Consensus
5. generation_number uses wrong counter Claude Unique
6. MCP disconnect hardcodes transport: "stdio" GLM-5 Convergence fix
7. upgrade_attempted loses method data GLM-5 Convergence fix
8. agent_outcome never produces "error" GLM-5 Convergence fix
9. Session context not updated after fork MiniMax Unique
10. categorizeToolName order-dependent Claude Unique
11. Hardcoded session_id: "cli" Kimi, Qwen, GLM-5 Consensus
12. Privacy tests don't verify runtime Claude, Kimi, Qwen Consensus
13. Redundant .toLowerCase() Claude, Kimi, GLM-5 Consensus
14. Unnecessary as const Qwen Unique

anandgupta42 added a commit that referenced this pull request Mar 5, 2026
- Standardize error message truncation to 500 chars (was 1000 in processor.ts)
- Implement ALTIMATE_TELEMETRY_DISABLED env var check in telemetry init()
- Track MCP transport type on connect, use correct transport in disconnect
- Add sessionHadError flag so agent_outcome produces "error" outcome
- Use Telemetry.getContext().sessionId in auth events instead of hardcoded "cli"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
anandgupta42 added a commit that referenced this pull request Mar 5, 2026
* fix: address telemetry review issues from PR #39

- Standardize error message truncation to 500 chars (was 1000 in processor.ts)
- Implement ALTIMATE_TELEMETRY_DISABLED env var check in telemetry init()
- Track MCP transport type on connect, use correct transport in disconnect
- Add sessionHadError flag so agent_outcome produces "error" outcome
- Use Telemetry.getContext().sessionId in auth events instead of hardcoded "cli"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: initialize telemetry early and buffer pre-init events

Telemetry.init() was only called in SessionPrompt.next(), but track()
calls from MCP connect, engine startup, and auth commands fire earlier.
Events tracked before init() were silently dropped.

- Buffer events in track() regardless of enabled state (pre-init events
  are kept and flushed once init() completes)
- Clear buffer when init() determines telemetry is disabled
- Call init() in main CLI middleware (covers auth, serve, etc.)
- Call init() in worker thread startup (covers MCP/engine events)
- Keep existing init() in session prompt as idempotent safety net

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: robust telemetry init with promise deduplication and pre-init buffering

The previous commit moved init() earlier but had subtle issues:
- No init-once guard when telemetry disabled (init ran repeatedly)
- Events during async init() were dropped
- Concurrent non-awaited init() calls could race

Fix:
- Use promise deduplication: init() returns same promise on repeated calls
- Use initDone flag: track() buffers during init, drops after disable
- Clear buffer on every disabled/error exit path in doInit()
- Reset initPromise/initDone in shutdown() for test isolation
- Add tests for pre-init buffering, post-disable dropping, idempotency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: handle Config.get() and Control.account() failures in telemetry init

Config.get() throws "No context found for instance" when called before
Instance.provide() (e.g. CLI middleware, worker thread startup). This
caused the outer catch to clear the buffer and disable telemetry entirely.

Fix: wrap Config.get() and Control.account() in individual try/catch
blocks so failures don't prevent telemetry initialization. The env var
ALTIMATE_TELEMETRY_DISABLED is the early-init escape hatch.

Verified end-to-end: pre-init events are buffered and flushed to App
Insights after init completes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: hash user email before sending to telemetry

Raw email is PII — replace with SHA-256 hash to preserve anonymous
user correlation without leaking personally identifiable information.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@kulvirgit
Copy link
Collaborator

Code Review: PR #39 — Comprehensive Telemetry Instrumentation

Verdict: REQUEST CHANGES (2 critical, 5 major issues)
Reviewed by: 8 AI models (Claude, Codex, Gemini, Kimi, Grok, MiniMax, GLM-5, Qwen) with convergence round


Critical Issues

1. PII Leakage — Raw Email in ai.user.id Tag

Category: Security / Privacy
Location: src/telemetry/index.ts:330-331

The App Insights envelope sets ai.user.id to the raw user email from Installation.fingerprint(). Every telemetry event sent to Azure contains the user's email in plaintext, creating GDPR/compliance risk.

Fix: Hash the email before sending (ai.user.id: sha256(email)) or use an anonymous installation fingerprint.

2. Missing Timeout for MCP listPrompts() / listResources()

Category: Reliability
Location: src/mcp/index.ts

client.listTools() is correctly wrapped in withTimeout, but client.listPrompts() and client.listResources() are not. Since MCP.prompts() is awaited during command initialization (src/command/index.ts), a remote MCP server that connects but hangs on listPrompts will permanently hang the entire CLI. This defeats the command loading resilience added in PR #36.

Fix: Wrap listPrompts() and listResources() in withTimeout() the same way listTools() is wrapped.


Major Issues

3. Hardcoded App Insights Key with Insufficient Opt-Out Documentation

Category: Privacy / UX
Location: src/telemetry/index.ts:365-366

The DEFAULT_CONNECTION_STRING is hardcoded and telemetry initializes by default regardless of Control account state. While the key being public is acceptable for client-side telemetry (like Sentry DSNs), there's no documented env-var-based opt-out mechanism. The telemetry docs reference ALTIMATE_TELEMETRY_DISABLED but the code only checks userConfig.telemetry?.disabled — the env var path doesn't exist.

Fix:

  1. Implement the documented ALTIMATE_TELEMETRY_DISABLED env var in init()
  2. Document the data model and opt-out clearly in docs/docs/configure/telemetry.md
  3. Set up Azure-side rate limiting

4. MCP Disconnect Hardcodes transport: "stdio" for All Servers

Category: Logic Error
Location: src/mcp/index.ts:627

mcp.disconnect() always emits mcp_server_status with transport: "stdio", even for SSE or streamable-http servers. This produces incorrect telemetry.

Fix: Read the actual transport from stored client status: s.status[name]?.transport ?? "stdio"

5. MCP Tool Type Detection Mismatch

Category: Data Integrity
Location: src/session/processor.ts (tool type check) and src/mcp/index.ts (tool registration)

processor.ts checks if a tool is MCP using tool.startsWith("mcp__"), but mcp/index.ts registers tools with clientName_toolName format (single underscore, not double). This means all MCP tools are incorrectly tracked as standard tools in telemetry, invalidating all MCP vs standard tool usage metrics.

Fix: Align the MCP tool prefix check with the actual naming format used in MCP tool registration.

6. MCP Prompt Template Getter Anti-Pattern

Category: Bug / Code Quality
Location: src/command/index.ts:122-135

Uses new Promise(async (resolve, reject) => {...}) anti-pattern. When MCP.getPrompt() returns undefined (server failure), the user gets a blank prompt with no error feedback. The .catch(reject) is dead code since MCP.getPrompt() internally catches errors and returns undefined.

Fix: Simplify to standard Promise chain and throw when template is undefined:

get template() {
  return MCP.getPrompt(prompt.client, prompt.name,
    prompt.arguments
      ? Object.fromEntries(prompt.arguments.map((a, i) => [a.name, `$${i + 1}`]))
      : {},
  ).then((template) => {
    if (!template) throw new Error(`Failed to load MCP prompt: ${prompt.name}`);
    return template.messages
      .map((m) => (m.content.type === "text" ? m.content.text : ""))
      .join("\n");
  });
}

7. Truncate.cleanup() Lacks Per-File Fault Isolation

Category: Reliability
Location: src/tool/truncation.ts:36 and src/id/id.ts:58

Identifier.timestamp() uses a weak marker check that can throw on legacy/malformed filenames. Since cleanup() calls this on every tool_* file without try/catch, a single malformed filename aborts the entire cleanup process.

Fix: Wrap timestamp extraction per-entry in try/catch and skip malformed IDs.


Minor Issues

8. Telemetry Session Context Race — Empty session_id in Early Events

Location: src/session/prompt.ts:303-304, src/mcp/index.ts (census)

Events like mcp_server_status and engine_started can fire before setContext() sets the session ID. MCP census uses Telemetry.getContext().sessionId which may be empty during startup. Auth events use session_id: "cli" which pollutes analytics.

Fix: Use fallback session IDs for early events (e.g., "startup") or buffer events until context is initialized.

9. No session_end Tracking on Early Process Termination

Location: src/session/prompt.ts:806-820

session_end and agent_outcome are only tracked in the finally block of SessionPrompt.loop(). If the process crashes or is killed before reaching this point, these events never fire.

Fix: Use process.on('beforeExit', ...) to attempt a final flush, or track session_end optimistically at session start with duration updated on completion.

10. compactionAttempts Map Grows Unbounded

Location: src/session/compaction.ts:156

The map grows for each session ID and is never cleaned up. In long-lived daemonized usage, this leaks memory.

Fix: Delete the session entry when compaction completes or session ends.

11. ai.cloud.role Still Says "altimate-code" After Binary Rename

Location: src/telemetry/index.ts (App Insights tags)

The binary was renamed to altimate but the App Insights cloud role tag still says "altimate-code".

Fix: Update to "altimate".

12. Error Message Truncation Inconsistent

Location: src/session/processor.ts:264 vs src/mcp/index.ts:434

processor.ts uses .toString() on error objects (can leak stack traces with file paths), while MCP uses .message (safer). Truncation length also varies (500 vs 1000 chars) across event types.

Fix: Standardize on .message and 500-char truncation everywhere.

13. environment_census Hardcodes skill_count: 0

Location: src/tool/project-scan.ts:588

Emits inaccurate observability data that will skew skill adoption metrics.

Fix: Populate from real skill registry or omit the field.

14. Context Utilization Calculation Includes cache_write Tokens

Location: src/session/processor.ts:348

totalTokens includes cache_write tokens which aren't part of context window consumption, inflating utilization_pct.

Fix: Use tokens.input + tokens.cache.read for context utilization.

15. Buffer Overflow Drops Oldest Events Silently

Location: src/telemetry/index.ts:402-408

When buffer exceeds 200 events, oldest are silently dropped. Critical events like session_start could be lost.

Fix: Track a dropped event counter to include in the next flush, or trigger an early flush at capacity.


Positive Observations

  • Comprehensive event taxonomy: 25 well-structured event types covering the full CLI lifecycle — sessions, generations, tool calls, bridge, auth, MCP, context management, engine, and error recovery
  • Privacy-conscious design: No tool arguments, no user content, truncated errors, bucketed counts in environment_census
  • Resilient telemetry: Designed to never crash the CLI — silent drops, buffer overflow protection, timeout handling, graceful shutdown
  • Clean test coverage: Validates all 25 event types, naming conventions, privacy guarantees. Command resilience tests cover success, partial, and total failure scenarios
  • Good resilience patterns: Command loading and TUI sync use try/catch and safe() wrappers to prevent cascading failures

Missing Tests

  1. App Insights envelope format (toAppInsightsEnvelopes()) — core serialization with no test
  2. Flush failure resilience — network errors, timeout handling
  3. Buffer overflow behavior — dropping oldest events at capacity
  4. parseConnectionString edge cases — missing keys, malformed strings
  5. Truncate.cleanup with mixed-format filenames — legacy vs current ID formats
  6. Auth flow telemetry — auth_login/auth_logout events
  7. Telemetry shutdown flush — verifying shutdown() sends buffered events

Finding Attribution

Issue Origin Type
PII: raw email in ai.user.id MiniMax, Kimi Consensus
Missing timeout for listPrompts/listResources Gemini Unique
Hardcoded App Insights key / opt-out gap Claude, Codex, Grok, Kimi, MiniMax, GLM-5, Qwen Consensus
MCP disconnect hardcodes "stdio" Codex, Kimi, MiniMax Consensus
MCP tool type detection mismatch (mcp__ prefix) Gemini Unique
MCP prompt template getter anti-pattern Claude Unique
Truncate.cleanup fault isolation Codex Unique
Empty session_id in early events Gemini, Kimi, MiniMax, GLM-5, Qwen Consensus
No session_end on process crash GLM-5 Unique
compactionAttempts unbounded Codex, GLM-5, Qwen Consensus
ai.cloud.role "altimate-code" after rename Gemini, Kimi, MiniMax Consensus
Error truncation inconsistent GLM-5 Unique
skill_count: 0 hardcoded Codex Unique
Context utilization includes cache_write Claude Unique
Buffer overflow drops silently Claude, Kimi, Qwen Consensus

Reviewed by 8 models: Claude, Codex (OpenAI), Gemini (Google), Kimi (Moonshot), Grok (xAI), MiniMax, GLM-5 (Zhipu), Qwen (Alibaba). 1 convergence round with corrections incorporated.

anandgupta42 added a commit that referenced this pull request Mar 5, 2026
Fixes 12 issues identified in the comprehensive code review of PR #39:

Critical:
- Wrap MCP listPrompts/listResources in withTimeout() to prevent CLI hangs

Major:
- Fix MCP tool type detection: add MCP.isMcpTool() instead of broken
  startsWith("mcp__") prefix check
- Fix MCP prompt template getter: replace Promise constructor anti-pattern
  with .then() chain, throw on undefined template
- Fix Truncate.cleanup: wrap Identifier.timestamp() per-entry in try/catch
  to prevent single malformed filename from aborting cleanup

Minor:
- Fix ai.cloud.role tag: "altimate-code" → "altimate"
- Fix error truncation: use .message instead of .toString() to avoid
  leaking stack traces with file paths
- Fix skill_count: populate from Skill.all() instead of hardcoded 0
- Fix context utilization: remove cache_write from token count
- Fix buffer overflow: add dropped events counter for telemetry
- Fix compactionAttempts: delete session entry on completion/error
- Fix empty session_id: use "startup" fallback for early events
- Fix session_end: add beforeExit handler for early process termination

Docs:
- Fix config path in telemetry docs (was ~/.config/altimate/config.json)
- Add telemetry field to config schema table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
anandgupta42 added a commit that referenced this pull request Mar 5, 2026
* fix: address remaining code review issues from PR #39

Fixes 12 issues identified in the comprehensive code review of PR #39:

Critical:
- Wrap MCP listPrompts/listResources in withTimeout() to prevent CLI hangs

Major:
- Fix MCP tool type detection: add MCP.isMcpTool() instead of broken
  startsWith("mcp__") prefix check
- Fix MCP prompt template getter: replace Promise constructor anti-pattern
  with .then() chain, throw on undefined template
- Fix Truncate.cleanup: wrap Identifier.timestamp() per-entry in try/catch
  to prevent single malformed filename from aborting cleanup

Minor:
- Fix ai.cloud.role tag: "altimate-code" → "altimate"
- Fix error truncation: use .message instead of .toString() to avoid
  leaking stack traces with file paths
- Fix skill_count: populate from Skill.all() instead of hardcoded 0
- Fix context utilization: remove cache_write from token count
- Fix buffer overflow: add dropped events counter for telemetry
- Fix compactionAttempts: delete session entry on completion/error
- Fix empty session_id: use "startup" fallback for early events
- Fix session_end: add beforeExit handler for early process termination

Docs:
- Fix config path in telemetry docs (was ~/.config/altimate/config.json)
- Add telemetry field to config schema table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve session_end coverage for process.exit() and add dedup guard

Address review comment about beforeExit limitations:
- Add process.once("exit") handler to cover process.exit() calls
- Add dedup flag to prevent double-firing when finally block also runs
- Document why SIGINT/SIGTERM are NOT handled (abort controller already
  triggers loop exit → finally block → normal session_end)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@kulvirgit kulvirgit deleted the feat/comprehensive-telemetry-instrumentation branch March 10, 2026 21:06
anandgupta42 added a commit that referenced this pull request Mar 17, 2026
* feat: comprehensive telemetry instrumentation for error visibility and data moat

Instrument 25 event types across the CLI lifecycle — auth, MCP servers, Python
engine, provider errors, permissions, upgrades, context utilization, agent
outcomes, workflow sequencing, and environment census. Adds telemetry docs page,
firewall endpoint, 25 unit tests, and privacy-safe helpers (categorizeToolName,
bucketCount). No behavioral changes; all telemetry is opt-out via config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review feedback on telemetry instrumentation

- compaction.ts: scope compactionAttempt per-session via Map instead of
  module-level counter that leaked across sessions
- prompt.ts: remove hardcoded doom_loops/permission_denials fields from
  agent_outcome (were always 0); doom loops and permission denials are
  already tracked as separate events
- mcp/index.ts: make mcp_server_census fire-and-forget so listTools/
  listResources never block the MCP connect critical path
- processor.ts: use dedicated generationCounter for context_utilization
  instead of toolCallCounter which counts a different thing
- telemetry/index.ts: refactor categorizeToolName to use a pattern array
  with explicit ordering, making the match logic order-independent and
  easier to maintain
- installation/index.ts: add "other" to upgrade method union instead of
  silently casting unknown methods to "npm"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
anandgupta42 added a commit that referenced this pull request Mar 17, 2026
* fix: address telemetry review issues from PR #39

- Standardize error message truncation to 500 chars (was 1000 in processor.ts)
- Implement ALTIMATE_TELEMETRY_DISABLED env var check in telemetry init()
- Track MCP transport type on connect, use correct transport in disconnect
- Add sessionHadError flag so agent_outcome produces "error" outcome
- Use Telemetry.getContext().sessionId in auth events instead of hardcoded "cli"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: initialize telemetry early and buffer pre-init events

Telemetry.init() was only called in SessionPrompt.next(), but track()
calls from MCP connect, engine startup, and auth commands fire earlier.
Events tracked before init() were silently dropped.

- Buffer events in track() regardless of enabled state (pre-init events
  are kept and flushed once init() completes)
- Clear buffer when init() determines telemetry is disabled
- Call init() in main CLI middleware (covers auth, serve, etc.)
- Call init() in worker thread startup (covers MCP/engine events)
- Keep existing init() in session prompt as idempotent safety net

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: robust telemetry init with promise deduplication and pre-init buffering

The previous commit moved init() earlier but had subtle issues:
- No init-once guard when telemetry disabled (init ran repeatedly)
- Events during async init() were dropped
- Concurrent non-awaited init() calls could race

Fix:
- Use promise deduplication: init() returns same promise on repeated calls
- Use initDone flag: track() buffers during init, drops after disable
- Clear buffer on every disabled/error exit path in doInit()
- Reset initPromise/initDone in shutdown() for test isolation
- Add tests for pre-init buffering, post-disable dropping, idempotency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: handle Config.get() and Control.account() failures in telemetry init

Config.get() throws "No context found for instance" when called before
Instance.provide() (e.g. CLI middleware, worker thread startup). This
caused the outer catch to clear the buffer and disable telemetry entirely.

Fix: wrap Config.get() and Control.account() in individual try/catch
blocks so failures don't prevent telemetry initialization. The env var
ALTIMATE_TELEMETRY_DISABLED is the early-init escape hatch.

Verified end-to-end: pre-init events are buffered and flushed to App
Insights after init completes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: hash user email before sending to telemetry

Raw email is PII — replace with SHA-256 hash to preserve anonymous
user correlation without leaking personally identifiable information.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
anandgupta42 added a commit that referenced this pull request Mar 17, 2026
* fix: address remaining code review issues from PR #39

Fixes 12 issues identified in the comprehensive code review of PR #39:

Critical:
- Wrap MCP listPrompts/listResources in withTimeout() to prevent CLI hangs

Major:
- Fix MCP tool type detection: add MCP.isMcpTool() instead of broken
  startsWith("mcp__") prefix check
- Fix MCP prompt template getter: replace Promise constructor anti-pattern
  with .then() chain, throw on undefined template
- Fix Truncate.cleanup: wrap Identifier.timestamp() per-entry in try/catch
  to prevent single malformed filename from aborting cleanup

Minor:
- Fix ai.cloud.role tag: "altimate-code" → "altimate"
- Fix error truncation: use .message instead of .toString() to avoid
  leaking stack traces with file paths
- Fix skill_count: populate from Skill.all() instead of hardcoded 0
- Fix context utilization: remove cache_write from token count
- Fix buffer overflow: add dropped events counter for telemetry
- Fix compactionAttempts: delete session entry on completion/error
- Fix empty session_id: use "startup" fallback for early events
- Fix session_end: add beforeExit handler for early process termination

Docs:
- Fix config path in telemetry docs (was ~/.config/altimate/config.json)
- Add telemetry field to config schema table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve session_end coverage for process.exit() and add dedup guard

Address review comment about beforeExit limitations:
- Add process.once("exit") handler to cover process.exit() calls
- Add dedup flag to prevent double-firing when finally block also runs
- Document why SIGINT/SIGTERM are NOT handled (abort controller already
  triggers loop exit → finally block → normal session_end)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants