feat: comprehensive telemetry instrumentation by anandgupta42 · Pull Request #39 · AltimateAI/altimate-code

anandgupta42 · 2026-03-05T01:47:06Z

Summary

25 telemetry event types instrumented across the CLI lifecycle — up from 10 defined (many never emitted) to full coverage
Error visibility: auth failures, MCP server errors, Python engine crashes, provider errors, permission denials, doom loops
Privacy-safe helpers: categorizeToolName() for tool classification, bucketCount() for anonymized counts
Telemetry docs page with event table, privacy policy, opt-out instructions, and contributor guide
25 unit tests covering all helpers, event types, privacy validation, and naming conventions

Files changed (15)

File	Changes
`src/telemetry/index.ts`	25 event types, `categorizeToolName()`, `bucketCount()`, `sequence_index`/`previous_tool` on tool_call
`src/session/processor.ts`	tool_call sequencing, context_utilization, error_recovered, provider_error, doom_loop_detected
`src/session/prompt.ts`	agent_outcome tracking
`src/mcp/index.ts`	mcp_server_status + mcp_server_census
`src/bridge/engine.ts`	engine_started, engine_error
`src/cli/cmd/auth.ts`	auth_login (7 paths), auth_logout
`src/installation/index.ts`	upgrade_attempted
`src/session/index.ts`	session_forked
`src/session/compaction.ts`	compaction_triggered (was defined, never emitted)
`src/permission/next.ts`	permission_denied
`src/tool/project-scan.ts`	environment_census
`test/telemetry/telemetry.test.ts`	25 tests, 81 assertions
`docs/docs/configure/telemetry.md`	New telemetry docs page
`docs/docs/network.md`	AppInsights firewall endpoint
`docs/mkdocs.yml`	Nav entry for telemetry docs

Test plan

npx tsc --noEmit — no new type errors (2 pre-existing)
bun test — 1413 pass, 0 fail across 85 files (including 25 new telemetry tests)
Privacy: no SQL/code/credentials/file paths sent; error messages truncated to 500 chars; counts bucketed
All events use snake_case domain_action naming convention
Manual: start altimate, trigger tool calls, verify tool_category and sequence_index appear in events
Manual: disconnect network, verify telemetry failures are silent

🤖 Generated with Claude Code

…d data moat Instrument 25 event types across the CLI lifecycle — auth, MCP servers, Python engine, provider errors, permissions, upgrades, context utilization, agent outcomes, workflow sequencing, and environment census. Adds telemetry docs page, firewall endpoint, 25 unit tests, and privacy-safe helpers (categorizeToolName, bucketCount). No behavioral changes; all telemetry is opt-out via config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jontsai

LGTM

Comprehensive telemetry instrumentation with good privacy practices. Verified:

✓ 25 well-typed event types with consistent snake_case naming
✓ Privacy-safe: no SQL/code/credentials/paths, error messages truncated to 500 chars
✓ categorizeToolName() and bucketCount() are smart privacy helpers
✓ sequence_index + previous_tool on tool_call enables workflow pattern analysis
✓ Engine error tracking across phases (uv_download, venv_create, pip_install) — good for debugging install issues
✓ Docs page with event table, opt-out instructions, and contributor guide
✓ 25 tests covering helpers, events, and privacy validation

Minor observations:

agent_outcome has doom_loops: 0, permission_denials: 0 hardcoded — worth wiring up actual counters in a follow-up
categorizeToolName uses substring matching which could miscategorize (e.g., a tool named sql_warehouse_connection hits sql first) — ordering matters, current priority seems reasonable

Ship it! 🚢

jontsai · 2026-03-05T02:06:49Z

packages/altimate-code/src/session/compaction.ts

+
  export async function process(input: {
    parentID: string
    messages: MessageV2.WithParts[]


issue: compactionAttempt is a module-level let — it persists across sessions in the same process. If a user runs multiple sessions (or the CLI is long-lived), the counter leaks between them. Should this be scoped inside process() or reset per-session?

jontsai · 2026-03-05T02:06:51Z

packages/altimate-code/src/session/prompt.ts

+        permission_denials: 0,
+        outcome,
+      })
      Telemetry.track({


suggestion: doom_loops: 0 and permission_denials: 0 are hardcoded — they'll never reflect actual counts. Consider tracking these via counters in the prompt loop (increment on doom_loop_detected / permission_denied events) and wiring them in here, or remove the fields until they're real to avoid misleading dashboards.

jontsai · 2026-03-05T02:06:52Z

packages/altimate-code/src/mcp/index.ts

+          const remoteResourcesList = await client.listResources().catch(() => ({ resources: [] }))
+          Telemetry.track({
+            type: "mcp_server_census",
+            timestamp: Date.now(),


thought: listTools() and listResources() are awaited inline during the connect path. If an MCP server is slow to respond to these, it adds latency to the connection sequence. Consider firing these asynchronously (fire-and-forget with .catch) since census data is purely telemetry and shouldn't block the critical path.

jontsai · 2026-03-05T02:06:53Z

packages/altimate-code/src/session/processor.ts

+                      cache_hit_ratio: totalInput > 0 ? Math.round((cacheRead / totalInput) * 1000) / 1000 : 0,
+                    })
+                  }
                  if (snapshot) {


issue: generation_number is set to toolCallCounter, but these count different things — toolCallCounter tracks tool calls, not generations/steps. This will overcount in multi-tool turns and undercount in zero-tool turns. Should this use a separate generation counter (or step from the prompt loop)?

jontsai · 2026-03-05T02:06:54Z

packages/altimate-code/src/telemetry/index.ts

+    if (type === "mcp") return "mcp"
+    const n = name.toLowerCase()
+    if (n.includes("sql") || n.includes("query")) return "sql"
+    if (n.includes("schema") || n.includes("column") || n.includes("table")) return "schema"


suggestion: The substring match ordering can cause edge-case miscategorization. For example, warehouse_usage_stats matches warehouse before reaching the warehouse_usage → finops check. Currently the finops check with warehouse_usage comes before the warehouse check so it works — but this ordering dependency is fragile. Consider using a more explicit matching approach (exact prefixes or a lookup map) to make this order-independent.

jontsai · 2026-03-05T02:06:55Z

packages/altimate-code/src/installation/index.ts

+      Telemetry.track({
+        type: "upgrade_attempted",
+        timestamp: Date.now(),
+        session_id: Telemetry.getContext().sessionId || "cli",


nit: The as "npm" | "bun" | "brew" cast silently maps unknown methods (like choco, yarn, etc.) to "npm". Consider adding an "other" bucket to the type union and using it as the fallback, so dashboards can distinguish actual npm installs from unknown methods.

jontsai

PR #39 Review Tour Guide 🗺️

Overview

+979/-9 across 15 files — comprehensive telemetry instrumentation adding 25 event types across the CLI lifecycle. Single commit by @anandgupta42 + Claude.

File Order

#	File	+/-	Purpose
1	`docs/docs/configure/telemetry.md`	+122	New telemetry docs page
2	`docs/docs/network.md`	+1	AppInsights firewall endpoint
3	`docs/mkdocs.yml`	+1	Nav entry
4	`src/bridge/engine.ts`	+49/-3	`engine_started`, `engine_error` events
5	`src/cli/cmd/auth.ts`	+83	`auth_login` (7 paths), `auth_logout`
6	`src/installation/index.ts`	+22	`upgrade_attempted`
7	`src/mcp/index.ts`	+75/-1	`mcp_server_status` + `mcp_server_census`
8	`src/permission/next.ts`	+19/-1	`permission_denied`
9	`src/session/compaction.ts`	+10	`compaction_triggered`
10	`src/session/index.ts`	+10	`session_forked`
11	`src/session/processor.ts`	+70/-4	Tool sequencing, `context_utilization`, `error_recovered`, `provider_error`, `doom_loop_detected`
12	`src/session/prompt.ts`	+24	`agent_outcome`
13	`src/telemetry/index.ts`	+168	15 new event types + `categorizeToolName()` + `bucketCount()`
14	`src/tool/project-scan.ts`	+41	`environment_census`
15	`test/telemetry/telemetry.test.ts`	+284	25 tests, 81 assertions

Tour Highlights

🟢 What's Good

Privacy design is solid — no SQL/code/creds, errors truncated to 500 chars, counts bucketed via bucketCount(), file paths excluded
Type system — 25 event types as a discriminated union, compile-time safe
Comprehensive coverage — auth (7 paths!), MCP lifecycle, engine, permissions, compaction, context utilization
Tests — 284 lines covering helpers, privacy validation, naming conventions, and all event types
Docs — clear telemetry page with event table, privacy policy, opt-out, and contributor guide

🟡 Issues Found (6 inline comments)

#	Severity	File	Issue
1	issue	`compaction.ts`	Module-level `compactionAttempt` counter leaks across sessions
2	suggestion	`prompt.ts`	`doom_loops: 0` and `permission_denials: 0` hardcoded — misleading data
3	thought	`mcp/index.ts`	`listTools()`/`listResources()` census calls block the MCP connect path
4	issue	`processor.ts`	`generation_number` uses `toolCallCounter` — wrong semantic
5	suggestion	`telemetry/index.ts`	`categorizeToolName` substring ordering is fragile
6	nit	`installation/index.ts`	Unknown upgrade methods silently map to `"npm"`

🟢 No Red Flags

No security concerns — telemetry is opt-out, privacy-safe
No breaking changes to existing behavior
Error paths re-throw properly after tracking

Summary

Solid PR. The 6 inline comments are mostly minor follow-ups — #1 (module-level state) and #4 (wrong counter) are the most impactful. None are blockers for shipping.

- compaction.ts: scope compactionAttempt per-session via Map instead of module-level counter that leaked across sessions - prompt.ts: remove hardcoded doom_loops/permission_denials fields from agent_outcome (were always 0); doom loops and permission denials are already tracked as separate events - mcp/index.ts: make mcp_server_census fire-and-forget so listTools/ listResources never block the MCP connect critical path - processor.ts: use dedicated generationCounter for context_utilization instead of toolCallCounter which counts a different thing - telemetry/index.ts: refactor categorizeToolName to use a pattern array with explicit ordering, making the match logic order-independent and easier to maintain - installation/index.ts: add "other" to upgrade method union instead of silently casting unknown methods to "npm" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

suryaiyer95

Multi-Model Code Review — 6 AI Models

Verdict: REQUEST CHANGES
Critical: 1 | Major: 7 | Minor: 4 | Nits: 2

Reviewed by Claude Opus 4.6, Kimi K2.5, Grok 4, MiniMax M2.5, GLM-5, Qwen 3.5 Plus — converged in 1 round.

Critical Issues

1. Docs claim path scrubbing + 500 char truncation, but code doesn't match

Security / Documentation — src/session/processor.ts:459, docs/docs/configure/telemetry.md:80

The telemetry docs state: "Error messages are truncated to 500 characters and scrubbed of file paths before sending."

Two problems:

(a) In processor.ts:459, the error event truncates to 1000 characters, not 500:

error_message: (e?.message ?? String(e)).slice(0, 1000),  // line 459

While tool_call error (line 263) and provider_error (line 488) correctly truncate to 500.

(b) There is no path scrubbing implementation anywhere. Error messages containing full filesystem paths (e.g., /Users/john/projects/secret-client/...) are sent as-is.

Fix: Standardize to 500 chars with a const MAX_ERROR_LENGTH = 500, implement a scrubPaths() helper, or update docs to reflect reality.

Flagged by: Claude, Kimi (Consensus)

Major Issues

2. `agent_outcome` hardcodes `doom_loops: 0` and `permission_denials: 0`

Logic Error — src/session/prompt.ts:808-809

These fields always send zero despite doom_loop_detected and permission_denied events being emitted elsewhere. The outcome summary loses correlation with these events.

Fix: Add session-scoped counters and pass actual values, or remove these fields.

Flagged by: Kimi, MiniMax, Qwen, GLM-5 (Consensus)

3. `ALTIMATE_TELEMETRY_DISABLED` env var documented but not implemented

Bug / Documentation — docs/docs/configure/telemetry.md:63-64, src/telemetry/index.ts:360-383

The docs say export ALTIMATE_TELEMETRY_DISABLED=true disables telemetry, but init() only checks userConfig.telemetry?.disabled. The env var is never read.

Fix: Add || process.env.ALTIMATE_TELEMETRY_DISABLED === 'true' to the init check.

Flagged by: Claude (Unique)

4. Module-level `compactionAttempt` counter never resets

Logic Error — src/session/compaction.ts:156

compactionAttempt persists across sessions. If Session A triggers 3 compactions, Session B's first compaction reports attempt: 4.

Fix: Make the counter session-scoped.

Flagged by: MiniMax, Kimi, Qwen, GLM-5 (Consensus)

5. `generation_number` in `context_utilization` uses tool call count

Logic Error — src/session/processor.ts:359

generation_number: toolCallCounter,  // counts tool invocations, not generations

A single generation can produce multiple tool calls. The field name implies generation count.

Fix: Add a separate generationCounter incremented in finish-step.

Flagged by: Claude (Unique)

6. MCP disconnect hardcodes `transport: "stdio"`

Logic Error — src/mcp/index.ts:634

transport: "stdio",  // Wrong for SSE/streamable-http servers

The disconnect function always reports "stdio" regardless of actual transport type.

Fix: Track the transport type when connecting and look it up on disconnect.

Flagged by: GLM-5 (Convergence fix)

7. `upgrade_attempted` silently loses method data

Logic Error — src/installation/index.ts:180, 201

const telemetryMethod = (["npm", "bun", "brew"].includes(method) ? method : "npm") as "npm" | "bun" | "brew"

Methods like pnpm, yarn, curl, scoop, choco are silently mapped to "npm".

Fix: Expand the type to include all methods.

Flagged by: GLM-5 (Convergence fix)

8. `agent_outcome` never produces `"error"` outcome

Logic Error — src/session/prompt.ts:793-797

const outcome = abort.aborted ? "abandoned"
  : sessionTotalCost === 0 && toolCallCount === 0 ? "abandoned"
  : "completed"  // "error" is never produced

Sessions ending with errors are incorrectly reported as "completed".

Fix: Track a sessionHadError flag and use it in outcome determination.

Flagged by: GLM-5 (Convergence fix)

Minor Issues

9. Session context not updated after fork

src/session/index.ts:275 — After session_forked is tracked, Telemetry.setContext() isn't called with the new session ID. Subsequent events may be attributed to the parent session. (MiniMax)

10. `categorizeToolName` has order-dependent keyword overlap

src/telemetry/index.ts:259-270 — A tool named warehouse_query_stats matches "query" → "sql" before "warehouse". (Claude)

11. Hardcoded `session_id: "cli"` in auth events

src/cli/cmd/auth.ts (10+ locations) — Should use Telemetry.getContext().sessionId || "cli" for consistency. (Kimi, Qwen, GLM-5)

12. Privacy tests verify type shapes, not runtime behavior

test/telemetry/telemetry.test.ts:174-246 — Tests construct events manually and check field absence, but don't verify actual track() call sites truncate error messages. (Claude, Kimi, Qwen)

Nits

Redundant .toLowerCase() at src/telemetry/index.ts:268 — n is already lowercased on line 261.
Unnecessary as const at src/session/processor.ts:214, 250 — TypeScript infers literal types from string literal ternaries.

Positive Observations

Privacy-by-design: categorizeToolName() and bucketCount() are thoughtful privacy primitives. Tool arguments/outputs never included.
Fail-safe telemetry: Silent catch in flush(), unref() on timer, !enabled guard — telemetry never breaks the CLI.
Strong type safety: Discriminated union Telemetry.Event ensures compile-time correctness for all 25 event variants.
Comprehensive documentation: Event table, privacy policy, opt-out instructions, contributor guide.
Clean App Insights integration: Proper envelope serialization, connection string parsing, batched flush model.

Missing Tests

toAppInsightsEnvelopes() serialization
parseConnectionString() edge cases
flush() / shutdown() behavior
Buffer overflow (> 200 events)
init() respecting disabled config / env var
Network failure silent handling

Finding Attribution

Issue	Origin	Type
1. Error truncation + path scrubbing mismatch	Claude, Kimi	Consensus
2. Hardcoded zeros in `agent_outcome`	Kimi, MiniMax, Qwen, GLM-5	Consensus
3. `ALTIMATE_TELEMETRY_DISABLED` env var not implemented	Claude	Unique
4. `compactionAttempt` counter never resets	MiniMax, Kimi, Qwen, GLM-5	Consensus
5. `generation_number` uses wrong counter	Claude	Unique
6. MCP disconnect hardcodes `transport: "stdio"`	GLM-5	Convergence fix
7. `upgrade_attempted` loses method data	GLM-5	Convergence fix
8. `agent_outcome` never produces `"error"`	GLM-5	Convergence fix
9. Session context not updated after fork	MiniMax	Unique
10. `categorizeToolName` order-dependent	Claude	Unique
11. Hardcoded `session_id: "cli"`	Kimi, Qwen, GLM-5	Consensus
12. Privacy tests don't verify runtime	Claude, Kimi, Qwen	Consensus
13. Redundant `.toLowerCase()`	Claude, Kimi, GLM-5	Consensus
14. Unnecessary `as const`	Qwen	Unique

- Standardize error message truncation to 500 chars (was 1000 in processor.ts) - Implement ALTIMATE_TELEMETRY_DISABLED env var check in telemetry init() - Track MCP transport type on connect, use correct transport in disconnect - Add sessionHadError flag so agent_outcome produces "error" outcome - Use Telemetry.getContext().sessionId in auth events instead of hardcoded "cli" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address telemetry review issues from PR #39 - Standardize error message truncation to 500 chars (was 1000 in processor.ts) - Implement ALTIMATE_TELEMETRY_DISABLED env var check in telemetry init() - Track MCP transport type on connect, use correct transport in disconnect - Add sessionHadError flag so agent_outcome produces "error" outcome - Use Telemetry.getContext().sessionId in auth events instead of hardcoded "cli" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: initialize telemetry early and buffer pre-init events Telemetry.init() was only called in SessionPrompt.next(), but track() calls from MCP connect, engine startup, and auth commands fire earlier. Events tracked before init() were silently dropped. - Buffer events in track() regardless of enabled state (pre-init events are kept and flushed once init() completes) - Clear buffer when init() determines telemetry is disabled - Call init() in main CLI middleware (covers auth, serve, etc.) - Call init() in worker thread startup (covers MCP/engine events) - Keep existing init() in session prompt as idempotent safety net Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: robust telemetry init with promise deduplication and pre-init buffering The previous commit moved init() earlier but had subtle issues: - No init-once guard when telemetry disabled (init ran repeatedly) - Events during async init() were dropped - Concurrent non-awaited init() calls could race Fix: - Use promise deduplication: init() returns same promise on repeated calls - Use initDone flag: track() buffers during init, drops after disable - Clear buffer on every disabled/error exit path in doInit() - Reset initPromise/initDone in shutdown() for test isolation - Add tests for pre-init buffering, post-disable dropping, idempotency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: handle Config.get() and Control.account() failures in telemetry init Config.get() throws "No context found for instance" when called before Instance.provide() (e.g. CLI middleware, worker thread startup). This caused the outer catch to clear the buffer and disable telemetry entirely. Fix: wrap Config.get() and Control.account() in individual try/catch blocks so failures don't prevent telemetry initialization. The env var ALTIMATE_TELEMETRY_DISABLED is the early-init escape hatch. Verified end-to-end: pre-init events are buffered and flushed to App Insights after init completes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: hash user email before sending to telemetry Raw email is PII — replace with SHA-256 hash to preserve anonymous user correlation without leaking personally identifiable information. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

kulvirgit · 2026-03-05T03:22:43Z

Code Review: PR #39 — Comprehensive Telemetry Instrumentation

Verdict: REQUEST CHANGES (2 critical, 5 major issues)
Reviewed by: 8 AI models (Claude, Codex, Gemini, Kimi, Grok, MiniMax, GLM-5, Qwen) with convergence round

Critical Issues

1. PII Leakage — Raw Email in `ai.user.id` Tag

Category: Security / Privacy
Location: src/telemetry/index.ts:330-331

The App Insights envelope sets ai.user.id to the raw user email from Installation.fingerprint(). Every telemetry event sent to Azure contains the user's email in plaintext, creating GDPR/compliance risk.

Fix: Hash the email before sending (ai.user.id: sha256(email)) or use an anonymous installation fingerprint.

2. Missing Timeout for MCP `listPrompts()` / `listResources()`

Category: Reliability
Location: src/mcp/index.ts

client.listTools() is correctly wrapped in withTimeout, but client.listPrompts() and client.listResources() are not. Since MCP.prompts() is awaited during command initialization (src/command/index.ts), a remote MCP server that connects but hangs on listPrompts will permanently hang the entire CLI. This defeats the command loading resilience added in PR #36.

Fix: Wrap listPrompts() and listResources() in withTimeout() the same way listTools() is wrapped.

Major Issues

3. Hardcoded App Insights Key with Insufficient Opt-Out Documentation

Category: Privacy / UX
Location: src/telemetry/index.ts:365-366

The DEFAULT_CONNECTION_STRING is hardcoded and telemetry initializes by default regardless of Control account state. While the key being public is acceptable for client-side telemetry (like Sentry DSNs), there's no documented env-var-based opt-out mechanism. The telemetry docs reference ALTIMATE_TELEMETRY_DISABLED but the code only checks userConfig.telemetry?.disabled — the env var path doesn't exist.

Fix:

Implement the documented ALTIMATE_TELEMETRY_DISABLED env var in init()
Document the data model and opt-out clearly in docs/docs/configure/telemetry.md
Set up Azure-side rate limiting

4. MCP Disconnect Hardcodes `transport: "stdio"` for All Servers

Category: Logic Error
Location: src/mcp/index.ts:627

mcp.disconnect() always emits mcp_server_status with transport: "stdio", even for SSE or streamable-http servers. This produces incorrect telemetry.

Fix: Read the actual transport from stored client status: s.status[name]?.transport ?? "stdio"

5. MCP Tool Type Detection Mismatch

Category: Data Integrity
Location: src/session/processor.ts (tool type check) and src/mcp/index.ts (tool registration)

processor.ts checks if a tool is MCP using tool.startsWith("mcp__"), but mcp/index.ts registers tools with clientName_toolName format (single underscore, not double). This means all MCP tools are incorrectly tracked as standard tools in telemetry, invalidating all MCP vs standard tool usage metrics.

Fix: Align the MCP tool prefix check with the actual naming format used in MCP tool registration.

6. MCP Prompt Template Getter Anti-Pattern

Category: Bug / Code Quality
Location: src/command/index.ts:122-135

Uses new Promise(async (resolve, reject) => {...}) anti-pattern. When MCP.getPrompt() returns undefined (server failure), the user gets a blank prompt with no error feedback. The .catch(reject) is dead code since MCP.getPrompt() internally catches errors and returns undefined.

Fix: Simplify to standard Promise chain and throw when template is undefined:

get template() {
  return MCP.getPrompt(prompt.client, prompt.name,
    prompt.arguments
      ? Object.fromEntries(prompt.arguments.map((a, i) => [a.name, `$${i + 1}`]))
      : {},
  ).then((template) => {
    if (!template) throw new Error(`Failed to load MCP prompt: ${prompt.name}`);
    return template.messages
      .map((m) => (m.content.type === "text" ? m.content.text : ""))
      .join("\n");
  });
}

7. `Truncate.cleanup()` Lacks Per-File Fault Isolation

Category: Reliability
Location: src/tool/truncation.ts:36 and src/id/id.ts:58

Identifier.timestamp() uses a weak marker check that can throw on legacy/malformed filenames. Since cleanup() calls this on every tool_* file without try/catch, a single malformed filename aborts the entire cleanup process.

Fix: Wrap timestamp extraction per-entry in try/catch and skip malformed IDs.

Minor Issues

8. Telemetry Session Context Race — Empty `session_id` in Early Events

Location: src/session/prompt.ts:303-304, src/mcp/index.ts (census)

Events like mcp_server_status and engine_started can fire before setContext() sets the session ID. MCP census uses Telemetry.getContext().sessionId which may be empty during startup. Auth events use session_id: "cli" which pollutes analytics.

Fix: Use fallback session IDs for early events (e.g., "startup") or buffer events until context is initialized.

9. No `session_end` Tracking on Early Process Termination

Location: src/session/prompt.ts:806-820

session_end and agent_outcome are only tracked in the finally block of SessionPrompt.loop(). If the process crashes or is killed before reaching this point, these events never fire.

Fix: Use process.on('beforeExit', ...) to attempt a final flush, or track session_end optimistically at session start with duration updated on completion.

10. `compactionAttempts` Map Grows Unbounded

Location: src/session/compaction.ts:156

The map grows for each session ID and is never cleaned up. In long-lived daemonized usage, this leaks memory.

Fix: Delete the session entry when compaction completes or session ends.

11. `ai.cloud.role` Still Says "altimate-code" After Binary Rename

Location: src/telemetry/index.ts (App Insights tags)

The binary was renamed to altimate but the App Insights cloud role tag still says "altimate-code".

Fix: Update to "altimate".

12. Error Message Truncation Inconsistent

Location: src/session/processor.ts:264 vs src/mcp/index.ts:434

processor.ts uses .toString() on error objects (can leak stack traces with file paths), while MCP uses .message (safer). Truncation length also varies (500 vs 1000 chars) across event types.

Fix: Standardize on .message and 500-char truncation everywhere.

13. `environment_census` Hardcodes `skill_count: 0`

Location: src/tool/project-scan.ts:588

Emits inaccurate observability data that will skew skill adoption metrics.

Fix: Populate from real skill registry or omit the field.

14. Context Utilization Calculation Includes `cache_write` Tokens

Location: src/session/processor.ts:348

totalTokens includes cache_write tokens which aren't part of context window consumption, inflating utilization_pct.

Fix: Use tokens.input + tokens.cache.read for context utilization.

15. Buffer Overflow Drops Oldest Events Silently

Location: src/telemetry/index.ts:402-408

When buffer exceeds 200 events, oldest are silently dropped. Critical events like session_start could be lost.

Fix: Track a dropped event counter to include in the next flush, or trigger an early flush at capacity.

Positive Observations

Comprehensive event taxonomy: 25 well-structured event types covering the full CLI lifecycle — sessions, generations, tool calls, bridge, auth, MCP, context management, engine, and error recovery
Privacy-conscious design: No tool arguments, no user content, truncated errors, bucketed counts in environment_census
Resilient telemetry: Designed to never crash the CLI — silent drops, buffer overflow protection, timeout handling, graceful shutdown
Clean test coverage: Validates all 25 event types, naming conventions, privacy guarantees. Command resilience tests cover success, partial, and total failure scenarios
Good resilience patterns: Command loading and TUI sync use try/catch and safe() wrappers to prevent cascading failures

Missing Tests

App Insights envelope format (toAppInsightsEnvelopes()) — core serialization with no test
Flush failure resilience — network errors, timeout handling
Buffer overflow behavior — dropping oldest events at capacity
parseConnectionString edge cases — missing keys, malformed strings
Truncate.cleanup with mixed-format filenames — legacy vs current ID formats
Auth flow telemetry — auth_login/auth_logout events
Telemetry shutdown flush — verifying shutdown() sends buffered events

Finding Attribution

Issue	Origin	Type
PII: raw email in ai.user.id	MiniMax, Kimi	Consensus
Missing timeout for listPrompts/listResources	Gemini	Unique
Hardcoded App Insights key / opt-out gap	Claude, Codex, Grok, Kimi, MiniMax, GLM-5, Qwen	Consensus
MCP disconnect hardcodes "stdio"	Codex, Kimi, MiniMax	Consensus
MCP tool type detection mismatch (mcp__ prefix)	Gemini	Unique
MCP prompt template getter anti-pattern	Claude	Unique
Truncate.cleanup fault isolation	Codex	Unique
Empty session_id in early events	Gemini, Kimi, MiniMax, GLM-5, Qwen	Consensus
No session_end on process crash	GLM-5	Unique
compactionAttempts unbounded	Codex, GLM-5, Qwen	Consensus
ai.cloud.role "altimate-code" after rename	Gemini, Kimi, MiniMax	Consensus
Error truncation inconsistent	GLM-5	Unique
skill_count: 0 hardcoded	Codex	Unique
Context utilization includes cache_write	Claude	Unique
Buffer overflow drops silently	Claude, Kimi, Qwen	Consensus

Reviewed by 8 models: Claude, Codex (OpenAI), Gemini (Google), Kimi (Moonshot), Grok (xAI), MiniMax, GLM-5 (Zhipu), Qwen (Alibaba). 1 convergence round with corrections incorporated.

Fixes 12 issues identified in the comprehensive code review of PR #39: Critical: - Wrap MCP listPrompts/listResources in withTimeout() to prevent CLI hangs Major: - Fix MCP tool type detection: add MCP.isMcpTool() instead of broken startsWith("mcp__") prefix check - Fix MCP prompt template getter: replace Promise constructor anti-pattern with .then() chain, throw on undefined template - Fix Truncate.cleanup: wrap Identifier.timestamp() per-entry in try/catch to prevent single malformed filename from aborting cleanup Minor: - Fix ai.cloud.role tag: "altimate-code" → "altimate" - Fix error truncation: use .message instead of .toString() to avoid leaking stack traces with file paths - Fix skill_count: populate from Skill.all() instead of hardcoded 0 - Fix context utilization: remove cache_write from token count - Fix buffer overflow: add dropped events counter for telemetry - Fix compactionAttempts: delete session entry on completion/error - Fix empty session_id: use "startup" fallback for early events - Fix session_end: add beforeExit handler for early process termination Docs: - Fix config path in telemetry docs (was ~/.config/altimate/config.json) - Add telemetry field to config schema table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address remaining code review issues from PR #39 Fixes 12 issues identified in the comprehensive code review of PR #39: Critical: - Wrap MCP listPrompts/listResources in withTimeout() to prevent CLI hangs Major: - Fix MCP tool type detection: add MCP.isMcpTool() instead of broken startsWith("mcp__") prefix check - Fix MCP prompt template getter: replace Promise constructor anti-pattern with .then() chain, throw on undefined template - Fix Truncate.cleanup: wrap Identifier.timestamp() per-entry in try/catch to prevent single malformed filename from aborting cleanup Minor: - Fix ai.cloud.role tag: "altimate-code" → "altimate" - Fix error truncation: use .message instead of .toString() to avoid leaking stack traces with file paths - Fix skill_count: populate from Skill.all() instead of hardcoded 0 - Fix context utilization: remove cache_write from token count - Fix buffer overflow: add dropped events counter for telemetry - Fix compactionAttempts: delete session entry on completion/error - Fix empty session_id: use "startup" fallback for early events - Fix session_end: add beforeExit handler for early process termination Docs: - Fix config path in telemetry docs (was ~/.config/altimate/config.json) - Add telemetry field to config schema table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: improve session_end coverage for process.exit() and add dedup guard Address review comment about beforeExit limitations: - Add process.once("exit") handler to cover process.exit() calls - Add dedup flag to prevent double-firing when finally block also runs - Document why SIGINT/SIGTERM are NOT handled (abort controller already triggers loop exit → finally block → normal session_end) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: comprehensive telemetry instrumentation for error visibility and data moat Instrument 25 event types across the CLI lifecycle — auth, MCP servers, Python engine, provider errors, permissions, upgrades, context utilization, agent outcomes, workflow sequencing, and environment census. Adds telemetry docs page, firewall endpoint, 25 unit tests, and privacy-safe helpers (categorizeToolName, bucketCount). No behavioral changes; all telemetry is opt-out via config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review feedback on telemetry instrumentation - compaction.ts: scope compactionAttempt per-session via Map instead of module-level counter that leaked across sessions - prompt.ts: remove hardcoded doom_loops/permission_denials fields from agent_outcome (were always 0); doom loops and permission denials are already tracked as separate events - mcp/index.ts: make mcp_server_census fire-and-forget so listTools/ listResources never block the MCP connect critical path - processor.ts: use dedicated generationCounter for context_utilization instead of toolCallCounter which counts a different thing - telemetry/index.ts: refactor categorizeToolName to use a pattern array with explicit ordering, making the match logic order-independent and easier to maintain - installation/index.ts: add "other" to upgrade method union instead of silently casting unknown methods to "npm" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address telemetry review issues from PR #39 - Standardize error message truncation to 500 chars (was 1000 in processor.ts) - Implement ALTIMATE_TELEMETRY_DISABLED env var check in telemetry init() - Track MCP transport type on connect, use correct transport in disconnect - Add sessionHadError flag so agent_outcome produces "error" outcome - Use Telemetry.getContext().sessionId in auth events instead of hardcoded "cli" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: initialize telemetry early and buffer pre-init events Telemetry.init() was only called in SessionPrompt.next(), but track() calls from MCP connect, engine startup, and auth commands fire earlier. Events tracked before init() were silently dropped. - Buffer events in track() regardless of enabled state (pre-init events are kept and flushed once init() completes) - Clear buffer when init() determines telemetry is disabled - Call init() in main CLI middleware (covers auth, serve, etc.) - Call init() in worker thread startup (covers MCP/engine events) - Keep existing init() in session prompt as idempotent safety net Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: robust telemetry init with promise deduplication and pre-init buffering The previous commit moved init() earlier but had subtle issues: - No init-once guard when telemetry disabled (init ran repeatedly) - Events during async init() were dropped - Concurrent non-awaited init() calls could race Fix: - Use promise deduplication: init() returns same promise on repeated calls - Use initDone flag: track() buffers during init, drops after disable - Clear buffer on every disabled/error exit path in doInit() - Reset initPromise/initDone in shutdown() for test isolation - Add tests for pre-init buffering, post-disable dropping, idempotency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: handle Config.get() and Control.account() failures in telemetry init Config.get() throws "No context found for instance" when called before Instance.provide() (e.g. CLI middleware, worker thread startup). This caused the outer catch to clear the buffer and disable telemetry entirely. Fix: wrap Config.get() and Control.account() in individual try/catch blocks so failures don't prevent telemetry initialization. The env var ALTIMATE_TELEMETRY_DISABLED is the early-init escape hatch. Verified end-to-end: pre-init events are buffered and flushed to App Insights after init completes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: hash user email before sending to telemetry Raw email is PII — replace with SHA-256 hash to preserve anonymous user correlation without leaking personally identifiable information. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address remaining code review issues from PR #39 Fixes 12 issues identified in the comprehensive code review of PR #39: Critical: - Wrap MCP listPrompts/listResources in withTimeout() to prevent CLI hangs Major: - Fix MCP tool type detection: add MCP.isMcpTool() instead of broken startsWith("mcp__") prefix check - Fix MCP prompt template getter: replace Promise constructor anti-pattern with .then() chain, throw on undefined template - Fix Truncate.cleanup: wrap Identifier.timestamp() per-entry in try/catch to prevent single malformed filename from aborting cleanup Minor: - Fix ai.cloud.role tag: "altimate-code" → "altimate" - Fix error truncation: use .message instead of .toString() to avoid leaking stack traces with file paths - Fix skill_count: populate from Skill.all() instead of hardcoded 0 - Fix context utilization: remove cache_write from token count - Fix buffer overflow: add dropped events counter for telemetry - Fix compactionAttempts: delete session entry on completion/error - Fix empty session_id: use "startup" fallback for early events - Fix session_end: add beforeExit handler for early process termination Docs: - Fix config path in telemetry docs (was ~/.config/altimate/config.json) - Add telemetry field to config schema table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: improve session_end coverage for process.exit() and add dedup guard Address review comment about beforeExit limitations: - Add process.once("exit") handler to cover process.exit() calls - Add dedup flag to prevent double-firing when finally block also runs - Document why SIGINT/SIGTERM are NOT handled (abort controller already triggers loop exit → finally block → normal session_end) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

jontsai approved these changes Mar 5, 2026

View reviewed changes

jontsai reviewed Mar 5, 2026

View reviewed changes

anandgupta42 merged commit 91923fe into main Mar 5, 2026
4 checks passed

suryaiyer95 requested changes Mar 5, 2026

View reviewed changes

anandgupta42 mentioned this pull request Mar 5, 2026

fix: address telemetry review issues from PR #39 #40

Merged

3 tasks

anandgupta42 mentioned this pull request Mar 5, 2026

fix: address remaining code review issues from PR #39 #43

Merged

5 tasks

kulvirgit deleted the feat/comprehensive-telemetry-instrumentation branch March 10, 2026 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: comprehensive telemetry instrumentation#39

feat: comprehensive telemetry instrumentation#39
anandgupta42 merged 2 commits intomainfrom
feat/comprehensive-telemetry-instrumentation

anandgupta42 commented Mar 5, 2026 •

edited

Loading

Uh oh!

jontsai left a comment

Uh oh!

jontsai Mar 5, 2026

Uh oh!

jontsai Mar 5, 2026

Uh oh!

jontsai Mar 5, 2026

Uh oh!

jontsai Mar 5, 2026

Uh oh!

jontsai Mar 5, 2026

Uh oh!

jontsai Mar 5, 2026

Uh oh!

jontsai left a comment

Uh oh!

Uh oh!

suryaiyer95 left a comment

Uh oh!

kulvirgit commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anandgupta42 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed (15)

Test plan

Uh oh!

jontsai left a comment

Choose a reason for hiding this comment

Uh oh!

jontsai Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

jontsai Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

jontsai Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

jontsai Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

jontsai Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

jontsai Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

jontsai left a comment

Choose a reason for hiding this comment

PR #39 Review Tour Guide 🗺️

Overview

File Order

Tour Highlights

🟢 What's Good

🟡 Issues Found (6 inline comments)

🟢 No Red Flags

Summary

Uh oh!

Uh oh!

suryaiyer95 left a comment

Choose a reason for hiding this comment

Multi-Model Code Review — 6 AI Models

Critical Issues

1. Docs claim path scrubbing + 500 char truncation, but code doesn't match

Major Issues

2. agent_outcome hardcodes doom_loops: 0 and permission_denials: 0

3. ALTIMATE_TELEMETRY_DISABLED env var documented but not implemented

4. Module-level compactionAttempt counter never resets

5. generation_number in context_utilization uses tool call count

6. MCP disconnect hardcodes transport: "stdio"

7. upgrade_attempted silently loses method data

8. agent_outcome never produces "error" outcome

Minor Issues

9. Session context not updated after fork

10. categorizeToolName has order-dependent keyword overlap

11. Hardcoded session_id: "cli" in auth events

12. Privacy tests verify type shapes, not runtime behavior

Nits

Positive Observations

Missing Tests

Uh oh!

kulvirgit commented Mar 5, 2026

Code Review: PR #39 — Comprehensive Telemetry Instrumentation

Critical Issues

1. PII Leakage — Raw Email in ai.user.id Tag

2. Missing Timeout for MCP listPrompts() / listResources()

Major Issues

3. Hardcoded App Insights Key with Insufficient Opt-Out Documentation

4. MCP Disconnect Hardcodes transport: "stdio" for All Servers

5. MCP Tool Type Detection Mismatch

6. MCP Prompt Template Getter Anti-Pattern

7. Truncate.cleanup() Lacks Per-File Fault Isolation

Minor Issues

8. Telemetry Session Context Race — Empty session_id in Early Events

9. No session_end Tracking on Early Process Termination

10. compactionAttempts Map Grows Unbounded

11. ai.cloud.role Still Says "altimate-code" After Binary Rename

12. Error Message Truncation Inconsistent

13. environment_census Hardcodes skill_count: 0

anandgupta42 commented Mar 5, 2026 •

edited

Loading

2. `agent_outcome` hardcodes `doom_loops: 0` and `permission_denials: 0`

3. `ALTIMATE_TELEMETRY_DISABLED` env var documented but not implemented

4. Module-level `compactionAttempt` counter never resets

5. `generation_number` in `context_utilization` uses tool call count

6. MCP disconnect hardcodes `transport: "stdio"`

7. `upgrade_attempted` silently loses method data

8. `agent_outcome` never produces `"error"` outcome

10. `categorizeToolName` has order-dependent keyword overlap

11. Hardcoded `session_id: "cli"` in auth events

1. PII Leakage — Raw Email in `ai.user.id` Tag

2. Missing Timeout for MCP `listPrompts()` / `listResources()`

4. MCP Disconnect Hardcodes `transport: "stdio"` for All Servers

7. `Truncate.cleanup()` Lacks Per-File Fault Isolation

8. Telemetry Session Context Race — Empty `session_id` in Early Events

9. No `session_end` Tracking on Early Process Termination

10. `compactionAttempts` Map Grows Unbounded

11. `ai.cloud.role` Still Says "altimate-code" After Binary Rename

13. `environment_census` Hardcodes `skill_count: 0`

14. Context Utilization Calculation Includes `cache_write` Tokens