Skip to content

feat: launch Agent Inspector from azd ai agent run#8264

Open
anchenyi wants to merge 13 commits into
Azure:mainfrom
anchenyi:feat/inspector-extension
Open

feat: launch Agent Inspector from azd ai agent run#8264
anchenyi wants to merge 13 commits into
Azure:mainfrom
anchenyi:feat/inspector-extension

Conversation

@anchenyi
Copy link
Copy Markdown

@anchenyi anchenyi commented May 20, 2026

Summary

azd ai agent run now automatically launches Agent Inspector for the local agent run. Users can opt out with --no-inspector.

Changes

  • Add --no-inspector to azd ai agent run.
  • Launch Agent Inspector through the azd workflow service.
  • Suppress Inspector CLI output when launched from azd ai agent run, so the terminal keeps showing the agent run output.
  • Show install guidance when azure.ai.inspector is not installed:
    • azd extension install azure.ai.inspector
  • Keep azd ai inspector as a command group; Inspector only starts from azd ai inspector launch.

Inspector Web Demo

image

@microsoft-github-policy-service microsoft-github-policy-service Bot added the customer-reported identify a customer issue label May 20, 2026
@anchenyi anchenyi marked this pull request as ready for review May 20, 2026 07:31
Copilot AI review requested due to automatic review settings May 20, 2026 07:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds automatic Agent Inspector launch when running a local agent via azd ai agent run, with an opt-out flag, and introduces a standalone azure.ai.inspector extension that serves the Inspector SPA and proxies JSON-RPC/WebSocket + HTTP/SSE traffic to the local agent.

Changes:

  • Add --no-inspector to azd ai agent run and launch Inspector via the azd workflow service using a silent mode.
  • Implement the Inspector extension server/runtime (embedded SPA, WS JSON-RPC bridge, HTTP fetch/invoke + SSE proxying) and add tests.
  • Update extension packaging (version bump, metadata/README/changelog, build scripts, lint config) and include bundled SPA assets.

Reviewed changes

Copilot reviewed 34 out of 119 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cli/azd/extensions/azure.ai.inspector/version.txt Bumps inspector extension version.
cli/azd/extensions/azure.ai.inspector/main.go Updates module import path for extension entrypoint.
cli/azd/extensions/azure.ai.inspector/internal/version/version.go Adds build-time version/commit/date variables.
cli/azd/extensions/azure.ai.inspector/internal/inspector/server_test.go Adds server routing test for SPA index fallback.
cli/azd/extensions/azure.ai.inspector/internal/inspector/server.go Adds local HTTP server + SPA asset handler + WS endpoint.
cli/azd/extensions/azure.ai.inspector/internal/inspector/rpc.go Implements JSON-RPC handling over WebSocket for the SPA.
cli/azd/extensions/azure.ai.inspector/internal/inspector/proxy_sse_test.go Adds test for upstream SSE cancellation on WS close.
cli/azd/extensions/azure.ai.inspector/internal/inspector/proxy_sse.go Implements SSE proxying and streaming to SPA (and optional sink).
cli/azd/extensions/azure.ai.inspector/internal/inspector/proxy_fetch.go Implements HTTP fetch/invoke proxying and response shaping.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/index.html Adds embedded SPA entrypoint HTML.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/qwen-T0CAGeOv.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/qwen-B0layTYq.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/minimax20-MO7AnRq7.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/minimax-1g4txH6T.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/kimi_o_80-BtOGfzOQ.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/kimi_o_20-DhxCfzjk.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/hljs/github.css Adds bundled SPA syntax highlighting theme.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/hljs/github-dark.css Adds bundled SPA syntax highlighting theme.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/foundry-openai-tmS3rrL9.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/foundry-openai-kEJt47xx.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/fireworks-openai-OsxdyxhS.svg Adds bundled SPA asset (composite icon).
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/fireworks-openai-Da_qp4ay.svg Adds bundled SPA asset (composite icon).
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/fireworks-minimax-cFev7e0G.svg Adds bundled SPA asset (composite icon).
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/fireworks-minimax-BCqYI7Go.svg Adds bundled SPA asset (composite icon).
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/fireworks-kimi-BZ4xkefv.svg Adds bundled SPA asset (composite icon).
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/fireworks-kimi-B-UniIay.svg Adds bundled SPA asset (composite icon).
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/fireworks-deepseek-DDRLQatW.svg Adds bundled SPA asset (composite icon).
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/fireworks-deepseek-BG-COsKx.svg Adds bundled SPA asset (composite icon).
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets/assets/ai21-labs-D2iYCjBT.svg Adds bundled SPA asset.
cli/azd/extensions/azure.ai.inspector/internal/inspector/assets.go Embeds SPA assets into the extension binary.
cli/azd/extensions/azure.ai.inspector/internal/cmd/version.go Switches version reporting to new internal/version package.
cli/azd/extensions/azure.ai.inspector/internal/cmd/sse_render.go Adds SSE rendering for terminal mirroring.
cli/azd/extensions/azure.ai.inspector/internal/cmd/root_test.go Adds tests ensuring launch is an explicit subcommand + metadata.
cli/azd/extensions/azure.ai.inspector/internal/cmd/root.go Refactors root command, adds launch/listen/metadata, adds debug logging hook.
cli/azd/extensions/azure.ai.inspector/internal/cmd/metadata.go Removes redundant metadata command wrapper.
cli/azd/extensions/azure.ai.inspector/internal/cmd/inspector.go Adds azd ai inspector launch command implementation.
cli/azd/extensions/azure.ai.inspector/internal/cmd/debug.go Adds debug logging routing to file (or discard).
cli/azd/extensions/azure.ai.inspector/internal/cmd/context.go Removes context command (no longer part of extension surface).
cli/azd/extensions/azure.ai.inspector/go.sum Updates module dependency checksums.
cli/azd/extensions/azure.ai.inspector/go.mod Renames module path and updates dependencies for inspector runtime.
cli/azd/extensions/azure.ai.inspector/extension.yaml Updates extension metadata, version, examples, and required azd version.
cli/azd/extensions/azure.ai.inspector/cspell.yaml Adds dictionary words for extension-specific terms.
cli/azd/extensions/azure.ai.inspector/ci-build.ps1 Updates build flags and linker vars to new version package + hardening.
cli/azd/extensions/azure.ai.inspector/build.sh Updates linker vars path for version embedding.
cli/azd/extensions/azure.ai.inspector/build.ps1 Updates linker vars path for version embedding.
cli/azd/extensions/azure.ai.inspector/README.md Adds detailed usage and development docs.
cli/azd/extensions/azure.ai.inspector/CHANGELOG.md Updates changelog for 0.1.0-preview.
cli/azd/extensions/azure.ai.inspector/AGENTS.md Adds extension-specific conventions/instructions.
cli/azd/extensions/azure.ai.inspector/.golangci.yaml Adds lint configuration for the extension module.
cli/azd/extensions/azure.ai.inspector/.gitattributes Disables whitespace normalization for bundled SPA artifacts.
cli/azd/extensions/azure.ai.agents/internal/cmd/run_test.go Adds tests for --no-inspector and workflow-based inspector launch.
cli/azd/extensions/azure.ai.agents/internal/cmd/run.go Launches inspector by default after agent is ready; adds opt-out and install guidance.
Comments suppressed due to low confidence (5)

cli/azd/extensions/azure.ai.inspector/internal/inspector/proxy_sse.go:1

  • This code emits user-facing output directly to stdout/stderr (printUserInput, fmt.Printf, fmt.Fprintln) even when the inspector is launched with --silent from azd ai agent run, which contradicts the PR’s goal to suppress inspector output. Additionally, the error message hardcodes POST even though the request method can vary. Route all output through the configurable logger/SSESink (or gate it on a verbosity/debug flag), and construct the error string using the resolved request method.
    cli/azd/extensions/azure.ai.inspector/internal/inspector/proxy_sse.go:1
  • This code emits user-facing output directly to stdout/stderr (printUserInput, fmt.Printf, fmt.Fprintln) even when the inspector is launched with --silent from azd ai agent run, which contradicts the PR’s goal to suppress inspector output. Additionally, the error message hardcodes POST even though the request method can vary. Route all output through the configurable logger/SSESink (or gate it on a verbosity/debug flag), and construct the error string using the resolved request method.
    cli/azd/extensions/azure.ai.inspector/internal/inspector/proxy_fetch.go:1
  • Echoing raw user input directly to stderr can leak potentially sensitive content into terminals/log capture (and currently bypasses --silent). Consider removing this entirely or only emitting it via the logger under an explicit debug/verbose flag so it is opt-in.
    cli/azd/extensions/azure.ai.inspector/internal/inspector/server.go:1
  • The SPA fallback currently serves index.html for any fs.Stat error, not just missing files. That can mask real problems (e.g., permission/corrupt embed) and turn them into misleading 200 responses. Prefer falling back only on not-exist errors (e.g., errors.Is(err, fs.ErrNotExist)), and return a 500 for other errors.
    cli/azd/extensions/azure.ai.inspector/internal/inspector/rpc.go:1
  • Spawning an unbounded goroutine per inbound WS message can create avoidable memory/CPU pressure if the client misbehaves (or if the UI sends bursts). Consider handling non-streaming methods inline, and only offloading streaming-related methods; or introduce a bounded worker/semaphore to limit concurrent handlers.

Comment thread cli/azd/extensions/azure.ai.inspector/internal/cmd/debug.go
Comment thread cli/azd/extensions/azure.ai.inspector/ci-build.ps1 Outdated
Comment thread cli/azd/extensions/azure.ai.inspector/internal/cmd/inspector.go
Copy link
Copy Markdown
Contributor

@wbreza wbreza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the Inspector auto-launch is a great UX. I've focused this review on the new attack surface, concurrency, and the default-behavior change. Findings are grouped by priority.

Existing Copilot review findings (proxy_sse stdout bypass, proxy_fetch user-input echo, server.go SPA fallback, rpc.go unbounded goroutine, debug.go log file leak, inspector.go scanner.Err) still apply at HEAD — please resolve those alongside the items below; several chain together.


🔴 Blockers

1. Local SSRF via unauthenticated WS + arbitrary-URL HTTP proxy

internal/inspector/server.go (handleWS) + proxy_fetch.go / proxy_sse.go

/agentdev/ws/rpc is exposed with no token / no handshake — only an Origin check (bypassable, see #2). Once connected, webviewProxy/fetch, webviewProxy/invoke, and webviewProxy/fetchSSE accept a client-supplied URL and forward it via http.NewRequest with no scheme/host validation.

Any local process (poisoned npm postinstall, malicious VS Code extension, a browser tab once #2 is exploited) can:

  • Exfiltrate IMDS managed-identity tokens (http://169.254.169.254/metadata/identity/...)
  • Read intranet services / loopback dev databases / other CLIs' RPC ports
  • POST arbitrary payloads to the local agent or any other localhost service

Auto-launching this from azd ai agent run by default makes the dev-machine surface materially larger.

Suggested fix:

  1. Mint a per-session random token on Server.Start, embed it in index.html, require it on WS upgrade (Sec-WebSocket-Protocol or query param). Reject upgrades without it.
  2. Constrain proxy URLs: scheme in {http, https}, host resolves to 127.0.0.1/::1/localhost, ideally port == cfg.AgentPort.

2. DNS rebinding bypasses CheckOrigin

internal/inspector/server.go:62-68

CheckOrigin: func(r *http.Request) bool {
    origin := r.Header.Get("Origin")
    if origin == "" { return true }
    return origin == "http://"+r.Host || origin == "https://"+r.Host
}

r.Host is attacker-controlled. Classic DNS-rebinding: developer visits evil.example → JS rebinds the host to 127.0.0.1 → browser sends Host: evil.example:8087, Origin: http://evil.example, check passes. Binding to 127.0.0.1 does not prevent this — the browser resolves locally itself.

Combined with #1, any browser visit to a malicious page while Inspector is running becomes click-zero exploitation for the full inspector lifetime.

Suggested fix: Validate r.Host against an explicit allowlist (127.0.0.1:<port>, [::1]:<port>, localhost:<port>); reject empty Origin on WS upgrades; ideally also gate on the session token from #1.


3. Nil-map panic race: registerStream after cleanup()

internal/inspector/rpc.go:94, 295proxy_sse.go:46, proxy_fetch.go:138

handleWS dispatches each message in a new goroutine (go sess.handleMessage(&msg)). On WS close, defer sess.cleanup() sets s.streams = nil. An already-scheduled webviewProxy/fetchSSE or proxyInvoke goroutine then calls s.registerStream(id, cancel)s.streams[id] = cancel on a nil map → process-wide panic. Reproducible on any disconnect-during-request scenario.

Suggested fix:

func (s *rpcSession) registerStream(id string, cancel context.CancelFunc) {
    s.streamsMu.Lock()
    defer s.streamsMu.Unlock()
    if s.closed { cancel(); return }
    s.streams[id] = cancel
}

…and in cleanup(): clear(s.streams) + s.closed = true under the same mutex.


4. No recover in per-message goroutines

internal/inspector/rpc.go:94

go sess.handleMessage(&msg) runs against client-controlled JSON; any nil-deref / OOB in any handler crashes the entire inspector process — and in auto-launch mode also propagates back to the parent azd ai agent run.

Suggested fix: Wrap the goroutine body in defer func() { if r := recover(); r != nil { /* log + send RPC error if msg.ID != nil */ } }().


🟠 Major

  1. Empty Origin accepted unconditionally (server.go:64-66) — if origin == "" { return true } lets any non-browser local process reach #1's SSRF without needing the rebinding chain.
  2. openUrlInBrowser is an unauth'd URI-handler launcher (rpc.go:186-201) — Any WS client can drive browser.OpenURL with arbitrary schemes (ms-msdt:, file:, vscode:, etc.). Pairs poorly with the Follina-class URI-handler vulnerabilities that ship regularly. Allowlist http/https only; reject embedded credentials.
  3. Unbounded WS frame size & no deadlines → memory DoS (rpc.go:59-95) — No conn.SetReadLimit, no SetReadDeadline, no ping/pong. A single 10 GB frame OOMs the process. Set conn.SetReadLimit(1<<20), periodic ping/pong, cap concurrent sessions.
  4. Body logging bypasses --silent via s.logger.Printf (proxy_fetch.go:111, 152) — s.logger.Printf("invoke ... body: %s", ..., p.Body) is not gated by cfg.Silent. With auto-launch, every prompt + every model response lands on the parent's stderr. (Distinct from Copilot's fmt.Fprintln finding in proxy_sse.go — this is the structured logger path.) When Silent, log only metadata (requestID, method, status, length).
  5. Auto-launched inspector lifetime not bound to parent process (extensions/azure.ai.agents/internal/cmd/run.go:258-326 + inspector.go:124-132) — workflow.Run(ctx, ...) spawns inspector with no PID capture, no explicit Kill on cancel(). Please verify: after Ctrl+C on azd ai agent run, does netstat -ano | findstr :8087 show port 8087 still bound? If yes, inspector keeps proxying with no UI, subsequent runs collide on the port, and a stale proxy stays reachable. Capture child PID and kill on cleanup, or pass --parent-pid for self-termination.
  6. streamSSELines SSESink deadlock / goroutine leak (proxy_sse.go:116-165inspector.go:141 injectSSEEvents) — readSSEStream returns early on response.completed, dropping the inner pipe reader; injectSSEEvents blocks on Fprintln into an unbuffered pipe with no reader → goroutine + HTTP body pinned until streamCtx is cancelled. Happens on normal completion when trailing events arrive. Close the sink writer on completion, or make injectSSEEvents ctx-aware.
  7. proxyInvoke streaming branch can leak response body on panic (proxy_fetch.go:137-145) — go s.pumpSSE(p.RequestID, resp, true) has no top-level defer resp.Body.Close(); body closure depends on the cancel-driven forcing goroutine that may not fire on panic / early return. Add defer resp.Body.Close() at top of pumpSSE (idempotent with existing path).
  8. Default-behavior UX change is ungated and undocumented (extensions/azure.ai.agents/internal/cmd/run.go + azure.ai.agents/README.md) — Auto-launching Inspector is a material UX shift with no env-var/global opt-out (only per-invocation --no-inspector), no Preview gating per docs/reference/feature-status.md, no README update on the agents extension explaining the new default / --no-inspector / headless behavior, and no telemetry distinguishing auto-launch attempts/successes/failures. CI and headless dev environments will see surprising behavior. Add an AZD_AI_AGENT_AUTO_INSPECTOR=false (or similar) global opt-out, document the new default, mark as Preview, emit telemetry.
  9. No Content-Security-Policy or hardening headers on the served SPA (server.go:129-137) — Only Content-Type + Cache-Control set. Given the 2.6 MB upstream JS bundle, a supply-chain compromise or XSS gets full origin powers. Add Content-Security-Policy: default-src 'self'; connect-src 'self' ws://localhost:<port>, X-Content-Type-Options: nosniff, X-Frame-Options: DENY, Referrer-Policy: no-referrer.
  10. Missing OSS attribution for embedded third-party SPA bundle — Embeds ~2.6 MB from qidon/tryinspector (KaTeX fonts, highlight.js themes, etc.) with no NOTICE / THIRD_PARTY_NOTICES / LICENSE references. Microsoft OSS policy generally requires attribution for redistributed third-party code, and incompatible licenses (GPL/AGPL) would block release. Please confirm upstream license compatibility and add the appropriate notices to cli/azd/extensions/azure.ai.inspector/.
  11. No integration test that Inspector failure doesn't break agent run (extensions/azure.ai.agents/internal/cmd/run_test.go) — The core safety property of the new default-on behavior — "inspector launch failure must not crash agent run" — has no test. Mock the workflow client to return RPC errors and panics; assert the agent continues. Also add an explicit test that --no-inspector skips the workflow call entirely.

🟡 Minor

  1. isInspectorExtensionMissingMessage brittle string match (run.go:369-373) — Substring-matches "unknown command"/"inspector"/"unknown flag: --port". If azd's workflow runner changes wording, install guidance silently regresses to a generic warning. Prefer a typed sentinel/status code, or rely on the existing pre-check.
  2. Windows Ctrl+C kills agent child ungracefully (run.go:203-245) — exec.CommandContext on Windows = TerminateProcess; no chance to flush OTel/logs. Use CREATE_NEW_PROCESS_GROUP + GenerateConsoleCtrlEvent(CTRL_BREAK_EVENT) with grace timeout.
  3. loadAzdEnvironment error silently swallowed (run.go:177-182) — User gets confusing downstream "endpoint is empty" errors. Mirror the warning pattern used right below for resolveConnectionCredentials.
  4. No size limit on error/buffered io.ReadAll(resp.Body) (proxy_sse.go:~76, proxy_fetch.go:~55) — Malicious/buggy upstream can OOM Inspector. Use io.LimitReader (4 KB for error bodies, 16 MB for buffered).
  5. streamHTTPClient has no response-header timeout (proxy_fetch.go:58) — Same client is used for proxyInvoke before content-type is known; a hung non-SSE upstream blocks the goroutine. Set Transport.ResponseHeaderTimeout: 30s.
  6. apim-request-id to stdout in standalone mode (proxy_sse.go:75-77) — fmt.Printf("Trace ID: %s\n", ...) writes to stdout; breaks | jq / stdout-as-data flows. Use stderr.
  7. .golangci.yaml line-length 220 vs azd core 125 — Inherit from core or use file-scoped //nolint:lll for justified long lines (mostly the embedded asset SHA-stamped paths).
  8. runInspector browser-open goroutine leak on Start failure (inspector.go:124-132) — Goroutine blocks on <-ready forever if Start fails before binding. Close ready in error path, or select on ctx.Done().
  9. Defense-in-depth on assetsHandler path handling (server.go:139-150) — http.ServeMux cleans paths today, but a future re-mount with StripPrefix could bypass that. Explicit path.Clean + escape rejection.
  10. Test coverage gaps — Missing tests for: SSE upstream 4xx/5xx + read-error paths, malformed SSE lines, SPA fallback for real fs.Stat errors vs ErrNotExist, JSON-RPC error response structure, very long data: lines (>64 KB scanner buffer), --no-inspector skips workflow call.
  11. Test flakiness — time.Sleep / hardcoded timeouts in proxy_sse_test.go, server_test.go — replace with synchronization primitives or t.Context() deadlines.
  12. Install warning has no rate-limit (run.go ~200) — Prints on every run if extension absent. Consider a once-per-day marker file.
  13. CHANGELOG entry minimal — Agents extension uses richer entries linking PR/feature.
  14. requiredAzdVersion accuracy (extension.yaml:8) — Verify >1.23.13 matches the minimum azd version supporting workflow.Run for extensions.
  15. AGENTS.md error-handling guidance gap (extensions/azure.ai.inspector/AGENTS.md:51) — Clarify this extension uses plain wrapped errors (no exterrors), unlike the agents extension.

⚪ Nits

  • srv.URL() returns http://localhost:%d while listener binds 127.0.0.1 only (server.go:73-75, 85)
  • cspell.yaml — comment why azureaiinspector (Go module camelCase) is allowed
  • build.sh:12 — comment casing vs actual $EXTENSION_ID_SAFE
  • debug.go log file close error swallowed
  • root_test.go findCommand no recursion depth guard
  • run_test.go appendPortEnvVars relies on slice index ordering

Happy to discuss any of the above — particularly grateful for the contribution and the careful refactor of the extension internals. The blockers are all addressable with localized changes (session token + URL host check + nil-map fix + recover).

Copy link
Copy Markdown
Member

@trangevi trangevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a \azure-dev.github\workflows\approval-ext-azure-ai-inspector.yml file which controls a github pipeline to block PRs on a separate approval list from the standard codeowners file. This is intended to still get helpful reviews from the broader AZD group (in terms of sticking to AZD standards, GO best practices, etc.) while still blocking the PR on a smaller set of people who know the proper code flow for each extension. We should update that file with folks on your team for proper ownership of this new extension.

In .github/workflows/approval-ext-azure-ai-inspector.yml, add a REQUIRED_APPROVERS environment variable to the step:

       - name: Check for required team approval
         uses: actions/github-script@v9
         env:
           EXTENSION_PATH: "cli/azd/extensions/azure.ai.inspector/"
           WORKFLOW_PATH: ".github/workflows/approval-ext-azure-ai-inspector.yml"
           OVERRIDE_COMMAND: "/inspector-extension-approval override"
           REQUIRED_APPROVERS: '["user1", "user2", "user3"]'  # ← add this
         with:
           script: |
             const script = require('./.github/scripts/pr-approval-foundry-extensions-shared.js');
             await script({ github, context, core });

The shared script (line 31-33) checks for this env var and uses it instead of the default list.

Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/run.go
@JeffreyCA JeffreyCA added the ext-agents azure.ai.{agents,connections,inspector,projects,routines,skills,toolboxes} extensions label May 20, 2026
@anchenyi anchenyi requested a review from danieljurek as a code owner May 21, 2026 01:38
@anchenyi
Copy link
Copy Markdown
Author

Thanks for this — the Inspector auto-launch is a great UX. I've focused this review on the new attack surface, concurrency, and the default-behavior change. Findings are grouped by priority.

Existing Copilot review findings (proxy_sse stdout bypass, proxy_fetch user-input echo, server.go SPA fallback, rpc.go unbounded goroutine, debug.go log file leak, inspector.go scanner.Err) still apply at HEAD — please resolve those alongside the items below; several chain together.

🔴 Blockers

1. Local SSRF via unauthenticated WS + arbitrary-URL HTTP proxy

internal/inspector/server.go (handleWS) + proxy_fetch.go / proxy_sse.go

/agentdev/ws/rpc is exposed with no token / no handshake — only an Origin check (bypassable, see #2). Once connected, webviewProxy/fetch, webviewProxy/invoke, and webviewProxy/fetchSSE accept a client-supplied URL and forward it via http.NewRequest with no scheme/host validation.

Any local process (poisoned npm postinstall, malicious VS Code extension, a browser tab once #2 is exploited) can:

  • Exfiltrate IMDS managed-identity tokens (http://169.254.169.254/metadata/identity/...)
  • Read intranet services / loopback dev databases / other CLIs' RPC ports
  • POST arbitrary payloads to the local agent or any other localhost service

Auto-launching this from azd ai agent run by default makes the dev-machine surface materially larger.

Suggested fix:

  1. Mint a per-session random token on Server.Start, embed it in index.html, require it on WS upgrade (Sec-WebSocket-Protocol or query param). Reject upgrades without it.
  2. Constrain proxy URLs: scheme in {http, https}, host resolves to 127.0.0.1/::1/localhost, ideally port == cfg.AgentPort.

2. DNS rebinding bypasses CheckOrigin

internal/inspector/server.go:62-68

CheckOrigin: func(r *http.Request) bool {
    origin := r.Header.Get("Origin")
    if origin == "" { return true }
    return origin == "http://"+r.Host || origin == "https://"+r.Host
}

r.Host is attacker-controlled. Classic DNS-rebinding: developer visits evil.example → JS rebinds the host to 127.0.0.1 → browser sends Host: evil.example:8087, Origin: http://evil.example, check passes. Binding to 127.0.0.1 does not prevent this — the browser resolves locally itself.

Combined with #1, any browser visit to a malicious page while Inspector is running becomes click-zero exploitation for the full inspector lifetime.

Suggested fix: Validate r.Host against an explicit allowlist (127.0.0.1:<port>, [::1]:<port>, localhost:<port>); reject empty Origin on WS upgrades; ideally also gate on the session token from #1.

3. Nil-map panic race: registerStream after cleanup()

internal/inspector/rpc.go:94, 295proxy_sse.go:46, proxy_fetch.go:138

handleWS dispatches each message in a new goroutine (go sess.handleMessage(&msg)). On WS close, defer sess.cleanup() sets s.streams = nil. An already-scheduled webviewProxy/fetchSSE or proxyInvoke goroutine then calls s.registerStream(id, cancel)s.streams[id] = cancel on a nil map → process-wide panic. Reproducible on any disconnect-during-request scenario.

Suggested fix:

func (s *rpcSession) registerStream(id string, cancel context.CancelFunc) {
    s.streamsMu.Lock()
    defer s.streamsMu.Unlock()
    if s.closed { cancel(); return }
    s.streams[id] = cancel
}

…and in cleanup(): clear(s.streams) + s.closed = true under the same mutex.

4. No recover in per-message goroutines

internal/inspector/rpc.go:94

go sess.handleMessage(&msg) runs against client-controlled JSON; any nil-deref / OOB in any handler crashes the entire inspector process — and in auto-launch mode also propagates back to the parent azd ai agent run.

Suggested fix: Wrap the goroutine body in defer func() { if r := recover(); r != nil { /* log + send RPC error if msg.ID != nil */ } }().

🟠 Major

  1. Empty Origin accepted unconditionally (server.go:64-66) — if origin == "" { return true } lets any non-browser local process reach Test pipeline triggers #1's SSRF without needing the rebinding chain.
  2. openUrlInBrowser is an unauth'd URI-handler launcher (rpc.go:186-201) — Any WS client can drive browser.OpenURL with arbitrary schemes (ms-msdt:, file:, vscode:, etc.). Pairs poorly with the Follina-class URI-handler vulnerabilities that ship regularly. Allowlist http/https only; reject embedded credentials.
  3. Unbounded WS frame size & no deadlines → memory DoS (rpc.go:59-95) — No conn.SetReadLimit, no SetReadDeadline, no ping/pong. A single 10 GB frame OOMs the process. Set conn.SetReadLimit(1<<20), periodic ping/pong, cap concurrent sessions.
  4. Body logging bypasses --silent via s.logger.Printf (proxy_fetch.go:111, 152) — s.logger.Printf("invoke ... body: %s", ..., p.Body) is not gated by cfg.Silent. With auto-launch, every prompt + every model response lands on the parent's stderr. (Distinct from Copilot's fmt.Fprintln finding in proxy_sse.go — this is the structured logger path.) When Silent, log only metadata (requestID, method, status, length).
  5. Auto-launched inspector lifetime not bound to parent process (extensions/azure.ai.agents/internal/cmd/run.go:258-326 + inspector.go:124-132) — workflow.Run(ctx, ...) spawns inspector with no PID capture, no explicit Kill on cancel(). Please verify: after Ctrl+C on azd ai agent run, does netstat -ano | findstr :8087 show port 8087 still bound? If yes, inspector keeps proxying with no UI, subsequent runs collide on the port, and a stale proxy stays reachable. Capture child PID and kill on cleanup, or pass --parent-pid for self-termination.
  6. streamSSELines SSESink deadlock / goroutine leak (proxy_sse.go:116-165inspector.go:141 injectSSEEvents) — readSSEStream returns early on response.completed, dropping the inner pipe reader; injectSSEEvents blocks on Fprintln into an unbuffered pipe with no reader → goroutine + HTTP body pinned until streamCtx is cancelled. Happens on normal completion when trailing events arrive. Close the sink writer on completion, or make injectSSEEvents ctx-aware.
  7. proxyInvoke streaming branch can leak response body on panic (proxy_fetch.go:137-145) — go s.pumpSSE(p.RequestID, resp, true) has no top-level defer resp.Body.Close(); body closure depends on the cancel-driven forcing goroutine that may not fire on panic / early return. Add defer resp.Body.Close() at top of pumpSSE (idempotent with existing path).
  8. Default-behavior UX change is ungated and undocumented (extensions/azure.ai.agents/internal/cmd/run.go + azure.ai.agents/README.md) — Auto-launching Inspector is a material UX shift with no env-var/global opt-out (only per-invocation --no-inspector), no Preview gating per docs/reference/feature-status.md, no README update on the agents extension explaining the new default / --no-inspector / headless behavior, and no telemetry distinguishing auto-launch attempts/successes/failures. CI and headless dev environments will see surprising behavior. Add an AZD_AI_AGENT_AUTO_INSPECTOR=false (or similar) global opt-out, document the new default, mark as Preview, emit telemetry.
  9. No Content-Security-Policy or hardening headers on the served SPA (server.go:129-137) — Only Content-Type + Cache-Control set. Given the 2.6 MB upstream JS bundle, a supply-chain compromise or XSS gets full origin powers. Add Content-Security-Policy: default-src 'self'; connect-src 'self' ws://localhost:<port>, X-Content-Type-Options: nosniff, X-Frame-Options: DENY, Referrer-Policy: no-referrer.
  10. Missing OSS attribution for embedded third-party SPA bundle — Embeds ~2.6 MB from qidon/tryinspector (KaTeX fonts, highlight.js themes, etc.) with no NOTICE / THIRD_PARTY_NOTICES / LICENSE references. Microsoft OSS policy generally requires attribution for redistributed third-party code, and incompatible licenses (GPL/AGPL) would block release. Please confirm upstream license compatibility and add the appropriate notices to cli/azd/extensions/azure.ai.inspector/.
  11. No integration test that Inspector failure doesn't break agent run (extensions/azure.ai.agents/internal/cmd/run_test.go) — The core safety property of the new default-on behavior — "inspector launch failure must not crash agent run" — has no test. Mock the workflow client to return RPC errors and panics; assert the agent continues. Also add an explicit test that --no-inspector skips the workflow call entirely.

🟡 Minor

  1. isInspectorExtensionMissingMessage brittle string match (run.go:369-373) — Substring-matches "unknown command"/"inspector"/"unknown flag: --port". If azd's workflow runner changes wording, install guidance silently regresses to a generic warning. Prefer a typed sentinel/status code, or rely on the existing pre-check.
  2. Windows Ctrl+C kills agent child ungracefully (run.go:203-245) — exec.CommandContext on Windows = TerminateProcess; no chance to flush OTel/logs. Use CREATE_NEW_PROCESS_GROUP + GenerateConsoleCtrlEvent(CTRL_BREAK_EVENT) with grace timeout.
  3. loadAzdEnvironment error silently swallowed (run.go:177-182) — User gets confusing downstream "endpoint is empty" errors. Mirror the warning pattern used right below for resolveConnectionCredentials.
  4. No size limit on error/buffered io.ReadAll(resp.Body) (proxy_sse.go:~76, proxy_fetch.go:~55) — Malicious/buggy upstream can OOM Inspector. Use io.LimitReader (4 KB for error bodies, 16 MB for buffered).
  5. streamHTTPClient has no response-header timeout (proxy_fetch.go:58) — Same client is used for proxyInvoke before content-type is known; a hung non-SSE upstream blocks the goroutine. Set Transport.ResponseHeaderTimeout: 30s.
  6. apim-request-id to stdout in standalone mode (proxy_sse.go:75-77) — fmt.Printf("Trace ID: %s\n", ...) writes to stdout; breaks | jq / stdout-as-data flows. Use stderr.
  7. .golangci.yaml line-length 220 vs azd core 125 — Inherit from core or use file-scoped //nolint:lll for justified long lines (mostly the embedded asset SHA-stamped paths).
  8. runInspector browser-open goroutine leak on Start failure (inspector.go:124-132) — Goroutine blocks on <-ready forever if Start fails before binding. Close ready in error path, or select on ctx.Done().
  9. Defense-in-depth on assetsHandler path handling (server.go:139-150) — http.ServeMux cleans paths today, but a future re-mount with StripPrefix could bypass that. Explicit path.Clean + escape rejection.
  10. Test coverage gaps — Missing tests for: SSE upstream 4xx/5xx + read-error paths, malformed SSE lines, SPA fallback for real fs.Stat errors vs ErrNotExist, JSON-RPC error response structure, very long data: lines (>64 KB scanner buffer), --no-inspector skips workflow call.
  11. Test flakiness — time.Sleep / hardcoded timeouts in proxy_sse_test.go, server_test.go — replace with synchronization primitives or t.Context() deadlines.
  12. Install warning has no rate-limit (run.go ~200) — Prints on every run if extension absent. Consider a once-per-day marker file.
  13. CHANGELOG entry minimal — Agents extension uses richer entries linking PR/feature.
  14. requiredAzdVersion accuracy (extension.yaml:8) — Verify >1.23.13 matches the minimum azd version supporting workflow.Run for extensions.
  15. AGENTS.md error-handling guidance gap (extensions/azure.ai.inspector/AGENTS.md:51) — Clarify this extension uses plain wrapped errors (no exterrors), unlike the agents extension.

⚪ Nits

  • srv.URL() returns http://localhost:%d while listener binds 127.0.0.1 only (server.go:73-75, 85)
  • cspell.yaml — comment why azureaiinspector (Go module camelCase) is allowed
  • build.sh:12 — comment casing vs actual $EXTENSION_ID_SAFE
  • debug.go log file close error swallowed
  • root_test.go findCommand no recursion depth guard
  • run_test.go appendPortEnvVars relies on slice index ordering

Happy to discuss any of the above — particularly grateful for the contribution and the careful refactor of the extension internals. The blockers are all addressable with localized changes (session token + URL host check + nil-map fix + recover).

@wbreza

Thanks for the detailed review. I went through the blocker, major, minor, and nit items and made targeted changes based on the threat model for this feature.

Agent Inspector is a local developer UI served on loopback and launched from azd ai agent run. The goal is to keep the local happy path simple while preventing the concrete abuse cases: using the inspector as an open proxy to arbitrary network targets, launching arbitrary URI handlers, leaking raw prompt/response bodies in silent mode, or leaving stale listeners after the agent run exits.

For items implemented exactly as requested, I marked them as resolved. For items that are partial or intentionally left out of this PR, I included the reason.

Blockers

# Status Reply
1. Local SSRF via WS + arbitrary proxy URL Partially addressed Proxy URLs are constrained to http/https, local hosts only (localhost, 127.0.0.1, ::1), and the configured agent port. This prevents proxying to IMDS, intranet hosts, unrelated local services, or a different localhost port. I did not add a per-session token in this PR because the proxy URL allowlist plus Host/Origin checks cover the concrete proxy-abuse path without changing the local request contract. If we want a stronger CSRF-style boundary for the loopback UI, I would track that as a follow-up design change.
2. DNS rebinding / CheckOrigin Resolved
3. Stream registration race after cleanup Resolved
4. Panic in per-message goroutines Resolved

Major

# Status Reply
5. Empty Origin accepted Resolved
6. openUrlInBrowser arbitrary URI launcher Resolved
7. WS frame size / deadlines Partially addressed Added SetReadLimit, read deadline, pong deadline refresh, periodic ping, and write deadline. The frame limit is bounded but set at 16 MB to avoid breaking larger local prompt/request payloads. I did not add a concurrent session cap because the server is loopback-only and now has bounded frame size/deadline behavior; we can add a cap later if there is a concrete stress case.
8. Body logging with --silent Resolved
9. Inspector lifetime after Ctrl+C Verified, no code change Verified with azd-dev ai agent run: Ctrl+C releases both 8087 and 8088. I did not observe a stale inspector process or port collision.
10. SSESink deadlock / goroutine leak Resolved
11. Streaming response body close Resolved
12. Default behavior docs / gating / telemetry Partially addressed Added agents README documentation for the default inspector launch and --no-inspector. I did not update the changelog because changelogs are maintained with release/version updates. I also did not add a global env opt-out, preview gate, or telemetry in this PR. Since this is a local developer convenience launched from an explicit local run command, I think the per-command --no-inspector opt-out is sufficient for this iteration.
13. CSP and hardening headers Resolved
14. OSS attribution Resolved
15. Tests for inspector launch failure / --no-inspector Partially addressed Added targeted tests that workflow launch failure only warns and does not fail agent run, and that --no-inspector skips workflow launch. I did not add full E2E coverage in this PR.

Minor

# Status Reply
16. Brittle missing-extension string match Not changed There is now an installed-extension pre-check before launch. The string match is only fallback handling for unexpected workflow errors, so I left it as-is instead of introducing a new typed workflow error contract in this PR.
17. Windows Ctrl+C graceful child shutdown Not changed This is broader process-control behavior. I verified the required lifecycle property for this PR: Ctrl+C releases both the agent and inspector ports.
18. loadAzdEnvironment error swallowed Resolved
19. No size limit on buffered/error bodies Resolved
20. streamHTTPClient response-header timeout Not changed The streaming client is used for long-lived SSE paths and is bound to the per-stream context. Adding a short response-header timeout risks breaking slow local agents; I left this out unless we have a concrete hang case to tune against.
21. apim-request-id to stdout Resolved
22. .golangci.yaml line length Not changed This is lint configuration churn and does not affect the security or runtime behavior addressed in this PR.
23. Browser-open goroutine on Start failure Resolved
24. Asset handler path/fallback hardening Resolved
25. Test coverage gaps Partially addressed Added targeted tests for the current fixes, including body limits and asset fallback behavior. I did not expand this into full E2E or exhaustive SSE/parser coverage in this PR.
26. Test sleep/timeouts Not changed Existing tests are passing; replacing the broader timeout style is test refactoring outside the current fix.
27. Install warning rate-limit Not changed This is product behavior/state management and would add marker-file complexity. The current warning is explicit and actionable.
28. CHANGELOG entry Not changed Changelog updates are release/version-bump work, not part of this PR fix.
29. requiredAzdVersion accuracy Verified, no code change WorkflowService.Run was introduced by #4985 and first appears in stable azd 1.15.0, so >1.23.13 is stricter than the workflow-service minimum. I left the current value because this extension is built and tested against the current azd extension SDK line and stays aligned with the existing azure.ai.agents extension requirement.
30. AGENTS.md error-handling guidance Resolved

Nits

Item Status Reply
srv.URL() uses localhost while listener binds 127.0.0.1 Resolved
cspell.yaml comment for azureaiinspector Resolved
build.sh comment casing Resolved
debug.go close error swallowed Resolved
root_test.go findCommand recursion guard Resolved
run_test.go appendPortEnvVars ordering Resolved

@anchenyi
Copy link
Copy Markdown
Author

There is a \azure-dev.github\workflows\approval-ext-azure-ai-inspector.yml file which controls a github pipeline to block PRs on a separate approval list from the standard codeowners file. This is intended to still get helpful reviews from the broader AZD group (in terms of sticking to AZD standards, GO best practices, etc.) while still blocking the PR on a smaller set of people who know the proper code flow for each extension. We should update that file with folks on your team for proper ownership of this new extension.

In .github/workflows/approval-ext-azure-ai-inspector.yml, add a REQUIRED_APPROVERS environment variable to the step:

       - name: Check for required team approval
         uses: actions/github-script@v9
         env:
           EXTENSION_PATH: "cli/azd/extensions/azure.ai.inspector/"
           WORKFLOW_PATH: ".github/workflows/approval-ext-azure-ai-inspector.yml"
           OVERRIDE_COMMAND: "/inspector-extension-approval override"
           REQUIRED_APPROVERS: '["user1", "user2", "user3"]'  # ← add this
         with:
           script: |
             const script = require('./.github/scripts/pr-approval-foundry-extensions-shared.js');
             await script({ github, context, core });

The shared script (line 31-33) checks for this env var and uses it instead of the default list.

Added "anchenyi", "XiaofuHuang", "swatDong" in the REQUIRED_APPROVERS list

@anchenyi anchenyi requested a review from trangevi May 21, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

customer-reported identify a customer issue ext-agents azure.ai.{agents,connections,inspector,projects,routines,skills,toolboxes} extensions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add azd ai agent Inspector auto-launch and structured event stream for the agent inner loop

6 participants