Skip to content

feat: v2.0.0 — MQTT events, agent hardening, and JSON contract#6

Merged
chenliuyun merged 16 commits intomainfrom
feat/agent-hardening
Apr 19, 2026
Merged

feat: v2.0.0 — MQTT events, agent hardening, and JSON contract#6
chenliuyun merged 16 commits intomainfrom
feat/agent-hardening

Conversation

@chenliuyun
Copy link
Copy Markdown
Collaborator

Summary

This PR bumps 1.3.2 → 2.0.0 and lands four phases of improvements plus a set of contract fixes discovered during review.

Breaking changes

  1. Top-level JSON envelope — every --json response is now {schemaVersion:'1.1', data:...} or {schemaVersion:'1.1', error:...}. Consumers that read parsed.foo must now read parsed.data.foo (or parsed.error on failure).
  2. batch.failed[].error shapestringErrorPayload object. Read .message for the old string content; use .transient / .retryAfterMs for retry decisions.
  3. HTTP MCP default bind0.0.0.0127.0.0.1. Pass --bind 0.0.0.0 --auth-token <token> to restore external reachability.

New features (Phases A–I, already on branch since v1.3.2)

  • Phase A — HTTP auth (Bearer token), safe-by-default bind, CORS, rate limiting
  • Phase B — Idempotency keys end-to-end (CLI + batch + MCP)
  • Phase C — MQTT client + EventSubscriptionManager infrastructure
  • Phase D — Richer error payloads (kind, transient, retryAfterMs, errorClass)
  • Phase Faccount_overview MCP tool for agent cold-start
  • Phase G — Health (/healthz, /ready), metrics (/metrics), structured logging (pino)
  • Phase H — Docker + systemd deployment artifacts
  • Phase I — Tool descriptions, schema-versioning docs, agent guide

Contract fixes (this fixup wave)

  • Phase 1 — Per-request profile routing via AsyncLocalStorage — multi-tenant HTTP now actually routes each request to the correct SwitchBot account.
  • Phase 2 — Top-level {schemaVersion, data|error} envelope (the breaking change). All ~20 printJson callsites now wrap automatically.
  • Phase 3EventSubscriptionManager properly initialized from env vars (SWITCHBOT_MQTT_*); /ready returns 503 + reason:'mqtt disabled' when MQTT creds absent; /metrics adds switchbot_mqtt_state{state=...} gauge; switchbot://events MCP resource registered.
  • Phase 4 — Dead scheduleStableEvent timer removed; swallowed JSON parse errors in MQTT shadow handler replaced with log.debug; no-op try/rethrow removed.
  • Phase 5 — Type safety (NodeJS.ErrnoException, Device shape); new tests for IdempotencyCache, logger, EventSubscriptionManager defaults, /ready + /metrics health endpoints.

Migration guide (consumers of the JSON output)

Before (≤ 1.3.2) After (2.0.0)
parsed.foo parsed.data.foo
parsed.error (on stderr) parsed.error.message etc.
parsed.failed[i].error (string) parsed.failed[i].error.message
mcp serve binds 0.0.0.0 mcp serve binds 127.0.0.1; add --bind 0.0.0.0 --auth-token $T

Test plan

  • npm run build — clean TypeScript compile, zero errors
  • npm test — 685 tests pass (41 test files)
  • All pre-existing tests updated for envelope (parsed.data.*)
  • New test file tests/commands/mcp-http-health.test.ts/ready 503, /metrics state gauge, EventSubscriptionManager defaults
  • New test files tests/lib/idempotency.test.ts, tests/logger.test.ts

🤖 Generated with Claude Code

chenliuyun added 16 commits April 19, 2026 15:53
…ting

- MCP HTTP now binds 127.0.0.1 by default (not 0.0.0.0)
- Add --bind <host> flag to override (must have --auth-token for external)
- Add --auth-token <token> flag for Bearer auth (fallback: SWITCHBOT_MCP_TOKEN env)
- Add --cors-origin <url> flag (repeatable) for CORS preflight
- Add --rate-limit <n> flag (default 60 req/min) per profile
- Constant-time token comparison to prevent timing attacks
- Graceful shutdown on SIGTERM/SIGINT with 30s drain timeout
- Startup log now shows truth about binding (e.g. 'listening on http://127.0.0.1:3030/mcp')
- All tests pass (659/659)
- New src/lib/idempotency.ts with LRU cache (1024 entries, 60s TTL)
- Modify executeCommand() to accept optional { idempotencyKey } param
- Thread cache through idempotencyCache.run() for transparent dedup
- No key = always execute (backward compat)
- Expired/new keys trigger fresh execution and cache update
- All tests pass (659/659)
…ency-key-prefix integration

Thread idempotency keys through the CLI interface:
- devices command: add --idempotency-key <key> to replay single commands safely
- devices batch: add --idempotency-key-prefix <prefix> to derive per-device keys

Examples:
  switchbot devices command BOT1 turnOn --idempotency-key abc123
  switchbot devices batch turnOn --ids A,B,C --idempotency-key-prefix batch-001

All 659 tests passing. Backward compatible — idempotency is opt-in.
…ructure

Lay foundation for real-time event streaming:
- src/mqtt/client.ts: New MQTT client with reconnect logic, auth refresh callbacks, state management (connecting/connected/reconnecting/failed)
- src/mcp/events-subscription.ts: Event subscription manager with ring buffer (1000 events), overflow detection, per-subscriber filtering, idle cleanup
- src/commands/mcp.ts: Integrate shared EventSubscriptionManager into HTTP serve mode, with graceful shutdown

Features:
- Auth refresh callbacks on reconnect failure for cert rotation scenarios
- Synthetic events for overflow notices (events.dropped) and reconnection (events.reconnected)
- Per-subscriber event filtering using existing filter grammar
- Idle subscriber cleanup after 10 minutes
- Exponential backoff for reconnection (1s, 2s, 4s, ...30s)

Note: MQTT credential resolution still TBD — awaiting SwitchBot MQTT endpoint documentation.
All 659 tests passing. Foundation ready for event streaming integration.
Add detailed error information to help agents make intelligent retry decisions:
- ErrorPayload: new fields retryAfterMs, transient, errorClass
- ApiError: track Retry-After header value and classify transience
- batch command: failed[] now returns {deviceId, error: ErrorPayload} instead of {deviceId, error: string}
- schemaVersion bumped to "1.1" (backward-compatible additive change)

Error classification:
- transient: true for 429, 5xx, connection timeouts (can retry)
- errorClass: network|api|device-offline|device-busy|guard|usage
- retryAfterMs: parsed from Retry-After header when available

All 659 tests passing. Agents can now examine error.errorClass to branch on error type and use retryAfterMs to determine backoff.
Add account_overview MCP tool and CLI command for bootstrap initialization:
- Bundles: device list, IR remotes, scenes, quota usage, cache status, MQTT state
- Single call replaces: list_devices + list_scenes + quota status + cache show
- Includes MQTT connection state in HTTP mode (eventManager.getState())
- schemaVersion 1.1, version 1.7.0 in response

Useful for:
- Agent cold-start (one call to understand account state)
- Periodic health checks (cache age, quota, MQTT connection)
- Integration debugging

All 659 tests passing.
Add observability infrastructure for production monitoring:
- src/logger.ts: pino logger factory (LOG_LEVEL, LOG_FORMAT env vars)
- /healthz endpoint: always 200, returns {ok, version, pid, uptimeSec}
- /ready endpoint: 200 when MQTT connected, 503 otherwise
- /metrics endpoint: Prometheus text format (0.0.4) with gauges:
  - switchbot_mqtt_connected
  - switchbot_mqtt_subscribers
  - process_uptime_seconds

No debug logging added yet (deferred to Phase G part 2 when needed).
Health endpoints bypass auth/rate limiting for orchestrator liveness probes.

All 659 tests passing.
Add production deployment files:
- Dockerfile: multi-stage build, Node 20-alpine, unprivileged user (10001), healthcheck
- docker-compose.example.yml: example setup with env vars, healthcheck
- contrib/systemd/switchbot-mcp.service: systemd unit with hardening (ProtectSystem, PrivateTmp)

Usage:
  docker build -t switchbot:1.7 .
  docker-compose --env-file .env up

Or systemd:
  sudo cp contrib/systemd/switchbot-mcp.service /etc/systemd/system/
  sudo systemctl enable --now switchbot-mcp

All 659 tests passing.
Improve agent developer experience with richer documentation:
- Upgraded tool descriptions for send_command and list_devices (120+ chars with context)
- docs/schema-versioning.md: explains v1→v1.1 backward-compatibility and migration path
- Clarified that schemaVersion "1.1" is backward-compatible with "1" parsers

Schema versioning policy:
- Additive changes (new optional fields) → minor bump (1.1, 1.2, ...)
- Breaking changes → major bump (2.0)
- Parsers pinning "1" continue to work on 1.1+ (backward-compatible)
- Migration guide included for v1.6 → v1.7 (batch error payload change)

All 659 tests passing.
Bumps package.json 1.7.0 → 2.0.0 and refreshes the hard-coded version
strings inside the MCP server, /healthz, /ready, and account_overview.

Adds tsconfig.build.json (sourceMap:false, declaration:false) plus a
build:prod + clean + prepublishOnly pipeline so the published tarball
drops .js.map and .d.ts files. Result against the prior build:

- package size: 140.2 kB → 83.0 kB (−41%)
- unpacked:     622.7 kB → 328.1 kB (−47%)
- files:        144 → 45

A CLI binary has no consumers that import its types or need shipped
source maps; local dev still emits both via the default tsc target.

Version 2.0.0 is the first npm release after 1.3.2 and carries three
breaking changes that land over the following commits: JSON envelope
with top-level schemaVersion, batch.failed[].error shape from string
to object, and HTTP MCP default bind flipped to 127.0.0.1.
Previously, HTTP MCP requests extracted x-switchbot-profile / ?profile
but used the value only as a rate-limit bucket key. Every tool call
then resolved credentials via the process-global --profile flag in
loadConfig(), so multi-tenant HTTP deployments silently collapsed all
traffic onto the default account.

This change introduces src/lib/request-context.ts — a tiny
AsyncLocalStorage wrapper with withRequestContext() and
getActiveProfile(). loadConfig() and configFilePath() now read the
active profile via getActiveProfile(), which prefers the ALS context
and falls back to the CLI flag when no HTTP context is active. The
HTTP handler wraps each request in withRequestContext so tool calls
land in the right account.

Also rejects unknown profiles with 401 before entering MCP dispatch,
so probing for valid profile names is closed off and agents get a
clear error instead of a confusing credentials-missing exit.

Stdio mode is unchanged: no request context, so getActiveProfile()
goes straight to the flag lookup.

Tests: tests/lib/request-context.test.ts covers concurrent isolation,
nested contexts, and flag fallback.
…lope

Every --json response now emits {schemaVersion:'1.1', data:...} on
success and {schemaVersion:'1.1', error:...} on failure, fulfilling
the contract documented in docs/schema-versioning.md.

- src/utils/output.ts: printJson wraps payload in {schemaVersion, data};
  handleError JSON branch wraps in {schemaVersion, error}
- src/commands/capabilities.ts: switch raw console.log to printJson
- src/commands/schema.ts: drop non-json-mode raw branch, always use printJson
- docs/schema-versioning.md: add envelope shape examples, migration guide
  from v1.x, note that batch.summary.schemaVersion is the historical
  nested location kept for back-compat
- All test files updated to unwrap .data (success) or .error (failure)
  from the parsed envelope
…//events resource

- src/mqtt/client.ts: add 'disabled' to MqttState
- src/mqtt/credential.ts: new file — resolve MQTT config from
  SWITCHBOT_MQTT_HOST / USERNAME / PASSWORD env vars; returns null
  when any are absent
- src/mcp/events-subscription.ts: getState() returns 'disabled' (not
  'idle') when no client; add getRecentEvents(limit) to expose ring
  buffer for MCP resource reads
- src/commands/mcp.ts:
  - import getMqttConfig and call eventManager.initialize() on startup
    if creds present; log a warning and leave manager disabled if not
  - remove dead mqttInitialized variable
  - /ready: returns 503 + {ready:false, reason:'mqtt disabled', mqtt:'disabled'}
    when MQTT is not configured; 503 + reason:'mqtt failed' on failure
  - /metrics: add switchbot_mqtt_state{state=...} gauge (one per state)
    so dashboards can distinguish disabled/connecting/connected/failed
  - register switchbot://events MCP resource backed by the ring buffer;
    returns {state, count, events[]} snapshot when read
  - add resources:{} to server capabilities
- tests/commands/mcp-http-health.test.ts: new file covering /ready 503
  + reason, /metrics state gauge, and EventSubscriptionManager defaults
- src/mqtt/client.ts: delete scheduleStableEvent() and its caller in
  onConnect(); the timer body only nulled itself and never emitted
  anything. Also remove the unused stableThresholdMs field.
- src/mcp/events-subscription.ts: replace empty catch {} with
  log.debug({err, topic}, ...) so JSON parse failures on shadow payloads
  are visible at debug level instead of silently discarded; simplify the
  no-op try/rethrow in subscribe() to a direct parseFilter() call.
…qtt and events

Type safety:
- src/mqtt/client.ts: replace (err as any).code with (err as NodeJS.ErrnoException).code
- src/mcp/events-subscription.ts: import Device type and construct a
  Device-compatible shape instead of casting a partial object as any

New tests:
- tests/lib/idempotency.test.ts: LRU eviction, TTL expiry, concurrent
  same-key behavior, undefined-key passthrough, clear()
- tests/logger.test.ts: LOG_LEVEL=warn silences debug; LOG_LEVEL=debug
  enables it; setLogLevel/getLogLevel roundtrip
@chenliuyun chenliuyun merged commit f9e4ca3 into main Apr 19, 2026
3 checks passed
@chenliuyun chenliuyun deleted the feat/agent-hardening branch April 19, 2026 09:19
chenliuyun pushed a commit that referenced this pull request Apr 20, 2026
The v2.4.0 release notes claimed "MCP tools mirror the tier in
meta.agentSafetyTier" but only aggregate_device_history (added in 2.5.0
work) actually exposed it. This fix adds _meta: { agentSafetyTier: <tier> }
to all other 10 MCP tool registrations, matching their CLI safety tiers
from COMMAND_META:
- list_devices, get_device_status, get_device_history, query_device_history,
  list_scenes, search_catalog, describe_device, account_overview: read
- send_command, run_scene: action

Also adds tests/mcp/tool-meta.test.ts to verify every tool has _meta and
spot-check key tiers match expected values.

Fixes bug #6.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
chenliuyun pushed a commit that referenced this pull request Apr 20, 2026
Document every fix landed in this branch beyond the history-aggregate
feature: bugs #1, #4, #5, #6, #8, #9, #10, #11, #12, #13, #14, #15,
#16, #17, #18 from the OpenClaw v2.4.0 smoke-test report. Call out
the deferred items (#2, #7) explicitly so readers don't assume they
were overlooked.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
chenliuyun pushed a commit that referenced this pull request Apr 20, 2026
The v2.4.0 release notes claimed "MCP tools mirror the tier in
meta.agentSafetyTier" but only aggregate_device_history (added in 2.5.0
work) actually exposed it. This fix adds _meta: { agentSafetyTier: <tier> }
to all other 10 MCP tool registrations, matching their CLI safety tiers
from COMMAND_META:
- list_devices, get_device_status, get_device_history, query_device_history,
  list_scenes, search_catalog, describe_device, account_overview: read
- send_command, run_scene: action

Also adds tests/mcp/tool-meta.test.ts to verify every tool has _meta and
spot-check key tiers match expected values.

Fixes bug #6.
chenliuyun pushed a commit that referenced this pull request Apr 20, 2026
Document every fix landed in this branch beyond the history-aggregate
feature: bugs #1, #4, #5, #6, #8, #9, #10, #11, #12, #13, #14, #15,
#16, #17, #18 from the OpenClaw v2.4.0 smoke-test report. Call out
the deferred items (#2, #7) explicitly so readers don't assume they
were overlooked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant