Skip to content

feat(server): metrics instrumentation #909

@sjenning

Description

@sjenning

Problem Statement

openshell-server currently has no metrics instrumentation — only structured tracing logs exist. As the gateway for all sandbox lifecycle operations, SSH tunneling, supervisor session management, and policy enforcement, it needs comprehensive Prometheus metrics to support production SLOs, alerting, capacity planning, and incident debugging.

Without metrics, operators cannot:

  • Define or track SLIs/SLOs
  • Set up meaningful alerting (beyond log-based)
  • Build operational dashboards
  • Debug production incidents with quantitative data
  • Plan capacity based on saturation trends

NOTE The list of metrics below are just some ideas to encourage thinking along the lines of an SRE trying to support this workload. We do not have to implement them all. Just looking to start a discussion about which ones would be most valuable to have in a first phase.

Proposed Design

Crates & Exposition

Use the metrics facade crate with metrics-exporter-prometheus for Prometheus exposition — this mirrors the existing tracing facade pattern. Add a /metrics GET route to the existing health_router() in http.rs (outside auth, accessible to scrapers). Initialize PrometheusBuilder in run_server() and store PrometheusHandle in ServerState.

For gRPC and HTTP request metrics, implement a Tower middleware layer in multiplex.rs that wraps both the GrpcRouter and HTTP service — this is the single highest-value instrumentation point.

Metrics Catalog (16 families, 3 priority tiers)

P0 — Critical (Day-1 Paging Metrics)

# Metric Type Labels Instrumentation Point
1 openshell_grpc_requests_total counter method, code multiplex.rs Tower layer
1 openshell_grpc_request_duration_seconds histogram method, code multiplex.rs Tower layer
2 openshell_http_requests_total counter path, status multiplex.rs Tower layer
2 openshell_http_request_duration_seconds histogram path, status multiplex.rs Tower layer
3 openshell_supervisor_sessions_active gauge supervisor_session.rs
3 openshell_supervisor_session_connects_total counter superseded supervisor_session.rs
3 openshell_supervisor_session_disconnects_total counter reason supervisor_session.rs
4 openshell_sandboxes_by_phase gauge phase compute/mod.rs
5 openshell_relay_opens_total counter result supervisor_session.rs
5 openshell_relay_claims_total counter result supervisor_session.rs
5 openshell_relay_pending_count gauge supervisor_session.rs

P1 — Important (SLO Tracking, Capacity Planning)

# Metric Type Labels Instrumentation Point
6 openshell_sandbox_phase_transition_duration_seconds histogram from_phase, to_phase compute/mod.rs
6 openshell_sandbox_create_total counter result compute/mod.rs
6 openshell_sandbox_delete_total counter result compute/mod.rs
7 openshell_ssh_connections_active gauge dimension ssh_tunnel.rs
7 openshell_ssh_connection_limit_rejections_total counter limit_type ssh_tunnel.rs
7 openshell_ssh_tunnel_duration_seconds histogram ssh_tunnel.rs
7 openshell_ssh_sessions_active gauge ssh_tunnel.rs
7 openshell_ssh_sessions_reaped_total counter reason ssh_tunnel.rs
8 openshell_relay_wait_for_session_duration_seconds histogram result supervisor_session.rs
8 openshell_relay_claim_latency_seconds histogram supervisor_session.rs
8 openshell_relay_reaped_total counter supervisor_session.rs
9 openshell_compute_watch_restarts_total counter reason compute/mod.rs
9 openshell_compute_reconcile_duration_seconds histogram compute/mod.rs
9 openshell_compute_orphans_pruned_total counter compute/mod.rs
9 openshell_compute_driver_rpc_duration_seconds histogram method compute/mod.rs
10 openshell_db_operation_duration_seconds histogram operation, backend persistence/mod.rs
10 openshell_db_errors_total counter operation, backend persistence/mod.rs
11 openshell_policy_merge_attempts_total counter result grpc/policy.rs
11 openshell_policy_merge_retries histogram grpc/policy.rs

P2 — Nice to Have (Deep Operational Insight)

# Metric Type Instrumentation Point
12 openshell_ws_tunnel_connections_active gauge ws_tunnel.rs
12 openshell_ws_tunnel_bytes_total counter ws_tunnel.rs
13 openshell_exec_duration_seconds histogram grpc/sandbox.rs
13 openshell_exec_total counter grpc/sandbox.rs
14 openshell_tcp_connections_active gauge lib.rs
14 openshell_tls_handshake_failures_total counter lib.rs
15 openshell_tracing_bus_subscribers gauge tracing_bus.rs
15 openshell_tracing_bus_messages_published_total counter tracing_bus.rs
16 Process metrics (RSS, FDs, CPU, uptime) various auto via PrometheusBuilder

SLI/SLO Definitions

SLI PromQL Target
gRPC Availability 1 - rate(grpc_requests_total{code!="OK"}[5m]) / rate(grpc_requests_total[5m]) 99.9% / 30d
gRPC Latency (p99) histogram_quantile(0.99, rate(grpc_request_duration_seconds_bucket[5m])) <500ms (unary)
Sandbox Create Success rate(sandbox_create_total{result="success"}[1h]) / rate(sandbox_create_total[1h]) 99.5%
Time to Ready (p50) histogram_quantile(0.50, rate(phase_transition_duration_seconds_bucket{to_phase="ready"}[1h])) <30s
SSH Tunnel Availability 1 - rate(http_requests_total{path="/connect/ssh",status=~"5.."}[5m]) / rate(http_requests_total{path="/connect/ssh"}[5m]) 99.9%
Relay Claim Rate rate(relay_claims_total{result="success"}[5m]) / rate(relay_opens_total{result="success"}[5m]) >99%

Key Alerting Rules

Severity Condition Duration
Page supervisor_sessions_active == 0 (while sandboxes exist) 2m
Page gRPC error ratio > 1% 5m
Page All sandboxes in error, none ready 5m
Warn relay_pending_count > 50 2m
Warn SSH connection limit rejections firing instant
Warn Compute watch restarts > 3 in 10m
Warn DB p99 > 100ms 5m

Implementation Phases

  1. Phase 1 — Infrastructure (single PR): Add metrics + metrics-exporter-prometheus crates, /metrics endpoint, Tower middleware for gRPC/HTTP RED metrics.
  2. Phase 2 — P0 Metrics (single PR): Instrument supervisor sessions, relay channels, sandbox phase gauge.
  3. Phase 3 — P1 Metrics (2-3 PRs): SSH tunnel saturation, DB latency, compute driver health, policy merge retries.
  4. Phase 4 — P2 Metrics (as needed): WebSocket, exec, TCP/TLS, log bus, process metrics.

Critical Files

  • crates/openshell-server/src/multiplex.rs — Tower middleware (highest-value single point)
  • crates/openshell-server/src/supervisor_session.rs — Session + relay gauges/counters
  • crates/openshell-server/src/lib.rs — PrometheusHandle init, ServerState
  • crates/openshell-server/src/http.rs/metrics endpoint
  • crates/openshell-server/src/compute/mod.rs — Phase gauge, driver health
  • crates/openshell-server/src/ssh_tunnel.rs — Connection saturation metrics
  • crates/openshell-server/src/persistence/mod.rs — DB operation metrics
  • crates/openshell-server/src/grpc/policy.rs — Policy merge retry metrics

Alternatives Considered

  • OpenTelemetry (opentelemetry + opentelemetry-prometheus): Heavier dependency, but provides OTLP export for Datadog/Grafana Cloud. Only justified if the deployment stack already uses OTLP collectors. The metrics facade is lighter and more idiomatic for Rust.
  • tracing-derived metrics (e.g. tracing-opentelemetry): Could derive counters/histograms from existing tracing spans, but provides less control over label cardinality and histogram buckets. Better suited as a complement to explicit metrics, not a replacement.

Agent Investigation

Explored the full openshell-server codebase including:

  • All gRPC service definitions in proto/ (openshell.proto, compute_driver.proto, inference.proto, sandbox.proto)
  • Server architecture: lib.rs (ServerState), multiplex.rs (protocol multiplexing), grpc/ handlers
  • Supervisor session system: supervisor_session.rs (session registry, relay channels, heartbeats, reaper)
  • SSH tunnel: ssh_tunnel.rs (connection limits, session reaping, relay integration)
  • Compute driver abstraction: compute/mod.rs (reconciliation loop, watch stream, orphan pruning)
  • Persistence layer: persistence/mod.rs, persistence/postgres.rs, persistence/sqlite.rs
  • All Cargo.toml files: confirmed zero metrics-related dependencies exist today
  • Searched for any existing counter/histogram/gauge patterns: none found

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions