feat(server): metrics instrumentation

## Problem Statement

openshell-server currently has no metrics instrumentation — only structured tracing logs exist. As the gateway for all sandbox lifecycle operations, SSH tunneling, supervisor session management, and policy enforcement, it needs comprehensive Prometheus metrics to support production SLOs, alerting, capacity planning, and incident debugging.

Without metrics, operators cannot:
- Define or track SLIs/SLOs
- Set up meaningful alerting (beyond log-based)
- Build operational dashboards
- Debug production incidents with quantitative data
- Plan capacity based on saturation trends

**NOTE** The list of metrics below are just some ideas to encourage thinking along the lines of an SRE trying to support this workload.  We do **not** have to implement them all.  Just looking to start a discussion about which ones would be most valuable to have in a first phase.

## Proposed Design

### Crates & Exposition

Use the **`metrics`** facade crate with **`metrics-exporter-prometheus`** for Prometheus exposition — this mirrors the existing `tracing` facade pattern. Add a `/metrics` GET route to the existing `health_router()` in `http.rs` (outside auth, accessible to scrapers). Initialize `PrometheusBuilder` in `run_server()` and store `PrometheusHandle` in `ServerState`.

For gRPC and HTTP request metrics, implement a Tower middleware layer in `multiplex.rs` that wraps both the `GrpcRouter` and HTTP service — this is the single highest-value instrumentation point.

### Metrics Catalog (16 families, 3 priority tiers)

#### P0 — Critical (Day-1 Paging Metrics)

| # | Metric | Type | Labels | Instrumentation Point |
|---|--------|------|--------|-----------------------|
| 1 | `openshell_grpc_requests_total` | counter | `method`, `code` | `multiplex.rs` Tower layer |
| 1 | `openshell_grpc_request_duration_seconds` | histogram | `method`, `code` | `multiplex.rs` Tower layer |
| 2 | `openshell_http_requests_total` | counter | `path`, `status` | `multiplex.rs` Tower layer |
| 2 | `openshell_http_request_duration_seconds` | histogram | `path`, `status` | `multiplex.rs` Tower layer |
| 3 | `openshell_supervisor_sessions_active` | gauge | — | `supervisor_session.rs` |
| 3 | `openshell_supervisor_session_connects_total` | counter | `superseded` | `supervisor_session.rs` |
| 3 | `openshell_supervisor_session_disconnects_total` | counter | `reason` | `supervisor_session.rs` |
| 4 | `openshell_sandboxes_by_phase` | gauge | `phase` | `compute/mod.rs` |
| 5 | `openshell_relay_opens_total` | counter | `result` | `supervisor_session.rs` |
| 5 | `openshell_relay_claims_total` | counter | `result` | `supervisor_session.rs` |
| 5 | `openshell_relay_pending_count` | gauge | — | `supervisor_session.rs` |

#### P1 — Important (SLO Tracking, Capacity Planning)

| # | Metric | Type | Labels | Instrumentation Point |
|---|--------|------|--------|-----------------------|
| 6 | `openshell_sandbox_phase_transition_duration_seconds` | histogram | `from_phase`, `to_phase` | `compute/mod.rs` |
| 6 | `openshell_sandbox_create_total` | counter | `result` | `compute/mod.rs` |
| 6 | `openshell_sandbox_delete_total` | counter | `result` | `compute/mod.rs` |
| 7 | `openshell_ssh_connections_active` | gauge | `dimension` | `ssh_tunnel.rs` |
| 7 | `openshell_ssh_connection_limit_rejections_total` | counter | `limit_type` | `ssh_tunnel.rs` |
| 7 | `openshell_ssh_tunnel_duration_seconds` | histogram | — | `ssh_tunnel.rs` |
| 7 | `openshell_ssh_sessions_active` | gauge | — | `ssh_tunnel.rs` |
| 7 | `openshell_ssh_sessions_reaped_total` | counter | `reason` | `ssh_tunnel.rs` |
| 8 | `openshell_relay_wait_for_session_duration_seconds` | histogram | `result` | `supervisor_session.rs` |
| 8 | `openshell_relay_claim_latency_seconds` | histogram | — | `supervisor_session.rs` |
| 8 | `openshell_relay_reaped_total` | counter | — | `supervisor_session.rs` |
| 9 | `openshell_compute_watch_restarts_total` | counter | `reason` | `compute/mod.rs` |
| 9 | `openshell_compute_reconcile_duration_seconds` | histogram | — | `compute/mod.rs` |
| 9 | `openshell_compute_orphans_pruned_total` | counter | — | `compute/mod.rs` |
| 9 | `openshell_compute_driver_rpc_duration_seconds` | histogram | `method` | `compute/mod.rs` |
| 10 | `openshell_db_operation_duration_seconds` | histogram | `operation`, `backend` | `persistence/mod.rs` |
| 10 | `openshell_db_errors_total` | counter | `operation`, `backend` | `persistence/mod.rs` |
| 11 | `openshell_policy_merge_attempts_total` | counter | `result` | `grpc/policy.rs` |
| 11 | `openshell_policy_merge_retries` | histogram | — | `grpc/policy.rs` |

#### P2 — Nice to Have (Deep Operational Insight)

| # | Metric | Type | Instrumentation Point |
|---|--------|------|-----------------------|
| 12 | `openshell_ws_tunnel_connections_active` | gauge | `ws_tunnel.rs` |
| 12 | `openshell_ws_tunnel_bytes_total` | counter | `ws_tunnel.rs` |
| 13 | `openshell_exec_duration_seconds` | histogram | `grpc/sandbox.rs` |
| 13 | `openshell_exec_total` | counter | `grpc/sandbox.rs` |
| 14 | `openshell_tcp_connections_active` | gauge | `lib.rs` |
| 14 | `openshell_tls_handshake_failures_total` | counter | `lib.rs` |
| 15 | `openshell_tracing_bus_subscribers` | gauge | `tracing_bus.rs` |
| 15 | `openshell_tracing_bus_messages_published_total` | counter | `tracing_bus.rs` |
| 16 | Process metrics (RSS, FDs, CPU, uptime) | various | auto via `PrometheusBuilder` |

### SLI/SLO Definitions

| SLI | PromQL | Target |
|-----|--------|--------|
| gRPC Availability | `1 - rate(grpc_requests_total{code!="OK"}[5m]) / rate(grpc_requests_total[5m])` | 99.9% / 30d |
| gRPC Latency (p99) | `histogram_quantile(0.99, rate(grpc_request_duration_seconds_bucket[5m]))` | <500ms (unary) |
| Sandbox Create Success | `rate(sandbox_create_total{result="success"}[1h]) / rate(sandbox_create_total[1h])` | 99.5% |
| Time to Ready (p50) | `histogram_quantile(0.50, rate(phase_transition_duration_seconds_bucket{to_phase="ready"}[1h]))` | <30s |
| SSH Tunnel Availability | `1 - rate(http_requests_total{path="/connect/ssh",status=~"5.."}[5m]) / rate(http_requests_total{path="/connect/ssh"}[5m])` | 99.9% |
| Relay Claim Rate | `rate(relay_claims_total{result="success"}[5m]) / rate(relay_opens_total{result="success"}[5m])` | >99% |

### Key Alerting Rules

| Severity | Condition | Duration |
|----------|-----------|----------|
| **Page** | `supervisor_sessions_active == 0` (while sandboxes exist) | 2m |
| **Page** | gRPC error ratio > 1% | 5m |
| **Page** | All sandboxes in error, none ready | 5m |
| **Warn** | `relay_pending_count > 50` | 2m |
| **Warn** | SSH connection limit rejections firing | instant |
| **Warn** | Compute watch restarts > 3 in 10m | — |
| **Warn** | DB p99 > 100ms | 5m |

### Implementation Phases

1. **Phase 1 — Infrastructure** (single PR): Add `metrics` + `metrics-exporter-prometheus` crates, `/metrics` endpoint, Tower middleware for gRPC/HTTP RED metrics.
2. **Phase 2 — P0 Metrics** (single PR): Instrument supervisor sessions, relay channels, sandbox phase gauge.
3. **Phase 3 — P1 Metrics** (2-3 PRs): SSH tunnel saturation, DB latency, compute driver health, policy merge retries.
4. **Phase 4 — P2 Metrics** (as needed): WebSocket, exec, TCP/TLS, log bus, process metrics.

### Critical Files

- `crates/openshell-server/src/multiplex.rs` — Tower middleware (highest-value single point)
- `crates/openshell-server/src/supervisor_session.rs` — Session + relay gauges/counters
- `crates/openshell-server/src/lib.rs` — PrometheusHandle init, ServerState
- `crates/openshell-server/src/http.rs` — `/metrics` endpoint
- `crates/openshell-server/src/compute/mod.rs` — Phase gauge, driver health
- `crates/openshell-server/src/ssh_tunnel.rs` — Connection saturation metrics
- `crates/openshell-server/src/persistence/mod.rs` — DB operation metrics
- `crates/openshell-server/src/grpc/policy.rs` — Policy merge retry metrics

## Alternatives Considered

- **OpenTelemetry (`opentelemetry` + `opentelemetry-prometheus`)**: Heavier dependency, but provides OTLP export for Datadog/Grafana Cloud. Only justified if the deployment stack already uses OTLP collectors. The `metrics` facade is lighter and more idiomatic for Rust.
- **`tracing`-derived metrics (e.g. `tracing-opentelemetry`)**: Could derive counters/histograms from existing tracing spans, but provides less control over label cardinality and histogram buckets. Better suited as a complement to explicit metrics, not a replacement.

## Agent Investigation

Explored the full openshell-server codebase including:
- All gRPC service definitions in `proto/` (openshell.proto, compute_driver.proto, inference.proto, sandbox.proto)
- Server architecture: `lib.rs` (ServerState), `multiplex.rs` (protocol multiplexing), `grpc/` handlers
- Supervisor session system: `supervisor_session.rs` (session registry, relay channels, heartbeats, reaper)
- SSH tunnel: `ssh_tunnel.rs` (connection limits, session reaping, relay integration)
- Compute driver abstraction: `compute/mod.rs` (reconciliation loop, watch stream, orphan pruning)
- Persistence layer: `persistence/mod.rs`, `persistence/postgres.rs`, `persistence/sqlite.rs`
- All Cargo.toml files: confirmed zero metrics-related dependencies exist today
- Searched for any existing counter/histogram/gauge patterns: none found

#	Metric	Type	Labels	Instrumentation Point
1	`openshell_grpc_requests_total`	counter	`method`, `code`	`multiplex.rs` Tower layer
1	`openshell_grpc_request_duration_seconds`	histogram	`method`, `code`	`multiplex.rs` Tower layer
2	`openshell_http_requests_total`	counter	`path`, `status`	`multiplex.rs` Tower layer
2	`openshell_http_request_duration_seconds`	histogram	`path`, `status`	`multiplex.rs` Tower layer
3	`openshell_supervisor_sessions_active`	gauge	—	`supervisor_session.rs`
3	`openshell_supervisor_session_connects_total`	counter	`superseded`	`supervisor_session.rs`
3	`openshell_supervisor_session_disconnects_total`	counter	`reason`	`supervisor_session.rs`
4	`openshell_sandboxes_by_phase`	gauge	`phase`	`compute/mod.rs`
5	`openshell_relay_opens_total`	counter	`result`	`supervisor_session.rs`
5	`openshell_relay_claims_total`	counter	`result`	`supervisor_session.rs`
5	`openshell_relay_pending_count`	gauge	—	`supervisor_session.rs`

#	Metric	Type	Labels	Instrumentation Point
6	`openshell_sandbox_phase_transition_duration_seconds`	histogram	`from_phase`, `to_phase`	`compute/mod.rs`
6	`openshell_sandbox_create_total`	counter	`result`	`compute/mod.rs`
6	`openshell_sandbox_delete_total`	counter	`result`	`compute/mod.rs`
7	`openshell_ssh_connections_active`	gauge	`dimension`	`ssh_tunnel.rs`
7	`openshell_ssh_connection_limit_rejections_total`	counter	`limit_type`	`ssh_tunnel.rs`
7	`openshell_ssh_tunnel_duration_seconds`	histogram	—	`ssh_tunnel.rs`
7	`openshell_ssh_sessions_active`	gauge	—	`ssh_tunnel.rs`
7	`openshell_ssh_sessions_reaped_total`	counter	`reason`	`ssh_tunnel.rs`
8	`openshell_relay_wait_for_session_duration_seconds`	histogram	`result`	`supervisor_session.rs`
8	`openshell_relay_claim_latency_seconds`	histogram	—	`supervisor_session.rs`
8	`openshell_relay_reaped_total`	counter	—	`supervisor_session.rs`
9	`openshell_compute_watch_restarts_total`	counter	`reason`	`compute/mod.rs`
9	`openshell_compute_reconcile_duration_seconds`	histogram	—	`compute/mod.rs`
9	`openshell_compute_orphans_pruned_total`	counter	—	`compute/mod.rs`
9	`openshell_compute_driver_rpc_duration_seconds`	histogram	`method`	`compute/mod.rs`
10	`openshell_db_operation_duration_seconds`	histogram	`operation`, `backend`	`persistence/mod.rs`
10	`openshell_db_errors_total`	counter	`operation`, `backend`	`persistence/mod.rs`
11	`openshell_policy_merge_attempts_total`	counter	`result`	`grpc/policy.rs`
11	`openshell_policy_merge_retries`	histogram	—	`grpc/policy.rs`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): metrics instrumentation #909

Problem Statement

Proposed Design

Crates & Exposition

Metrics Catalog (16 families, 3 priority tiers)

P0 — Critical (Day-1 Paging Metrics)

P1 — Important (SLO Tracking, Capacity Planning)

P2 — Nice to Have (Deep Operational Insight)

SLI/SLO Definitions

Key Alerting Rules

Implementation Phases

Critical Files

Alternatives Considered

Agent Investigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Metric	Type	Instrumentation Point
12	`openshell_ws_tunnel_connections_active`	gauge	`ws_tunnel.rs`
12	`openshell_ws_tunnel_bytes_total`	counter	`ws_tunnel.rs`
13	`openshell_exec_duration_seconds`	histogram	`grpc/sandbox.rs`
13	`openshell_exec_total`	counter	`grpc/sandbox.rs`
14	`openshell_tcp_connections_active`	gauge	`lib.rs`
14	`openshell_tls_handshake_failures_total`	counter	`lib.rs`
15	`openshell_tracing_bus_subscribers`	gauge	`tracing_bus.rs`
15	`openshell_tracing_bus_messages_published_total`	counter	`tracing_bus.rs`
16	Process metrics (RSS, FDs, CPU, uptime)	various	auto via `PrometheusBuilder`

SLI	PromQL	Target
gRPC Availability	`1 - rate(grpc_requests_total{code!="OK"}[5m]) / rate(grpc_requests_total[5m])`	99.9% / 30d
gRPC Latency (p99)	`histogram_quantile(0.99, rate(grpc_request_duration_seconds_bucket[5m]))`	<500ms (unary)
Sandbox Create Success	`rate(sandbox_create_total{result="success"}[1h]) / rate(sandbox_create_total[1h])`	99.5%
Time to Ready (p50)	`histogram_quantile(0.50, rate(phase_transition_duration_seconds_bucket{to_phase="ready"}[1h]))`	<30s
SSH Tunnel Availability	`1 - rate(http_requests_total{path="/connect/ssh",status=~"5.."}[5m]) / rate(http_requests_total{path="/connect/ssh"}[5m])`	99.9%
Relay Claim Rate	`rate(relay_claims_total{result="success"}[5m]) / rate(relay_opens_total{result="success"}[5m])`	>99%

Severity	Condition	Duration
Page	`supervisor_sessions_active == 0` (while sandboxes exist)	2m
Page	gRPC error ratio > 1%	5m
Page	All sandboxes in error, none ready	5m
Warn	`relay_pending_count > 50`	2m
Warn	SSH connection limit rejections firing	instant
Warn	Compute watch restarts > 3 in 10m	—
Warn	DB p99 > 100ms	5m

feat(server): metrics instrumentation #909

Description

Problem Statement

Proposed Design

Crates & Exposition

Metrics Catalog (16 families, 3 priority tiers)

P0 — Critical (Day-1 Paging Metrics)

P1 — Important (SLO Tracking, Capacity Planning)

P2 — Nice to Have (Deep Operational Insight)

SLI/SLO Definitions

Key Alerting Rules

Implementation Phases

Critical Files

Alternatives Considered

Agent Investigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions