Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,9 +188,11 @@ After the container starts:
1. **Clean stale nodes**: `clean_stale_nodes()` finds `NotReady` nodes via `kubectl get nodes` and deletes them. This is needed when a container is recreated but reuses the persistent volume -- k3s registers a new node (using the container ID as hostname) while old node entries persist in etcd. Non-fatal on error; returns the count of removed nodes.
2. **Push local images** (optional, local deploy only): If `OPENSHELL_PUSH_IMAGES` is set, the comma-separated image refs are exported from the local Docker daemon as a single tar, uploaded into the container via `docker put_archive`, and imported into containerd via `ctr images import` in the `k8s.io` namespace. After import, `kubectl rollout restart deployment/openshell openshell` is run, followed by `kubectl rollout status --timeout=180s` to wait for completion. See `crates/openshell-bootstrap/src/push.rs`.
3. **Wait for gateway health**: `wait_for_gateway_ready()` polls the Docker HEALTHCHECK status up to 180 times, 2 seconds apart (6 min total). A background task streams container logs during this wait. Failure modes:
- Container exits during polling: error includes recent log lines.
- Container has no HEALTHCHECK instruction: fails immediately.
- HEALTHCHECK reports unhealthy on final attempt: error includes recent logs.
- Container exits during polling: error includes recent log lines.
- Container has no HEALTHCHECK instruction: fails immediately.
- HEALTHCHECK reports unhealthy on final attempt: error includes recent logs.

The gateway StatefulSet also uses a Kubernetes `startupProbe` on the gRPC port before steady-state liveness and readiness checks begin. This gives single-node k3s boots extra time to absorb early networking and flannel initialization delay without restarting the gateway pod too aggressively.

### 5) mTLS bundle capture

Expand Down
42 changes: 40 additions & 2 deletions crates/openshell-server/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ mod ws_tunnel;

use openshell_core::{Config, Error, Result};
use std::collections::HashMap;
use std::io::ErrorKind;
use std::sync::{Arc, Mutex};
use tokio::net::TcpListener;
use tracing::{error, info};
use tracing::{debug, error, info};

pub use grpc::OpenShellService;
pub use http::{health_router, http_router};
Expand Down Expand Up @@ -67,6 +68,13 @@ pub struct ServerState {
pub ssh_connections_by_sandbox: Mutex<HashMap<String, u32>>,
}

fn is_benign_tls_handshake_failure(error: &std::io::Error) -> bool {
matches!(
error.kind(),
ErrorKind::UnexpectedEof | ErrorKind::ConnectionReset
)
}

impl ServerState {
/// Create new server state.
#[must_use]
Expand Down Expand Up @@ -198,7 +206,11 @@ pub async fn run_server(config: Config, tracing_log_bus: TracingLogBus) -> Resul
}
}
Err(e) => {
error!(error = %e, client = %addr, "TLS handshake failed");
if is_benign_tls_handshake_failure(&e) {
debug!(error = %e, client = %addr, "TLS handshake closed early");
} else {
error!(error = %e, client = %addr, "TLS handshake failed");
}
}
}
});
Expand All @@ -211,3 +223,29 @@ pub async fn run_server(config: Config, tracing_log_bus: TracingLogBus) -> Resul
}
}
}

#[cfg(test)]
mod tests {
use super::is_benign_tls_handshake_failure;
use std::io::{Error, ErrorKind};

#[test]
fn classifies_probe_style_tls_disconnects_as_benign() {
for kind in [ErrorKind::UnexpectedEof, ErrorKind::ConnectionReset] {
let error = Error::new(kind, "probe disconnected");
assert!(is_benign_tls_handshake_failure(&error));
}
}

#[test]
fn preserves_real_tls_failures_as_errors() {
for kind in [
ErrorKind::InvalidData,
ErrorKind::PermissionDenied,
ErrorKind::Other,
] {
let error = Error::new(kind, "real tls failure");
assert!(!is_benign_tls_handshake_failure(&error));
}
}
}
6 changes: 6 additions & 0 deletions deploy/helm/openshell/templates/statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,12 @@ spec:
- name: grpc
containerPort: {{ .Values.service.port }}
protocol: TCP
startupProbe:
tcpSocket:
port: grpc
periodSeconds: {{ .Values.probes.startup.periodSeconds }}
timeoutSeconds: {{ .Values.probes.startup.timeoutSeconds }}
failureThreshold: {{ .Values.probes.startup.failureThreshold }}
livenessProbe:
tcpSocket:
port: grpc
Expand Down
4 changes: 4 additions & 0 deletions deploy/helm/openshell/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,10 @@ podLifecycle:
terminationGracePeriodSeconds: 5

probes:
startup:
periodSeconds: 2
timeoutSeconds: 1
failureThreshold: 30
liveness:
initialDelaySeconds: 2
periodSeconds: 5
Expand Down
Loading