From 158054f46d4b756d41724bf668d4c9bd974e14b8 Mon Sep 17 00:00:00 2001 From: Russell Bryant Date: Fri, 15 May 2026 16:42:12 -0400 Subject: [PATCH] refactor(sandbox): replace iptables with nftables for network policy enforcement Migrate all sandbox and VM driver network policy enforcement from iptables to nftables. nftables provides atomic ruleset loading, a cleaner rule syntax, and is the standard netfilter interface in modern kernels. Sandbox bypass enforcement (openshell-sandbox): - Replace iptables chain of individual rule insertions with a single atomic nftables ruleset load via nft -f - New nft_ruleset module with pure functions for ruleset generation and unit tests - Combine log and reject rules in one inet family table (handles both IPv4 and IPv6 in a single ruleset) - Fall back to reject-only ruleset when kernel lacks nft_log support - Enable net.netfilter.nf_log_all_netns so log rules work from non-init network namespaces - Use temp file for nft ruleset loading instead of stdin for compatibility with minimal VM guest environments VM TAP networking (openshell-driver-vm): - Replace iptables NAT/forwarding rules with nftables equivalents - New nft_ruleset module for TAP network rule generation with unit tests - Atomic table-per-TAP-device lifecycle (create/destroy) - Host-side rules provide NAT infrastructure and defense-in-depth isolation (input chain restricts VM to gateway port only, forward chain blocks unsolicited inbound); primary security enforcement happens inside the VM guest via the sandbox supervisor's own rules VM init script: - Load nft kernel modules at sandbox init - Enable nf_log_all_netns sysctl for bypass detection logging OCSF / docs: - Update firewall rule engine references from iptables to nftables - Document host firewall interaction model and two-layer enforcement architecture in VM driver README and compute drivers reference Closes #1335 Signed-off-by: Russell Bryant --- crates/openshell-driver-podman/NETWORKING.md | 8 +- crates/openshell-driver-podman/README.md | 2 +- crates/openshell-driver-vm/README.md | 19 + .../scripts/openshell-vm-sandbox-init.sh | 7 + crates/openshell-driver-vm/src/lib.rs | 1 + crates/openshell-driver-vm/src/nft_ruleset.rs | 92 +++ crates/openshell-driver-vm/src/runtime.rs | 108 ++-- .../src/events/network_activity.rs | 2 +- crates/openshell-ocsf/src/format/shorthand.rs | 4 +- .../src/objects/firewall_rule.rs | 2 +- crates/openshell-sandbox/Cargo.toml | 1 + .../openshell-sandbox/src/bypass_monitor.rs | 16 +- crates/openshell-sandbox/src/lib.rs | 6 +- .../src/sandbox/linux/mod.rs | 1 + .../src/sandbox/linux/netns.rs | 540 ++++-------------- .../src/sandbox/linux/nft_ruleset.rs | 148 +++++ docs/reference/sandbox-compute-drivers.mdx | 6 + docs/security/best-practices.mdx | 2 +- examples/bring-your-own-container/Dockerfile | 4 +- 19 files changed, 464 insertions(+), 505 deletions(-) create mode 100644 crates/openshell-driver-vm/src/nft_ruleset.rs create mode 100644 crates/openshell-sandbox/src/sandbox/linux/nft_ruleset.rs diff --git a/crates/openshell-driver-podman/NETWORKING.md b/crates/openshell-driver-podman/NETWORKING.md index d7f5ed6be..2c976a1c7 100644 --- a/crates/openshell-driver-podman/NETWORKING.md +++ b/crates/openshell-driver-podman/NETWORKING.md @@ -178,7 +178,7 @@ Namespace 2: Rootless Podman network namespace, managed by pasta Namespace 3: Inner sandbox netns, created by supervisor | veth pair, such as 10.200.0.1 <-> 10.200.0.2 - iptables forces ordinary traffic through proxy + nftables forces ordinary traffic through proxy user workload runs here ``` @@ -270,7 +270,7 @@ Container on the Podman bridge | user code runs here | - iptables rules: + nftables rules: ACCEPT -> proxy TCP ACCEPT -> loopback ACCEPT -> established/related @@ -337,7 +337,7 @@ User code in inner netns HTTP_PROXY points at the local sandbox proxy | 2. TCP connect to proxy - allowed by iptables as the only ordinary egress destination + allowed by nftables as the only ordinary egress destination | 3. HTTP CONNECT api.example.com:443 | @@ -398,7 +398,7 @@ bind-mounted into sandbox containers by the Podman driver. | Port publishing | Not needed for relay | Ephemeral host port remains in the container spec for compatibility and debug paths. | | TLS | mTLS via Kubernetes secrets | mTLS via mounted client files, RPM defaults, or explicit configuration. | | DNS | Kubernetes CoreDNS | Podman bridge DNS through aardvark-dns when DNS is enabled. | -| Network policy | Kubernetes network policy for pod ingress plus supervisor policy | iptables inside inner sandbox netns plus supervisor policy. | +| Network policy | Kubernetes network policy for pod ingress plus supervisor policy | nftables inside inner sandbox netns plus supervisor policy. | | Supervisor delivery | Kubernetes driver managed pod image or template | OCI image volume mount. | | Secrets | Kubernetes Secret volume and env vars | Mounted TLS client materials from a Podman secret. | diff --git a/crates/openshell-driver-podman/README.md b/crates/openshell-driver-podman/README.md index 1906bd912..db4081194 100644 --- a/crates/openshell-driver-podman/README.md +++ b/crates/openshell-driver-podman/README.md @@ -55,7 +55,7 @@ The restricted agent child does not retain these supervisor privileges. | Capability | Purpose | |---|---| | `SYS_ADMIN` | seccomp filter installation, namespace creation, and Landlock setup. | -| `NET_ADMIN` | Network namespace veth setup, IP address assignment, routes, and iptables. | +| `NET_ADMIN` | Network namespace veth setup, IP address assignment, routes, and nftables. | | `SYS_PTRACE` | Reading `/proc//exe` and walking process ancestry for binary identity. | | `SYSLOG` | Reading `/dev/kmsg` for bypass-detection diagnostics. | | `DAC_READ_SEARCH` | Reading `/proc//fd/` across UIDs so the proxy can resolve the binary responsible for a connection. | diff --git a/crates/openshell-driver-vm/README.md b/crates/openshell-driver-vm/README.md index e9900f3bb..2d1f98337 100644 --- a/crates/openshell-driver-vm/README.md +++ b/crates/openshell-driver-vm/README.md @@ -200,6 +200,25 @@ RUST_LOG=openshell_server=debug,openshell_driver_vm=debug \ The VM guest's serial console is appended to `//console.log`. Sandbox IDs must match `[A-Za-z0-9._-]{1,128}` before the driver uses them in host paths. The gateway-owned compute-driver socket lives at `/run/compute-driver.sock`; OpenShell creates `run/` with owner-only permissions, removes same-owner stale sockets, and the gateway removes the socket on clean shutdown via `ManagedDriverProcess::drop`. UDS clients must match the driver UID and provide the expected gateway process PID by default. Standalone same-UID UDS mode requires the explicit `--allow-same-uid-peer` development flag. TCP mode is disabled by default because it is unauthenticated; use `--allow-unauthenticated-tcp --bind-address 127.0.0.1:50061` only for local development. +## Host-side nftables rules + +The VM driver creates a per-VM nftables table on the host (`openshell_vm_vmtap_`) with three chains. These rules serve two purposes: NAT infrastructure (required for VM connectivity) and defense-in-depth host isolation. Primary security enforcement — proxy-only egress and bypass detection — is handled by the sandbox supervisor's own nftables rules inside the VM guest. + +**`postrouting` (NAT):** Masquerades outbound VM traffic so it can be routed from the VM's private subnet to the external network. This chain handles forwarded traffic (VM → internet), not traffic destined for the host. + +**`forward` (defense-in-depth):** Accepts all outbound traffic from the VM (security enforcement happens guest-side) and accepts established/related response traffic back to the VM. Drops unsolicited inbound connections to the VM from the broader network. This chain handles forwarded traffic only — packets transiting the host between the TAP interface and other interfaces. + +**`input` (defense-in-depth):** Accepts traffic from the VM to the gateway port on the host. Drops all other traffic from the VM destined for the host itself. This limits what a compromised guest can reach on the host to the gateway service only. + +The `input` and `postrouting` chains handle different traffic paths: `input` covers packets addressed to the host (VM → host), while `postrouting` covers packets the host is forwarding on behalf of the VM (VM → internet). A packet from the VM goes through one path or the other, never both. + +All chains use `policy accept`, so non-TAP traffic is unaffected. Because nftables evaluates multiple base chains on the same hook independently, host firewalls interact with these rules as follows: + +- **Open host (no other firewall):** Our chains are the only filter. The defense-in-depth drop rules block unsolicited inbound and non-gateway host access. Non-TAP traffic passes through. +- **Restrictive host firewall (e.g. firewalld):** The host firewall's chains may additionally drop TAP traffic that our chains accept. A `drop` verdict from any chain is final — our `accept` cannot override it. If VM connectivity fails, verify that the host firewall allows forwarding and input for `vmtap-*` interfaces. + +Each table is created atomically via `nft -f` on VM start and torn down atomically via `nft delete table` when the VM is destroyed. + ## Prerequisites - macOS on Apple Silicon, or Linux on aarch64/x86_64 with KVM diff --git a/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh b/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh index a6aff191e..c1571dfd9 100644 --- a/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh +++ b/crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh @@ -585,6 +585,13 @@ run_post_overlay_setup() { mount -t cgroup2 cgroup2 "$(root_path /sys/fs/cgroup)" 2>/dev/null & wait + # Allow nftables LOG rules to work in non-init network namespaces. + # Without this, the kernel's nf_log_syslog silently suppresses output + # from the sandbox's network namespace. + if [ -f /proc/sys/net/netfilter/nf_log_all_netns ]; then + echo 1 > /proc/sys/net/netfilter/nf_log_all_netns 2>/dev/null || true + fi + setup_sandbox_workdir configure_hostname diff --git a/crates/openshell-driver-vm/src/lib.rs b/crates/openshell-driver-vm/src/lib.rs index 194dde43c..5b2ddc2bc 100644 --- a/crates/openshell-driver-vm/src/lib.rs +++ b/crates/openshell-driver-vm/src/lib.rs @@ -5,6 +5,7 @@ pub mod driver; mod embedded_runtime; mod ffi; pub mod gpu; +mod nft_ruleset; pub mod procguard; mod rootfs; mod runtime; diff --git a/crates/openshell-driver-vm/src/nft_ruleset.rs b/crates/openshell-driver-vm/src/nft_ruleset.rs new file mode 100644 index 000000000..fe3e86c90 --- /dev/null +++ b/crates/openshell-driver-vm/src/nft_ruleset.rs @@ -0,0 +1,92 @@ +// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + +use std::fmt::Write; + +/// Sanitize a TAP device name for use as an nftables table name suffix. +/// Assumes device names match `vmtap-[a-f0-9]+` (driver-controlled). +fn sanitize_table_name(device: &str) -> String { + device.replace('-', "_") +} + +/// Return the nftables table name for a TAP device. +pub fn teardown_table_name(device: &str) -> String { + format!("openshell_vm_{}", sanitize_table_name(device)) +} + +/// Generate the nftables ruleset for VM TAP networking. +pub fn generate_tap_ruleset(tap_device: &str, subnet: &str, gateway_port: u16) -> String { + let table_name = teardown_table_name(tap_device); + let mut ruleset = String::with_capacity(512); + + writeln!(ruleset, "table ip {table_name} {{").unwrap(); + writeln!(ruleset, " chain postrouting {{").unwrap(); + writeln!( + ruleset, + " type nat hook postrouting priority 100; policy accept;" + ) + .unwrap(); + writeln!(ruleset, " ip saddr {subnet} masquerade").unwrap(); + writeln!(ruleset, " }}").unwrap(); + writeln!(ruleset, " chain forward {{").unwrap(); + writeln!( + ruleset, + " type filter hook forward priority 0; policy accept;" + ) + .unwrap(); + writeln!(ruleset, " iifname \"{tap_device}\" accept").unwrap(); + writeln!( + ruleset, + " oifname \"{tap_device}\" ct state related,established accept" + ) + .unwrap(); + writeln!(ruleset, " oifname \"{tap_device}\" drop").unwrap(); + writeln!(ruleset, " }}").unwrap(); + writeln!(ruleset, " chain input {{").unwrap(); + writeln!( + ruleset, + " type filter hook input priority 0; policy accept;" + ) + .unwrap(); + writeln!( + ruleset, + " iifname \"{tap_device}\" tcp dport {gateway_port} accept" + ) + .unwrap(); + writeln!(ruleset, " iifname \"{tap_device}\" drop").unwrap(); + writeln!(ruleset, " }}").unwrap(); + writeln!(ruleset, "}}").unwrap(); + + ruleset +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn generates_tap_setup_ruleset() { + let ruleset = generate_tap_ruleset("vmtap-abcd", "10.0.128.0/30", 8080); + assert!(ruleset.contains("table ip openshell_vm_vmtap_abcd {")); + assert!(ruleset.contains("type nat hook postrouting priority 100; policy accept;")); + assert!(ruleset.contains("ip saddr 10.0.128.0/30 masquerade")); + assert!(ruleset.contains("type filter hook forward priority 0; policy accept;")); + assert!(ruleset.contains("iifname \"vmtap-abcd\" accept")); + assert!(ruleset.contains("oifname \"vmtap-abcd\" ct state related,established accept")); + assert!(ruleset.contains("oifname \"vmtap-abcd\" drop")); + assert!(ruleset.contains("type filter hook input priority 0; policy accept;")); + assert!(ruleset.contains("iifname \"vmtap-abcd\" tcp dport 8080 accept")); + } + + #[test] + fn table_name_sanitizes_device_name() { + let ruleset = generate_tap_ruleset("vmtap-abc-123", "10.0.128.0/30", 8080); + assert!(ruleset.contains("table ip openshell_vm_vmtap_abc_123 {")); + } + + #[test] + fn teardown_command_targets_correct_table() { + let cmd = teardown_table_name("vmtap-abcd"); + assert_eq!(cmd, "openshell_vm_vmtap_abcd"); + } +} diff --git a/crates/openshell-driver-vm/src/runtime.rs b/crates/openshell-driver-vm/src/runtime.rs index 4a9053c46..1ce6fb26b 100644 --- a/crates/openshell-driver-vm/src/runtime.rs +++ b/crates/openshell-driver-vm/src/runtime.rs @@ -10,7 +10,7 @@ use std::ptr; use std::sync::atomic::{AtomicI32, Ordering}; use std::time::{Duration, Instant}; -use crate::{embedded_runtime, ffi, procguard, rootfs}; +use crate::{embedded_runtime, ffi, nft_ruleset, procguard, rootfs}; pub const VM_RUNTIME_DIR_ENV: &str = "OPENSHELL_VM_RUNTIME_DIR"; @@ -413,6 +413,12 @@ fn setup_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) -> R enable_ip_forwarding()?; let subnet = tap_subnet_from_host_ip(host_ip); + let table_name = nft_ruleset::teardown_table_name(tap_device); + + // Delete any stale nftables table from a previous driver run. + let _ = run_cmd("nft", &["delete", "table", "ip", &table_name]); + + // Clean up legacy iptables rules from older driver versions. let _ = run_cmd( "iptables", &[ @@ -426,27 +432,10 @@ fn setup_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) -> R "MASQUERADE", ], ); - run_cmd( - "iptables", - &[ - "-t", - "nat", - "-A", - "POSTROUTING", - "-s", - &subnet, - "-j", - "MASQUERADE", - ], - )?; let _ = run_cmd( "iptables", &["-D", "FORWARD", "-i", tap_device, "-j", "ACCEPT"], ); - run_cmd( - "iptables", - &["-A", "FORWARD", "-i", tap_device, "-j", "ACCEPT"], - )?; let _ = run_cmd( "iptables", &[ @@ -462,25 +451,6 @@ fn setup_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) -> R "ACCEPT", ], ); - run_cmd( - "iptables", - &[ - "-A", - "FORWARD", - "-o", - tap_device, - "-m", - "state", - "--state", - "RELATED,ESTABLISHED", - "-j", - "ACCEPT", - ], - )?; - // Allow guest → host traffic only to the gateway gRPC port. - // Previous versions accepted ALL inbound traffic from the TAP - // interface; scope to the specific port so the guest cannot reach - // other host services. let port_str = gateway_port.to_string(); let _ = run_cmd( "iptables", @@ -488,17 +458,24 @@ fn setup_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) -> R "-D", "INPUT", "-i", tap_device, "-p", "tcp", "--dport", &port_str, "-j", "ACCEPT", ], ); - run_cmd( + let _ = run_cmd( "iptables", - &[ - "-A", "INPUT", "-i", tap_device, "-p", "tcp", "--dport", &port_str, "-j", "ACCEPT", - ], - )?; + &["-D", "INPUT", "-i", tap_device, "-j", "ACCEPT"], + ); + + // Load nftables ruleset atomically. + let ruleset = nft_ruleset::generate_tap_ruleset(tap_device, &subnet, gateway_port); + run_nft_stdin(&ruleset)?; Ok(()) } fn teardown_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) { + // Delete the entire nftables table — single atomic operation. + let table_name = nft_ruleset::teardown_table_name(tap_device); + let _ = run_cmd("nft", &["delete", "table", "ip", &table_name]); + + // Clean up legacy iptables rules from older driver versions. let subnet = tap_subnet_from_host_ip(host_ip); let _ = run_cmd( "iptables", @@ -519,8 +496,6 @@ fn teardown_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) { "iptables", &["-D", "FORWARD", "-i", tap_device, "-j", "ACCEPT"], ); - // Remove the port-scoped INPUT rule. Also try the legacy blanket - // rule so stale rules from older driver versions are cleaned up. if gateway_port > 0 { let port_str = gateway_port.to_string(); let _ = run_cmd( @@ -547,6 +522,7 @@ fn teardown_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) { "MASQUERADE", ], ); + let _ = run_cmd("ip", &["link", "set", tap_device, "down"]); let _ = run_cmd("ip", &["tuntap", "del", "dev", tap_device, "mode", "tap"]); } @@ -583,6 +559,35 @@ fn run_cmd(cmd: &str, args: &[&str]) -> Result<(), String> { } } +fn run_nft_stdin(ruleset: &str) -> Result<(), String> { + use std::io::Write; + + let mut child = StdCommand::new("nft") + .args(["-f", "-"]) + .stdin(Stdio::piped()) + .stdout(Stdio::piped()) + .stderr(Stdio::piped()) + .spawn() + .map_err(|e| format!("failed to run nft: {e}"))?; + + if let Some(mut stdin) = child.stdin.take() { + stdin + .write_all(ruleset.as_bytes()) + .map_err(|e| format!("failed to write nft ruleset: {e}"))?; + } + + let output = child + .wait_with_output() + .map_err(|e| format!("failed to wait for nft: {e}"))?; + + if output.status.success() { + Ok(()) + } else { + let stderr = String::from_utf8_lossy(&output.stderr); + Err(format!("nft -f - failed: {stderr}")) + } +} + /// RAII guard that tears down TAP networking on drop. struct TapGuard { tap_device: String, @@ -715,7 +720,7 @@ fn run_libkrun_vm(config: &VmLaunchConfig) -> Result<(), String> { // on its own service ports (DNS:53, DHCP, HTTP API:80). // // That network plane is also what the sandbox supervisor's - // per-sandbox netns (veth pair + iptables, see + // per-sandbox netns (veth pair + nftables, see // `openshell-sandbox/src/sandbox/linux/netns.rs`) branches off of; // libkrun's built-in TSI socket impersonation would not satisfy // those kernel-level primitives. @@ -1481,4 +1486,17 @@ mod tests { assert_ne!(first, second); } + + #[test] + fn tap_subnet_from_host_ip_calculates_slash30_base() { + assert_eq!(tap_subnet_from_host_ip("10.0.128.1"), "10.0.128.0/30"); + assert_eq!(tap_subnet_from_host_ip("10.0.128.2"), "10.0.128.0/30"); + assert_eq!(tap_subnet_from_host_ip("10.0.128.5"), "10.0.128.4/30"); + } + + #[test] + fn tap_subnet_from_host_ip_handles_invalid_ip() { + let result = tap_subnet_from_host_ip("not-an-ip"); + assert_eq!(result, "not-an-ip/30"); + } } diff --git a/crates/openshell-ocsf/src/events/network_activity.rs b/crates/openshell-ocsf/src/events/network_activity.rs index 6cd125fdc..92450bbe8 100644 --- a/crates/openshell-ocsf/src/events/network_activity.rs +++ b/crates/openshell-ocsf/src/events/network_activity.rs @@ -11,7 +11,7 @@ use crate::objects::{Actor, ConnectionInfo, Endpoint, FirewallRule}; /// OCSF Network Activity Event [4001]. /// -/// Proxy CONNECT tunnel events and iptables-level bypass detection. +/// Proxy CONNECT tunnel events and nftables bypass detection. #[derive(Debug, Clone, PartialEq, Eq, Deserialize)] pub struct NetworkActivityEvent { /// Common base event fields. diff --git a/crates/openshell-ocsf/src/format/shorthand.rs b/crates/openshell-ocsf/src/format/shorthand.rs index 08b413429..0e50fc6c5 100644 --- a/crates/openshell-ocsf/src/format/shorthand.rs +++ b/crates/openshell-ocsf/src/format/shorthand.rs @@ -456,7 +456,7 @@ mod tests { actor: Some(Actor { process: Process::new("node", 1234), }), - firewall_rule: Some(FirewallRule::new("bypass-detect", "iptables")), + firewall_rule: Some(FirewallRule::new("bypass-detect", "nftables")), connection_info: Some(ConnectionInfo::new("tcp")), action: Some(ActionId::Denied), disposition: Some(DispositionId::Blocked), @@ -467,7 +467,7 @@ mod tests { let shorthand = event.format_shorthand(); assert_eq!( shorthand, - "NET:REFUSE [MED] DENIED node(1234) -> 93.184.216.34:443/tcp [policy:bypass-detect engine:iptables]" + "NET:REFUSE [MED] DENIED node(1234) -> 93.184.216.34:443/tcp [policy:bypass-detect engine:nftables]" ); } diff --git a/crates/openshell-ocsf/src/objects/firewall_rule.rs b/crates/openshell-ocsf/src/objects/firewall_rule.rs index fa8829275..2e242225b 100644 --- a/crates/openshell-ocsf/src/objects/firewall_rule.rs +++ b/crates/openshell-ocsf/src/objects/firewall_rule.rs @@ -11,7 +11,7 @@ pub struct FirewallRule { /// Rule name (e.g., "default-egress", "bypass-detect"). pub name: String, - /// Rule type / engine (e.g., "mechanistic", "opa", "iptables"). + /// Rule type / engine (e.g., "mechanistic", "opa", "nftables"). /// /// Kept as `String` because this is a project-specific extension field /// (not OCSF-enumerated) with runtime-dynamic values from the policy engine. diff --git a/crates/openshell-sandbox/Cargo.toml b/crates/openshell-sandbox/Cargo.toml index 29919ede4..b90a9221b 100644 --- a/crates/openshell-sandbox/Cargo.toml +++ b/crates/openshell-sandbox/Cargo.toml @@ -85,6 +85,7 @@ libc = "0.2" [target.'cfg(target_os = "linux")'.dependencies] landlock = "0.4" seccompiler = "0.5" +tempfile = "3" uuid = { version = "1", features = ["v4"] } [dev-dependencies] diff --git a/crates/openshell-sandbox/src/bypass_monitor.rs b/crates/openshell-sandbox/src/bypass_monitor.rs index 1a7ec5f99..9e37ef27c 100644 --- a/crates/openshell-sandbox/src/bypass_monitor.rs +++ b/crates/openshell-sandbox/src/bypass_monitor.rs @@ -5,15 +5,15 @@ //! detect and report direct connection attempts that bypass the HTTP CONNECT //! proxy. //! -//! When the sandbox network namespace has iptables LOG rules installed (see +//! When the sandbox network namespace has nftables log rules installed (see //! `NetworkNamespace::install_bypass_rules`), the kernel writes a log line for -//! each dropped packet. This module reads those messages, parses the iptables +//! each dropped packet. This module reads those messages, parses the nftables //! LOG format, and emits structured tracing events + denial aggregator entries. //! //! ## Graceful degradation //! //! If `/dev/kmsg` cannot be opened (e.g., restricted container environment), -//! the monitor logs a one-time warning and returns. The iptables REJECT rules +//! the monitor logs a one-time warning and returns. The nftables reject rules //! still provide fast-fail UX — the monitor only adds diagnostic visibility. use crate::denial_aggregator::DenialEvent; @@ -26,7 +26,7 @@ use std::sync::atomic::{AtomicU32, Ordering}; use tokio::sync::mpsc; use tracing::debug; -/// A parsed iptables LOG entry from `/dev/kmsg`. +/// A parsed nftables log entry from `/dev/kmsg`. #[derive(Debug, Clone, PartialEq, Eq)] pub struct BypassEvent { /// Destination IP address. @@ -41,7 +41,7 @@ pub struct BypassEvent { pub uid: Option, } -/// Parse an iptables LOG line from `/dev/kmsg`. +/// Parse a nftables log line from `/dev/kmsg`. /// /// Expected format (from the kernel LOG target): /// ```text @@ -74,7 +74,7 @@ pub fn parse_kmsg_line(line: &str, namespace_prefix: &str) -> Option &'static str { /// Spawn the bypass monitor as a background tokio task. /// -/// Uses `dmesg --follow` to tail the kernel ring buffer for iptables LOG +/// Uses `dmesg --follow` to tail the kernel ring buffer for nftables log /// entries matching the given namespace. Falls back gracefully if `dmesg` /// is not available. /// @@ -221,7 +221,7 @@ pub fn spawn( .severity(SeverityId::Medium) .dst_endpoint(dst_ep.clone()) .actor_process(Process::from_bypass(&binary, &binary_pid, &ancestors)) - .firewall_rule("bypass-detect", "iptables") + .firewall_rule("bypass-detect", "nftables") .observation_point(3) .message(format!( "BYPASS_DETECT {}:{} proto={} binary={binary} action=reject reason={reason}", diff --git a/crates/openshell-sandbox/src/lib.rs b/crates/openshell-sandbox/src/lib.rs index e297b9262..016b952bd 100644 --- a/crates/openshell-sandbox/src/lib.rs +++ b/crates/openshell-sandbox/src/lib.rs @@ -509,7 +509,7 @@ pub async fn run_sandbox( let netns = if matches!(policy.network.mode, NetworkMode::Proxy) { match NetworkNamespace::create() { Ok(ns) => { - // Install bypass detection rules (iptables LOG + REJECT). + // Install bypass detection rules (nftables log + reject). // This provides fast-fail UX and diagnostic logging for direct // connection attempts that bypass the HTTP CONNECT proxy. let proxy_port = policy @@ -550,7 +550,7 @@ pub async fn run_sandbox( let _netns: Option<()> = None; // Install the supervisor seccomp prelude after privileged startup helpers - // (network namespace setup, iptables probes) complete, but before the SSH + // (network namespace setup, nftables probes) complete, but before the SSH // listener and workload process are exposed. apply_supervisor_startup_hardening()?; @@ -620,7 +620,7 @@ pub async fn run_sandbox( }; // Spawn bypass detection monitor (Linux only, proxy mode only). - // Reads /dev/kmsg for iptables LOG entries and emits structured + // Reads /dev/kmsg for nftables log entries and emits structured // tracing events for direct connection attempts that bypass the proxy. #[cfg(target_os = "linux")] let _bypass_monitor = netns.as_ref().and_then(|ns| { diff --git a/crates/openshell-sandbox/src/sandbox/linux/mod.rs b/crates/openshell-sandbox/src/sandbox/linux/mod.rs index 848ab1e3b..a3a32c77a 100644 --- a/crates/openshell-sandbox/src/sandbox/linux/mod.rs +++ b/crates/openshell-sandbox/src/sandbox/linux/mod.rs @@ -5,6 +5,7 @@ mod landlock; pub mod netns; +mod nft_ruleset; mod seccomp; use crate::policy::SandboxPolicy; diff --git a/crates/openshell-sandbox/src/sandbox/linux/netns.rs b/crates/openshell-sandbox/src/sandbox/linux/netns.rs index 019036e53..433f70b1c 100644 --- a/crates/openshell-sandbox/src/sandbox/linux/netns.rs +++ b/crates/openshell-sandbox/src/sandbox/linux/netns.rs @@ -242,7 +242,7 @@ impl NetworkNamespace { self.ns_fd } - /// Install iptables rules for bypass detection inside the namespace. + /// Install nftables rules for bypass detection inside the namespace. /// /// Sets up OUTPUT chain rules that: /// 1. ACCEPT traffic destined for the proxy (`host_ip:proxy_port`) @@ -253,22 +253,21 @@ impl NetworkNamespace { /// This provides two benefits: /// - **Fast-fail UX**: applications get immediate ECONNREFUSED instead of /// a 30-second timeout when they bypass the proxy - /// - **Diagnostics**: iptables LOG entries are picked up by the bypass + /// - **Diagnostics**: nftables LOG entries are picked up by the bypass /// monitor to emit structured tracing events /// - /// Degrades gracefully if `iptables` is not available — the namespace + /// Degrades gracefully if `nft` is not available — the namespace /// still provides isolation via routing, just without fast-fail and /// diagnostic logging. pub fn install_bypass_rules(&self, proxy_port: u16) -> Result<()> { - // Check if iptables is available before attempting to install rules. - let Some(iptables_path) = find_iptables() else { + let Some(nft_path) = find_nft() else { openshell_ocsf::ocsf_emit!( openshell_ocsf::ConfigStateChangeBuilder::new(crate::ocsf_ctx()) .severity(openshell_ocsf::SeverityId::Medium) .status(openshell_ocsf::StatusId::Failure) .state(openshell_ocsf::StateId::Disabled, "degraded") .message(format!( - "iptables not found; bypass detection rules will not be installed [ns:{}]", + "nft not found; bypass detection rules will not be installed [ns:{}]", self.name )) .build() @@ -277,49 +276,53 @@ impl NetworkNamespace { }; let host_ip_str = self.host_ip.to_string(); - let proxy_port_str = proxy_port.to_string(); let log_prefix = format!("openshell:bypass:{}:", &self.name); - // "Installing bypass detection rules" is a transient step — skip OCSF. - // The completion event below covers the outcome. + // The kernel's nf_log_syslog module suppresses log output from + // non-init network namespaces by default. Enable it so the bypass + // monitor can see log entries from the sandbox namespace. + enable_nf_log_all_netns(); - // Install IPv4 rules - if let Err(e) = self.install_bypass_rules_for( - &iptables_path, + // Try combined ruleset with log rules first. Log rules must appear + // before reject rules in the chain so packets are logged before being + // rejected. If the kernel lacks nft_log support, fall back to the + // reject-only ruleset. + let ruleset_with_log = super::nft_ruleset::generate_bypass_ruleset( &host_ip_str, - &proxy_port_str, - &log_prefix, - ) { - openshell_ocsf::ocsf_emit!( - openshell_ocsf::ConfigStateChangeBuilder::new(crate::ocsf_ctx()) - .severity(openshell_ocsf::SeverityId::Medium) - .status(openshell_ocsf::StatusId::Failure) - .state(openshell_ocsf::StateId::Disabled, "failed") - .message(format!( - "Failed to install IPv4 bypass detection rules [ns:{}]: {e}", - self.name - )) - .build() - ); - return Err(e); - } + proxy_port, + Some(&log_prefix), + ); - // Install IPv6 rules — best-effort. - // Skip the proxy ACCEPT rule for IPv6 since the proxy address is IPv4. - if let Some(ip6_path) = find_ip6tables(&iptables_path) - && let Err(e) = self.install_bypass_rules_for_v6(&ip6_path, &log_prefix) - { + if let Err(e) = run_nft_netns(&self.name, &nft_path, &ruleset_with_log) { openshell_ocsf::ocsf_emit!( openshell_ocsf::ConfigStateChangeBuilder::new(crate::ocsf_ctx()) .severity(openshell_ocsf::SeverityId::Low) .status(openshell_ocsf::StatusId::Failure) .state(openshell_ocsf::StateId::Other, "degraded") .message(format!( - "Failed to install IPv6 bypass detection rules (non-fatal) [ns:{}]: {e}", + "Failed to install bypass log rules (non-fatal), falling back to reject-only [ns:{}]: {e}", self.name )) .build() ); + + let ruleset_no_log = + super::nft_ruleset::generate_bypass_ruleset(&host_ip_str, proxy_port, None); + + if let Err(e) = run_nft_netns(&self.name, &nft_path, &ruleset_no_log) { + openshell_ocsf::ocsf_emit!( + openshell_ocsf::ConfigStateChangeBuilder::new(crate::ocsf_ctx()) + .severity(openshell_ocsf::SeverityId::Medium) + .status(openshell_ocsf::StatusId::Failure) + .state(openshell_ocsf::StateId::Disabled, "failed") + .message(format!( + "Failed to install bypass detection rules [ns:{}]: {e}", + self.name + )) + .build() + ); + return Err(e); + } } openshell_ocsf::ocsf_emit!( @@ -336,297 +339,6 @@ impl NetworkNamespace { Ok(()) } - - /// Install bypass detection rules for a specific iptables variant (iptables or ip6tables). - fn install_bypass_rules_for( - &self, - iptables_cmd: &str, - host_ip: &str, - proxy_port: &str, - log_prefix: &str, - ) -> Result<()> { - // Rule 1: ACCEPT traffic to the proxy - run_iptables_netns( - &self.name, - iptables_cmd, - &[ - "-A", - "OUTPUT", - "-d", - &format!("{host_ip}/32"), - "-p", - "tcp", - "--dport", - proxy_port, - "-j", - "ACCEPT", - ], - )?; - - // Rule 2: ACCEPT loopback traffic - run_iptables_netns( - &self.name, - iptables_cmd, - &["-A", "OUTPUT", "-o", "lo", "-j", "ACCEPT"], - )?; - - // Rule 3: ACCEPT established/related connections (response packets) - run_iptables_netns( - &self.name, - iptables_cmd, - &[ - "-A", - "OUTPUT", - "-m", - "conntrack", - "--ctstate", - "ESTABLISHED,RELATED", - "-j", - "ACCEPT", - ], - )?; - - // Rule 4: LOG TCP SYN bypass attempts (rate-limited) - // LOG rule failure is non-fatal — the REJECT rule still provides fast-fail. - if let Err(e) = run_iptables_netns( - &self.name, - iptables_cmd, - &[ - "-A", - "OUTPUT", - "-p", - "tcp", - "--syn", - "-m", - "limit", - "--limit", - "5/sec", - "--limit-burst", - "10", - "-j", - "LOG", - "--log-prefix", - log_prefix, - "--log-uid", - ], - ) { - openshell_ocsf::ocsf_emit!(openshell_ocsf::ConfigStateChangeBuilder::new( - crate::ocsf_ctx() - ) - .severity(openshell_ocsf::SeverityId::Low) - .status(openshell_ocsf::StatusId::Failure) - .state(openshell_ocsf::StateId::Other, "degraded") - .message(format!( - "Failed to install LOG rule for TCP (xt_LOG module may not be loaded) [ns:{}]: {e}", - self.name - )) - .build()); - } - - // Rule 5: REJECT TCP bypass attempts (fast-fail) - run_iptables_netns( - &self.name, - iptables_cmd, - &[ - "-A", - "OUTPUT", - "-p", - "tcp", - "-j", - "REJECT", - "--reject-with", - "icmp-port-unreachable", - ], - )?; - - // Rule 6: LOG UDP bypass attempts (rate-limited, covers DNS bypass) - if let Err(e) = run_iptables_netns( - &self.name, - iptables_cmd, - &[ - "-A", - "OUTPUT", - "-p", - "udp", - "-m", - "limit", - "--limit", - "5/sec", - "--limit-burst", - "10", - "-j", - "LOG", - "--log-prefix", - log_prefix, - "--log-uid", - ], - ) { - openshell_ocsf::ocsf_emit!( - openshell_ocsf::ConfigStateChangeBuilder::new(crate::ocsf_ctx()) - .severity(openshell_ocsf::SeverityId::Low) - .status(openshell_ocsf::StatusId::Failure) - .state(openshell_ocsf::StateId::Other, "degraded") - .message(format!( - "Failed to install LOG rule for UDP [ns:{}]: {e}", - self.name - )) - .build() - ); - } - - // Rule 7: REJECT UDP bypass attempts (covers DNS bypass) - run_iptables_netns( - &self.name, - iptables_cmd, - &[ - "-A", - "OUTPUT", - "-p", - "udp", - "-j", - "REJECT", - "--reject-with", - "icmp-port-unreachable", - ], - )?; - - Ok(()) - } - - /// Install IPv6 bypass detection rules. - /// - /// Similar to `install_bypass_rules_for` but omits the proxy ACCEPT rule - /// (the proxy listens on an IPv4 address) and uses IPv6-appropriate - /// REJECT types. - fn install_bypass_rules_for_v6(&self, ip6tables_cmd: &str, log_prefix: &str) -> Result<()> { - // ACCEPT loopback traffic - run_iptables_netns( - &self.name, - ip6tables_cmd, - &["-A", "OUTPUT", "-o", "lo", "-j", "ACCEPT"], - )?; - - // ACCEPT established/related connections - run_iptables_netns( - &self.name, - ip6tables_cmd, - &[ - "-A", - "OUTPUT", - "-m", - "conntrack", - "--ctstate", - "ESTABLISHED,RELATED", - "-j", - "ACCEPT", - ], - )?; - - // LOG TCP SYN bypass attempts (rate-limited) - if let Err(e) = run_iptables_netns( - &self.name, - ip6tables_cmd, - &[ - "-A", - "OUTPUT", - "-p", - "tcp", - "--syn", - "-m", - "limit", - "--limit", - "5/sec", - "--limit-burst", - "10", - "-j", - "LOG", - "--log-prefix", - log_prefix, - "--log-uid", - ], - ) { - openshell_ocsf::ocsf_emit!( - openshell_ocsf::ConfigStateChangeBuilder::new(crate::ocsf_ctx()) - .severity(openshell_ocsf::SeverityId::Low) - .status(openshell_ocsf::StatusId::Failure) - .state(openshell_ocsf::StateId::Other, "degraded") - .message(format!( - "Failed to install IPv6 LOG rule for TCP [ns:{}]: {e}", - self.name - )) - .build() - ); - } - - // REJECT TCP bypass attempts - run_iptables_netns( - &self.name, - ip6tables_cmd, - &[ - "-A", - "OUTPUT", - "-p", - "tcp", - "-j", - "REJECT", - "--reject-with", - "icmp6-port-unreachable", - ], - )?; - - // LOG UDP bypass attempts (rate-limited) - if let Err(e) = run_iptables_netns( - &self.name, - ip6tables_cmd, - &[ - "-A", - "OUTPUT", - "-p", - "udp", - "-m", - "limit", - "--limit", - "5/sec", - "--limit-burst", - "10", - "-j", - "LOG", - "--log-prefix", - log_prefix, - "--log-uid", - ], - ) { - openshell_ocsf::ocsf_emit!( - openshell_ocsf::ConfigStateChangeBuilder::new(crate::ocsf_ctx()) - .severity(openshell_ocsf::SeverityId::Low) - .status(openshell_ocsf::StatusId::Failure) - .state(openshell_ocsf::StateId::Other, "degraded") - .message(format!( - "Failed to install IPv6 LOG rule for UDP [ns:{}]: {e}", - self.name - )) - .build() - ); - } - - // REJECT UDP bypass attempts - run_iptables_netns( - &self.name, - ip6tables_cmd, - &[ - "-A", - "OUTPUT", - "-p", - "udp", - "-j", - "REJECT", - "--reject-with", - "icmp6-port-unreachable", - ], - )?; - - Ok(()) - } } impl Drop for NetworkNamespace { @@ -732,34 +444,43 @@ fn run_ip_netns(netns: &str, args: &[&str]) -> Result<()> { Ok(()) } -/// Run an iptables command inside a network namespace via `nsenter --net=`. +/// Load an nftables ruleset inside a network namespace via `nsenter --net=`. /// -/// Uses `nsenter` instead of `ip netns exec` to avoid the sysfs remount -/// that fails in rootless container runtimes. See `run_ip_netns` for details. -fn run_iptables_netns(netns: &str, iptables_cmd: &str, args: &[&str]) -> Result<()> { +/// Writes the ruleset to a temp file and loads it with `nft -f `. +/// A temp file is used instead of piping to stdin (`nft -f -`) because +/// `nft` resolves `-` to `/dev/stdin`, which may not exist in minimal +/// VM guest environments (e.g. virtiofs rootfs without /proc mounted +/// at nft invocation time). +fn run_nft_netns(netns: &str, nft_cmd: &str, ruleset: &str) -> Result<()> { + use std::io::Write; + let mut tmp = tempfile::Builder::new() + .prefix("openshell-nft-") + .suffix(".conf") + .tempfile() + .into_diagnostic()?; + tmp.write_all(ruleset.as_bytes()).into_diagnostic()?; + let ruleset_path = tmp.path().to_string_lossy().to_string(); + let nsenter_path = find_trusted_binary("nsenter", NSENTER_SEARCH_PATHS)?; let ns_path = format!("/var/run/netns/{netns}"); let net_flag = format!("--net={ns_path}"); - let mut full_args = vec![net_flag.as_str(), "--", iptables_cmd]; - full_args.extend(args); - debug!( - command = %format!("{nsenter_path} {}", full_args.join(" ")), - "Running iptables in namespace via nsenter" + command = %format!("{nsenter_path} {net_flag} -- {nft_cmd} -f {ruleset_path}"), + "Loading nftables ruleset in namespace" ); let output = Command::new(nsenter_path) - .args(&full_args) + .args([net_flag.as_str(), "--", nft_cmd, "-f", &ruleset_path]) .output() .into_diagnostic()?; + drop(tmp); + if !output.status.success() { let stderr = String::from_utf8_lossy(&output.stderr); return Err(miette::miette!( - "{nsenter_path} --net={} {} failed: {}", - ns_path, - iptables_cmd, + "nft ruleset load failed in netns {netns}: {}", stderr.trim() )); } @@ -767,11 +488,35 @@ fn run_iptables_netns(netns: &str, iptables_cmd: &str, args: &[&str]) -> Result< Ok(()) } -/// Well-known paths where iptables may be installed. -/// The sandbox container PATH often excludes `/usr/sbin`, so we probe -/// explicit paths rather than relying on `which`. -const IPTABLES_SEARCH_PATHS: &[&str] = - &["/usr/sbin/iptables", "/sbin/iptables", "/usr/bin/iptables"]; +const NF_LOG_ALL_NETNS_PATH: &str = "/proc/sys/net/netfilter/nf_log_all_netns"; + +/// Enable nftables logging from non-init network namespaces. +/// +/// The kernel's `nf_log_syslog` module silently suppresses log output from +/// non-init network namespaces unless `net.netfilter.nf_log_all_netns` is +/// set to 1. Since sandbox bypass rules live in a per-sandbox network +/// namespace, the bypass monitor can't see log entries without this. +fn enable_nf_log_all_netns() { + use std::path::Path; + if !Path::new(NF_LOG_ALL_NETNS_PATH).exists() { + debug!("nf_log_all_netns sysctl not available (may already be set by init)"); + return; + } + match std::fs::write(NF_LOG_ALL_NETNS_PATH, "1") { + Ok(()) => { + debug!("Enabled nf_log_all_netns for non-init namespace logging"); + } + Err(e) => { + debug!( + error = %e, + "Could not enable nf_log_all_netns; bypass log rules may not produce output" + ); + } + } +} + +/// Well-known paths where nft may be installed. +const NFT_SEARCH_PATHS: &[&str] = &["/usr/sbin/nft", "/sbin/nft", "/usr/bin/nft"]; fn find_trusted_binary<'a>(name: &str, paths: &'a [&str]) -> Result<&'a str> { paths @@ -789,100 +534,11 @@ fn find_trusted_binary<'a>(name: &str, paths: &'a [&str]) -> Result<&'a str> { }) } -/// Returns true if xt extension modules (e.g. `xt_comment`) cannot be used -/// via the given iptables binary. -/// -/// Some kernels have `nf_tables` but lack the `nft_compat` bridge that allows -/// xt extension modules to be used through the `nf_tables` path (e.g. Jetson -/// Linux 5.15-tegra). This probe detects that condition by attempting to -/// insert a rule using the `xt_comment` extension. If it fails, xt extensions -/// are unavailable and the caller should fall back to iptables-legacy. -fn xt_extensions_unavailable(iptables_path: &str) -> bool { - // Create a temporary probe chain. If this fails (e.g. no CAP_NET_ADMIN), - // we can't determine availability — assume extensions are available. - let created = Command::new(iptables_path) - .args(["-t", "filter", "-N", "_xt_probe"]) - .output() - .is_ok_and(|o| o.status.success()); - - if !created { - return false; - } - - // Attempt to insert a rule using xt_comment. Failure means nft_compat - // cannot bridge xt extension modules on this kernel. - let probe_ok = Command::new(iptables_path) - .args([ - "-t", - "filter", - "-A", - "_xt_probe", - "-m", - "comment", - "--comment", - "probe", - "-j", - "ACCEPT", - ]) - .output() - .is_ok_and(|o| o.status.success()); - - // Clean up — best-effort, ignore failures. - let _ = Command::new(iptables_path) - .args([ - "-t", - "filter", - "-D", - "_xt_probe", - "-m", - "comment", - "--comment", - "probe", - "-j", - "ACCEPT", - ]) - .output(); - let _ = Command::new(iptables_path) - .args(["-t", "filter", "-X", "_xt_probe"]) - .output(); - - !probe_ok -} - -/// Find the iptables binary path, checking well-known locations. -/// -/// If xt extension modules are unavailable via the standard binary and -/// `iptables-legacy` is available alongside it, the legacy binary is returned -/// instead. This ensures bypass-detection rules can be installed on kernels -/// where `nft_compat` is unavailable (e.g. Jetson Linux 5.15-tegra). -fn find_iptables() -> Option { - let standard_path = IPTABLES_SEARCH_PATHS - .iter() - .find(|path| Path::new(path).exists()) - .copied()?; - - if xt_extensions_unavailable(standard_path) { - let legacy_path = standard_path.replace("iptables", "iptables-legacy"); - if Path::new(&legacy_path).exists() { - debug!( - legacy = legacy_path, - "xt extensions unavailable; using iptables-legacy" - ); - return Some(legacy_path); - } - } - - Some(standard_path.to_string()) -} - -/// Find the ip6tables binary path, deriving it from the iptables location. -fn find_ip6tables(iptables_path: &str) -> Option { - let ip6_path = iptables_path.replace("iptables", "ip6tables"); - if Path::new(&ip6_path).exists() { - Some(ip6_path) - } else { - None - } +/// Find the nft binary path, checking well-known locations. +fn find_nft() -> Option { + find_trusted_binary("nft", NFT_SEARCH_PATHS) + .ok() + .map(String::from) } #[cfg(test)] @@ -914,6 +570,16 @@ mod tests { assert!(err.to_string().contains("trusted nsenter helper not found")); } + #[test] + fn nft_search_paths_are_absolute() { + for path in NFT_SEARCH_PATHS { + assert!( + path.starts_with('/'), + "NFT_SEARCH_PATHS entry must be absolute: {path}" + ); + } + } + #[test] #[ignore = "requires root privileges"] fn test_create_and_drop_namespace() { diff --git a/crates/openshell-sandbox/src/sandbox/linux/nft_ruleset.rs b/crates/openshell-sandbox/src/sandbox/linux/nft_ruleset.rs new file mode 100644 index 000000000..ba7aeb936 --- /dev/null +++ b/crates/openshell-sandbox/src/sandbox/linux/nft_ruleset.rs @@ -0,0 +1,148 @@ +// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + +//! nftables ruleset generation for sandbox network bypass enforcement. +//! +//! This module provides pure functions to generate nftables rulesets that enforce +//! the sandbox network policy: all traffic must go through the proxy, with bypass +//! attempts logged and rejected. + +/// Generate a complete nftables ruleset for sandbox network bypass enforcement. +/// +/// Creates an `inet` family table (handles both IPv4 and IPv6) with rules that: +/// 1. Accept traffic to the proxy (IPv4 only) +/// 2. Accept loopback traffic +/// 3. Accept established/related connections +/// 4. Reject TCP and UDP bypass attempts (both IPv4 and IPv6) +/// +/// If `log_prefix` is provided, log rules are inserted before each reject rule +/// so that bypass attempts are recorded in the kernel ring buffer before being +/// rejected. The `log` expression requires kernel `nft_log` module support; +/// pass `None` for `log_prefix` as a fallback when that module is unavailable. +pub fn generate_bypass_ruleset(host_ip: &str, proxy_port: u16, log_prefix: Option<&str>) -> String { + let log_tcp = log_prefix + .map(|p| { + format!( + "\n tcp flags syn limit rate 5/second burst 10 packets log prefix \"{p}\" flags skuid" + ) + }) + .unwrap_or_default(); + let log_udp = log_prefix + .map(|p| { + format!( + "\n meta l4proto udp limit rate 5/second burst 10 packets log prefix \"{p}\" flags skuid" + ) + }) + .unwrap_or_default(); + + format!( + r#"table inet openshell_bypass {{ + chain output {{ + type filter hook output priority 0; policy accept; + + ip daddr {host_ip} tcp dport {proxy_port} accept + oifname "lo" accept + ct state established,related accept{log_tcp} + meta nfproto ipv4 meta l4proto tcp reject with icmp type port-unreachable + meta nfproto ipv6 meta l4proto tcp reject with icmpv6 type port-unreachable{log_udp} + meta nfproto ipv4 meta l4proto udp reject with icmp type port-unreachable + meta nfproto ipv6 meta l4proto udp reject with icmpv6 type port-unreachable + }} +}} +"# + ) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn generates_bypass_ruleset_with_proxy_rule() { + let ruleset = generate_bypass_ruleset("10.0.2.2", 8080, None); + assert!(ruleset.contains("table inet openshell_bypass")); + assert!(ruleset.contains("chain output")); + assert!(ruleset.contains("ip daddr 10.0.2.2 tcp dport 8080 accept")); + } + + #[test] + fn ruleset_has_inet_family_table_and_output_chain() { + let ruleset = generate_bypass_ruleset("192.168.1.1", 3128, None); + assert!(ruleset.contains("table inet openshell_bypass")); + assert!(ruleset.contains("type filter hook output priority 0; policy accept;")); + } + + #[test] + fn proxy_accept_rule_uses_provided_ip_and_port() { + let ruleset = generate_bypass_ruleset("172.16.0.1", 9999, None); + assert!(ruleset.contains("ip daddr 172.16.0.1 tcp dport 9999 accept")); + } + + #[test] + fn rules_are_ordered_accept_then_reject() { + let ruleset = generate_bypass_ruleset("10.0.2.2", 8080, None); + let proxy_pos = ruleset.find("ip daddr").unwrap(); + let lo_pos = ruleset.find("oifname \"lo\"").unwrap(); + let ct_pos = ruleset.find("ct state established,related").unwrap(); + let reject_pos = ruleset.find("reject with icmp type").unwrap(); + + assert!(proxy_pos < lo_pos); + assert!(lo_pos < ct_pos); + assert!(ct_pos < reject_pos); + } + + #[test] + fn both_ipv4_and_ipv6_reject_types_are_present() { + let ruleset = generate_bypass_ruleset("10.0.2.2", 8080, None); + let icmp_count = ruleset + .matches("reject with icmp type port-unreachable") + .count(); + let icmpv6_count = ruleset + .matches("reject with icmpv6 type port-unreachable") + .count(); + assert_eq!(icmp_count, 2, "need IPv4 ICMP rejects for TCP + UDP"); + assert_eq!(icmpv6_count, 2, "need IPv6 ICMPv6 rejects for TCP + UDP"); + } + + #[test] + fn no_log_ruleset_omits_log_rules() { + let ruleset = generate_bypass_ruleset("10.0.2.2", 8080, None); + assert!( + !ruleset.contains("log prefix"), + "no-log ruleset must not contain log rules" + ); + } + + #[test] + fn log_ruleset_contains_prefix_for_tcp_and_udp() { + let ruleset = generate_bypass_ruleset("10.0.2.2", 8080, Some("openshell:bypass:test:")); + let count = ruleset + .matches("log prefix \"openshell:bypass:test:\"") + .count(); + assert_eq!(count, 2, "need log rules for both TCP and UDP"); + assert!(ruleset.contains("tcp flags syn limit rate 5/second burst 10 packets")); + assert!(ruleset.contains("meta l4proto udp limit rate 5/second burst 10 packets")); + } + + #[test] + fn log_rules_appear_before_reject_rules() { + let ruleset = generate_bypass_ruleset("10.0.2.2", 8080, Some("openshell:bypass:test:")); + let tcp_log_pos = ruleset.find("tcp flags syn").unwrap(); + let tcp_reject_pos = ruleset + .find("meta nfproto ipv4 meta l4proto tcp reject") + .unwrap(); + let udp_log_pos = ruleset.find("meta l4proto udp limit rate").unwrap(); + let udp_reject_pos = ruleset + .find("meta nfproto ipv4 meta l4proto udp reject") + .unwrap(); + + assert!( + tcp_log_pos < tcp_reject_pos, + "TCP log rule must come before TCP reject rule" + ); + assert!( + udp_log_pos < udp_reject_pos, + "UDP log rule must come before UDP reject rule" + ); + } +} diff --git a/docs/reference/sandbox-compute-drivers.mdx b/docs/reference/sandbox-compute-drivers.mdx index 43b7fb81e..7ccf1faa7 100644 --- a/docs/reference/sandbox-compute-drivers.mdx +++ b/docs/reference/sandbox-compute-drivers.mdx @@ -95,6 +95,12 @@ The VM driver resolves sandbox images from a local container engine before falli systemctl --user start podman.socket ``` +### Host Firewall + +The VM driver creates nftables rules on the host for each sandbox VM's TAP network interface. These rules provide NAT for VM connectivity and defense-in-depth isolation: unsolicited inbound connections to the VM are dropped, and the VM can only reach the gateway port on the host. Primary security enforcement (proxy-only egress and bypass detection) is handled by the sandbox supervisor inside the VM guest. + +On hosts with restrictive firewalls (e.g. firewalld), the host firewall may additionally block VM traffic that the driver's rules accept. If VM sandboxes cannot reach the network, verify that the host firewall allows forwarding and input for `vmtap-*` interfaces. See the [VM driver README](https://github.com/NVIDIA/OpenShell/blob/main/crates/openshell-driver-vm/README.md#host-side-nftables-rules) for details. + ## Kubernetes Driver Kubernetes-backed sandboxes run as pods in the configured sandbox namespace. Use Kubernetes for shared clusters, remote compute, GPU scheduling, and operator-managed environments. diff --git a/docs/security/best-practices.mdx b/docs/security/best-practices.mdx index 0c86069e1..faf1b5991 100644 --- a/docs/security/best-practices.mdx +++ b/docs/security/best-practices.mdx @@ -225,7 +225,7 @@ OpenShell applies seccomp in two phases. A narrow supervisor-startup prelude run The sandbox supervisor applies enforcement in a specific order during process startup. This ordering is intentional: named network-namespace setup still relies on privileged helpers, and privilege dropping still needs `/etc/group` and `/etc/passwd`, which Landlock subsequently restricts. -1. Privileged supervisor bootstrap helpers, including network-namespace setup and optional `iptables` probes. +1. Privileged supervisor bootstrap helpers, including network-namespace setup and optional `nft` probes. 2. Supervisor startup prelude seccomp (`PR_SET_NO_NEW_PRIVS` plus the early syscall denylist) synchronized across runtime threads. 3. Network namespace entry (`setns`) in child `pre_exec`. 4. Privilege drop (`initgroups` + `setgid` + `setuid`). diff --git a/examples/bring-your-own-container/Dockerfile b/examples/bring-your-own-container/Dockerfile index 17f8083df..61f283970 100644 --- a/examples/bring-your-own-container/Dockerfile +++ b/examples/bring-your-own-container/Dockerfile @@ -9,9 +9,9 @@ FROM python:3.13-slim # System tools useful for sandbox networking and debugging. # iproute2: required for network namespace management (ip netns, veth pairs) -# iptables: optional, enables bypass detection (LOG + REJECT for direct connections) +# nftables: optional, enables bypass detection (log + reject for direct connections) RUN apt-get update && apt-get install -y --no-install-recommends \ - curl iproute2 iptables \ + curl iproute2 nftables \ && rm -rf /var/lib/apt/lists/* # Create the sandbox user for non-root execution.