Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions crates/openshell-driver-podman/NETWORKING.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ Namespace 2: Rootless Podman network namespace, managed by pasta
Namespace 3: Inner sandbox netns, created by supervisor
|
veth pair, such as 10.200.0.1 <-> 10.200.0.2
iptables forces ordinary traffic through proxy
nftables forces ordinary traffic through proxy
user workload runs here
```

Expand Down Expand Up @@ -270,7 +270,7 @@ Container on the Podman bridge
|
user code runs here
|
iptables rules:
nftables rules:
ACCEPT -> proxy TCP
ACCEPT -> loopback
ACCEPT -> established/related
Expand Down Expand Up @@ -337,7 +337,7 @@ User code in inner netns
HTTP_PROXY points at the local sandbox proxy
|
2. TCP connect to proxy
allowed by iptables as the only ordinary egress destination
allowed by nftables as the only ordinary egress destination
|
3. HTTP CONNECT api.example.com:443
|
Expand Down Expand Up @@ -398,7 +398,7 @@ bind-mounted into sandbox containers by the Podman driver.
| Port publishing | Not needed for relay | Ephemeral host port remains in the container spec for compatibility and debug paths. |
| TLS | mTLS via Kubernetes secrets | mTLS via mounted client files, RPM defaults, or explicit configuration. |
| DNS | Kubernetes CoreDNS | Podman bridge DNS through aardvark-dns when DNS is enabled. |
| Network policy | Kubernetes network policy for pod ingress plus supervisor policy | iptables inside inner sandbox netns plus supervisor policy. |
| Network policy | Kubernetes network policy for pod ingress plus supervisor policy | nftables inside inner sandbox netns plus supervisor policy. |
| Supervisor delivery | Kubernetes driver managed pod image or template | OCI image volume mount. |
| Secrets | Kubernetes Secret volume and env vars | Mounted TLS client materials from a Podman secret. |

Expand Down
2 changes: 1 addition & 1 deletion crates/openshell-driver-podman/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ The restricted agent child does not retain these supervisor privileges.
| Capability | Purpose |
|---|---|
| `SYS_ADMIN` | seccomp filter installation, namespace creation, and Landlock setup. |
| `NET_ADMIN` | Network namespace veth setup, IP address assignment, routes, and iptables. |
| `NET_ADMIN` | Network namespace veth setup, IP address assignment, routes, and nftables. |
| `SYS_PTRACE` | Reading `/proc/<pid>/exe` and walking process ancestry for binary identity. |
| `SYSLOG` | Reading `/dev/kmsg` for bypass-detection diagnostics. |
| `DAC_READ_SEARCH` | Reading `/proc/<pid>/fd/` across UIDs so the proxy can resolve the binary responsible for a connection. |
Expand Down
19 changes: 19 additions & 0 deletions crates/openshell-driver-vm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,25 @@ RUST_LOG=openshell_server=debug,openshell_driver_vm=debug \

The VM guest's serial console is appended to `<state-dir>/<sandbox-id>/console.log`. Sandbox IDs must match `[A-Za-z0-9._-]{1,128}` before the driver uses them in host paths. The gateway-owned compute-driver socket lives at `<state-dir>/run/compute-driver.sock`; OpenShell creates `run/` with owner-only permissions, removes same-owner stale sockets, and the gateway removes the socket on clean shutdown via `ManagedDriverProcess::drop`. UDS clients must match the driver UID and provide the expected gateway process PID by default. Standalone same-UID UDS mode requires the explicit `--allow-same-uid-peer` development flag. TCP mode is disabled by default because it is unauthenticated; use `--allow-unauthenticated-tcp --bind-address 127.0.0.1:50061` only for local development.

## Host-side nftables rules

The VM driver creates a per-VM nftables table on the host (`openshell_vm_vmtap_<id>`) with three chains. These rules serve two purposes: NAT infrastructure (required for VM connectivity) and defense-in-depth host isolation. Primary security enforcement — proxy-only egress and bypass detection — is handled by the sandbox supervisor's own nftables rules inside the VM guest.

**`postrouting` (NAT):** Masquerades outbound VM traffic so it can be routed from the VM's private subnet to the external network. This chain handles forwarded traffic (VM → internet), not traffic destined for the host.

**`forward` (defense-in-depth):** Accepts all outbound traffic from the VM (security enforcement happens guest-side) and accepts established/related response traffic back to the VM. Drops unsolicited inbound connections to the VM from the broader network. This chain handles forwarded traffic only — packets transiting the host between the TAP interface and other interfaces.

**`input` (defense-in-depth):** Accepts traffic from the VM to the gateway port on the host. Drops all other traffic from the VM destined for the host itself. This limits what a compromised guest can reach on the host to the gateway service only.

The `input` and `postrouting` chains handle different traffic paths: `input` covers packets addressed to the host (VM → host), while `postrouting` covers packets the host is forwarding on behalf of the VM (VM → internet). A packet from the VM goes through one path or the other, never both.

All chains use `policy accept`, so non-TAP traffic is unaffected. Because nftables evaluates multiple base chains on the same hook independently, host firewalls interact with these rules as follows:

- **Open host (no other firewall):** Our chains are the only filter. The defense-in-depth drop rules block unsolicited inbound and non-gateway host access. Non-TAP traffic passes through.
- **Restrictive host firewall (e.g. firewalld):** The host firewall's chains may additionally drop TAP traffic that our chains accept. A `drop` verdict from any chain is final — our `accept` cannot override it. If VM connectivity fails, verify that the host firewall allows forwarding and input for `vmtap-*` interfaces.

Each table is created atomically via `nft -f` on VM start and torn down atomically via `nft delete table` when the VM is destroyed.

## Prerequisites

- macOS on Apple Silicon, or Linux on aarch64/x86_64 with KVM
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -585,6 +585,13 @@ run_post_overlay_setup() {
mount -t cgroup2 cgroup2 "$(root_path /sys/fs/cgroup)" 2>/dev/null &
wait

# Allow nftables LOG rules to work in non-init network namespaces.
# Without this, the kernel's nf_log_syslog silently suppresses output
# from the sandbox's network namespace.
if [ -f /proc/sys/net/netfilter/nf_log_all_netns ]; then
echo 1 > /proc/sys/net/netfilter/nf_log_all_netns 2>/dev/null || true
fi

setup_sandbox_workdir

configure_hostname
Expand Down
1 change: 1 addition & 0 deletions crates/openshell-driver-vm/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ pub mod driver;
mod embedded_runtime;
mod ffi;
pub mod gpu;
mod nft_ruleset;
pub mod procguard;
mod rootfs;
mod runtime;
Expand Down
92 changes: 92 additions & 0 deletions crates/openshell-driver-vm/src/nft_ruleset.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0

use std::fmt::Write;

/// Sanitize a TAP device name for use as an nftables table name suffix.
/// Assumes device names match `vmtap-[a-f0-9]+` (driver-controlled).
fn sanitize_table_name(device: &str) -> String {
device.replace('-', "_")
}

/// Return the nftables table name for a TAP device.
pub fn teardown_table_name(device: &str) -> String {
format!("openshell_vm_{}", sanitize_table_name(device))
}

/// Generate the nftables ruleset for VM TAP networking.
pub fn generate_tap_ruleset(tap_device: &str, subnet: &str, gateway_port: u16) -> String {
let table_name = teardown_table_name(tap_device);
let mut ruleset = String::with_capacity(512);

writeln!(ruleset, "table ip {table_name} {{").unwrap();
writeln!(ruleset, " chain postrouting {{").unwrap();
writeln!(
ruleset,
" type nat hook postrouting priority 100; policy accept;"
)
.unwrap();
writeln!(ruleset, " ip saddr {subnet} masquerade").unwrap();
writeln!(ruleset, " }}").unwrap();
writeln!(ruleset, " chain forward {{").unwrap();
writeln!(
ruleset,
" type filter hook forward priority 0; policy accept;"
)
.unwrap();
writeln!(ruleset, " iifname \"{tap_device}\" accept").unwrap();
writeln!(
ruleset,
" oifname \"{tap_device}\" ct state related,established accept"
)
.unwrap();
writeln!(ruleset, " oifname \"{tap_device}\" drop").unwrap();
writeln!(ruleset, " }}").unwrap();
writeln!(ruleset, " chain input {{").unwrap();
writeln!(
ruleset,
" type filter hook input priority 0; policy accept;"
)
.unwrap();
writeln!(
ruleset,
" iifname \"{tap_device}\" tcp dport {gateway_port} accept"
)
.unwrap();
writeln!(ruleset, " iifname \"{tap_device}\" drop").unwrap();
writeln!(ruleset, " }}").unwrap();
writeln!(ruleset, "}}").unwrap();

ruleset
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn generates_tap_setup_ruleset() {
let ruleset = generate_tap_ruleset("vmtap-abcd", "10.0.128.0/30", 8080);
assert!(ruleset.contains("table ip openshell_vm_vmtap_abcd {"));
assert!(ruleset.contains("type nat hook postrouting priority 100; policy accept;"));
assert!(ruleset.contains("ip saddr 10.0.128.0/30 masquerade"));
assert!(ruleset.contains("type filter hook forward priority 0; policy accept;"));
assert!(ruleset.contains("iifname \"vmtap-abcd\" accept"));
assert!(ruleset.contains("oifname \"vmtap-abcd\" ct state related,established accept"));
assert!(ruleset.contains("oifname \"vmtap-abcd\" drop"));
assert!(ruleset.contains("type filter hook input priority 0; policy accept;"));
assert!(ruleset.contains("iifname \"vmtap-abcd\" tcp dport 8080 accept"));
}

#[test]
fn table_name_sanitizes_device_name() {
let ruleset = generate_tap_ruleset("vmtap-abc-123", "10.0.128.0/30", 8080);
assert!(ruleset.contains("table ip openshell_vm_vmtap_abc_123 {"));
}

#[test]
fn teardown_command_targets_correct_table() {
let cmd = teardown_table_name("vmtap-abcd");
assert_eq!(cmd, "openshell_vm_vmtap_abcd");
}
}
108 changes: 63 additions & 45 deletions crates/openshell-driver-vm/src/runtime.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ use std::ptr;
use std::sync::atomic::{AtomicI32, Ordering};
use std::time::{Duration, Instant};

use crate::{embedded_runtime, ffi, procguard, rootfs};
use crate::{embedded_runtime, ffi, nft_ruleset, procguard, rootfs};

pub const VM_RUNTIME_DIR_ENV: &str = "OPENSHELL_VM_RUNTIME_DIR";

Expand Down Expand Up @@ -413,6 +413,12 @@ fn setup_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) -> R
enable_ip_forwarding()?;

let subnet = tap_subnet_from_host_ip(host_ip);
let table_name = nft_ruleset::teardown_table_name(tap_device);

// Delete any stale nftables table from a previous driver run.
let _ = run_cmd("nft", &["delete", "table", "ip", &table_name]);

// Clean up legacy iptables rules from older driver versions.
let _ = run_cmd(
"iptables",
&[
Expand All @@ -426,27 +432,10 @@ fn setup_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) -> R
"MASQUERADE",
],
);
run_cmd(
"iptables",
&[
"-t",
"nat",
"-A",
"POSTROUTING",
"-s",
&subnet,
"-j",
"MASQUERADE",
],
)?;
let _ = run_cmd(
"iptables",
&["-D", "FORWARD", "-i", tap_device, "-j", "ACCEPT"],
);
run_cmd(
"iptables",
&["-A", "FORWARD", "-i", tap_device, "-j", "ACCEPT"],
)?;
let _ = run_cmd(
"iptables",
&[
Expand All @@ -462,43 +451,31 @@ fn setup_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) -> R
"ACCEPT",
],
);
run_cmd(
"iptables",
&[
"-A",
"FORWARD",
"-o",
tap_device,
"-m",
"state",
"--state",
"RELATED,ESTABLISHED",
"-j",
"ACCEPT",
],
)?;
// Allow guest → host traffic only to the gateway gRPC port.
// Previous versions accepted ALL inbound traffic from the TAP
// interface; scope to the specific port so the guest cannot reach
// other host services.
let port_str = gateway_port.to_string();
let _ = run_cmd(
"iptables",
&[
"-D", "INPUT", "-i", tap_device, "-p", "tcp", "--dport", &port_str, "-j", "ACCEPT",
],
);
run_cmd(
let _ = run_cmd(
"iptables",
&[
"-A", "INPUT", "-i", tap_device, "-p", "tcp", "--dport", &port_str, "-j", "ACCEPT",
],
)?;
&["-D", "INPUT", "-i", tap_device, "-j", "ACCEPT"],
);

// Load nftables ruleset atomically.
let ruleset = nft_ruleset::generate_tap_ruleset(tap_device, &subnet, gateway_port);
run_nft_stdin(&ruleset)?;

Ok(())
}

fn teardown_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) {
// Delete the entire nftables table — single atomic operation.
let table_name = nft_ruleset::teardown_table_name(tap_device);
let _ = run_cmd("nft", &["delete", "table", "ip", &table_name]);

// Clean up legacy iptables rules from older driver versions.
let subnet = tap_subnet_from_host_ip(host_ip);
let _ = run_cmd(
"iptables",
Expand All @@ -519,8 +496,6 @@ fn teardown_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) {
"iptables",
&["-D", "FORWARD", "-i", tap_device, "-j", "ACCEPT"],
);
// Remove the port-scoped INPUT rule. Also try the legacy blanket
// rule so stale rules from older driver versions are cleaned up.
if gateway_port > 0 {
let port_str = gateway_port.to_string();
let _ = run_cmd(
Expand All @@ -547,6 +522,7 @@ fn teardown_tap_networking(tap_device: &str, host_ip: &str, gateway_port: u16) {
"MASQUERADE",
],
);

let _ = run_cmd("ip", &["link", "set", tap_device, "down"]);
let _ = run_cmd("ip", &["tuntap", "del", "dev", tap_device, "mode", "tap"]);
}
Expand Down Expand Up @@ -583,6 +559,35 @@ fn run_cmd(cmd: &str, args: &[&str]) -> Result<(), String> {
}
}

fn run_nft_stdin(ruleset: &str) -> Result<(), String> {
use std::io::Write;

let mut child = StdCommand::new("nft")
.args(["-f", "-"])
.stdin(Stdio::piped())
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn()
.map_err(|e| format!("failed to run nft: {e}"))?;

if let Some(mut stdin) = child.stdin.take() {
stdin
.write_all(ruleset.as_bytes())
.map_err(|e| format!("failed to write nft ruleset: {e}"))?;
}

let output = child
.wait_with_output()
.map_err(|e| format!("failed to wait for nft: {e}"))?;

if output.status.success() {
Ok(())
} else {
let stderr = String::from_utf8_lossy(&output.stderr);
Err(format!("nft -f - failed: {stderr}"))
}
}

/// RAII guard that tears down TAP networking on drop.
struct TapGuard {
tap_device: String,
Expand Down Expand Up @@ -715,7 +720,7 @@ fn run_libkrun_vm(config: &VmLaunchConfig) -> Result<(), String> {
// on its own service ports (DNS:53, DHCP, HTTP API:80).
//
// That network plane is also what the sandbox supervisor's
// per-sandbox netns (veth pair + iptables, see
// per-sandbox netns (veth pair + nftables, see
// `openshell-sandbox/src/sandbox/linux/netns.rs`) branches off of;
// libkrun's built-in TSI socket impersonation would not satisfy
// those kernel-level primitives.
Expand Down Expand Up @@ -1481,4 +1486,17 @@ mod tests {

assert_ne!(first, second);
}

#[test]
fn tap_subnet_from_host_ip_calculates_slash30_base() {
assert_eq!(tap_subnet_from_host_ip("10.0.128.1"), "10.0.128.0/30");
assert_eq!(tap_subnet_from_host_ip("10.0.128.2"), "10.0.128.0/30");
assert_eq!(tap_subnet_from_host_ip("10.0.128.5"), "10.0.128.4/30");
}

#[test]
fn tap_subnet_from_host_ip_handles_invalid_ip() {
let result = tap_subnet_from_host_ip("not-an-ip");
assert_eq!(result, "not-an-ip/30");
}
}
2 changes: 1 addition & 1 deletion crates/openshell-ocsf/src/events/network_activity.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ use crate::objects::{Actor, ConnectionInfo, Endpoint, FirewallRule};

/// OCSF Network Activity Event [4001].
///
/// Proxy CONNECT tunnel events and iptables-level bypass detection.
/// Proxy CONNECT tunnel events and nftables bypass detection.
#[derive(Debug, Clone, PartialEq, Eq, Deserialize)]
pub struct NetworkActivityEvent {
/// Common base event fields.
Expand Down
4 changes: 2 additions & 2 deletions crates/openshell-ocsf/src/format/shorthand.rs
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,7 @@ mod tests {
actor: Some(Actor {
process: Process::new("node", 1234),
}),
firewall_rule: Some(FirewallRule::new("bypass-detect", "iptables")),
firewall_rule: Some(FirewallRule::new("bypass-detect", "nftables")),
connection_info: Some(ConnectionInfo::new("tcp")),
action: Some(ActionId::Denied),
disposition: Some(DispositionId::Blocked),
Expand All @@ -467,7 +467,7 @@ mod tests {
let shorthand = event.format_shorthand();
assert_eq!(
shorthand,
"NET:REFUSE [MED] DENIED node(1234) -> 93.184.216.34:443/tcp [policy:bypass-detect engine:iptables]"
"NET:REFUSE [MED] DENIED node(1234) -> 93.184.216.34:443/tcp [policy:bypass-detect engine:nftables]"
);
}

Expand Down
Loading
Loading