feat(sandbox): L7 content inspection hooks — scriptable request/response filtering

## Problem Statement

OpenShell enforces network policy at L4 (allow/deny by host:port) and L7 (method/path/query for REST, operation-type/fields for GraphQL). Neither layer inspects the **content** of request or response bodies for security-relevant signals like prompt injection, PII leakage, sensitive data exfiltration, or adversarial payloads.

Agents operating inside sandboxes receive LLM responses and make outbound API calls. An attacker who controls upstream content (e.g., a poisoned web page fetched by the agent, a malicious tool response, or a compromised API) can embed prompt injection payloads in responses. Conversely, a compromised or misguided agent can exfiltrate sensitive data in outbound request bodies. Today, neither vector is visible to the policy layer.

The inference proxy already buffers request/response bodies for GraphQL inspection (#1022) and credential injection (#689). This proposal adds a general-purpose **content inspection hook** system that lets operators run external scripts/classifiers against L7 traffic inline, per-route.

### Relationship to Privacy Router

The Privacy Router (#1043) and content inspection hooks are **complementary, not overlapping**:

- **Privacy Router** answers: *where should this traffic go?* It routes inference requests to local or external providers based on data sensitivity, PII classification, and operator policy. It controls the destination.
- **Content inspection hooks** answer: *should this traffic flow at all?* They inspect request/response bodies for adversarial content (prompt injection), data exfiltration (secrets, PII in outbound calls), and policy violations. They gate the traffic.

A deployment might use both: the Privacy Router ensures sensitive prompts stay on a local NIM deployment, while content filters block prompt injection payloads in responses regardless of which provider served them. The router can't catch prompt injection (it classifies sensitivity, not adversarial intent), and content filters don't decide routing (they allow or deny, not redirect).

They share an interest in body content but serve different security objectives — routing policy vs. content policy.

<details>
<summary><h2>Prior Art: fullsend Security Pipeline</h2></summary>

The fullsend project has a production-grade, multi-layered security pipeline that validates this approach and should inform the design:

**Input pipeline** (`InputPipeline`): UnicodeNormalizer → ContextInjectionScanner — runs before untrusted text enters agent processing.
**Output pipeline** (`OutputPipeline`): SecretRedactor — runs before agent-generated text is posted to external APIs.

Key scanners:

| Scanner | Technique | What it catches |
|---------|-----------|-----------------|
| **ContextInjectionScanner** | 27 regex patterns across 4 categories | Instruction override, credential exfiltration, hidden content, execution-via-translation |
| **ONNXGuardScanner** | ProtectAI DeBERTa-v3 ONNX model, sentence-level splitting | Social engineering, indirect prompt injection (83% detection, 0 false positives in eval) |
| **UnicodeNormalizer** | Strip + NFKC normalize | Zero-width chars, bidi overrides, ANSI escapes, tag characters with hidden text, fullwidth encoding |
| **SecretRedactor** | 20+ prefix patterns + structural patterns | API keys (OpenAI sk-, GitHub ghp_, AWS AKIA...), private key blocks, DB connection strings, auth headers |
| **SSRFValidator** | IP blocklist + DNS rebinding defense | RFC 1918, cloud metadata endpoints, dangerous URI schemes |

Additionally, fullsend ships **runtime hooks** (Python scripts as PreToolUse/PostToolUse hooks):
- `tirith_check.py` — Tirith CLI for static command injection + unicode trick detection
- `canary_pretool.py` — detects canary token exfiltration in tool inputs
- `secret_redact_posttool.py` — redacts secrets from tool output before the LLM sees them
- `unicode_posttool.py` — normalizes unicode in tool output

**Experimental validation** (`experiments/guardrails-eval/`): evaluated LLM Guard (DeBERTa-v3), NeMo Guardrails (YARA), and Model Armor (GCP). Key finding: ML sentence-level scanning + regex pattern matching is complementary — ML catches social engineering, regex catches structural attacks. Neither alone is sufficient.

</details>

<details>
<summary><h2>Proposed Design</h2></summary>

### Core concept

Add a `content_filters` field to L7 endpoint policy. Each filter references an executable script that the **supervisor** (not the sandbox) runs against request and/or response bodies. Scripts receive body content on stdin and signal allow (exit 0) or deny (exit 1 + reason on stdout). The supervisor short-circuits with a 403 and the denial reason if any filter denies.

### Scripts live outside the sandbox

Filter scripts run in the supervisor process context, **not** inside the agent container. This is a hard requirement — if the agent can modify the scripts that inspect it, the entire mechanism is bypassable. Scripts are mounted from the host or baked into the supervisor image, never from the sandbox filesystem.

### Two inspection modes

**Synchronous (outbound requests):** The supervisor buffers the request body, pipes it to each filter script sequentially, and only forwards upstream if all filters pass. This catches sensitive data exfiltration and prompt injection in outbound calls before they leave the sandbox boundary.

**Async streaming (inbound responses):** For SSE/streaming inference responses, buffering the full response before returning it to the agent kills latency. Instead:

1. Proxy chunks through to the agent in real-time.
2. Simultaneously accumulate chunks and pipe to the filter script(s) asynchronously.
3. If a filter flags content mid-stream, **sever the connection** — inject an SSE error frame and close the stream.
4. Optionally: accumulate to a temp file outside the sandbox, run the full scan on completion, and only then decide whether to persist/allow the result.

The tradeoff: the agent may see partial content before denial. For prompt injection this is acceptable — the dangerous part is the agent *acting on* injected instructions, not reading partial tokens. Severing the stream causes most agent frameworks to treat the response as failed and not act on it.

### Policy surface

```yaml
endpoints:
  - host: api.openai.com
    port: 443
    protocol: rest
    enforcement: enforce
    content_filters:
      - script: /etc/openshell/filters/injection-scan.sh
        direction: response
        timeout_ms: 500
        on_timeout: deny
      - script: /etc/openshell/filters/onnx-guard.sh
        direction: response
        timeout_ms: 1000
        on_timeout: deny
      - script: /etc/openshell/filters/secret-redact.sh
        direction: request
        timeout_ms: 300
        on_timeout: deny
      - script: /etc/openshell/filters/unicode-normalize.sh
        direction: both
        timeout_ms: 200
        on_timeout: deny
```

- **`script`**: Absolute path on the supervisor filesystem. Must be executable. Not accessible from inside the sandbox.
- **`direction`**: Which body to inspect — `request` (outbound), `response` (inbound), or `both`.
- **`timeout_ms`**: Per-script execution timeout. Prevents slow classifiers from blocking the proxy indefinitely.
- **`on_timeout`**: Fail-closed (`deny`, default) or fail-open (`allow`) when the script exceeds its timeout.

### Script interface

- **stdin**: Raw body bytes (for streaming mode: accumulated chunks so far).
- **stdout**: On deny (exit 1), a single-line human-readable reason (e.g., `"Prompt injection: instruction override pattern detected"`). On allow (exit 0), stdout is ignored.
- **stderr**: Logged by the supervisor at debug level for diagnostics.
- **Exit code**: 0 = allow, 1 = deny, 2+ = script error (treated as deny when `on_timeout: deny`).
- **Environment variables**: The supervisor injects metadata: `OPENSHELL_FILTER_HOST`, `OPENSHELL_FILTER_PORT`, `OPENSHELL_FILTER_METHOD`, `OPENSHELL_FILTER_PATH`, `OPENSHELL_FILTER_DIRECTION` (request/response).

</details>

<details>
<summary><h2>Recommended Filter Stack</h2></summary>

Based on fullsend's production pipeline and experimental results, the recommended default filter stack for OpenShell would be:

1. **UnicodeNormalizer** (both directions, fast) — strip invisible characters, bidi overrides, tag chars before any other scanner sees the content. Pre-processing stage, not a deny gate.
2. **ContextInjectionScanner** (response direction, regex) — 27 patterns covering instruction override, credential exfiltration, hidden content, execution-via-translation. Fast, deterministic, zero false positives on known patterns.
3. **ONNXGuardScanner** (response direction, ML) — DeBERTa-v3 sentence-level classification for social engineering and indirect prompt injection that regex won't catch. Configurable threshold (default 0.92).
4. **SecretRedactor** (request direction, regex) — prevent exfiltration of API keys, tokens, private keys, DB strings in outbound requests. 20+ prefix patterns + structural matching.
5. **SSRFValidator** (request direction, URL extraction) — block requests to private networks, cloud metadata, dangerous schemes.

The ML + regex combination is critical: fullsend's evaluation showed ML alone misses structural attacks (unicode tricks, encoded exfiltration) while regex alone misses social engineering and indirect injection.

</details>

<details>
<summary><h2>Observability</h2></summary>

Every filter execution must be fully auditable. Operators, security teams, and compliance workflows need to see what was inspected, what was flagged, and what was allowed through.

**Every filter execution emits an OCSF event**, regardless of outcome:

| Outcome | OCSF event | Severity | What is logged |
|---------|-----------|----------|----------------|
| Allow | `HttpActivityBuilder` | Informational | Filter name, script path, direction, host:port, execution time, body size |
| Deny | `HttpActivityBuilder` + `DetectionFindingBuilder` (dual-emit) | Medium | All of the above + denial reason from script stdout, body hash (SHA-256) |
| Timeout | `HttpActivityBuilder` + `DetectionFindingBuilder` (dual-emit) | Medium | All of the above + configured timeout, `on_timeout` action taken |
| Script error | `HttpActivityBuilder` + `DetectionFindingBuilder` (dual-emit) | High | All of the above + exit code, stderr (truncated) |
| Async stream severed | `DetectionFindingBuilder` | Medium | Filter name, bytes streamed before sever, accumulated chunk count, denial reason |

**Key observability constraints:**

- **Never log body content in OCSF events.** Body bytes may contain secrets, PII, or credentials. Log a SHA-256 hash of the body for correlation, not the content itself. The OCSF JSONL file may be shipped to external systems.
- **Always log execution time.** Filter latency is critical for debugging proxy performance. Emit `filter_duration_ms` on every event.
- **Correlation ID.** Each request/response pair gets a unique ID so allow/deny decisions on the same HTTP transaction can be correlated across request-side and response-side filter events.
- **Structured filter metadata.** OCSF events include: `filter.name` (script basename), `filter.script_path`, `filter.direction`, `filter.timeout_ms`, `filter.exit_code`, `filter.duration_ms`, `filter.body_size_bytes`, `filter.body_hash` (SHA-256), `filter.denial_reason` (on deny only).
- **Shorthand log line.** The OCSF shorthand layer emits a grep-friendly summary: `CONTENT_FILTER DENY injection-scan.sh response api.openai.com:443 "instruction override pattern" 12ms` or `CONTENT_FILTER ALLOW onnx-guard.sh response api.openai.com:443 45ms`.

**Integration with existing observability:**

- Filter events flow through the same OCSF JSONL pipeline as network denials (#1093), making them available to `openshell policy denials` and the TUI log viewer.
- The PolicyAdvisor (#205) can consume filter denial patterns to recommend policy adjustments.
- Filter metrics (execution count, deny rate, p99 latency per script) should be exposed as Prometheus-compatible metrics if the sandbox metrics endpoint is enabled, enabling alerting on filter degradation.

</details>

<details>
<summary><h2>Alternatives Considered</h2></summary>

- **In-sandbox filters (readonly mount):** Simpler deployment but weaker security boundary. A sandbox escape or container breakout could tamper with the scripts. Rejected in favor of supervisor-side execution.
- **Built-in classifier (compiled into supervisor):** Lower latency but rigid. Operators can't customize detection rules, add domain-specific patterns, or swap classifiers without rebuilding the supervisor. The script interface lets operators iterate without image rebuilds. However, a compiled ONNX runtime (as fullsend does with hugot) could be offered as a built-in fast-path option alongside the script interface.
- **Gateway-side inspection:** The gateway doesn't have body bytes — it receives gRPC metadata from the sandbox. Moving inspection to the gateway would require streaming body content over gRPC, adding significant complexity. The supervisor already has the bytes in flight.
- **Buffered-only (no streaming mode):** Simpler but kills inference latency. Agents routinely use streaming for LLM calls — buffering a 30-second generation to scan it would break interactive workflows. The async streaming mode preserves responsiveness at the cost of partial exposure.
- **OPA-only (Rego rules on body content):** OPA is not designed for arbitrary text classification. Pattern matching in Rego is limited to `regex.match` — no subprocess execution, no ML model calls. OPA remains the policy decision point; content filters are a pre-processing stage.
- **Merge with Privacy Router:** The Privacy Router (#1043) classifies data sensitivity for routing decisions (local vs. external provider). Content filters classify adversarial intent and data exfiltration for allow/deny decisions. They share interest in body content but serve different security objectives. Keeping them separate avoids coupling routing logic to content scanning logic.
- **Fire-and-forget audit-only mode:** Log but don't block. Useful for gradual rollout — could be added as an `enforcement: audit` option on individual filters. But insufficient standalone for prompt injection and exfiltration which require active blocking.

</details>

<details>
<summary><h2>Agent Investigation</h2></summary>

Codebase surveyed prior to filing:

- **L7 body buffering exists:** `crates/openshell-sandbox/src/l7/rest.rs` implements `parse_body_length` for Content-Length and chunked bodies. GraphQL inspection (#1022) buffers up to 64 KiB. The bytes are already available at the point where filters would run.
- **Streaming inference proxy exists:** The inference proxy in `crates/openshell-sandbox/src/proxy.rs` handles SSE chunk forwarding. Async filter mode would hook into this path.
- **OCSF dual-emit pattern exists:** `DetectionFindingBuilder` is used for security findings (nonce replay, bypass detection). Content filter denials follow the same pattern.
- **OCSF shorthand layer exists:** Shorthand log format is auto-derived from builder fields — filter events get human-readable log lines automatically.
- **No content inspection today:** `grep -rn "content_filter\|body_scan\|body_inspect\|prompt_inject" crates/openshell-sandbox/` returns zero hits.
- **Supervisor runs outside sandbox:** The supervisor process runs in the host/pod context, not inside the agent container. Scripts executed by the supervisor are inaccessible to the agent.
- **fullsend prior art:** The fullsend project ships a production input/output security pipeline with regex-based injection scanning, ONNX ML classification, unicode normalization, secret redaction, and SSRF validation. Experimental evaluation confirmed ML + regex complementarity.
- **Privacy Router is complementary:** #1043 handles sensitivity-based routing decisions (where to send). Content filters handle adversarial content and exfiltration decisions (whether to send). Both inspect body content but for different security objectives.
- **Related issues:** #1022 (GraphQL body inspection), #689 (L7 credential injection), #1043 (Privacy Router), #755 (destructive command chaining), #1093 (OCSF deny inspection), #1056 (cross-sandbox flow verification), #205 (PolicyAdvisor with LLM calls), #1059 (Policy Prover).

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sandbox): L7 content inspection hooks — scriptable request/response filtering #1272

Problem Statement

Relationship to Privacy Router

Prior Art: fullsend Security Pipeline

Proposed Design

Core concept

Scripts live outside the sandbox

Two inspection modes

Policy surface

Script interface

Recommended Filter Stack

Observability

Alternatives Considered

Agent Investigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scanner	Technique	What it catches
ContextInjectionScanner	27 regex patterns across 4 categories	Instruction override, credential exfiltration, hidden content, execution-via-translation
ONNXGuardScanner	ProtectAI DeBERTa-v3 ONNX model, sentence-level splitting	Social engineering, indirect prompt injection (83% detection, 0 false positives in eval)
UnicodeNormalizer	Strip + NFKC normalize	Zero-width chars, bidi overrides, ANSI escapes, tag characters with hidden text, fullwidth encoding
SecretRedactor	20+ prefix patterns + structural patterns	API keys (OpenAI sk-, GitHub ghp_, AWS AKIA...), private key blocks, DB connection strings, auth headers
SSRFValidator	IP blocklist + DNS rebinding defense	RFC 1918, cloud metadata endpoints, dangerous URI schemes

Outcome	OCSF event	Severity	What is logged
Allow	`HttpActivityBuilder`	Informational	Filter name, script path, direction, host:port, execution time, body size
Deny	`HttpActivityBuilder` + `DetectionFindingBuilder` (dual-emit)	Medium	All of the above + denial reason from script stdout, body hash (SHA-256)
Timeout	`HttpActivityBuilder` + `DetectionFindingBuilder` (dual-emit)	Medium	All of the above + configured timeout, `on_timeout` action taken
Script error	`HttpActivityBuilder` + `DetectionFindingBuilder` (dual-emit)	High	All of the above + exit code, stderr (truncated)
Async stream severed	`DetectionFindingBuilder`	Medium	Filter name, bytes streamed before sever, accumulated chunk count, denial reason

feat(sandbox): L7 content inspection hooks — scriptable request/response filtering #1272

Description

Problem Statement

Relationship to Privacy Router

Prior Art: fullsend Security Pipeline

Proposed Design

Core concept

Scripts live outside the sandbox

Two inspection modes

Policy surface

Script interface

Recommended Filter Stack

Observability

Alternatives Considered

Agent Investigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions