Skip to content

feat(sandbox): L7 content inspection hooks — scriptable request/response filtering #1272

@maruiz93

Description

@maruiz93

Problem Statement

OpenShell enforces network policy at L4 (allow/deny by host:port) and L7 (method/path/query for REST, operation-type/fields for GraphQL). Neither layer inspects the content of request or response bodies for security-relevant signals like prompt injection, PII leakage, sensitive data exfiltration, or adversarial payloads.

Agents operating inside sandboxes receive LLM responses and make outbound API calls. An attacker who controls upstream content (e.g., a poisoned web page fetched by the agent, a malicious tool response, or a compromised API) can embed prompt injection payloads in responses. Conversely, a compromised or misguided agent can exfiltrate sensitive data in outbound request bodies. Today, neither vector is visible to the policy layer.

The inference proxy already buffers request/response bodies for GraphQL inspection (#1022) and credential injection (#689). This proposal adds a general-purpose content inspection hook system that lets operators run external scripts/classifiers against L7 traffic inline, per-route.

Relationship to Privacy Router

The Privacy Router (#1043) and content inspection hooks are complementary, not overlapping:

  • Privacy Router answers: where should this traffic go? It routes inference requests to local or external providers based on data sensitivity, PII classification, and operator policy. It controls the destination.
  • Content inspection hooks answer: should this traffic flow at all? They inspect request/response bodies for adversarial content (prompt injection), data exfiltration (secrets, PII in outbound calls), and policy violations. They gate the traffic.

A deployment might use both: the Privacy Router ensures sensitive prompts stay on a local NIM deployment, while content filters block prompt injection payloads in responses regardless of which provider served them. The router can't catch prompt injection (it classifies sensitivity, not adversarial intent), and content filters don't decide routing (they allow or deny, not redirect).

They share an interest in body content but serve different security objectives — routing policy vs. content policy.

Prior Art: fullsend Security Pipeline

The fullsend project has a production-grade, multi-layered security pipeline that validates this approach and should inform the design:

Input pipeline (InputPipeline): UnicodeNormalizer → ContextInjectionScanner — runs before untrusted text enters agent processing.
Output pipeline (OutputPipeline): SecretRedactor — runs before agent-generated text is posted to external APIs.

Key scanners:

Scanner Technique What it catches
ContextInjectionScanner 27 regex patterns across 4 categories Instruction override, credential exfiltration, hidden content, execution-via-translation
ONNXGuardScanner ProtectAI DeBERTa-v3 ONNX model, sentence-level splitting Social engineering, indirect prompt injection (83% detection, 0 false positives in eval)
UnicodeNormalizer Strip + NFKC normalize Zero-width chars, bidi overrides, ANSI escapes, tag characters with hidden text, fullwidth encoding
SecretRedactor 20+ prefix patterns + structural patterns API keys (OpenAI sk-, GitHub ghp_, AWS AKIA...), private key blocks, DB connection strings, auth headers
SSRFValidator IP blocklist + DNS rebinding defense RFC 1918, cloud metadata endpoints, dangerous URI schemes

Additionally, fullsend ships runtime hooks (Python scripts as PreToolUse/PostToolUse hooks):

  • tirith_check.py — Tirith CLI for static command injection + unicode trick detection
  • canary_pretool.py — detects canary token exfiltration in tool inputs
  • secret_redact_posttool.py — redacts secrets from tool output before the LLM sees them
  • unicode_posttool.py — normalizes unicode in tool output

Experimental validation (experiments/guardrails-eval/): evaluated LLM Guard (DeBERTa-v3), NeMo Guardrails (YARA), and Model Armor (GCP). Key finding: ML sentence-level scanning + regex pattern matching is complementary — ML catches social engineering, regex catches structural attacks. Neither alone is sufficient.

Proposed Design

Core concept

Add a content_filters field to L7 endpoint policy. Each filter references an executable script that the supervisor (not the sandbox) runs against request and/or response bodies. Scripts receive body content on stdin and signal allow (exit 0) or deny (exit 1 + reason on stdout). The supervisor short-circuits with a 403 and the denial reason if any filter denies.

Scripts live outside the sandbox

Filter scripts run in the supervisor process context, not inside the agent container. This is a hard requirement — if the agent can modify the scripts that inspect it, the entire mechanism is bypassable. Scripts are mounted from the host or baked into the supervisor image, never from the sandbox filesystem.

Two inspection modes

Synchronous (outbound requests): The supervisor buffers the request body, pipes it to each filter script sequentially, and only forwards upstream if all filters pass. This catches sensitive data exfiltration and prompt injection in outbound calls before they leave the sandbox boundary.

Async streaming (inbound responses): For SSE/streaming inference responses, buffering the full response before returning it to the agent kills latency. Instead:

  1. Proxy chunks through to the agent in real-time.
  2. Simultaneously accumulate chunks and pipe to the filter script(s) asynchronously.
  3. If a filter flags content mid-stream, sever the connection — inject an SSE error frame and close the stream.
  4. Optionally: accumulate to a temp file outside the sandbox, run the full scan on completion, and only then decide whether to persist/allow the result.

The tradeoff: the agent may see partial content before denial. For prompt injection this is acceptable — the dangerous part is the agent acting on injected instructions, not reading partial tokens. Severing the stream causes most agent frameworks to treat the response as failed and not act on it.

Policy surface

endpoints:
  - host: api.openai.com
    port: 443
    protocol: rest
    enforcement: enforce
    content_filters:
      - script: /etc/openshell/filters/injection-scan.sh
        direction: response
        timeout_ms: 500
        on_timeout: deny
      - script: /etc/openshell/filters/onnx-guard.sh
        direction: response
        timeout_ms: 1000
        on_timeout: deny
      - script: /etc/openshell/filters/secret-redact.sh
        direction: request
        timeout_ms: 300
        on_timeout: deny
      - script: /etc/openshell/filters/unicode-normalize.sh
        direction: both
        timeout_ms: 200
        on_timeout: deny
  • script: Absolute path on the supervisor filesystem. Must be executable. Not accessible from inside the sandbox.
  • direction: Which body to inspect — request (outbound), response (inbound), or both.
  • timeout_ms: Per-script execution timeout. Prevents slow classifiers from blocking the proxy indefinitely.
  • on_timeout: Fail-closed (deny, default) or fail-open (allow) when the script exceeds its timeout.

Script interface

  • stdin: Raw body bytes (for streaming mode: accumulated chunks so far).
  • stdout: On deny (exit 1), a single-line human-readable reason (e.g., "Prompt injection: instruction override pattern detected"). On allow (exit 0), stdout is ignored.
  • stderr: Logged by the supervisor at debug level for diagnostics.
  • Exit code: 0 = allow, 1 = deny, 2+ = script error (treated as deny when on_timeout: deny).
  • Environment variables: The supervisor injects metadata: OPENSHELL_FILTER_HOST, OPENSHELL_FILTER_PORT, OPENSHELL_FILTER_METHOD, OPENSHELL_FILTER_PATH, OPENSHELL_FILTER_DIRECTION (request/response).

Recommended Filter Stack

Based on fullsend's production pipeline and experimental results, the recommended default filter stack for OpenShell would be:

  1. UnicodeNormalizer (both directions, fast) — strip invisible characters, bidi overrides, tag chars before any other scanner sees the content. Pre-processing stage, not a deny gate.
  2. ContextInjectionScanner (response direction, regex) — 27 patterns covering instruction override, credential exfiltration, hidden content, execution-via-translation. Fast, deterministic, zero false positives on known patterns.
  3. ONNXGuardScanner (response direction, ML) — DeBERTa-v3 sentence-level classification for social engineering and indirect prompt injection that regex won't catch. Configurable threshold (default 0.92).
  4. SecretRedactor (request direction, regex) — prevent exfiltration of API keys, tokens, private keys, DB strings in outbound requests. 20+ prefix patterns + structural matching.
  5. SSRFValidator (request direction, URL extraction) — block requests to private networks, cloud metadata, dangerous schemes.

The ML + regex combination is critical: fullsend's evaluation showed ML alone misses structural attacks (unicode tricks, encoded exfiltration) while regex alone misses social engineering and indirect injection.

Observability

Every filter execution must be fully auditable. Operators, security teams, and compliance workflows need to see what was inspected, what was flagged, and what was allowed through.

Every filter execution emits an OCSF event, regardless of outcome:

Outcome OCSF event Severity What is logged
Allow HttpActivityBuilder Informational Filter name, script path, direction, host:port, execution time, body size
Deny HttpActivityBuilder + DetectionFindingBuilder (dual-emit) Medium All of the above + denial reason from script stdout, body hash (SHA-256)
Timeout HttpActivityBuilder + DetectionFindingBuilder (dual-emit) Medium All of the above + configured timeout, on_timeout action taken
Script error HttpActivityBuilder + DetectionFindingBuilder (dual-emit) High All of the above + exit code, stderr (truncated)
Async stream severed DetectionFindingBuilder Medium Filter name, bytes streamed before sever, accumulated chunk count, denial reason

Key observability constraints:

  • Never log body content in OCSF events. Body bytes may contain secrets, PII, or credentials. Log a SHA-256 hash of the body for correlation, not the content itself. The OCSF JSONL file may be shipped to external systems.
  • Always log execution time. Filter latency is critical for debugging proxy performance. Emit filter_duration_ms on every event.
  • Correlation ID. Each request/response pair gets a unique ID so allow/deny decisions on the same HTTP transaction can be correlated across request-side and response-side filter events.
  • Structured filter metadata. OCSF events include: filter.name (script basename), filter.script_path, filter.direction, filter.timeout_ms, filter.exit_code, filter.duration_ms, filter.body_size_bytes, filter.body_hash (SHA-256), filter.denial_reason (on deny only).
  • Shorthand log line. The OCSF shorthand layer emits a grep-friendly summary: CONTENT_FILTER DENY injection-scan.sh response api.openai.com:443 "instruction override pattern" 12ms or CONTENT_FILTER ALLOW onnx-guard.sh response api.openai.com:443 45ms.

Integration with existing observability:

Alternatives Considered

  • In-sandbox filters (readonly mount): Simpler deployment but weaker security boundary. A sandbox escape or container breakout could tamper with the scripts. Rejected in favor of supervisor-side execution.
  • Built-in classifier (compiled into supervisor): Lower latency but rigid. Operators can't customize detection rules, add domain-specific patterns, or swap classifiers without rebuilding the supervisor. The script interface lets operators iterate without image rebuilds. However, a compiled ONNX runtime (as fullsend does with hugot) could be offered as a built-in fast-path option alongside the script interface.
  • Gateway-side inspection: The gateway doesn't have body bytes — it receives gRPC metadata from the sandbox. Moving inspection to the gateway would require streaming body content over gRPC, adding significant complexity. The supervisor already has the bytes in flight.
  • Buffered-only (no streaming mode): Simpler but kills inference latency. Agents routinely use streaming for LLM calls — buffering a 30-second generation to scan it would break interactive workflows. The async streaming mode preserves responsiveness at the cost of partial exposure.
  • OPA-only (Rego rules on body content): OPA is not designed for arbitrary text classification. Pattern matching in Rego is limited to regex.match — no subprocess execution, no ML model calls. OPA remains the policy decision point; content filters are a pre-processing stage.
  • Merge with Privacy Router: The Privacy Router (Privacy Router #1043) classifies data sensitivity for routing decisions (local vs. external provider). Content filters classify adversarial intent and data exfiltration for allow/deny decisions. They share interest in body content but serve different security objectives. Keeping them separate avoids coupling routing logic to content scanning logic.
  • Fire-and-forget audit-only mode: Log but don't block. Useful for gradual rollout — could be added as an enforcement: audit option on individual filters. But insufficient standalone for prompt injection and exfiltration which require active blocking.

Agent Investigation

Codebase surveyed prior to filing:

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions