DataFog

The data firewall for agents and developer tools.

DataFog is a runtime data governance layer for AI agents and developer tooling.

It runs a single in-process policy loop: detect → decide → enforce. For each payload crossing a process boundary (command execution, file read/write, or API action), it detects sensitive entities, evaluates policy, and enforces the result before the action proceeds.

This repo has two runtime pieces:

datafog-api – HTTP API for scan/decide/transform/receipts.
datafog-shim – optional runtime policy gate wrapper for CLI-style execution.

The wrapper process is still named datafog-shim for compatibility, but we describe its role as a policy gate.

What DataFog does (technical)

Detect sensitive entities in text and payload context (/v1/scan).
Decide using adapter-aware policy rules (/v1/decide) from policy.json.
Enforce the decision before execution (allow, transform, allow_with_redaction, or deny) in consuming runtimes.
Transform or tokenize matched data deterministically when a policy asks for it (/v1/transform, /v1/anonymize).
Emit an auditable receipt for every enforcement decision (/v1/receipts/{id}).
Optionally emit decision events (/v1/events) when DATAFOG_EVENTS_PATH is set.

What it does not do

It does not secure every layer of your platform for you.
It does not continuously discover vulnerabilities.
It does not manage policy editing UI or dynamic policy updates through the API.
It does not guarantee zero false positives/negatives from detection (detectors are deterministic and regex/heuristic based).

Use cases

Prevent sensitive data from crossing process boundaries before it leaves the machine (for example: a shell command exposing credentials or a script writing secret-bearing files).
Enforce policy-specific transformations such as masking, tokenization, or redaction at runtime.
Add pre-execution guardrails to AI agents and CLI workflows.
Keep auditable receipts/events for every policy decision.

Positioning

Developers and agent builders: DataFog is a data-aware policy enforcement layer for CLI tools and AI agents. It sits in your PATH or runtime, inspects data flowing through commands, and enforces policy before sensitive actions execute.
Security/compliance buyers: DataFog maps closely to runtime DLP for developer workstations, but without the legacy footprint: policy is programmable (OPA-style), decision-aware, and process-bound.
Broader view: DataFog is the data plane for agent governance — detect, decide, enforce, and audit.

Repository layout

cmd/datafog-api: API server.
cmd/datafog-shim: policy-gate wrapper CLI.
internal/policy: policy parsing and matching.
internal/scan: entity detectors.
internal/transform: deterministic redaction/masking/tokenization/anonymization.
internal/receipts: receipt persistence.
internal/server: HTTP handlers and middleware.
internal/shim: decision + execution adapters.
config/policy.json: starter policy used by default.
docs/: API contract and operational docs.

Prerequisites

Go 1.22+
Optional: Docker (for container workflow)
Optional: jq for pretty-printing JSON

Quick start (API only)

go mod download
go run ./cmd/datafog-api

The API listens on :8080 by default and requires a valid policy file at config/policy.json.

Verify service is up:

curl -i http://localhost:8080/health

If you set DATAFOG_API_TOKEN, send it on every request using:

Authorization: Bearer <token> header, or
X-API-Key: <token> header.

Configuration

Variable	Default	Description
`DATAFOG_POLICY_PATH`	`config/policy.json`	Policy snapshot loaded at startup
`DATAFOG_RECEIPT_PATH`	`datafog_receipts.jsonl`	Append-only receipts file
`DATAFOG_EVENTS_PATH`	(unset)	NDJSON event log for decision events
`DATAFOG_ADDR`	`:8080`	HTTP listen address
`DATAFOG_API_TOKEN`	(unset)	Optional API auth token
`DATAFOG_RATE_LIMIT_RPS`	`0`	Global request cap in RPS (`0` disables)
`DATAFOG_READ_TIMEOUT`	`5s`	HTTP read timeout
`DATAFOG_WRITE_TIMEOUT`	`10s`	HTTP write timeout
`DATAFOG_READ_HEADER_TIMEOUT`	`2s`	Request-header parse timeout
`DATAFOG_IDLE_TIMEOUT`	`30s`	Idle keep-alive timeout
`DATAFOG_SHUTDOWN_TIMEOUT`	`10s`	Graceful shutdown timeout
`GOMAXPROCS`	(runtime default)	Auto-tuned at startup to detected CPU limit; set explicitly to override
`DATAFOG_PPROF_ADDR`	(unset)	If set, starts optional profiling server on this address (example `localhost:6060`)
`DATAFOG_FGPROF`	`false`	Add `/debug/fgprof` endpoint to the profiling server
`DATAFOG_ENABLE_DEMO`	(unset)	Enable `/demo*` endpoints
`DATAFOG_DEMO_HTML`	`docs/demo.html`	Path to demo HTML

Duration values use Go duration syntax, for example 1s, 500ms, 2m.

API surface

Base URL defaults to http://localhost:8080.

Method	Path	What it does
`GET`	`/health`	Health plus policy identity + start time
`GET`	`/v1/policy/version`	Current policy id/version
`POST`	`/v1/scan`	Run detector set on text
`POST`	`/v1/decide`	Evaluate an action + findings and get a decision
`POST`	`/v1/transform`	Apply requested transform mode(s)
`POST`	`/v1/anonymize`	Apply irreversible anonymization
`GET`	`/v1/receipts/{id}`	Read a decision receipt
`GET`	`/v1/events`	List recent decision events
`GET`	`/metrics`	In-process metrics counters

Optional demo routes (only when demo mode is enabled):

GET /demo
POST /demo/exec
POST /demo/write-file
POST /demo/read-file
POST /demo/seed
GET /demo/sandbox

Optional profiling endpoints

For production debugging, set DATAFOG_PPROF_ADDR to run an auxiliary profiling server:

/debug/pprof/ (standard net/http/pprof handlers: profiles, goroutines, heap, trace)
/debug/fgprof when DATAFOG_FGPROF=true (low-overhead flame graph style profiler)

Recommended values:

DATAFOG_PPROF_ADDR=:6060

The profiling server is disabled by default and should be exposed only on trusted networks.

Decisions and idempotency

Endpoints that accept idempotency_key:

/v1/scan
/v1/decide
/v1/transform
/v1/anonymize

Repeat requests with the same key and identical payload should return the same body and status. If the same key is reused with a different payload, response is 409 + idempotency_conflict.

Basic examples

Scan for entities

curl -X POST http://localhost:8080/v1/scan \
  -H "Content-Type: application/json" \
  -d '{"text":"alice@example.com - API key: SK8x... and 555-123-4567"}'

Decide action

curl -X POST http://localhost:8080/v1/decide \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "type": "file.write",
      "resource": "notes.txt"
    },
    "text": "customer email is alice@example.com"
  }'

Transform detected sensitive data in text

curl -X POST http://localhost:8080/v1/transform \
  -H "Content-Type: application/json" \
  -d '{
    "text": "customer email is alice@example.com",
    "findings": [{"entity_type":"email","value":"alice@example.com","start":18,"end":34,"confidence":0.99}],
    "mode":"mask"
  }'

Fetch a receipt

curl -s http://localhost:8080/v1/receipts/<receipt-id> | jq .

Query events (optional)

curl 'http://localhost:8080/v1/events?limit=20&decision=deny'

Enforcement policy gate (`datafog-shim`)

datafog-shim is an optional runtime layer for CLI-style workflows. It sends action details to DataFog (/v1/decide) before executing shell/file actions.

Build it:

go build -o datafog-shim ./cmd/datafog-shim

Use direct shell mode:

./datafog-shim --policy-url http://localhost:8080 shell rm -rf /tmp/test

Install a managed wrapper:

datafog-shim hooks install --target /usr/bin/git git

Route wrappers through PATH:

export PATH="$HOME/.datafog/shims:$PATH"

Common env vars for the policy gate:

DATAFOG_SHIM_POLICY_URL (required)
DATAFOG_SHIM_API_TOKEN (required if API token is enabled)
DATAFOG_SHIM_MODE (enforced or observe)
DATAFOG_SHIM_EVENT_SINK (optional NDJSON sink)
DATAFOG_SHIM_ENFORCE_POLICY_ERRORS (true to block on policy service errors even in observe mode)

When using enforced mode, a blocked action exits non-zero. In observe mode, it logs decisions but allows execution to continue.

Policy gate receipts are logged to stderr in a compact format:

receipt=<id> decision=<allow|transform|allow_with_redaction|deny>

Policy file behavior and limits

Policies live in JSON at DATAFOG_POLICY_PATH.
The policy is loaded on startup only; file edits require restart.
A restart is the only reload path for policy changes in this version.
Invalid or malformed JSON blocks startup.

config/policy.json in this repo is a runnable example with basic allow/deny/redact behavior.

Limitations and operational notes

Detection defaults are fast and deterministic, with bounded coverage.
- Good for common formats (e.g., email, phone, SSN, API keys, credit cards) and lightweight heuristic NER.
- Not a full privacy ML detector.
Receipt log and event log are file-based and must be writable.
Large volumes of receipts/events need external retention/rotation strategy.
/v1/receipts/{id} and /v1/events are read APIs; there is no policy mutate endpoint.

Container quick start

docker build -t datafog-api:latest .

docker run --rm -p 8080:8080 \
  -e DATAFOG_API_TOKEN=changeme \
  -e DATAFOG_RATE_LIMIT_RPS=50 \
  -e DATAFOG_RECEIPT_PATH=/var/lib/datafog/datafog_receipts.jsonl \
  -v "$(pwd)/config:/app/config:ro" \
  -v datafog-receipts:/var/lib/datafog \
  datafog-api:latest

Verify setup end-to-end

# health check
curl -i http://localhost:8080/health

# decision + receipt loop
RECEIPT_ID=$(curl -s -X POST http://localhost:8080/v1/decide \
  -H "Content-Type: application/json" \
  -d '{"action":{"type":"shell.exec","command":"git"},"text":"no pii here"}' \
| jq -r '.receipt_id')

curl -s http://localhost:8080/v1/receipts/$RECEIPT_ID | jq .

Expected outcome: the first request returns a decision and receipt id; second call should return the saved receipt.

Kubernetes deployment example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datafog-api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: datafog-api
  template:
    metadata:
      labels:
        app: datafog-api
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 65532
        runAsGroup: 65532
        fsGroup: 65532
      containers:
      - name: datafog-api
        image: ghcr.io/datafog/datafog-api:v2
        ports:
        - containerPort: 8080
        env:
        - name: DATAFOG_ADDR
          value: ":8080"
        - name: DATAFOG_POLICY_PATH
          value: "/app/config/policy.json"
        - name: DATAFOG_RECEIPT_PATH
          value: "/var/lib/datafog/datafog_receipts.jsonl"
        - name: DATAFOG_EVENTS_PATH
          value: "/var/lib/datafog/datafog_events.ndjson"
        - name: DATAFOG_RATE_LIMIT_RPS
          value: "100"
        - name: DATAFOG_SHUTDOWN_TIMEOUT
          value: "10s"
        volumeMounts:
        - name: policy
          mountPath: /app/config
          readOnly: true
        - name: receipts
          mountPath: /var/lib/datafog
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]
      volumes:
      - name: policy
        configMap:
          name: datafog-policy
      - name: receipts
        persistentVolumeClaim:
          claimName: datafog-receipts

Documentation map

API contract: docs/contracts/datafog-api-contract.md
Architecture/module map: docs/ARCHITECTURE.md
Security and operations:
- docs/SECURITY.md
- docs/RELIABILITY.md
- docs/OBSERVABILITY.md
- docs/DOMAIN_DOCS.md
Design/product context:
- docs/DESIGN.md
- docs/PRODUCT_SENSE.md

If something fails, check these first

go test ./... (build/runtime validation before changing policy)
go test -race ./... (check race conditions on concurrency-sensitive paths)
/health response for policy id/version mismatch
Environment variables are set and files are writable
API token/header if DATAFOG_API_TOKEN is configured
Policy JSON is valid and rules match expected action fields
Optional benchmark sweep: scripts/run-benchmarks.sh (writes /tmp/bench/benchmark-current.txt; if scripts/benchmark-baseline.txt exists, also writes /tmp/bench/benchmark-trend.txt with benchstat deltas)

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
cmd		cmd
config		config
docs		docs
internal		internal
scripts		scripts
.gitignore		.gitignore
.pre-config-commit.yaml		.pre-config-commit.yaml
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFog

What DataFog does (technical)

What it does not do

Use cases

Positioning

Repository layout

Prerequisites

Quick start (API only)

Configuration

API surface

Optional profiling endpoints

Decisions and idempotency

Basic examples

Scan for entities

Decide action

Transform detected sensitive data in text

Fetch a receipt

Query events (optional)

Enforcement policy gate (`datafog-shim`)

Policy file behavior and limits

Limitations and operational notes

Container quick start

Verify setup end-to-end

Kubernetes deployment example

Documentation map

If something fails, check these first

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

DataFog/datafog

Folders and files

Latest commit

History

Repository files navigation

DataFog

What DataFog does (technical)

What it does not do

Use cases

Positioning

Repository layout

Prerequisites

Quick start (API only)

Configuration

API surface

Optional profiling endpoints

Decisions and idempotency

Basic examples

Scan for entities

Decide action

Transform detected sensitive data in text

Fetch a receipt

Query events (optional)

Enforcement policy gate (datafog-shim)

Policy file behavior and limits

Limitations and operational notes

Container quick start

Verify setup end-to-end

Kubernetes deployment example

Documentation map

If something fails, check these first

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Enforcement policy gate (`datafog-shim`)

Packages