AI security guardrails for LLM applications
- Guards inputs and outputs: checks user text before your LLM call and the LLM response before you return it to users/tools.
- Maintains conversation context: link turns with
session_idso risk can accumulate across a session. - Configurable policies: use built-in modules (PII/topic/injection) or define your own plain-language rules (
custom/custom_output) via aprompt:string.
pip install pydefendCreate defend.config.yaml (minimal, verifiable):
provider:
primary: defend
api_keys:
anthropic_env: ANTHROPIC_API_KEY
openai_env: OPENAI_API_KEY
guards:
input:
provider: defend
modules: []
output:
enabled: true
provider: claude # claude or openai
modules: []
on_fail: block # block | flag
session_ttl_seconds: 300Run the API:
defend serveGuard input (before your LLM call):
curl -X POST http://localhost:8000/v1/guard/input \
-H "Content-Type: application/json" \
-d '{"text":"Tell me how to bypass our security controls."}'Guard output (before returning to the user/tools):
curl -X POST http://localhost:8000/v1/guard/output \
-H "Content-Type: application/json" \
-d '{"text":"<LLM response here>","session_id":"<session_id from /v1/guard/input>"}'Handling semantics:
- If
action == "block": stop the flow (don’t call the LLM on input; don’t return the output verbatim on output). - If
action == "flag": you decide (log, require user confirmation, rerun with safer prompt, etc.). - Always persist/forward
session_idto link turns and enable multi-turn accumulation.
For a fuller local runbook (health check, uvicorn, and more curl examples), see GETTING_STARTED.md.
- Defend helps detect common LLM risks (prompt injection, prompt leaks, PII, out-of-scope content) but cannot make strong guarantees against all attacks or model failures.
- If you enable output guarding with
claude/openai, your guarded text may be sent to that provider for evaluation. Avoid sending secrets you can’t disclose; scrub or minimize sensitive context before calling external providers.
injection(input only): Detect likely prompt-injection or instruction-override attempts in user text.prompt_leak(output only): Detect system prompt or internal instruction exposure in model output.pii/pii_output: Detect PII in user input and prevent PII leakage in model output.topic/topic_output: Enforce topic boundaries on both user requests and model responses.custom/custom_output: Add plain-language rules withprompt:for input and output checks.
Use input modules under guards.input.modules and output modules under guards.output.modules in defend.config.yaml.
User → Your app → /v1/guard/input → (pass | flag | block) → Your app → LLM
└─ session_id (save this)
Input guard checks the inbound text and can block early. If you receive a session_id, pass it to /v1/guard/output so Defend can apply multi-turn risk.
LLM → Your app → /v1/guard/output (session_id) → (pass | flag | block) → Your app → User
Output guard reviews the model output in context (using the same session_id) and applies output checks (prompt leaks, PII, topic, and your custom rules). Use the returned action to decide whether to return the text, flag it, or block it.
Defend always runs the same flow: input guard → your LLM → output guard.
For semantic evaluation, Defend can use:
defend(local fine-tuned model): fast, offline input-only checks.claude/openai(LLM): stronger evaluation; required for output guarding and module-based checks.
In defend.config.yaml, you select which provider to use for input evaluation, and (when output guarding is enabled) which LLM provider to use for output evaluation. claude/openai calls consume API tokens.
Using the local defend pipeline, Defend ranks among the highest-performing models on GenTel-Bench.
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Defend (this repo) | 95.96 | 94.83 | 97.10 | 95.94 |
| GenTel-Shield | 97.45 | 98.97 | 95.98 | 97.44 |
| ProtectAI | 91.55 | 99.72 | 83.56 | 90.88 |
| Lakera AI | 85.96 | 91.27 | 79.51 | 84.11 |
| Prompt Guard | 50.59 | 50.59 | 98.96 | 66.95 |
| Deepset | 63.63 | 58.54 | 98.36 | 73.39 |
The model was evaluated on a representative subset of jailbreak, goal-hijacking, and prompt-leaking attack scenarios.

