GitHub - Adxzer/defend: AI security guardrails for LLM applications.

AI security guardrails for LLM applications

Guards inputs and outputs: checks user text before your LLM call and the LLM response before you return it to users/tools.
Maintains conversation context: link turns with session_id so risk can accumulate across a session.
Configurable policies: use built-in modules (PII/topic/injection) or define your own plain-language rules (custom / custom_output) via a prompt: string.

Quick links

Getting started
Modules
How it works
Benchmarks

Easy setup (HTTP-first)

pip install pydefend

Create defend.config.yaml (minimal, verifiable):

provider:
  primary: defend

api_keys:
  anthropic_env: ANTHROPIC_API_KEY
  openai_env: OPENAI_API_KEY

guards:
  input:
    provider: defend
    modules: []

  output:
    enabled: true
    provider: claude   # claude or openai
    modules: []
    on_fail: block     # block | flag

  session_ttl_seconds: 300

Run the API:

defend serve

Guard input (before your LLM call):

curl -X POST http://localhost:8000/v1/guard/input \
  -H "Content-Type: application/json" \
  -d '{"text":"Tell me how to bypass our security controls."}'

Guard output (before returning to the user/tools):

curl -X POST http://localhost:8000/v1/guard/output \
  -H "Content-Type: application/json" \
  -d '{"text":"<LLM response here>","session_id":"<session_id from /v1/guard/input>"}'

Handling semantics:

If action == "block": stop the flow (don’t call the LLM on input; don’t return the output verbatim on output).
If action == "flag": you decide (log, require user confirmation, rerun with safer prompt, etc.).
Always persist/forward session_id to link turns and enable multi-turn accumulation.

For a fuller local runbook (health check, uvicorn, and more curl examples), see GETTING_STARTED.md.

Security & privacy (non-goals included)

Defend helps detect common LLM risks (prompt injection, prompt leaks, PII, out-of-scope content) but cannot make strong guarantees against all attacks or model failures.
If you enable output guarding with claude/openai, your guarded text may be sent to that provider for evaluation. Avoid sending secrets you can’t disclose; scrub or minimize sensitive context before calling external providers.

Modules

injection (input only): Detect likely prompt-injection or instruction-override attempts in user text.
prompt_leak (output only): Detect system prompt or internal instruction exposure in model output.
pii / pii_output: Detect PII in user input and prevent PII leakage in model output.
topic / topic_output: Enforce topic boundaries on both user requests and model responses.
custom / custom_output: Add plain-language rules with prompt: for input and output checks.

Use input modules under guards.input.modules and output modules under guards.output.modules in defend.config.yaml.

How it works

Input guard

User → Your app → /v1/guard/input → (pass | flag | block) → Your app → LLM
                      └─ session_id (save this)

Input guard checks the inbound text and can block early. If you receive a session_id, pass it to /v1/guard/output so Defend can apply multi-turn risk.

Output guard

LLM → Your app → /v1/guard/output (session_id) → (pass | flag | block) → Your app → User

Output guard reviews the model output in context (using the same session_id) and applies output checks (prompt leaks, PII, topic, and your custom rules). Use the returned action to decide whether to return the text, flag it, or block it.

Evaluation model

Defend always runs the same flow: input guard → your LLM → output guard.

For semantic evaluation, Defend can use:

defend (local fine-tuned model): fast, offline input-only checks.
claude / openai (LLM): stronger evaluation; required for output guarding and module-based checks.

In defend.config.yaml, you select which provider to use for input evaluation, and (when output guarding is enabled) which LLM provider to use for output evaluation. claude/openai calls consume API tokens.

Benchmark comparison

Using the local defend pipeline, Defend ranks among the highest-performing models on GenTel-Bench.

Model	Accuracy	Precision	Recall	F1
Defend (this repo)	95.96	94.83	97.10	95.94
GenTel-Shield	97.45	98.97	95.98	97.44
ProtectAI	91.55	99.72	83.56	90.88
Lakera AI	85.96	91.27	79.51	84.11
Prompt Guard	50.59	50.59	98.96	66.95
Deepset	63.63	58.54	98.36	73.39

The model was evaluated on a representative subset of jailbreak, goal-hijacking, and prompt-leaking attack scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
assets		assets
defend		defend
defend_api		defend_api
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
README.md		README.md
defend.config.yaml		defend.config.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick links

Easy setup (HTTP-first)

Security & privacy (non-goals included)

Modules

How it works

Input guard

Output guard

Evaluation model

Benchmark comparison

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick links

Easy setup (HTTP-first)

Security & privacy (non-goals included)

Modules

How it works

Input guard

Output guard

Evaluation model

Benchmark comparison

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages