AI-Native Detection engineering toolkit. One install, one config, one CI step.
Detect-Forge is a composable CLI for detection engineers. Each capability is a subcommand; they share configuration, output formatting, caching, and a single CI gate. No platform, no sign-up.
The first shipping capability is stale — it scores your Sigma (YAML) and Elastic Detection Rules (TOML — covering EQL, KQL, and ESQL) for ATT&CK technique staleness along three dimensions:
- Timestamp drift — compares ATT&CK STIX
modifiedtimestamps to rule modification dates (deterministic). - Semantic alignment ✅ — embeddings-based cosine similarity between rule text (title + description) and current ATT&CK technique description. Flags rules whose alignment falls below a configurable threshold (
--semantic-threshold, default 0.65). True historical drift (comparing against past MITRE definitions) is Phase 3.b. - LLM diff proposals ✅ — opt-in, BYOLLM via OpenAI structured output; proposes rewritten rules for
semantic_driftfindings. Never auto-applied — every proposal is reviewed manually. Anthropic Claude support deferred to v0.2.
Designed to run in GitHub Actions as a CI gate. No data leaves your environment.
🚀 May 23, 2026 launch — stale ships with all three scoring dimensions: timestamp drift, semantic drift (Phase 3.a), and LLM diff proposals (Phase 4). True historical drift (Phase 3.b) deferred to v0.2. Other subcommands (backtest, coverage, cti ingest, audit) are registered as stubs and will ship in subsequent releases.
- Python 3.12 or newer
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"detect-forge --help
detect-forge --version
detect-forge stale path/to/rules| Command | Status | Description |
|---|---|---|
stale |
✅ Available | Score detection rules for ATT&CK technique staleness. |
backtest |
📅 Jun 28, 2026 | Adversarial replay (Types 3 + 4). |
coverage |
📝 Q3 2026 | Coverage gap mapping (Type 6a expansion). |
cti ingest |
📝 Q3–Q4 2026 | CTI-to-detection generation. |
audit |
📝 Reserved | Runs every check once 2+ subcommands ship. |
| Option | Default | Description |
|---|---|---|
RULE_DIR (positional) |
— | Directory of detection rules to scan. Recursively picks up .yml/.yaml (Sigma) and .toml (Elastic Detection Rules: EQL/KQL/ESQL). Must exist. |
--format {terminal,json,html} |
terminal |
Output format. |
-o, --output PATH |
stdout | Write output to a file instead of stdout. |
--min-severity {low,medium,high,critical} |
low |
Only show rules at or above this severity. |
--no-cache |
off | Bypass the disk cache and fetch a fresh ATT&CK bundle. |
--domain {enterprise-attack,ics-attack,mobile-attack} |
enterprise-attack |
ATT&CK domain to fetch. |
--semantic-threshold FLOAT |
0.65 |
Cosine similarity threshold; pairs below this value emit a semantic_drift finding. |
Supported rule formats are auto-detected by extension. .yml/.yaml files are parsed as Sigma rules; .toml files are parsed as Elastic Detection Rules. The Elastic schema covers EQL, KQL (kuery), and ESQL — they share the same TOML structure and only differ in the language field.
Each rule is embedded as title + description (the natural-language portion — the detection-query body is NOT embedded, since query languages don't align well with general-purpose text embeddings). Each ATT&CK technique is embedded as name + description from the STIX bundle. For every technique a rule tags, we compute the cosine similarity between the two vectors; pairs whose score falls strictly below --semantic-threshold (default 0.65) emit a semantic_drift finding at medium severity, with the score visible in the Similarity column of the report.
Embeddings are computed once with fastembed (model BAAI/bge-small-en-v1.5, ~30MB, auto-downloaded on first run) and cached under $CACHE_DIR/embeddings/. Subsequent runs read from cache. There is no --no-semantic flag: warm-cache cost is near-zero, and cold-cache work has to happen at least once anyway.
| Similarity | What it means |
|---|---|
| < 0.50 | Major concept divergence — rule and technique are describing different things |
| 0.50–0.70 | Significant drift — technique has evolved substantially |
| 0.70–0.85 | Moderate drift — wording changes, some behavioral shifts |
| > 0.85 | Minor or no drift |
The default trigger (semantic_threshold = 0.65) catches rules with significant or major drift — meaningful divergence that warrants attention, not just a flag.
Progress spinners go to stderr; the report goes to stdout so JSON output can be piped safely:
detect-forge stale path/to/rules --format json | jq '.scores'
detect-forge stale path/to/rules --format json -o report.json| Code | Meaning |
|---|---|
0 |
Scan completed; no gating findings (CI passes). |
1 |
Tool error, stub command, or unimplemented capability. |
2 |
CI-gating condition met (e.g. stale found a critical finding). |
Use exit-code 2 to fail your CI pipeline:
detect-forge stale path/to/rules
code=$?
if [ "$code" -eq 2 ]; then exit 2; fiAll settings can be overridden via DETECT_FORGE_-prefixed env vars (or a .env file in the working directory). Copy .env.sample at the repo root to .env to get started.
| Variable | Default | Purpose |
|---|---|---|
DETECT_FORGE_CACHE_DIR |
$XDG_CACHE_HOME/detect-forge (or ~/.cache/detect-forge) |
Where the ATT&CK bundle is cached. |
DETECT_FORGE_CACHE_TTL_HOURS |
24 |
Cache lifetime in hours. |
DETECT_FORGE_ATTACK_DOMAIN |
enterprise-attack |
Default --domain value. |
DETECT_FORGE_NO_CACHE |
false |
If truthy, always bypass the cache. |
DETECT_FORGE_SEMANTIC_THRESHOLD |
unset | Overrides semantic_threshold from .detect-forge.toml and the CLI flag (highest precedence). |
OPENAI_API_KEY |
unset | Required to enable LLM diff proposals. When unset, scans complete normally and print a skip banner. |
When a rule emits a semantic_drift finding, stale can optionally call OpenAI's structured-output API to propose a rewritten rule aligned with the current ATT&CK technique. Proposals are BYOLLM and never auto-applied — the practitioner reviews every suggestion and manually decides what to keep.
Set OPENAI_API_KEY in your environment. Without it, the scan completes normally and prints 💡 LLM diff proposals skipped at the end of the report.
export OPENAI_API_KEY=sk-...
detect-forge stale ./rulesLLM proposal settings live in .detect-forge.toml (discovered upward from your CWD, halting at the git root). There are no CLI flags for these. A starter .detect-forge.toml with the defaults ships at the repo root — edit in place or copy to your own project.
[stale]
semantic_threshold = 0.65 # Cosine similarity floor; pairs below trigger a proposal
llm_model = "gpt-4o-mini" # Any OpenAI chat-completion model that supports structured outputs
max_proposals = 5 # Hard ceiling on LLM calls per scan run (cost guard)max_proposals is your primary cost lever — every proposal attempt (success, refusal, or validation rejection) counts against this quota.
At default settings (gpt-4o-mini, 5 proposals): well under $0.01 per scan. Roughly $0.0005 per proposal. The max_proposals setting is your hard cost ceiling.
For each candidate rule, you get a terminal panel with the rule filename, the model's confidence (0–1), the list of fields it changed, a brief explanation, and the rewritten rule body in syntax-highlighted YAML (Sigma) or TOML (Elastic). The HTML report adds a "LLM Proposals" section at the bottom with color-coded confidence badges.
- They never modify your rules on disk. Apply changes manually after review.
- They don't run if
OPENAI_API_KEYis unset. - They use only the rule's natural-language fields and your current ATT&CK technique description — no telemetry leaves your environment beyond the OpenAI API call.
- They're not a substitute for human review. The model's
confidencefield is self-reported and unreliable — treat every proposal as a draft.
Each subcommand exposes a programmatic API for power users:
from pathlib import Path
from detect_forge.stale import scan
report = scan(Path("./rules"), domain="enterprise-attack")
for score in report.scores:
if score.worst_severity == "critical":
print(f"{score.title}: {score.worst_days_stale} days stale")pytest -q # run the test suite
ruff check src/ tests/ # lint
mypy src/ # type-check (strict)The package layout:
src/detect_forge/
├── cli.py # click root group; registers all subcommands
├── settings.py # DETECT_FORGE_* pydantic-settings config
├── console.py # rich stdout + stderr consoles
├── cache.py # XDG-aware cache (default_cache_dir() factory)
├── common.py # @common_output_options decorator
├── exit_codes.py # CLEAN=0, RESERVED=1, GATED=2
├── _stubs.py # stub_command() helper
├── stale/ # the staleness pipeline (real subcommand)
├── backtest/ # stub
├── coverage/ # stub
├── cti/ # group + ingest stub
└── audit/ # stub
MIT