| title | ArmorFlo | |
|---|---|---|
| emoji | π‘οΈ | |
| colorFrom | red | |
| colorTo | red | |
| sdk | docker | |
| pinned | false | |
| tags |
|
ArmorFlo is a production-grade OpenEnv-compatible reinforcement learning environment that simulates real-world Application Security (AppSec) vulnerability triage workflows.
An AI agent receives realistic CVE reports alongside an organisational asset inventory and must do exactly what a human security analyst does: assess exploitability in context, classify severity, determine which assets are actually affected, build a prioritised remediation plan, escalate to the right team, and produce a written resolution summary.
Why this domain? CVE triage is one of the highest-volume, highest-stakes tasks in enterprise security. The average organisation receives hundreds of CVEs per week; deciding which ones matter for your specific environment is genuinely hard and time-consuming. ArmorFlo is the first OpenEnv environment to model this workflow, making it immediately valuable for training and evaluating security-reasoning agents.
Each episode presents a realistic vulnerability triage scenario:
- CVE reports β real CVE IDs with accurate CVSS scores, vectors, affected products, patch status, and exploit availability
- Asset inventory β services with product names, versions, environments, internet-exposure flags, and business criticality
- Assess actions β agent can query for additional context (runbook-style guidance, version-specific applicability info)
- Mixed applicability β some CVEs in the report do not affect any asset in inventory (false positives the agent must suppress)
- Prioritised remediation β internet-facing critical assets must be addressed before internal low-criticality ones
- Escalation β different tasks require escalating to different teams (security team vs. management)
| Action | Key Fields | Description |
|---|---|---|
assess |
query: str |
Query for CVE context, asset details, or remediation guidance |
classify |
severity_tier, affected_components, cvss_score_estimate |
Declare CVSS severity (CRITICAL/HIGH/MEDIUM/LOW) and affected assets |
check_applicability |
cve_id, asset_id, applicable, inapplicability_reason |
Mark a CVE as applicable or not for a specific asset |
recommend |
remediation_plan: List[RemediationStep] |
Submit a prioritised remediation plan (priority, action, target_asset_ids, rationale) |
escalate |
team, justification |
Escalate to security / platform / network / development / management |
defer |
defer_reason, defer_until |
Defer a CVE with documented reason and revisit date |
close |
resolution_summary: str |
Close the report with a comprehensive post-triage summary |
Full JSON schema available at GET /schema and GET /tasks.
| Field | Type | Description |
|---|---|---|
reports |
List[VulnerabilityReport] |
CVE reports in scope (cve_id, title, cvss_score, cvss_vector, affected_products, patch_available, exploit_public) |
assets |
List[AssetRecord] |
Asset inventory (asset_id, name, product, version, environment, internet_facing, business_criticality) |
applicability_decisions |
Dict |
Decisions recorded so far: {cve_id: {asset_id: {applicable, reason}}} |
assess_result |
str | None |
Result of the most recent assess action |
step_count |
int |
Current step within the episode |
max_steps |
int |
Maximum allowed steps |
reward |
float |
Current episode score (0.0β1.0), updated every step |
done |
bool |
Whether the episode has terminated |
Incident: CVE-2021-44228 (Log4Shell) β CVSS 10.0. Three assets running Apache Log4j; one is already patched (version 2.17.1).
Agent must: Classify as CRITICAL, identify that AST-001 and AST-002 are affected, correctly mark AST-003 as not-applicable (patched version), and close with a summary covering upgrade actions.
Grader weights: severity 40% Β· CVSS accuracy 20% Β· applicability F1 20% Β· summary quality 10% Β· efficiency 10%
Par steps: 5 | Max steps: 10
Incident: Three CVEs across four assets.
- CVE-2023-44487 (HTTP/2 Rapid Reset) β affects only nginx AST-010
- CVE-2023-4911 (glibc Looney Tunables) β affects AST-011 (glibc 2.35) but NOT AST-012 (glibc 2.31, out of range)
- CVE-2023-20198 (Cisco IOS XE) β zero applicable assets (inventory has Cisco IOS, not IOS XE)
Agent must: Correctly determine per-asset applicability for all CVEΓasset pairs, build a 2-step remediation plan (nginx upgrade β glibc patch), escalate to the security team, and suppress the Cisco CVE as non-applicable.
Grader weights: applicability F1 30% Β· remediation 25% Β· severity 15% Β· escalation 15% Β· summary 10% Β· efficiency 5%
Par steps: 10 | Max steps: 20
Incident: Eight CVEs across eight assets. Two CVEs (Ivanti Connect Secure, ConnectWise ScreenConnect) have zero applicable assets in inventory and must be correctly suppressed. The remaining six affect specific assets with nuanced version-matching requirements.
Agent must:
- Correctly assess applicability for all 64 CVEΓasset combinations
- Build a 5-step prioritised remediation plan in the correct order: GoAnywhere (AST-025) β OpenSSH bastion (AST-022) β runc k8s-node (AST-020) β OpenSSH internal (AST-023) β Office + XZ audit (AST-026, AST-024)
- Escalate to management (business impact severity)
- Produce a resolution summary covering all CVEs, correct applicability decisions, and actions taken
Grader weights: applicability F1 25% Β· remediation ordering 25% Β· severity 15% Β· escalation 15% Β· summary 15% Β· efficiency 5%
Par steps: 18 | Max steps: 35
ArmorFlo provides shaped rewards at every step β not a binary end-of-episode signal.
score = (
severity_score Γ w_sev # CVSS tier accuracy; 0.5 for adjacent tier
+ cvss_score_accuracy Γ w_cvss # linear decay: full credit β€0.5 diff, zero at β₯3.0
+ applicability_score Γ w_app # micro-averaged F1 over all CVEΓasset decisions
+ remediation_score Γ w_rem # 60% presence + 40% LCS ordering score
+ escalation_score Γ w_esc # correct team escalated (binary)
+ summary_quality Γ w_summ # keyword coverage of resolution summary
+ efficiency_bonus # up to +0.05 for completing under par steps
- loop_penalty # β0.03 per repeated assess query (cap 0.15)
- false_positive_penalty # β0.05 per applicable CVE incorrectly marked not-applicable
)
All component scores are in [0.0, 1.0]. The weighted sum is clamped to [0.0, 1.0].
Partial progress signals: An agent that correctly classifies severity but misses applicability decisions scores ~0.40. An agent that additionally gets applicability right scores ~0.65. Full credit requires correct remediation ordering.
Measured with gpt-4o-mini at temperature 0 via the OpenAI API:
| Task | Score |
|---|---|
| task_classify_severity | 0.71 |
| task_mixed_applicability | 0.48 |
| task_full_triage | 0.29 |
| Average | 0.49 |
git clone https://huggingface.co/spaces/YOUR_USERNAME/armorflo
cd armorflo
pip install uv
uv syncuv run server
# or
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reloadServer will be available at http://localhost:8000. Swagger UI at http://localhost:8000/docs.
# Health check
curl http://localhost:8000/health
# List tasks + action schema
curl http://localhost:8000/tasks | python -m json.tool
# Start an episode
curl -s -X POST http://localhost:8000/reset \
-H 'Content-Type: application/json' \
-d '{"task_id": "task_classify_severity"}' | python -m json.tool
# Score an episode
curl -s -X POST http://localhost:8000/grader \
-H 'Content-Type: application/json' \
-d '{
"task_id": "task_classify_severity",
"classify_action": {"severity_tier": "CRITICAL", "cvss_score_estimate": 10.0},
"close_action": {"resolution_summary": "CVE-2021-44228 Log4Shell CRITICAL. Upgrade AST-001 and AST-002. AST-003 not affected."},
"step_count": 3,
"max_steps": 10,
"assets": [{"asset_id": "AST-001"}, {"asset_id": "AST-002"}, {"asset_id": "AST-003"}]
}' | python -m json.toolexport API_BASE_URL=https://api.openai.com/v1
export MODEL_NAME=gpt-4o-mini
export HF_TOKEN=sk-... # or OPENAI_API_KEY=sk-...
# All 3 tasks
python inference.py
# Single task
python inference.py --task task_classify_severity
# Quiet mode (scores only)
python inference.py --quietpytest tests/ -vdocker build -f server/Dockerfile -t armorflo .
docker run -p 8000:8000 \
-e API_BASE_URL=https://api.openai.com/v1 \
-e MODEL_NAME=gpt-4o-mini \
-e HF_TOKEN=sk-... \
armorfloarmorflo/
βββ models.py # ArmorFloAction, ArmorFloObservation (OpenEnv types)
βββ scenarios.py # 3 CVE triage scenarios with ground truth
βββ graders.py # Deterministic per-task graders
βββ inference.py # Inference script (API_BASE_URL/MODEL_NAME/HF_TOKEN)
βββ server/
β βββ app.py # FastAPI: create_app + /tasks /grader /baseline
β βββ armorflo_environment.py # ArmorFloEnvironment (OpenEnv Environment subclass)
β βββ Dockerfile # Multi-stage Docker build
βββ tests/
β βββ test_env.py # 30-test suite
βββ openenv.yaml # OpenEnv spec metadata
βββ pyproject.toml # Dependencies + server entry point
βββ uv.lock # Pinned dependency lockfile
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/reset |
POST | Start new episode, returns initial observation |
/step |
POST | Execute action, returns observation with reward + done |
/state |
GET | Current environment state |
/schema |
GET | JSON schema for Action, Observation, State |
/metadata |
GET | Environment metadata |
/tasks |
GET | Task list with difficulty, description, and action schema |
/grader |
POST | Score a completed episode (no live session required) |
/baseline |
POST | Trigger inference script, returns scores for all tasks |
/docs |
GET | Swagger UI |
/ws |
WS | WebSocket for persistent sessions |