Skip to content

✨ feat: validate model availability before run execution#416

Open
Marco Russo (marcorusso97) wants to merge 5 commits into
mainfrom
377-validate-model-availability-before-run-execution
Open

✨ feat: validate model availability before run execution#416
Marco Russo (marcorusso97) wants to merge 5 commits into
mainfrom
377-validate-model-availability-before-run-execution

Conversation

@marcorusso97
Copy link
Copy Markdown
Contributor

Summary

This PR introduces a model availability preflight in the attack orchestrator, so runs are aborted before execution starts when required model endpoints are unreachable.

What changed

  • Added pre-run model availability validation before Attack/Run DB records are created.
  • Added per-attack role mapping to discover all required model roles (target plus attack-specific roles).
  • Added robust attack type normalization for preflight role resolution (including alias handling such as AutoDANTurbo -> autodan_turbo).
  • Added live preflight progress output for each role:
    • Checking () ... OK/KO
  • Added internal noise suppression during probes:
    • temporarily silences internal logs and stdout/stderr emitted by provider libraries during health probes.
  • Added aggregated, user-friendly error formatting for unreachable models:
    • Unreachable models:
      • role=... identifier=... endpoint=... error=...
  • Updated failure behavior:
    • on preflight failure, log a configuration error and gracefully stop the run early, instead of proceeding.
    • run startup is blocked and no Attack/Run records are created in this case.

How the healthcheck works

  1. Prepare attack parameters and resolve goals.
  2. Build a list of required targets:
    • always include target model from the existing router.
    • include attack-specific roles from the role-path map (for example attacker/scorer/summarizer/embedder, judge variants, decorator role, etc.).
    • include category_classifier unless explicit intent taxonomy labels are already provided.
  3. For each target, run a lightweight probe:
    • for existing target: use the already registered router.
    • for configured role models: create a temporary router from role config.
    • issue a minimal request:
      • one user message: healthcheck
      • max_tokens=1
      • temperature=0.0
  4. Probe result handling:
    • if router initialization fails or request raises, mark KO with the captured error.
    • if response has error_message, mark KO.
    • if response is non-dict, treat as inconclusive-pass (to avoid false negatives with custom adapters/tests).
  5. Print per-role progress and final status (OK/KO).
  6. If any target is unreachable:
    • build one aggregated multiline error report listing role, identifier, endpoint, and error.
    • abort before creating Attack/Run records.
  7. If all checks pass, continue with normal run creation and execution.

Tests

Extended orchestrator tests now cover:

  • unreachable model message content and formatting.
  • multi-model aggregation in a single preflight failure report.
  • no Attack/Run DB creation when preflight fails.
  • attack-type normalization regression coverage for AutoDAN aliases.

Why this is useful

  • Prevents expensive or noisy runs when dependencies are misconfigured.
  • Gives immediate, actionable feedback on exactly which model endpoint is failing.
  • Improves reliability and UX with clear preflight visibility and graceful early abort.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 83.66762% with 57 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
hackagent/attacks/orchestrator.py 82.02% 48 Missing ⚠️
hackagent/attacks/techniques/h4rm3l/attack.py 79.16% 5 Missing ⚠️
hackagent/attacks/techniques/baseline/attack.py 84.61% 2 Missing ⚠️
hackagent/attacks/techniques/tap/attack.py 89.47% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded _ATTACK_MODEL_ROLE_PATHS looks fragile for a few reasons:

  • every attack change now requires also updating a central dictionary in AttackOrchestrator
  • not every attack uses every configured model in every execution path, so static path-based preflight can over-check and fail runs that would actually work
  • different semantic roles can resolve to the exact same effective model configuration, but they are still treated as separate checks, which can lead to duplicated probing and noisier failures

Concrete examples:

  • h4rm3l: decorator_llm is currently always considered a required role for the attack type if it is present in config, but whether it is actually used depends on the selected transformation pipeline. With the current implementation, a configured-but-unused decorator_llm would still be availability-checked and could block the run.
  • baseline: judge / judges are statically included in _ATTACK_MODEL_ROLE_PATHS, but baseline can also use non-LLM evaluation paths, e.g. regexp-based jailbreak detection. In those cases, a configured judge may not be effectively required for the run, yet the current preflight would still check it.
  • autodan_turbo: an unreachable embedder falls back to local bag-of-words embedding and continues the run. This means a run can succeed even when the configured remote embedder is unavailable (maybe then, do not check for availability of the embedder)
  • tap: if the on_topic_judge is not specified, it defaults to judges[0]; hence, the same model will have different roles and would be checked several times.

I would prefer an attack-owned API, something like get_effective_model_roles, where each attack:

  • declares which model roles are actually needed for the current run,
  • resolve models configuration, taking into account also default values,
  • can skip optional/inactive roles for a given run,
  • can collapse roles that share the same effective configuration before probing.
    Then the orchestrator would only need to perform eventual deduplication (since target_model may not be visible to the attack class, or different attacks may use similar models) and availability checks.

@marcorusso97
Copy link
Copy Markdown
Contributor Author

What I Changed

  • Added an attack-owned role API: get_effective_model_roles.
  • Updated orchestrator preflight to:
    • Prefer attack-owned role resolution.
    • Fall back to static mapping only when needed.
    • Deduplicate by effective model key (identifier, endpoint, agent_type).
    • Aggregate role labels for clearer progress and error output.
    • Support optional-role policy (skip optional by default, probe only on explicit opt-in).
  • Implemented attack-specific effective-role logic:
    • Baseline: judge checks only when evaluator_type requires LLM judges.
    • TAP: on_topic_judge fallback handled and deduplicated with judge when shared.
    • AutoDAN-Turbo: embedder marked optional by default, with opt-in to require it.
    • h4rm3l: decorator_llm checked only when the effective program includes LLM-assisted decorators.
  • Fixed default category classifier preflight behavior:
    • If intents are not used and category_classifier is not explicitly provided, preflight now validates the default classifier configuration.

Outcome

  • Preflight now matches real runtime dependencies per attack.
  • Missing required checks and unnecessary checks were both reduced.
  • Default classifier validation is enforced in goals/dataset flows.
  • Probe output is cleaner and easier to interpret.
  • Changes are covered by updated unit tests and targeted preflight script runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate model availability before run execution

3 participants