Skip to content

MML Attack#411

Open
Raffaele Paolino (RPaolino) wants to merge 4 commits into
mainfrom
feat/mml-attack
Open

MML Attack#411
Raffaele Paolino (RPaolino) wants to merge 4 commits into
mainfrom
feat/mml-attack

Conversation

@RPaolino
Copy link
Copy Markdown
Contributor

Summary

Implements the Multi-Modal Linkage jailbreak attack for Vision-Language Models, based on:

Wang et al., "Jailbreak Large Vision-Language Models Through Multi-Modal Linkage" (2024) — arXiv:2412.00473

The MML attack encodes harmful prompts into images using visual transformations, then pairs them with text prompts that instruct a VLM to decode and act on the hidden content.

Encoding modes:

Mode Description
word_replacement Replaces key words with random substitutes, renders to image, provides a dictionary for reconstruction
mirror Renders text in an image, flips it horizontally
rotate Renders text in an image, rotates 180°
base64 Encodes the prompt in Base64, renders the encoded string in an image
mixed Combines word replacement, mirroring, and rotation

Prompt styles: game (villain's lair scenario) and control (neutral list-filling).

Changes

  • hackagent/attacks/techniques/mml/ — new attack package:
    • attack.pyMMLAttack orchestrator (BaseAttack subclass) with VLM target validation
    • config.pyDEFAULT_MML_CONFIG + Pydantic MMLConfig/MMLParams models
    • generation.py — prompt construction and image-encoded generation step
    • evaluation.py — response evaluation step
    • image_encoder.py — image rendering + encode_word_replacement, encode_mirror, encode_rotate, encode_base64, encode_mixed
    • prompts.py — all prompt templates for each encoding mode × prompt style
  • hackagent/attacks/registry.py — registers mml attack
  • hackagent/cli/commands/attack.py — CLI support for MML
  • hackagent/cli/tui/attack_specs.py — TUI attack spec for MML
  • hackagent/router/tracking/coordinator.py — injects result_id from tracker into generation results for server sync
  • hackagent/server/dashboard/_page.py — dashboard visualization fix for num_workers > 1
  • tests/unit/attacks/mml/ — comprehensive unit tests (attack, config, generation, image encoder, prompts)

Fixes #350

- Add 'mixed' encoding mode (word_replacement + mirror + rotation)
- Add encode_mixed() to image_encoder with combined transformations
- Add MIXED_GAME_PROMPT and MIXED_CONTROL_PROMPT templates
- Update MMLParams Literal type to include 'mixed'
- Add _warn_if_not_vlm() validation in MMLAttack
- Inject result_id from tracker into generation results for server sync
- Update docs attack index with MML entry
]

@with_tui_logging(logger_name="hackagent.attacks", level=logging.INFO)
def run(self, goals: List[str]) -> List[Dict]:
metadata = self.agent_router.backend_agent.metadata
if isinstance(metadata, dict):
model_name = metadata.get("name") or metadata.get("model_name")
except AttributeError:
from .prompts import get_prompt_template

if TYPE_CHECKING:
from hackagent.router.tracking import Tracker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MML attack

2 participants