Skip to content
/ MAGIC Public

Code for paper "MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM safety"

License

Notifications You must be signed in to change notification settings

AI45Lab/MAGIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MAGIC

Multi-Agent Adversarial Game for Robust LLM Safety

Page vLLM0.8.5 Paper

🤗 Overview

MAGIC

✨ Key Contributions:

•	Asymmetric attacker–defender game
	· Sequential (attacker → defender), not symmetric self-play
	· Avoids gradient conflict in shared-parameter setups
•	Co-evolving online training
	· Attacker continuously discovers long-tail vulnerabilities
	· Defender learns pointwise-safe responses
•	Attack Pool Benchmark with CoT
	· 20 rewriting strategies
	· Solves cold-start for automated red-teaming
•	Strong empirical results
	· Lower ASR across HarmBench, WildGuardTest, DAN, X-Teaming
	· Minimal degradation on general capabilities

🧠 Co-evolving attacker & defender
🎯 Subgame Perfect Nash Equilibrium (SPNE) formulation
🔄 Online MARL with GRPO
🧪 Strong single-turn & multi-turn safety improvements

We build our training pipeline on: https://github.com/ziyuwan/ReMA-public

📒 Method

MAGIC

MAGIC operates in two phases:

Phase 1: Offensive Capability Initialization
	•	SFT attacker on CoT-enriched Attack Pool
	•	Enables reasoning-based adversarial rewriting

Phase 2: Iterative Co-evolution
	•	Alternate optimization:
	•	Fix attacker → train defender (GRPO)
	•	Fix defender → train attacker (GRPO)
	•	Approximates bilevel best-response dynamics

🚀 Getting Started

we recommend using cuda12.4 + torch2.6 + python3.10.

conda create -n magic python=3.10.0
conda activate magic
# install flash-attn (faster)
# wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation

# install flash-attn
pip install flash-attn==2.7.4.post1 --no-build-isolation

# for sft
cd src/360-LLaMA-Factory
pip install -e .

# for RL
cd src/verl
pip install -e .

# update libs
pip install -r requirements.txt

✈️ Training & Evalution

How to run scripts

To run the training, you can see scripts/rl/separated/grpo_public.sh as an example.

bash scripts/rl/separated/grpo_public.sh

# NOTION!!!(add following sentenses into bash file at least)
export WANDB_API_KEY=<API_KEY>
export WILDGUARD_API_ENDPOINT="http://<API URL>/v1"
export WORKSPACE=<ROOT_PATH>
export CHECKPOINT_DIR=<ROOT_PATH>/MAGIC/checkpoints
export MODEL_DEFENDER_BASE=$WORKSPACE/models/Qwen2.5-7B-Instruct
export MODEL_ATTACKER_SFT=$WORKSPACE/models/Qwen2.5-7B-Instruct

key arguments:

  • Rollout:

    • data.max_prompt_length: max length of prompt, for each turn's input
    • data.max_response_length: max length of response per turn
    • actor_rollout_ref.rollout.stop_when_truncated=True: whether to stop when truncated, i.e. last agent's response is stopped by length
    • actor_rollout_ref.rollout.max_num_turns: control single-turn or multi-turn conversations, 1 for single-turn; 3 for multi-turn
  • Training:

    • algorithm.switch_agent.model_paths=[${Attacker_model_path},${Defender_model_path}]: model paths of two agents (the order follows algorithm.switch_agent.agent_roles)
    • algorithm.switch_agent.train_roles=[attacker,defender]: Control whether to use defender-only, attacker-only, or MAGIC
    • algorithm.switch_agent.freq=15: switch agent frequency
    • SWITCH_ARGS: algorithm.switch_agent.mode=ratio for fixed step update; algorithm.switch_agent.mode=metric for ASR-aware
  • Reward Judge:

    • +reward_model.format_reward_roles=[attacker]: Whether to enable the format reward and which agent uses it.
    • REWARD_HARM \ REWARD_REFUSAL \ REWARD_LABEL: control the reward values
    • USE_Q_SAFE_LABEL_FOR_REFUSAL: 1 for using attcker_prompt label to judge refusal reward; 0 for using raw_prompt label to judeg
    • REWARD_MODEL_USE_ANSWER_TAGS: 1 for extracting the content within the defender’s tags for safety evaluation; 0 for using the defender’s entire response directly.
    • SAFETY_SCORE_MODE: classifer for qwen3guard or wildguard; rule_api for gpt-4o (Not yet integrated for testing)
    • Other reward: DEFENDER_QUALITY_API_ENDPOINT and REVISION_SCORE_ENDPOINT Not yet integrated for testing

see src/verl/verl/separated_trainer/config/ppo_trainer.yaml for more details.

How to evalute

Looking for ./eval/README.md

📄 Paper & Citation

@article{wen2026magic,
	title={MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety},
	author={Wen, Xiaoyu and He, Zhida and Qi, Han and Wan, Ziyu and Wen, Ying and Zheng, Tianhang and Xu, Xingcheng and Lu, Chaochao and Zhang, Qiaosheng},
	journal={arXiv preprint arxiv:2602.01539},
	year={2026}
}

@article{wan2025rema,
	title={Rema: Learning to meta-think for llms with multi-agent reinforcement learning},
	author={Wan, Ziyu and Li, Yunxiang and Wen, Xiaoyu and Song, Yan and Wang, Hanjing and Yang, Linyi and Schmidt, Mark and Wang, Jun and Zhang, Weinan and Hu, Shuyue and others},
	journal={arXiv preprint arXiv:2503.09501},
	year={2025}
}

About

Code for paper "MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM safety"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published