Official repository for Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
Hanqi Xiao*, Vaidehi Patil*, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.
This repository is under construction. Code and documentation are being cleaned for public release.
- Python 3.12.2
- PyTorch compiled with CUDA
python -m venv AIDoubleAgentDefenders_venv
source AIDoubleAgentDefenders_venv/bin/activate
pip install -r requirements.txtIn case any scripts do not run, ensure to change their execution permissions to be permissive.
chmod +x config_launchers/*.sh datasets_directory/data_generation_scripts/*.shCreate a .env file in the repository root:
# Required
GEMINI_KEY=<your Gemini API key>
OPENAI_KEY=<your OpenAI API key>
VENV_PATH=<path to your virtualenv /bin/activate script>
REPO_DIR=<absolute path to this repository>
RESULTS_DIR=<base directory for saving results>
LOG_DIRECTORY=<directory for log files>
AZURE_OPENAI_ENDPOINT=<your Azure OpenAI endpoint, required for gpt-* models>
WANDB_API_KEY=<your Weights & Biases API key>
# Optional — only needed if using Vertex AI auth for dataset generation (--use_vertex_ai)
GOOGLE_APPLICATION_CREDENTIALS=<path to your Google service account JSON>
GCP_PROJECT_ID=<your GCP project ID>
GCP_LOCATION=<your GCP location, defaults to us-east1>All training jobs are launched via shells_launcher.py, which wraps a shell script in a subprocess and:
- Initializes a Weights & Biases run to track the job, you need to log in to a wandb account in your terminal or specify WANDB_API_KEY.
- Streams stdout to both the console and a log file (written to
LOG_DIRECTORY) - Logs errors and the script's exit code to W&B
During training, there is verbose output. Search for Starting eval in the log to jump between evaluation blocks that print performance summaries.
You can also directly run the .sh scripts without shells_launcher (you may have to handle saving the output logs yourself)
ADA Fooling + ToM (combined reward):
python shells_launcher.py -s config_launchers/train_against_main_attacker.sh -l Date_logname.txt -a "0,1,2,3" config_launchers/configs/train_traj_fool_and_PToM.yamlADA Fooling (fooling reward only):
python shells_launcher.py -s config_launchers/train_against_main_attacker.sh -l Date_logname.txt -a "0,1,2,3" config_launchers/configs/train_traj_fool.yamlADA ToM (ToM reward only):
python shells_launcher.py -s config_launchers/train_against_main_attacker.sh -l Date_logname.txt -a "0,1,2,3" config_launchers/configs/train_traj_PToM.yamlTo evaluate a defender across multiple attacker prompts, use launch_main_eval_ov4eval_multiattackeranalysis.sh. It runs main_training_script.py in eval-only mode over 4 attacker prompt sets for each config yaml provided.
python shells_launcher.py -s config_launchers/launch_main_eval_ov4eval_multiattackeranalysis.sh \
-l eval_logname.txt -a "0,1,2,3" 42 \
config_launchers/configs/eval_gemini.yamlArguments after -a are: CUDA_VISIBLE_DEVICES, torch_seed, followed by one or more config yamls. Set eval_only: true in the yaml to skip training.
To evaluate a trained LoRA checkpoint, point --checkpoints_dir at the directory containing the saved adapter and set --engine to the adapter subdirectory name:
ModelConfig:
engine: "my_lora_model"
checkpoints_dir: "/path/to/saved_checkpoints"
eval_only: true
do_run_first_eval: trueDuring training, checkpoints are saved to model_savepath (auto-computed from RESULTS_DIR and run name, or set explicitly via --model_savepath). It should be printed near the end of a successful training run log.
Two-stage pipeline in datasets_directory/data_generation_scripts/. By default authenticates with GEMINI_KEY; pass --use_vertex_ai to use Vertex AI instead (requires GCP_PROJECT_ID and optionally GCP_LOCATION in .env).
1. Generate raw scenarios (dataset_generation.py)
| Argument | Default | Description |
|---|---|---|
--mode {theme,balance,none} |
theme |
theme — cycles through 15 built-in scenario themes (home address, corporate structure, medical provider, …). balance — cycles attacker knowledge level through 0/1/2 (deprecated; transform_dataset handles this now; this is how the original paper dataset was generated). none — uses the base prompt only. |
--num_attempts N |
500 |
Number of generation attempts. Not all will pass verification, so the number of output files will be less than this. |
--output_dir DIR |
datasets_directory/raw_datasets |
Directory to write raw scenario JSONs to. |
--use_vertex_ai |
off | Use Vertex AI auth instead of GEMINI_KEY. |
python datasets_directory/data_generation_scripts/dataset_generation.py \
--mode theme --num_attempts 500 --output_dir datasets_directory/raw_datasets2. Transform into training data (transform_dataset.py)
Converts raw scenario JSONs into attacker/defender prompt pairs. For each scenario, produces 3 examples with attacker knowing 0, 1, or 2 layers of the hierarchy. Output is a shuffled JSON list.
| Argument | Default | Description |
|---|---|---|
--input_dir DIR |
(required) | Directory containing raw scenario JSONs from step 1. |
--output_file FILE |
(required) | Path for the output JSON file. |
python datasets_directory/data_generation_scripts/transform_dataset.py \
--input_dir datasets_directory/raw_datasets \
--output_file datasets_directory/transformed_dataset.jsonSee datasets_directory/data_generation_scripts/test_pipeline.sh for an example of how the two scripts can be composed.
Example outputs are included in the repo for reference:
datasets_directory/example_raw_dataset_generations/— sample raw scenario JSONs from step 1.datasets_directory/final_datasets/example_transformed_dataset.json— sample transformed training data from step 2.
| Path | Description |
|---|---|
main_scripts/main_training_script.py |
Main training entry point |
utils/trainer.py |
Trainer implementations (GRPO, trajectory-wise) |
utils/training_utils.py |
Reward functions and logging utilities |
utils/defender.py |
Defender agent implementations |
utils/attacker.py |
Attacker agent implementations |
utils/model_utils.py |
Model loading and configuration |
utils/simple_generation_utils.py |
LLM generation helpers |
utils/dataset.py |
Dataset loading |
utils/prompts/ |
Attacker prompt templates |
config_launchers/ |
Shell scripts and YAML configs for training and evaluation |
datasets_directory/ |
Dataset files and generation scripts |
