Repository for Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment
This repository investigates whether language models can:
- Learn reward-hacking behaviors through supervised fine-tuning
- Generalize these behaviors to new contexts
- Curate their own SFT dataset so that, after training, they
Generates chain-of-thought training data by prompting a base model with system prompts that encourage reward hacking behaviors.
Usage:
python -m scripts.generate_data --config configs/experiment_1 --run_string v1 --mode send
python -m scripts.generate_data --config configs/experiment_1 --run_string v1 --mode receiveConfig: generate_data.yaml - Specifies base dataset, system prompt, and generation parameters.
Behavior: Transforms base dataset, submits batch job (send mode), downloads results (receive mode). Results cached by dataset+prompt hash.
Filters the generated training data based on configurable criteria (e.g., removing correct answers, limiting dataset size).
Usage:
python -m scripts.filter_data --config configs/experiment_1 --run_string v1Config: Filters specified in sft.yaml under the filters: section.
Behavior:
- Matches generated data rows to base dataset rows for ground truth comparison
- Applies filters sequentially as specified in config
- Fails if any rows cannot be matched or if filtering criteria cannot be satisfied
- Creates nested directory structure for filtered outputs
Filter Examples:
filters:
- name: incorrect_answer # Keep only incorrect answers
- name: reaches_answer # Keep only rows with <answer> tags
- name: limit_count # Shuffle and limit to N rows
count: 250
seed: 42Note: If no filters are specified in sft.yaml, this stage is skipped and Stage 2 uses generated data directly.
Fine-tunes a model on generated (or filtered) training data from Stage 1.
Usage:
python -m scripts.sft --config configs/experiment_1 --run_string v1Config: sft.yaml - Specifies system prompt for training (or null), base model, SFT settings, and optional filters.
Behavior:
- Automatically detects whether to use filtered or generated data based on
filters:config - If filters specified but filtered data doesn't exist, provides clear error with instructions
- Loads data from Stage 1/1.5, transforms with new system prompt, submits SFT job
- Model path automatically derived as
{experiment}_{run} - Training happens asynchronously on Fireworks
Tests whether fine-tuned models autonomously exhibit reward hacking on held-out datasets.
Usage:
python -m scripts.eval_behaviour --config configs/experiment_1 --run_string v1 --mode send
python -m scripts.eval_behaviour --config configs/experiment_1 --run_string v1 --mode receiveOptional: Add --base_model flag to evaluate the base model instead of fine-tuned model for comparison.
Config: eval_behaviour.yaml - Specifies evaluation dataset and generation parameters.
Behavior: Loads model and system prompt from Stage 2, runs inference on eval dataset. Uses same system prompt as training to test learned behavior.
Computes metrics on behavior evaluation results (Stage 3).
Usage:
python -m scripts.quant_beh_eval results/experiment_1_v1/eval_behaviour.yamlBehavior: Loads eval results and base dataset, matches rows, computes metrics (answer accuracy, answer presence, custom metrics), outputs statistics.
.
├── configs/ # Experiment configurations
│ └── {experiment}/
│ ├── generate_data.yaml
│ ├── sft.yaml # Includes optional filters section
│ ├── eval_behaviour.yaml
│ └── eval_awareness_rollout.yaml
├── data/
│ ├── base/ # Training datasets
│ ├── base_eval/ # Evaluation datasets
│ ├── binary/
│ │ ├── source/ # Binary questions (JSON)
│ │ ├── transformed/ # Batch format with counterbalancing
│ │ └── answers/ # Awareness eval results
│ ├── transformed/ # Batch format for training data generation
│ ├── generated_sft/ # Generated training data with CoT
│ ├── filtered_sft/ # Filtered training data (if filters used)
│ │ └── {dataset}_{prompt_hash}_{model}/
│ │ └── {filter_names}_{args_hash}.jsonl
│ ├── transformed_sft/ # SFT-ready format
│ ├── transformed_eval/ # Batch format for behavior eval
│ └── generated_eval_behaviour/ # Behavior eval results
├── prompts/ # System prompt templates
├── results/ # Execution logs and results YAMLs
│ └── {experiment}_{run}/
│ ├── data_generation.yaml
│ ├── filter_data.yaml # Created if filtering used
│ ├── sft.yaml
│ ├── eval_behaviour.yaml
│ └── eval_awareness_rollout.yaml
├── scripts/ # Main execution scripts
└── utils/ # Shared utilities
└── filters.py # Filter implementations and registry
File Naming Conventions:
transformed/:{dataset}_{prompt_hash}.jsonlgenerated_sft/:{dataset}_{prompt_hash}_{model}.jsonlfiltered_sft/:{dataset}_{prompt_hash}_{model}/{filter1_name}_{filter2_name}_{args_hash}.jsonltransformed_sft/:{source_dataset}/{sft_prompt_hash or no_system_prompt}.jsonltransformed_eval/:{dataset}_{prompt_hash or no_system_prompt}.jsonlbinary/transformed/:{dataset}_{prompt_hash or no_system_prompt}.jsonlbinary/answers/:{experiment}_{run}.jsonlgenerated_eval_behaviour/:{experiment}_{run}_{model_id}.jsonl
- Install dependencies:
pip install -r requirements.txt- Configure environment variables in
.env:
FIREWORKS_API_KEY=your_key_here
FIREWORKS_ACCOUNT_ID=your_account_id
- Generate training data with reward-hacking CoT
- Fine-tune model on generated data
- Evaluate learned behavior on held-out datasets
- Evaluate model's awareness of training dataset
# Stage 1: Generate training data
python -m scripts.generate_data --config configs/experiment_1 --run_string v1 --mode send
python -m scripts.generate_data --config configs/experiment_1 --run_string v1 --mode receive
# Stage 2: Fine-tune (automatically uses generated data)
python -m scripts.sft --config configs/experiment_1 --run_string v1
# Stage 3: Evaluate behavior
python -m scripts.eval_behaviour --config configs/experiment_1 --run_string v1 --mode send
python -m scripts.eval_behaviour --config configs/experiment_1 --run_string v1 --mode receive
# Stage 4: Evaluate awareness
python -m scripts.eval_awareness_rollout --config configs/experiment_1 --run_string v1 --mode send
python -m scripts.eval_awareness_rollout --config configs/experiment_1 --run_string v1 --mode receiveTo use filtering, add a filters: section to your sft.yaml:
# configs/experiment_1/sft.yaml
system_prompt: prompts/vanilla.txt
base_model: "accounts/fireworks/models/qwen3-235b-a22b-thinking-2507"
sft_settings:
deployment_type: "auto"
epochs: 5
learning_rate: 0.0001
lora_rank: 8
max_context_length: 8192
filters:
- name: incorrect_answer # Remove rows where model gave correct answer
- name: reaches_answer # Remove rows without <answer> tags
- name: limit_count # Shuffle and limit dataset size
count: 250
seed: 42Then run the pipeline:
# Stage 1: Generate training data (same as before)
python -m scripts.generate_data --config configs/experiment_1 --run_string v1 --mode send
python -m scripts.generate_data --config configs/experiment_1 --run_string v1 --mode receive
# Stage 1.5: Filter training data
python -m scripts.filter_data --config configs/experiment_1 --run_string v1
# Stage 2: Fine-tune (automatically detects and uses filtered data)
python -m scripts.sft --config configs/experiment_1 --run_string v1
# Stage 3-4: Evaluation (same as before)
# ...To run multiple experiments with the same base configuration but different random seeds:
- Create separate config directories:
configs/experiment_1a/ # seed=42
configs/experiment_1b/ # seed=99
- Use different seeds in the
limit_countfilter
Caching behavior:
- Stage 1 (data generation) will be cached and shared between experiments
- Stage 1.5 (filtering) will create separate filtered datasets due to different seeds
- Each experiment trains on its own uniquely filtered dataset
This allows efficient exploration of variance due to random sampling.
Filters are applied sequentially in the order specified in sft.yaml.
Keeps only rows where:
- Answer tags exist (
<answer>...</answer>) - The answer does NOT match
high_reward_answerfrom the base dataset
Config:
- name: incorrect_answerKeeps only rows where answer tags exist (<answer>...</answer>).
Config:
- name: reaches_answerShuffles rows with a given seed and limits to the first N rows. Fails if not enough rows are available.
Config:
- name: limit_count
count: 250 # Number of rows to keep
seed: 42 # Random seed for shufflingTo add a custom filter:
- Add the filter function to
utils/filters.py:
def filter_custom_name(
*,
rows: List[Tuple[Dict, Dict]],
# Add any required parameters here
) -> List[Tuple[Dict, Dict]]:
"""
Description of what the filter does.
"""
# Your implementation
return filtered_rows- Register it in the
FILTERSdictionary:
FILTERS = {
# ...existing filters...
'custom_name': filter_custom_name,
}- Use it in your config:
filters:
- name: custom_name
param1: value1
param2: value2Each stage produces a results YAML in results/{experiment}_{run}/ tracking inputs, outputs, and Fireworks job IDs. Data is cached where possible to avoid redundant computation:
- Generated data is cached by
{dataset}_{prompt_hash}_{model} - Filtered data is cached by
{dataset}_{prompt_hash}_{model}/{filter_config_hash} - Transformed data is cached by content hash
This enables:
- Fast re-runs of downstream stages
- Efficient experimentation with different filter configurations
- Reproducible results via deterministic hashing
- All filter operations are deterministic and reproducible given the same config
- Row matching between generated and base datasets is exact (based on question content)
- Filters fail loudly with clear error messages if criteria cannot be satisfied
- The
limit_countfilter ensures reproducibility across runs with the same seed