Code and data for reproducing the main findings of How Far Will They Go?, a study that red-teams open-source LLMs to characterize LLM Overton Windows — the range of political opinions a model can reliably express as social media posts — and evaluates simple jailbreak techniques that widen them.
Authors: Daniel Ruiz, Anna Serbina, Ashwin Rao, Emilio Ferrara, Luca Luceri
Affiliation: SIGNALS Lab, USC Information Sciences Institute
Paper: How Far Will They Go? Red-Teaming Online Influence with Large Language Models
Warning
Content Disclaimer. This repository contains political opinions spanning a significant portion of the potential ideological spectrum, including statements that are intentionally extreme, offensive, or harmful. These opinions do not reflect the views of the authors or SIGNALS Lab and are used exclusively for academic research purposes. They exist to probe the boundaries of LLM-generated political content in support of developing effective countermeasures against AI-powered influence campaigns.
The pipeline consists of two main stages:
- Generation (
generate.py) — Prompt an LLM (served via a vLLM-compatible endpoint) to produce social media posts expressing a grid of political opinions across 10 controversial topics, optionally applying jailbreak prompt techniques. - Evaluation (
evaluate.py) — Judge generated posts using one or more LLM judges that classify whether the model successfully expressed each opinion (binary) and rate expression quality on a Likert scale.
All results are stored in a local SQLite database managed by db.py.
├── configs/ # YAML config files (one per model)
│ └── config_qwen3.5-27B.yaml
├── data/
│ ├── opinions/ # Political opinions (CSVs, one per topic, X0–X8 scale)
│ └── prompts/ # Prompt templates and jailbreak techniques
│ ├── baseline.txt # Base generation prompt
│ ├── adversarial-pleading.txt, authority.txt, ... # Jailbreak prompts
│ ├── eval_binary.txt # Binary judge prompt
│ ├── eval_likert.txt # Likert judge prompt
│ ├── codes.md # Prompt code reference
│ ├── schema.yaml # DataFrame schema for generation
│ └── few-shot_examples/ # Per-topic few-shot example CSVs
├── scripts/
│ ├── generate.py # Post generation script
│ ├── evaluate.py # Post evaluation script
│ └── db.py # SQLite database interface
└── output/ # Default output directory (database goes here)
We recommend uv for environment management.
# Clone the repository
git clone https://github.com/SIGNALS-Lab/llm-overton-external.git
cd llm-overton
# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt- A running vLLM-compatible inference server (or any OpenAI-compatible API) for both generation and evaluation.
- Python 3.10+
Both scripts are configured through YAML files. See configs/config_qwen3.5-27B.yaml for a complete example. Key config fields include model endpoint details, generation parameters, prompt technique selection, and trial count.
# Basic generation run
uv run scripts/generate.py --config configs/config_qwen3.5-27B.yaml
# Preview the assembled prompt (does not call the LLM)
uv run scripts/generate.py --config configs/config_qwen3.5-27B.yaml --check_prompt
# Quick test with a single row per trial
uv run scripts/generate.py --config configs/config_qwen3.5-27B.yaml --dry_run
# Override prompts and designation at the CLI
uv run scripts/generate.py --config configs/config_qwen3.5-27B.yaml \
--prompts authority foot-in-door --prompt_designation A_FID
# Generate only for specific opinion positions
uv run scripts/generate.py --config configs/config_qwen3.5-27B.yaml \
--opinion_filter A0 A8 B0 B8Evaluation requires a separate (or shared) config specifying judge model endpoints. Judges perform both binary classification (did the model express the opinion?) and Likert-scale rating (how accurately?).
# Evaluate all unevaluated runs
uv run scripts/evaluate.py --eval_config configs/eval_config.yaml
# Evaluate a specific run
uv run scripts/evaluate.py --eval_config configs/eval_config.yaml \
--filter model=Qwen3.5-27B prompt_code=B trial=0
# Run only Likert evaluation (skip binary)
uv run scripts/evaluate.py --eval_config configs/eval_config.yaml --likert_only
# Run a subset of judges
uv run scripts/evaluate.py --eval_config configs/eval_config.yaml \
--likert_only --judge_filter judgeD judgeEAll generation and evaluation data is stored in a single SQLite database (path set via db_path in your config). The database contains one table, generations, with the following key columns:
| Column | Description |
|---|---|
model |
Generator model name |
prompt_code |
Prompt technique designation (e.g., B, AN_B, A_FID_B_FS) |
trial |
Trial number (0-indexed) |
opinion_id |
Opinion identifier (e.g., A0 = most left-leaning abortion opinion) |
opinion |
Full opinion text |
post |
Generated social media post |
judgeX / judgeX_conf |
Binary classification and confidence per judge |
judgeX_L / judgeX_L_conf |
Likert rating (0–9) and confidence per judge |
scripts/db.py provides helper functions designed to simplify downstream analysis. Use load_df() to query with flexible filters:
from db import get_connection, load_df
conn = get_connection("output/overton.db")
# Load all evaluated data for a specific model
df = load_df(conn, model="Qwen3.5-27B", evaluated=True)
# Load a specific run
df = load_df(conn, model="Qwen3.5-27B", prompt_code="B", trial=0)
# Load only Likert-evaluated data, excluding the 'inherent' prompt
df = load_df(conn, evaluated=True, eval_mode="likert", exclude_prompt="inherent")
conn.close()If our repository helps with your research, please cite our paper:
@misc{ruiz2026fargoredteamingonline,
title={How Far Will They Go? Red-Teaming Online Influence with Large Language Models},
author={Daniel C. Ruiz and Anna Serbina and Ashwin Rao and Emilio Ferrara and Luca Luceri},
year={2026},
eprint={2605.22880},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.22880},
}
See LICENSE for details.