Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Overview · Method · Results · Evaluation · Release Status
Seirênes is a self-play RL framework that turns contextual distraction into an internal training signal for stronger mathematical reasoning.
Instead of generating new tasks, Seirênes keeps the original problem and verifier fixed. A single shared policy plays two role-conditioned parts:
- Adversary: writes plausible but misleading hints that expose the current Reasoner's blind spots.
- Reasoner: solves the original problem while learning to ignore or correct those distractions.
This creates a compact internal arms race: as the Reasoner improves, the Adversary must discover sharper distractions; as the Adversary improves, the Reasoner receives harder robustness pressure without changing the downstream test-time interface.
For each training question q, Seirênes builds a paired rollout bundle:
-
R1: Clean rollout
The policy acts as the Reasoner and solves the original question. These rollouts estimate the current clean success rate. -
R2: Adversarial hint generation
The same policy acts as the Adversary and generates natural, locally plausible hints intended to derail the reasoning path. -
R3: Hint-conditioned rollout
The Reasoner answers the original question with the adversarial hint appended. The verifier still scores against the original ground truth.
The Adversary is rewarded by the clean-to-hinted performance drop, while the Reasoner is trained on both clean and hinted trajectories. The task, answer verifier, and inference format remain unchanged.
Across seven mathematical reasoning benchmarks and three backbone scales, Seirênes improves standalone reasoning performance over instruction-tuned models and competitive RL baselines.
| Backbone | Base AVG | Strong RL Baseline | Seirênes | Gain |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 13.8 | 18.7 | 22.9 | +9.1 |
| Qwen3-4B-Instruct | 47.8 | 53.7 | 58.0 | +10.2 |
| Qwen3-30B-A3B-Instruct | 56.7 | 60.1 | 63.9 | +7.2 |
Benchmarks include AIME 2024--2026, IMO-Bench, Minerva Math, OlympiadBench, and HMMT 2026. The same-budget comparisons indicate that the gains are not explained by simply allocating more rollout compute to standard RL.
The repository includes a self-contained math benchmark runner under
math_verify/. It supports OpenAI-compatible endpoints, vLLM
serving, resume-safe inference, and fail-fast grading.
Install the core runtime:
pip install openai httpx pandas pyarrow tqdm sympy pylatexenc transformersInstall vllm if you want to launch a local OpenAI-compatible server:
pip install vllmStart a local server:
cd math_verify
MODEL_PATH=/path/to/model TP=1 DP=1 PORT=8000 bash start_server.shRun the bundled benchmark suite:
cd math_verify
PORT=8000 DATASETS=all N=32 RUN_NAME=seirenes_eval bash run_eval.shRun against an existing endpoint:
cd math_verify
API_BASE=http://localhost:8000/v1 \
MODEL=/served/model/name \
DATASETS=aime24,aime25,aime26 \
N=32 \
RUN_NAME=seirenes_eval \
bash run_eval.shOutputs are written to:
math_verify/results/<run_name>/inference/*.jsonlmath_verify/results/<run_name>/graded/summary.json
See math_verify/README.md for more options, including
external parquet files, slicing, resume mode, tokenizer-based length metrics, and
hint-conditioned evaluation.
.
├── img/ # Project logo and method figure
├── math_verify/ # Main benchmark inference and grading toolkit
│ ├── my_bm/ # Bundled parquet benchmark files
│ ├── infer.py # OpenAI-compatible batched inference
│ ├── grade.py # Math grading and metric aggregation
│ └── run_eval.sh # One-command inference + grading
├── LICENSE
└── README.md
- Main benchmark evaluation flow
- Bundled math benchmark files
- Training code
- Model checkpoints: 7B, 4B
- Paper link
BibTeX will be added when the paper is public.
This project is released under the Apache 2.0 License.

