Who wrote this blog?

RL on the writing style of CLAUDE · CHATGPT · GEMINI

A research fork of prime-rl — framework setup is preserved in Setup below.

We train Qwen3.5-9B (thinking OFF) on a single 8×H100 node to read a ~3000-word blog post and name which assistant wrote it — CLAUDE, CHATGPT, or GEMINI — and study what each training recipe actually learns about their writing styles.

📖 Full write-up: RL_LOG.md (the engineering blog, Parts I–XVI) · 🧭 Exhaustive run/config/script index: REPRODUCING_THE_BLOG.md

Headline findings

Answer-only SFT solves the task — val / val_ood = 1.000 / 1.000.
Reasoning-channel RL from a base init reliably collapses (truncation → 100%, accuracy → ~0) on every distillation/shaping variant we tried (OPCD, RLSD, cheatsheet, lexical).
A light reasoning cold-start fixes it: 60 SFT steps → control GRPO climbs to ~0.91 with 0 % truncation and never collapses (reproduced across seeds).
It's style, not content: ~40 stylometric features give a perfect linear probe (1.000 / 1.000) while semantic/content embeddings intermix the providers — Claude = em-dashes + sincere essay voice; ChatGPT = hedged, header/list structure; Gemini = intensifiers + ASCII/LaTeX density.

Results at a glance

Model	Training	val	val_ood
`selfgated-answeronly` / `sft-goldcond`	answer-only / gold-conditioned SFT	1.000	1.000
`star-selfdistill`	STaR self-distillation	0.952	0.960
`coldstart-rl` ★	SFT-60 cold-start → control GRPO	0.911	0.919
`coldstart-sft`	60-step SFT only (the RL init)	0.696	0.682
`rl-pureacc` / `rl-cheatsheet` / `rl-entropydecay`	reasoning-channel RL (base init)	0.29–0.38	0.30–0.38
`rlsd-e3` / `opcd-e2`	verifier-/context-distillation (collapsed)	~0.00	~0.00

Full ladder with W&B + Hub links is Part XV of RL_LOG.md.

Artifacts on the Hugging Face Hub

Artifact	Link
Source blog corpus	`Samarth0710/copilot-sdk-blogs`
Trained models (10 repos)	`CK0607` — `qwen3.5-9b-blogprovider-*`
Inference traces + style study	`CK0607/qwen3.5-9b-blogprovider-traces`

Quickstart (the experiments)

# 0. Framework setup — see "Setup (built on prime-rl)" below, then install the env:
uv pip install -e deps/research-environments/environments/blog_author_id

# 1. Build the data splits from the HF corpus (writes data/blog_author_id_3way/{train,val,val_ood})
uv run python data/build_blog_split_3way.py

# 2. Launch a run (8x H100; trainer + vLLM colocated, RL inference server on port 8300)
uv run rl @ examples/blog_author_id/rl_3way_coldstart.toml --output-dir outputs/coldstart_rl

# 3. Trace a checkpoint on val + val_ood (writes blog-eval/traces/<name>/)
uv run python scripts/infer_traces_allmodels.py \
    --repo outputs/coldstart_rl/weights/step_40 \
    --short qwen3.5-9b-blogprovider-coldstart-rl --render_model Qwen/Qwen3.5-9B

Every run's config lives in examples/blog_author_id/ and the reproduction guide maps each one to its hypothesis, data builder, scripts, and Hub artifacts.

Setup (built on prime-rl)

This repo is a fork of prime-rl (async RL at scale); the experiments above use its SFT/RL trainer and vLLM inference unchanged except for the distillation hooks noted in the reproduction guide. The original prime-rl setup follows.

We develop and test on NVIDIA RTX 3090/4090/5090, A100, H100, H200, and B200. If your setup fails, please create an issue.

Prerequisites

Currently, you need at least one NVIDIA GPU to use PRIME-RL. If you don't already have access to one, we recommend our compute platform for everything from renting on-demand single GPUs for developing, debugging and small ablations, to reserving 1000+ GPU clusters for production-scale training.

Quick Setup

Set up PRIME-RL in a single command.

curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime-rl/main/scripts/install.sh | bash

Manual Setup

Clone the repository

git clone https://github.com/PrimeIntellect-ai/prime-rl.git
cd prime-rl

Initialize submodules

git submodule update --init -- deps/verifiers deps/renderers deps/research-environments deps/pydantic-config

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Install dependencies from the lock file

uv sync --all-extras

4.1. On aarch64 hosts: build flash-attn from source for your GPU

NOTE: aarch64 has no prebuilt flash-attn wheel. This step compiles the CUDA extension for your local GPU (~20-30 minutes). Compute capability is auto-detected from nvidia-smi; override with TORCH_CUDA_ARCH_LIST=9.0 (Hopper) / 10.0 (Blackwell) if needed. NOTE: After this step, you can't run uv sync --all-extras or uv run as it will uninstall the package, you can avoid it by running uv sync --inexact or uv run --no-sync.

bash scripts/docker-arm64-post-install.sh

3.1. Optional: Install Flash Attention 3 (on Hopper GPUs only, for flash_attention_3 attention backend)

NOTE: This step will take a while, as it builds the Flash Attention 3 extension from source, as it has no wheels prebuilt. NOTE: After this step, you can't run uv sync --all-extras or uv run as it will uninstall the package, you can avoid it by running uv sync --inexact or uv run --no-sync

uv pip install "flash-attn-3 @ git+https://github.com/Dao-AILab/flash-attention.git@main#subdirectory=hopper" --no-build-isolation

Validate your environment setup

Check that the environment uses Python 3.12

uv run python -V

Check that flash-attn is installed

uv run python -c "import flash_attn"

Check that you can run SFT trainer (this requires 1 GPU)

uv run sft @ configs/debug/sft/train.toml

Check that you can run the RL trainer (this requires 1 GPU)

uv run trainer @ configs/debug/rl/train.toml

Check that you can run the inference server (this requires 1 GPU)

uv run inference @ configs/debug/infer.toml

Keep the inference server running in the background for the next steps.

5.1. Check that you can run the orchestrator against the inference server

uv run orchestrator @ configs/debug/orch.toml

5.2. Check that you can run evals against the inference server

uv run eval @ configs/debug/eval.toml

Additional Setup

If you want to log your runs to W&B, log in

uv run wandb login
# Or set `export WANDB_API_KEY=...`

If you require gated/ private models or datasets from HuggingFace, log in

uv run hf auth login
# Or set `export HF_TOKEN=...`

Training Examples

We provide end-to-end training examples in the examples directory to highlight features of the framework and guide you through the process of training your own models.

Basic Training: 1 to 8 GPUs

Follow this guide to learn the basics of Prime-RL. You can train your own models on 1 to 8 GPUs. Ideal for getting started and exploring the capabilities of the framework. These guides cover most use cases -- single-turn, multi-turn, tool calling, etc. -- on toy environments and small models.

Reverse Text: Train Qwen3-0.6B to reverse a small chunk of text. Demonstrates tiny-scale single-turn SFT and RL training. Can be trained on a single consumer GPU in a few minutes, and is ideal for getting started.
Wordle: Train Qwen3-1.7B to play Wordle. A fun example of multi-turn SFT and RL training. Can be trained on a 2-4 H100 GPUs in a few hours. Ideal for exploring the multi-turn training capabilities of the framework.
Alphabet Sort: Train Qwen3-4B-Instruct-2507 to sort names alphabetically. Demonstrates multi-turn RL training via LoRA without SFT warmup. Can be trained on a single H100 GPU in just over an hour. Ideal for exploring LoRA-based training.
Wiki Search: Train Qwen3-4B-Instruct-2507 to answer trivia questions by searching through a Wikipedia. Demonstrates multi-turn with web search tool use.
Hendrycks Sanity: Run a sanity check experiment on DeepSeek-R1-Distill-Qwen-1.5B using a filtered subset of MATH where the model already partially solves 20-80% of problems. Useful for algorithm ablations.

Advanced Training: 32 - 2048 GPUs:

Follow this guide to train large models on hard reasoning and agentic / swe environments. These guides are designed to be run from a Slurm cluster but can also be adapted to k8s deployments.

Qwen 3 30B - A3B Math: Train Qwen3-30B-A3B to solve hard math problems.
Qwen 3 30B - A3B SWE: Train Qwen3-30B-A3B to solve hard SWE problems.
Intellect-3.1: Reproduce our INTELLECT-3.1 training run.
MiniMax-M2.5 SWE: Train MiniMax-M2.5 on agentic SWE tasks.
High-throughput GLM-5: Train GLM-5 with PD disaggregation and FP8 inference on SWE.

Docs

Check out the docs directory for in-depth guides on how to use PRIME-RL.

Overview - Architecture, install, and a copy-pasteable end-to-end RL run
Configuration - TOML composition, CLI overrides, env vars, validation
Training - RL, SFT, evals, checkpointing, observability, rules of thumb
Scaling - Single-GPU through multi-node, FSDP/EP/CP, SLURM, benchmarking
Algorithms - Async/off-policy training, the AIPO loss, advantage and filter plugins, trajectory merging
Advanced - Custom modeling, multimodal training, LoRA, multi-tenant training
Development - Test suite, pre-commit hooks, adding a new model

Contributing

We warmly welcome community contributions! We use issues to track bugs, feature requests, and share our internal roadmap. If you encounter bugs, have pain points during development, or have ideas for new features, please open an issue.

Contributions are welcome via PR. Please follow these guidelines:

Install the pre-commit hooks to ensure your code is formatted correctly.
Please keep your PR in "Draft" until it is ready for review.
If your PR resolves an issue, please link the issue in the PR description
If you can, try running the test suite locally to ensure your changes are working as expected.

Pre-Commit Hooks

Please install the pre-commit hooks to ensure your code is formatted correctly.

uv run pre-commit install

Tests

uv run pytest -v                    # everything
uv run pytest tests/unit -v         # unit only
uv run pytest tests/integration -v  # integration only
uv run pytest -v -m "not gpu"       # CPU-only (inverse of the gpu marker)

License

This project is licensed under the Apache 2.0 license, as found in the License file.

Citation

If you find our work useful, feel free to cite it using

@misc{primeintellect2025prime-rl,
  author = {Prime Intellect},
  title = {PRIME-RL},
  url = {https://github.com/PrimeIntellect-ai/prime-rl},
  year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,050 Commits
.claude		.claude
.cursor		.cursor
.github/workflows		.github/workflows
benchmarks		benchmarks
blog-eval		blog-eval
configs		configs
data		data
deps		deps
docs		docs
examples		examples
k8s		k8s
packages/prime-rl-configs		packages/prime-rl-configs
probe_results		probe_results
scripts		scripts
skills		skills
src/prime_rl		src/prime_rl
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
Dockerfile.cuda		Dockerfile.cuda
LICENSE		LICENSE
README.md		README.md
REPRODUCING_THE_BLOG.md		REPRODUCING_THE_BLOG.md
RL_LOG.md		RL_LOG.md
analyze_sweep.py		analyze_sweep.py
pyproject.toml		pyproject.toml
run_3way_curriculum.sh		run_3way_curriculum.sh
run_design_b.sh		run_design_b.sh
run_lr_sweep.sh		run_lr_sweep.sh
run_sweep.sh		run_sweep.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Who wrote this blog?

RL on the writing style of CLAUDE · CHATGPT · GEMINI

Headline findings

Results at a glance

Artifacts on the Hugging Face Hub

Quickstart (the experiments)

Setup (built on prime-rl)

Prerequisites

Quick Setup

Additional Setup

Training Examples

Basic Training: 1 to 8 GPUs

Advanced Training: 32 - 2048 GPUs:

Docs

Contributing

Pre-Commit Hooks

Tests

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Who wrote this blog?

RL on the writing style of CLAUDE · CHATGPT · GEMINI

Headline findings

Results at a glance

Artifacts on the Hugging Face Hub

Quickstart (the experiments)

Setup (built on prime-rl)

Prerequisites

Quick Setup

Additional Setup

Training Examples

Basic Training: 1 to 8 GPUs

Advanced Training: 32 - 2048 GPUs:

Docs

Contributing

Pre-Commit Hooks

Tests

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages