Cooperative Multi-LLM Reinforcement Learning (CoMLRL) is an open-source library for training multiple LLMs to collaborate using Multi-Agent Reinforcement Learning (MARL). It provides implementations of various MARL algorithms for LLM collaboration and support for different environments and benchmarks.
pip install comlrl
# install PyTorch compatible with your deviceconda install -c conda-forge comlrl
# install PyTorch compatible with your deviceTo access the latest features, you can install CoMLRL from source:
git clone https://github.com/OpenMLRL/CoMLRL.git
cd CoMLRL
pip install -e .
# install PyTorch compatible with your device-
MARL trainers to optimize LLM collaboration:
- Multi-Agent REINFORCE: Critic-free policy gradient methods, including MAREINFROCE, MAGRPO, MARLOO, MAREMAX.
- Aligned individual response joint with
joint_mode='align'. - Memory-efficient cross joint with
joint_mode='cross'.
- Aligned individual response joint with
- Multi-Agent PPO: Critic-based policy gradient methods, including IPPO.
- Canonical IPPO with a separate critic with
use_separate_critic=True. - Memory-efficient critic with value-head over actor with
use_separate_critic=False.
- Canonical IPPO with a separate critic with
- Multi-Agent REINFORCE: Critic-free policy gradient methods, including MAREINFROCE, MAGRPO, MARLOO, MAREMAX.
-
Environments that simulate real-world tasks for training and evaluating LLM collaboration:
- Writing Collaboration: Multiple LLM agents collaborate on processing articles.
- Code Generation: Generate code solutions for programming problems.
- MBPP - Mostly basic python problems.
- HumanEval - Handwritten evaluation problems
- CoopHumanEval - HumanEval with cooperative nature.
- Code Completion: Complete code snippets based on given contexts.
- ClassEval - Complete class-level code based on method stubs and docstrings.
Quick start by training 2 Qwen-2.5 agents to summarize Reddit posts with MAGRPO:
from datasets import load_dataset
from transformers import AutoTokenizer
from comlrl.trainers.magrpo import MAGRPOConfig, MAGRPOTrainer
# Load dataset and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
dataset = load_dataset("trl-lib/tldr", split="train").select(range(128))
# Initialize trainer and start training
trainer = MAGRPOTrainer(
model="Qwen/Qwen2.5-0.5B",
num_agents=2,
tokenizer=tokenizer,
train_dataset=dataset,
reward_func=lambda a, b: [abs(max(len(b[0]), 1) / max(len(a[0]), 1) - 3.0)],
formatters=[lambda example: example["prompt"]] * 2,
args=MAGRPOConfig(
per_device_train_batch_size=1,
),
)
trainer.train()We welcome contributions from the community! Please see contributing guidelines on setting up a development environment and contribute.
Thanks to the gracious help of contributors:
Shuo Liu π€ π§ π» π |
Tianle Chen π§ π» π |
Ryan Amiri π§ π» π |
Zeyu Liang π π |
Please cite our paper if you find this library useful in your research:
@misc{liu2025comlrl,
title={LLM Collaboration With Multi-Agent Reinforcement Learning},
author={Shuo Liu and Tianle Chen and Zeyu Liang and Xueguang Lyu and Christopher Amato},
year={2025},
eprint={2508.04652},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.04652},
}
