fix: heterogenous loading and critic conflict handling by LovelyBuggies · Pull Request #60 · OpenMLRL/CoMLRL

LovelyBuggies · 2026-02-11T22:59:53Z

No description provided.

LovelyBuggies · 2026-02-14T02:34:08Z

Results at v1.3.6

Since CoMLRL v1.3.6 is primarily based on the feature development in this PR, I present some results obtained at this commit and briefly explain that.

Quick insights:

VARM usage and training time in general: Minecraft > Writing ~= Coding
VRAM usages, MAGRPO (depending on your num_generations) < MAAC ~= IAC-shared < IAC-separate
Minecraft is demanding on the training devices, the type of CPU can affect the training

Writing Collaboration

Since the reward design for arXiv expansion and tldr summarization is quite similar (with different hyperparameters), I primarily use tldr for testing.

cd LLM_Collab_Writing
python train_magrpo.py --config configs/magrpo_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None magrpo.num_agents=2 magrpo.num_turns=1 wandb.project=hetero wandb.name='magrpo_tldr_1.5_1.7'

cd LLM_Collab_Writing
python train_magrpo.py --config configs/magrpo_tldr_config.yaml --override agents='[\"Qwen/Qwen3-1.7B\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None magrpo.num_agents=2 magrpo.num_turns=1 wandb.project=hetero wandb.name='magrpo_tldr_1.7_1.7'

MAGRPO takes about 17 hours to train on a piece of H100 with about 45G Vram usage.

cd LLM_Collab_Writing
python train_maac.py --config configs/maac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critic_model.name="Qwen/Qwen3-1.7B" wandb.project=hetero wandb.name='maac_tldr_1.5_1.7'

cd LLM_Collab_Writing
python train_maac.py --config configs/maac_tldr_config.yaml --override agents=None agent_model.name="Qwen/Qwen3-1.7B" critic_model.name="Qwen/Qwen3-1.7B" wandb.project=hetero wandb.name='maac_tldr_1.7_1.7'

MAAC takes about 34 hours to train on a piece of H100 with about 71G Vram usage.

cd LLM_Collab_Writing
python train_iac.py --config configs/iac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critics='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' critic_model.name=None iac.use_separate_critic=true wandb.project=hetero wandb.name='iac_tldr_1.5_1.7'

cd LLM_Collab_Writing
python train_iac.py --config configs/iac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critics=None critic_model.name=None iac.use_separate_critic=false wandb.project=hetero wandb.name='iac_tldr_1.5_1.7_shared'

IAC takes about 41-48 hours to train on a piece of H100 with about 80G (separate) or 48 (shared) Vram usage.

Code Generation

cd LLM_Collab_Code_Generation
python train_magrpo.py --config configs/magrpo_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None wandb.project=hetero wandb.name='magrpo_che_3b_4b'

MAGRPO takes about 10 hours to train on a piece of H100 with about 89G Vram usage.

cd LLM_Collab_Code_Generation
python train_maac.py --config configs/maac_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name="Qwen/Qwen2.5-Coder-3B" wandb.project=hetero wandb.name='magrpo_che_3b_4b'

MAAC takes about 8 hours to train on a piece of H200 with about 118G Vram usage.

cd LLM_Collab_Code_Generation
python train_iac.py --config configs/iac_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critics='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' critic_model.name=None iac.use_separate_critic=true wandb.project=hetero wandb.name='magrpo_che_3b_4b'

IAC takes about 8 hours to train on a piece of H200 with about 140G (separate) or 74 (shared) Vram usage.

Minecraft

For Minecraft, I select a house as a representative task for testing the new interface.

cd LLM_Collab_Minecraft
python house_build/train/train_magrpo.py --config house_build/configs/house_build_magrpo_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None wandb.project=hetero-mc wandb.name='magrpo_house_3B_4B'

MAGRPO takes about 8 hours to train on a piece of H200 with about 108G Vram usage.

cd LLM_Collab_Minecraft
python house_build/train/train_maac.py --config house_build/configs/house_build_maac_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name="Qwen/Qwen3-4B-Instruct-2507" wandb.project=hetero wandb.name='maac_house_3B_4B'

MAAC takes about 10 hours to train on a piece of H200 with about 138G Vram usage.
P.S. This experiment is very sensitive to the device; I used H200 NVL, CPU AMD EPYC 9655 96-Core, Cuda 12.9.

cd LLM_Collab_Minecraft
python house_build/train/train_iac.py --config house_build/configs/house_build_iac_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name=None iac.use_separate_critic=false wandb.project=hetero wandb.name='maac_house_3B_4B_shared'

IAC-shared (I both use shared here to save vram) takes about 9 hours to train on a piece of H200 with about 102 (3B+4B) 114G (4Bx2) Vram usage.

LovelyBuggies added 18 commits February 10, 2026 12:04

hetero

a8387e9

Update magrpo.py

616aab4

Update heterogeneous-agents.md

d7fc2c2

docs

9e4ecd0

ud

bb77955

ud

641faa7

Update leetcode-func-print.py

3b26d88

Update reinforce-finetuning.md

97eb9c8

Update reinforce-finetuning.md

eded138

update

ab7a593

up docs

e664aef

ud

4c0f824

ud

8f98892

ud

954da58

ud

32d887e

Update model-loading.md

a41f76f

Update test_trainer_constraints.py

1ad376a

Update changelog.md

6d8f456

LovelyBuggies mentioned this pull request Feb 11, 2026

Clarification on dataset splits, critic types, and feedback settings for reproduction #59

Closed

Update changelog.md

d87f16e

LovelyBuggies merged commit 48b028a into main Feb 13, 2026
4 checks passed

LovelyBuggies deleted the hetero branch February 13, 2026 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: heterogenous loading and critic conflict handling#60

fix: heterogenous loading and critic conflict handling#60
LovelyBuggies merged 19 commits intomainfrom
hetero

LovelyBuggies commented Feb 11, 2026

Uh oh!

Uh oh!

LovelyBuggies commented Feb 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LovelyBuggies commented Feb 11, 2026

Uh oh!

Uh oh!

LovelyBuggies commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results at v1.3.6

Writing Collaboration

Code Generation

Minecraft

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LovelyBuggies commented Feb 14, 2026 •

edited

Loading