Skip to content

fix: heterogenous loading and critic conflict handling#60

Merged
LovelyBuggies merged 19 commits intomainfrom
hetero
Feb 13, 2026
Merged

fix: heterogenous loading and critic conflict handling#60
LovelyBuggies merged 19 commits intomainfrom
hetero

Conversation

@LovelyBuggies
Copy link
Member

No description provided.

@LovelyBuggies LovelyBuggies merged commit 48b028a into main Feb 13, 2026
4 checks passed
@LovelyBuggies LovelyBuggies deleted the hetero branch February 13, 2026 22:44
@LovelyBuggies
Copy link
Member Author

LovelyBuggies commented Feb 14, 2026

Results at v1.3.6

Since CoMLRL v1.3.6 is primarily based on the feature development in this PR, I present some results obtained at this commit and briefly explain that.

Quick insights:

  • VARM usage and training time in general: Minecraft > Writing ~= Coding
  • VRAM usages, MAGRPO (depending on your num_generations) < MAAC ~= IAC-shared < IAC-separate
  • Minecraft is demanding on the training devices, the type of CPU can affect the training

Writing Collaboration

Since the reward design for arXiv expansion and tldr summarization is quite similar (with different hyperparameters), I primarily use tldr for testing.

cd LLM_Collab_Writing
python train_magrpo.py --config configs/magrpo_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None magrpo.num_agents=2 magrpo.num_turns=1 wandb.project=hetero wandb.name='magrpo_tldr_1.5_1.7'
cd LLM_Collab_Writing
python train_magrpo.py --config configs/magrpo_tldr_config.yaml --override agents='[\"Qwen/Qwen3-1.7B\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None magrpo.num_agents=2 magrpo.num_turns=1 wandb.project=hetero wandb.name='magrpo_tldr_1.7_1.7'
image

MAGRPO takes about 17 hours to train on a piece of H100 with about 45G Vram usage.

cd LLM_Collab_Writing
python train_maac.py --config configs/maac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critic_model.name="Qwen/Qwen3-1.7B" wandb.project=hetero wandb.name='maac_tldr_1.5_1.7'
cd LLM_Collab_Writing
python train_maac.py --config configs/maac_tldr_config.yaml --override agents=None agent_model.name="Qwen/Qwen3-1.7B" critic_model.name="Qwen/Qwen3-1.7B" wandb.project=hetero wandb.name='maac_tldr_1.7_1.7'
image

MAAC takes about 34 hours to train on a piece of H100 with about 71G Vram usage.

cd LLM_Collab_Writing
python train_iac.py --config configs/iac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critics='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' critic_model.name=None iac.use_separate_critic=true wandb.project=hetero wandb.name='iac_tldr_1.5_1.7'
cd LLM_Collab_Writing
python train_iac.py --config configs/iac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critics=None critic_model.name=None iac.use_separate_critic=false wandb.project=hetero wandb.name='iac_tldr_1.5_1.7_shared'
image

IAC takes about 41-48 hours to train on a piece of H100 with about 80G (separate) or 48 (shared) Vram usage.

Code Generation

cd LLM_Collab_Code_Generation
python train_magrpo.py --config configs/magrpo_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None wandb.project=hetero wandb.name='magrpo_che_3b_4b'
image

MAGRPO takes about 10 hours to train on a piece of H100 with about 89G Vram usage.

cd LLM_Collab_Code_Generation
python train_maac.py --config configs/maac_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name="Qwen/Qwen2.5-Coder-3B" wandb.project=hetero wandb.name='magrpo_che_3b_4b'
image

MAAC takes about 8 hours to train on a piece of H200 with about 118G Vram usage.

cd LLM_Collab_Code_Generation
python train_iac.py --config configs/iac_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critics='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' critic_model.name=None iac.use_separate_critic=true wandb.project=hetero wandb.name='magrpo_che_3b_4b'
image

IAC takes about 8 hours to train on a piece of H200 with about 140G (separate) or 74 (shared) Vram usage.

Minecraft

For Minecraft, I select a house as a representative task for testing the new interface.

cd LLM_Collab_Minecraft
python house_build/train/train_magrpo.py --config house_build/configs/house_build_magrpo_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None wandb.project=hetero-mc wandb.name='magrpo_house_3B_4B'
image image

MAGRPO takes about 8 hours to train on a piece of H200 with about 108G Vram usage.

cd LLM_Collab_Minecraft
python house_build/train/train_maac.py --config house_build/configs/house_build_maac_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name="Qwen/Qwen3-4B-Instruct-2507" wandb.project=hetero wandb.name='maac_house_3B_4B'
image image

MAAC takes about 10 hours to train on a piece of H200 with about 138G Vram usage.
P.S. This experiment is very sensitive to the device; I used H200 NVL, CPU AMD EPYC 9655 96-Core, Cuda 12.9.

cd LLM_Collab_Minecraft
python house_build/train/train_iac.py --config house_build/configs/house_build_iac_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name=None iac.use_separate_critic=false wandb.project=hetero wandb.name='maac_house_3B_4B_shared'
image image

IAC-shared (I both use shared here to save vram) takes about 9 hours to train on a piece of H200 with about 102 (3B+4B) 114G (4Bx2) Vram usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant