Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 11 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ python LLM_Collaboration_with_MARL/train_grpo.py \
# Multi-turn override example
python LLM_Collaboration_with_MARL/train_magrpo.py \
--config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \
--override dataset.train_split='test[:20]' dataset.eval_split='test[20:30]' \
--override dataset.train_split='test[16:]' dataset.eval_split='test[:16]' \
magrpo.num_turns=2 magrpo.turn_gradient_weights=[1.5,0.5]
```
### Legacy Command-Line Args
Expand Down Expand Up @@ -84,42 +84,40 @@ python LLM_Collaboration_with_MARL/train_magrpo.py \

### External Modes

Multi-turn training supports external transition modes for 2nd+ turns, set via `magrpo.external_mode`:
Multi-turn training supports external transition modes for 2nd+ turns, set via `external.mode`:

- `expert_edits` **(default)**: Uses an expert LLM to suggest edits.
- Requires `magrpo.expert_model` in config (e.g., `deepseek-coder`, Claude, etc.).
- Requires corrsponding API keys in env vars.
- `level_feedback` **(default)**: Detailed diagnostics (impl found, syntax with line/col, per-test pass/fail errors, aux usage).
- Requires `external.expert_model` in config when using `expert_edits` (e.g., `deepseek-coder`, Claude, etc.). This parameter is ignored for other modes (`level_feedback`, `level_passed`, `passed`, `plain`).
- Requires corrsponding API keys in env vars.
- `level_passed`: Binary passed signals (impl found, syntax, tests summary, aux usage).
- `level_feedback`: Detailed diagnostics (impl found, syntax with line/col, per-test pass/fail errors, aux usage).
- `passed`: A binary signal — "All levels passed" or "Not all levels passed".
- `plain`: No signals or diagnostics.

```bash
# HumanEval with detailed feedback signals
python LLM_Collaboration_with_MARL/train_magrpo.py \
--config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \
--override magrpo.external_mode='level_feedback'
--override external.mode='level_feedback'
```

### Sandbox Tests

The external modes obtain `entry_point` and tests via an internal resolver registered by the training script. **By default, the sandbox tests are the same as the dataset’s eval tests.**
Note: `magrpo.sandbox_slice` only affects analysis-based modes (`level_feedback`, `level_passed`, `passed`), and it has no effect on `expert_edits`.
The external modes obtain `entry_point` and tests via an internal resolver registered by the training script. **By default, sandbox executes only the first assert (`sandbox_slice=1`).** Use all eval tests by setting `external.sandbox_slice` to `0`, `None`, or `'all'`. A negative value uses the last N asserts. Note: `external.sandbox_slice` only affects analysis-based modes (`level_feedback`, `level_passed`, `passed`), and it has no effect on `expert_edits`.

```bash
# Add a magrpo.sandbox_slice to override
# Add an external.sandbox_slice override
python LLM_Collaboration_with_MARL/train_magrpo.py \
--config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \
--override magrpo.external_mode='level_feedback' magrpo.sandbox_slice=-2
--override external.mode='level_feedback' external.sandbox_slice=-2
```

### Handoff Strategy

In MAGRPO, since agents generate a few responses per turn, we need to hand off one for efficiency, else the number of generations per turn will increase exponentially. External handoff controls which previous response is used as context for the later turns. **By default, the "best" completion per agent from the prior turn is used.** Random handoff requires the training loop to supply a candidate pool of previous-turn completions per agent to the external transition. If only a single completion per agent is available, random falls back to the best completion.
In MAGRPO/GRPO multi-turn training, we hand off one prior completion per agent to keep compute bounded. The trainer selects this per the `handoff` mode: **default `random`**, or `best`. Selection happens in the CoMLRL trainer; external modes simply format the next-turn prompts using the provided completions. Configure via `magrpo.handoff` or `grpo.handoff` in your config or `--override`.


```bash
python LLM_Collaboration_with_MARL/train_magrpo.py \
--config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \
--override magrpo.external_mode='plain' magrpo.external_handoff='random'
--override external.mode='plain' magrpo.handoff='best'
```
39 changes: 22 additions & 17 deletions configs/grpo_che_config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# Configuration for CoopHumanEval single-agent training with GRPO

# Model configuration
# model
model:
name: "Qwen/Qwen2.5-Coder-3B"
type: "qwen"
Expand All @@ -13,21 +11,28 @@ model:
trust_remote_code: true
torch_dtype: "auto"

# Dataset configuration
# dataset
dataset:
name: "CoMLRL/CoopHumaneval"
type: "coophumaneval" # Used to select formatters and reward function
train_split: "test[:50]"
eval_split: "test[50:66]"
name: "CoMLRL/CoopHumanEval"
type: "coophumaneval"
train_split: "test[16:]"
eval_split: "test[:16]"

# Output configuration
# output
output:
base_dir: "../../../projects/bepg/tchen19/output"
save_final_model: true
base_dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
save_final_model: false

# external
external:
mode: "level_feedback"
sandbox_slice: 1
original_prompt: true
previous_response: true

# GRPO training configuration
# grpo
grpo:
num_train_epochs: 20 # Same as multi-agent CHE
num_train_epochs: 16
per_device_train_batch_size: 1
learning_rate: 1.0e-5
logging_steps: 50
Expand All @@ -36,13 +41,13 @@ grpo:
max_new_tokens: 256
temperature: 0.8
top_p: 0.95
# Early termination threshold for single-agent (GRPO)
handoff: random
early_termination_threshold: 2.1

# Wandb configuration
# wandb
wandb:
project: "mlrl"
entity: "nu-llpr"
name: "grpo_coophumaneval" # Will be appended with model name in script
dir: "../../../projects/bevi/sliu30"
name: "grpo_coophumaneval"
dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
tags: ["grpo", "coophumaneval", "single-agent"]
38 changes: 21 additions & 17 deletions configs/grpo_he_config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
# Configuration for HumanEval single-agent training with GRPO
# Based on train_he_single_agent.py parameters

# Model configuration
# model
model:
name: "Qwen/Qwen2.5-Coder-3B"
type: "qwen"
Expand All @@ -14,36 +11,43 @@ model:
trust_remote_code: true
torch_dtype: "auto"

# Dataset configuration
# dataset
dataset:
name: "openai/openai_humaneval"
type: "humaneval" # Used to select formatters and reward function
train_split: "test[33:133]"
type: "humaneval"
train_split: "test[33:163]"
eval_split: "test[:32]"

# Output configuration
# output
output:
base_dir: "../../../projects/bepg/tchen19/output"
save_final_model: true
base_dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
save_final_model: false

# external
external:
mode: "level_feedback"
sandbox_slice: 1
original_prompt: true
previous_response: true

# GRPO training configuration
# grpo
grpo:
num_train_epochs: 10
num_train_epochs: 8
per_device_train_batch_size: 1
learning_rate: 1.0e-5
logging_steps: 50
save_steps: 200
num_generations: 4 # Number of completions to generate per prompt
num_generations: 4
max_new_tokens: 256
temperature: 0.8
top_p: 0.95
# Early termination threshold for single-agent (GRPO)
handoff: random
early_termination_threshold: 2.1

# Wandb configuration
# wandb
wandb:
project: "mlrl"
entity: "nu-llpr"
name: "grpo_humaneval" # Will be appended with model name in script
dir: "../../../projects/bepg/sliu30"
name: "grpo_humaneval"
dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
tags: ["grpo", "humaneval", "single-agent"]
40 changes: 22 additions & 18 deletions configs/magrpo_che_config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
# Configuration for CoopHumanEval training with MAGRPO
# Exact parameters from train_che.py

# Model configuration
# model
model:
name: "Qwen/Qwen2.5-Coder-3B"
type: "qwen"
Expand All @@ -14,21 +11,28 @@ model:
trust_remote_code: true
torch_dtype: "auto"

# Dataset configuration
# dataset
dataset:
name: "LovelyBuggies/CoopHumaneval"
type: "coophumaneval" # Used to select formatters and reward function
train_split: "test[:50]"
eval_split: "test[50:66]"
name: "CoMLRL/CoopHumanEval"
type: "coophumaneval"
train_split: "test[16:]"
eval_split: "test[:16]"

# Output configuration
# output
output:
base_dir: "../../../projects/bepg/sliu30/output"
save_final_model: true
base_dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
save_final_model: false

# external
external:
mode: "level_feedback"
sandbox_slice: 1
original_prompt: true
previous_response: true

# MAGRPO training configuration
# magrpo
magrpo:
num_train_epochs: 20 # Exact value from train_che.py
num_train_epochs: 16
per_device_train_batch_size: 1
learning_rate: 2.0e-5
logging_steps: 50
Expand All @@ -38,13 +42,13 @@ magrpo:
temperature: 0.8
top_p: 0.95
num_agents: 2
# Early termination threshold for multi-agent (MAGRPO)
handoff: random
early_termination_threshold: 4.0

# Wandb configuration
# wandb
wandb:
project: "mlrl"
entity: "nu-llpr"
name: "magrpo_coophumaneval" # Will be appended with model name in script
dir: "../../../projects/bevi/sliu30"
name: "magrpo_coophumaneval"
dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
tags: ["magrpo", "coophumaneval", "multi-agent"]
38 changes: 21 additions & 17 deletions configs/magrpo_he_config.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
# Configuration for HumanEval training with MAGRPO
# This file defines all parameters for training experiments

# Model configuration
# model
model:
name: "Qwen/Qwen2.5-Coder-3B" # Options: "Qwen/Qwen2.5-Coder-3B", "bigcode/starcoder2-3b", etc.
name: "Qwen/Qwen2.5-Coder-3B"
type: "qwen"
temperature: 0.7
top_p: 0.9
Expand All @@ -14,35 +11,42 @@ model:
trust_remote_code: true
torch_dtype: "auto"

# Dataset configuration
# dataset
dataset:
name: "openai/openai_humaneval"
type: "humaneval" # Used to select formatters and reward function
train_split: "test[33:133]"
type: "humaneval"
train_split: "test[33:163]"
eval_split: "test[:32]"

# Output configuration
# output
output:
base_dir: "../../../projects/bepg/sliu30/output"
save_final_model: true
base_dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
save_final_model: false

# external
external:
mode: "level_feedback"
sandbox_slice: 1
original_prompt: true
previous_response: true

# MAGRPO training configuration
# magrpo
magrpo:
num_train_epochs: 10
num_train_epochs: 8
per_device_train_batch_size: 1
learning_rate: 2.0e-5
logging_steps: 50
save_steps: 200
num_generations: 4
max_new_tokens: 256
num_agents: 2
# Early termination threshold for multi-agent (MAGRPO)
handoff: random
early_termination_threshold: 4.0

# Wandb configuration
# wandb
wandb:
project: "mlrl"
entity: "nu-llpr"
name: "magrpo_humaneval" # Will be appended with model name in script
dir: "../../../projects/bepg/sliu30"
name: "magrpo_humaneval"
dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
tags: ["magrpo", "humaneval", "multi-agent"]
42 changes: 22 additions & 20 deletions configs/mt_grpo_che_config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
# Configuration for Multi-Turn CoopHumanEval training with GRPO (single-agent)
# Based on mt_magrpo_che_config.yaml parameters but adapted for single-agent

# Model configuration
# model
model:
name: "Qwen/Qwen2.5-Coder-3B"
type: "qwen"
Expand All @@ -14,22 +11,29 @@ model:
trust_remote_code: true
torch_dtype: "bfloat16"

# Dataset configuration
# dataset
dataset:
name: "LovelyBuggies/CoopHumaneval"
type: "coophumaneval" # Used to select formatters and reward function
train_split: "test[:50]"
eval_split: "test[50:66]"
name: "CoMLRL/CoopHumanEval"
type: "coophumaneval"
train_split: "test[16:]"
eval_split: "test[:16]"

# Output configuration
# output
output:
base_dir: "../../../projects/bevi/sliu30/output_mt"
save_final_model: true
base_dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo"
save_final_model: false

# external
external:
mode: "level_feedback"
sandbox_slice: 1
original_prompt: true
previous_response: true

# GRPO training configuration (multi-turn enabled via num_turns)
# grpo
grpo:
num_turns: 2
num_train_epochs: 10 # Reduced from 20 for multi-turn
num_train_epochs: 8
per_device_train_batch_size: 1
learning_rate: 2.0e-5
logging_steps: 50
Expand All @@ -38,17 +42,15 @@ grpo:
max_new_tokens: 256
temperature: 0.8
top_p: 0.95
# Multi-turn specific parameters
handoff: random
turn_gradient_weights: [1.2, 0.8]
early_termination_weight: 2.0
early_termination_threshold: 2.1
external_mode: "expert_edits" # Options: expert_edits (default), level_passed, level_feedback, passed, plain
expert_model: "deepseek-coder" # Used by expert_edits mode only

# Wandb configuration
# wandb
wandb:
project: "mlrl"
entity: "nu-llpr"
name: "mt_grpo_coophumaneval" # Will be appended with model name in script
dir: "../../../projects/bevi/sliu30"
name: "mt_grpo_coophumaneval"
dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo"
tags: ["mt_grpo", "coophumaneval", "single-agent", "multi-turn"]
Loading