OpenMLRL · LovelyBuggies · Sep 24, 2025 · Sep 22, 2025 · Sep 22, 2025 · Sep 22, 2025
diff --git a/README.md b/README.md
@@ -52,7 +52,7 @@ python LLM_Collaboration_with_MARL/train_grpo.py \
 # Multi-turn override example
 python LLM_Collaboration_with_MARL/train_magrpo.py \
   --config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \
-  --override dataset.train_split='test[:20]' dataset.eval_split='test[20:30]' \
+  --override dataset.train_split='test[16:]' dataset.eval_split='test[:16]' \
   magrpo.num_turns=2 magrpo.turn_gradient_weights=[1.5,0.5]
 ```
 ### Legacy Command-Line Args
@@ -84,42 +84,40 @@ python LLM_Collaboration_with_MARL/train_magrpo.py \
 
 ### External Modes
 
-Multi-turn training supports external transition modes for 2nd+ turns, set via `magrpo.external_mode`:
+Multi-turn training supports external transition modes for 2nd+ turns, set via `external.mode`:
 
-- `expert_edits` **(default)**: Uses an expert LLM to suggest edits.
-  - Requires `magrpo.expert_model` in config (e.g., `deepseek-coder`, Claude, etc.).
-  - Requires corrsponding API keys in env vars.
+- `level_feedback` **(default)**: Detailed diagnostics (impl found, syntax with line/col, per-test pass/fail errors, aux usage).
+ - Requires `external.expert_model` in config when using `expert_edits` (e.g., `deepseek-coder`, Claude, etc.). This parameter is ignored for other modes (`level_feedback`, `level_passed`, `passed`, `plain`).
+- Requires corrsponding API keys in env vars.
 - `level_passed`: Binary passed signals (impl found, syntax, tests summary, aux usage).
-- `level_feedback`: Detailed diagnostics (impl found, syntax with line/col, per-test pass/fail errors, aux usage).
 - `passed`: A binary signal — "All levels passed" or "Not all levels passed".
 - `plain`: No signals or diagnostics.
 
 ```bash
 # HumanEval with detailed feedback signals
 python LLM_Collaboration_with_MARL/train_magrpo.py \
   --config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \
-  --override magrpo.external_mode='level_feedback'
+  --override external.mode='level_feedback'
 ```
 
 ### Sandbox Tests
 
-The external modes obtain `entry_point` and tests via an internal resolver registered by the training script. **By default, the sandbox tests are the same as the dataset’s eval tests.**
-Note: `magrpo.sandbox_slice` only affects analysis-based modes (`level_feedback`, `level_passed`, `passed`), and it has no effect on `expert_edits`.
+The external modes obtain `entry_point` and tests via an internal resolver registered by the training script. **By default, sandbox executes only the first assert (`sandbox_slice=1`).** Use all eval tests by setting `external.sandbox_slice` to `0`, `None`, or `'all'`. A negative value uses the last N asserts. Note: `external.sandbox_slice` only affects analysis-based modes (`level_feedback`, `level_passed`, `passed`), and it has no effect on `expert_edits`.
 
 ```bash
-# Add a magrpo.sandbox_slice to override
+# Add an external.sandbox_slice override
 python LLM_Collaboration_with_MARL/train_magrpo.py \
   --config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \
-  --override magrpo.external_mode='level_feedback' magrpo.sandbox_slice=-2
+  --override external.mode='level_feedback' external.sandbox_slice=-2
 ```
 
 ### Handoff Strategy
 
-In MAGRPO, since agents generate a few responses per turn, we need to hand off one for efficiency, else the number of generations per turn will increase exponentially. External handoff controls which previous response is used as context for the later turns. **By default, the "best" completion per agent from the prior turn is used.** Random handoff requires the training loop to supply a candidate pool of previous-turn completions per agent to the external transition. If only a single completion per agent is available, random falls back to the best completion.
+In MAGRPO/GRPO multi-turn training, we hand off one prior completion per agent to keep compute bounded. The trainer selects this per the `handoff` mode: **default `random`**, or `best`. Selection happens in the CoMLRL trainer; external modes simply format the next-turn prompts using the provided completions. Configure via `magrpo.handoff` or `grpo.handoff` in your config or `--override`.
 
 
 ```bash
 python LLM_Collaboration_with_MARL/train_magrpo.py \
   --config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \
-  --override magrpo.external_mode='plain' magrpo.external_handoff='random'
+  --override external.mode='plain' magrpo.handoff='best'
 ```
diff --git a/configs/grpo_che_config.yaml b/configs/grpo_che_config.yaml
@@ -1,6 +1,4 @@
-# Configuration for CoopHumanEval single-agent training with GRPO
-
-# Model configuration
+# model
 model:
   name: "Qwen/Qwen2.5-Coder-3B"
   type: "qwen"
@@ -13,21 +11,28 @@ model:
     trust_remote_code: true
     torch_dtype: "auto"
 
-# Dataset configuration
+# dataset
 dataset:
-  name: "CoMLRL/CoopHumaneval"
-  type: "coophumaneval"  # Used to select formatters and reward function
-  train_split: "test[:50]"
-  eval_split: "test[50:66]"
+  name: "CoMLRL/CoopHumanEval"
+  type: "coophumaneval"
+  train_split: "test[16:]"
+  eval_split: "test[:16]"
 
-# Output configuration
+# output
 output:
-  base_dir: "../../../projects/bepg/tchen19/output"
-  save_final_model: true
+  base_dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
+  save_final_model: false
+
+# external
+external:
+  mode: "level_feedback"
+  sandbox_slice: 1
+  original_prompt: true
+  previous_response: true
 
-# GRPO training configuration
+# grpo
 grpo:
-  num_train_epochs: 20  # Same as multi-agent CHE
+  num_train_epochs: 16
   per_device_train_batch_size: 1
   learning_rate: 1.0e-5
   logging_steps: 50
@@ -36,13 +41,13 @@ grpo:
   max_new_tokens: 256
   temperature: 0.8
   top_p: 0.95
-  # Early termination threshold for single-agent (GRPO)
+  handoff: random
   early_termination_threshold: 2.1
 
-# Wandb configuration
+# wandb
 wandb:
   project: "mlrl"
   entity: "nu-llpr"
-  name: "grpo_coophumaneval"  # Will be appended with model name in script
-  dir: "../../../projects/bevi/sliu30"
+  name: "grpo_coophumaneval"
+  dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
   tags: ["grpo", "coophumaneval", "single-agent"]
diff --git a/configs/grpo_he_config.yaml b/configs/grpo_he_config.yaml
@@ -1,7 +1,4 @@
-# Configuration for HumanEval single-agent training with GRPO
-# Based on train_he_single_agent.py parameters
-
-# Model configuration
+# model
 model:
   name: "Qwen/Qwen2.5-Coder-3B"
   type: "qwen"
@@ -14,36 +11,43 @@ model:
     trust_remote_code: true
     torch_dtype: "auto"
 
-# Dataset configuration
+# dataset
 dataset:
   name: "openai/openai_humaneval"
-  type: "humaneval"  # Used to select formatters and reward function
-  train_split: "test[33:133]"
+  type: "humaneval"
+  train_split: "test[33:163]"
   eval_split: "test[:32]"
 
-# Output configuration
+# output
 output:
-  base_dir: "../../../projects/bepg/tchen19/output"
-  save_final_model: true
+  base_dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
+  save_final_model: false
+
+# external
+external:
+  mode: "level_feedback"
+  sandbox_slice: 1
+  original_prompt: true
+  previous_response: true
 
-# GRPO training configuration
+# grpo
 grpo:
-  num_train_epochs: 10
+  num_train_epochs: 8
   per_device_train_batch_size: 1
   learning_rate: 1.0e-5
   logging_steps: 50
   save_steps: 200
-  num_generations: 4  # Number of completions to generate per prompt
+  num_generations: 4
   max_new_tokens: 256
   temperature: 0.8
   top_p: 0.95
-  # Early termination threshold for single-agent (GRPO)
+  handoff: random
   early_termination_threshold: 2.1
 
-# Wandb configuration
+# wandb
 wandb:
   project: "mlrl"
   entity: "nu-llpr"
-  name: "grpo_humaneval"  # Will be appended with model name in script
-  dir: "../../../projects/bepg/sliu30"
+  name: "grpo_humaneval"
+  dir: "../../../work/hdd/bepg/sliu30/output_st_grpo"
   tags: ["grpo", "humaneval", "single-agent"]
diff --git a/configs/magrpo_che_config.yaml b/configs/magrpo_che_config.yaml
@@ -1,7 +1,4 @@
-# Configuration for CoopHumanEval training with MAGRPO
-# Exact parameters from train_che.py
-
-# Model configuration
+# model
 model:
   name: "Qwen/Qwen2.5-Coder-3B"
   type: "qwen"
@@ -14,21 +11,28 @@ model:
     trust_remote_code: true
     torch_dtype: "auto"
 
-# Dataset configuration
+# dataset
 dataset:
-  name: "LovelyBuggies/CoopHumaneval"
-  type: "coophumaneval"  # Used to select formatters and reward function
-  train_split: "test[:50]"
-  eval_split: "test[50:66]"
+  name: "CoMLRL/CoopHumanEval"
+  type: "coophumaneval"
+  train_split: "test[16:]"
+  eval_split: "test[:16]"
 
-# Output configuration
+# output
 output:
-  base_dir: "../../../projects/bepg/sliu30/output"
-  save_final_model: true
+  base_dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
+  save_final_model: false
+
+# external
+external:
+  mode: "level_feedback"
+  sandbox_slice: 1
+  original_prompt: true
+  previous_response: true
 
-# MAGRPO training configuration
+# magrpo
 magrpo:
-  num_train_epochs: 20  # Exact value from train_che.py
+  num_train_epochs: 16
   per_device_train_batch_size: 1
   learning_rate: 2.0e-5
   logging_steps: 50
@@ -38,13 +42,13 @@ magrpo:
   temperature: 0.8
   top_p: 0.95
   num_agents: 2
-  # Early termination threshold for multi-agent (MAGRPO)
+  handoff: random
   early_termination_threshold: 4.0
 
-# Wandb configuration
+# wandb
 wandb:
   project: "mlrl"
   entity: "nu-llpr"
-  name: "magrpo_coophumaneval"  # Will be appended with model name in script
-  dir: "../../../projects/bevi/sliu30"
+  name: "magrpo_coophumaneval"
+  dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
   tags: ["magrpo", "coophumaneval", "multi-agent"]
diff --git a/configs/magrpo_he_config.yaml b/configs/magrpo_he_config.yaml
@@ -1,9 +1,6 @@
-# Configuration for HumanEval training with MAGRPO
-# This file defines all parameters for training experiments
-
-# Model configuration
+# model
 model:
-  name: "Qwen/Qwen2.5-Coder-3B"  # Options: "Qwen/Qwen2.5-Coder-3B", "bigcode/starcoder2-3b", etc.
+  name: "Qwen/Qwen2.5-Coder-3B"
   type: "qwen"
   temperature: 0.7
   top_p: 0.9
@@ -14,35 +11,42 @@ model:
     trust_remote_code: true
     torch_dtype: "auto"
 
-# Dataset configuration
+# dataset
 dataset:
   name: "openai/openai_humaneval"
-  type: "humaneval"  # Used to select formatters and reward function
-  train_split: "test[33:133]"
+  type: "humaneval"
+  train_split: "test[33:163]"
   eval_split: "test[:32]"
 
-# Output configuration
+# output
 output:
-  base_dir: "../../../projects/bepg/sliu30/output"
-  save_final_model: true
+  base_dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
+  save_final_model: false
+
+# external
+external:
+  mode: "level_feedback"
+  sandbox_slice: 1
+  original_prompt: true
+  previous_response: true
 
-# MAGRPO training configuration
+# magrpo
 magrpo:
-  num_train_epochs: 10
+  num_train_epochs: 8
   per_device_train_batch_size: 1
   learning_rate: 2.0e-5
   logging_steps: 50
   save_steps: 200
   num_generations: 4
   max_new_tokens: 256
   num_agents: 2
-  # Early termination threshold for multi-agent (MAGRPO)
+  handoff: random
   early_termination_threshold: 4.0
 
-# Wandb configuration
+# wandb
 wandb:
   project: "mlrl"
   entity: "nu-llpr"
-  name: "magrpo_humaneval"  # Will be appended with model name in script
-  dir: "../../../projects/bepg/sliu30"
+  name: "magrpo_humaneval"
+  dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo"
   tags: ["magrpo", "humaneval", "multi-agent"]
diff --git a/configs/mt_grpo_che_config.yaml b/configs/mt_grpo_che_config.yaml
@@ -1,7 +1,4 @@
-# Configuration for Multi-Turn CoopHumanEval training with GRPO (single-agent)
-# Based on mt_magrpo_che_config.yaml parameters but adapted for single-agent
-
-# Model configuration
+# model
 model:
   name: "Qwen/Qwen2.5-Coder-3B"
   type: "qwen"
@@ -14,22 +11,29 @@ model:
     trust_remote_code: true
     torch_dtype: "bfloat16"
 
-# Dataset configuration
+# dataset
 dataset:
-  name: "LovelyBuggies/CoopHumaneval"
-  type: "coophumaneval"  # Used to select formatters and reward function
-  train_split: "test[:50]"
-  eval_split: "test[50:66]"
+  name: "CoMLRL/CoopHumanEval"
+  type: "coophumaneval"
+  train_split: "test[16:]"
+  eval_split: "test[:16]"
 
-# Output configuration
+# output
 output:
-  base_dir: "../../../projects/bevi/sliu30/output_mt"
-  save_final_model: true
+  base_dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo"
+  save_final_model: false
+
+# external
+external:
+  mode: "level_feedback"
+  sandbox_slice: 1
+  original_prompt: true
+  previous_response: true
 
-# GRPO training configuration (multi-turn enabled via num_turns)
+# grpo
 grpo:
   num_turns: 2
-  num_train_epochs: 10  # Reduced from 20 for multi-turn
+  num_train_epochs: 8
   per_device_train_batch_size: 1
   learning_rate: 2.0e-5
   logging_steps: 50
@@ -38,17 +42,15 @@ grpo:
   max_new_tokens: 256
   temperature: 0.8
   top_p: 0.95
-  # Multi-turn specific parameters
+  handoff: random
   turn_gradient_weights: [1.2, 0.8]
   early_termination_weight: 2.0
   early_termination_threshold: 2.1
-  external_mode: "expert_edits"    # Options: expert_edits (default), level_passed, level_feedback, passed, plain
-  expert_model: "deepseek-coder"   # Used by expert_edits mode only
 
-# Wandb configuration
+# wandb
 wandb:
   project: "mlrl"
   entity: "nu-llpr"
-  name: "mt_grpo_coophumaneval"  # Will be appended with model name in script
-  dir: "../../../projects/bevi/sliu30"
+  name: "mt_grpo_coophumaneval"
+  dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo"
   tags: ["mt_grpo", "coophumaneval", "single-agent", "multi-turn"]