diff --git a/README.md b/README.md index 84acb79..8277afa 100644 --- a/README.md +++ b/README.md @@ -1,101 +1,65 @@ # LLM Collaboration with MARL -This repository contains training scripts and configurations for the paper "LLM Collaboration with Multi‑Agent Reinforcement Learning". -- [Benchmarks](#benchmarks) -- [Training Scripts](#training-scripts) - - [Default Configs](#default-configs) - - [Parameter Overrides](#parameter-overrides) -- [Multi-Turn Settings](#multi-turn-settings) - - [2+Turn Prompt Composition](#2turn-prompt-composition) - - [External Modes](#external-modes) - - [Sandbox Tests](#sandbox-tests) +Training scripts and configs for _"LLM Collaboration with Multi‑Agent Reinforcement Learning"_. ## Benchmarks -- HumanEval (HE): 164 problems on split `test` -- CoopHumanEval (CHE): 82 problems on split `test` +- MBPP: 427 problems on split `sanitized` +- HumanEval: 164 problems on split `test` +- CoopHumanEval: 82 problems on split `test` ## Training Scripts ### Default Configs ```bash -# Single-agent HumanEval (GRPO) python LLM_Collaboration_with_MARL/train_grpo.py \ --config LLM_Collaboration_with_MARL/configs/grpo_he_config.yaml -# Multi-agent CoopHumanEval (MAGRPO) python LLM_Collaboration_with_MARL/train_magrpo.py \ --config LLM_Collaboration_with_MARL/configs/magrpo_che_config.yaml - -# Multi-turn HumanEval (MT-MAGRPO) -python LLM_Collaboration_with_MARL/train_magrpo.py \ - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml ``` ### Parameter Overrides -You can override any configuration parameter using `--override`: +You can always override any configuration parameter using `--override`: ```bash -# Change model python LLM_Collaboration_with_MARL/train_magrpo.py \ --config LLM_Collaboration_with_MARL/configs/magrpo_he_config.yaml \ - --override model_name='bigcode/starcoder2-3b' + --override model.name='bigcode/starcoder2-3b' magrpo.num_turns=1 +``` -# Modify training params -python LLM_Collaboration_with_MARL/train_grpo.py \ - --config LLM_Collaboration_with_MARL/configs/grpo_che_config.yaml \ - --override grpo.num_train_epochs=20 grpo.learning_rate=3e-5 +## Settings -# Multi-turn override example -python LLM_Collaboration_with_MARL/train_magrpo.py \ - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \ - --override dataset.train_split='test[16:]' dataset.eval_split='test[:16]' \ - magrpo.num_turns=2 +### Joint Action Modes -# Enable code-level training metrics (expensive; default is off) -python LLM_Collaboration_with_MARL/train_magrpo.py \ - --config LLM_Collaboration_with_MARL/configs/magrpo_he_config.yaml \ - --override magrpo.log_code_levels=true -``` -## Multi-Turn Settings +`magrpo.joint_mode` determine how to combine each agent's K generations into joint actions at each turn. 2 modes are supported: if set 'align' by default, each agent's k-th generation is paired with the other agents' k-th generations to form a joint action; if set 'cross', all combinations of the agents' K generations are used to form joint actions (K^N joint actions for N agents). -### 2+Turn Prompt Composition +Since the number of samples will also grow exponentially with the number of turns, aligned joint will be **more flexible** (\#samples could not be a perfect power) and hence faster to train in wall time. However, using cross joint will be more sample efficient (much lower VRAM compare to 'align' when num_generations=K^N), it also performs better since the value estimation is more accurate. -To save memory usage, 2+ turn prompts **include the previous response without the original first‑turn problem prompt by default**. You can add the original prompt to match the concept of observation-action history in MARL. +### Number of Turns -```bash -python LLM_Collaboration_with_MARL/train_magrpo.py \ - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \ - --override magrpo.external_original_prompt=True magrpo.external_previous_response=True -``` +`magrpo.num_turns` determines the number of turns (`magrpo.num_turns=2` by default). The number of samples at each turn will grow exponentially with the number of turns: K^TN at turn T if cross joint, K^N if aligned joint. -### External Modes +### Early Termination -Multi-turn training supports external transition modes for 2nd+ turns, set via `external.mode`: +`magrpo.termination_threshold` is used to incentive agents to find high-reward solutions quickly, instead of expanding the full Monte Carlo tree. -- `level_feedback` **(default)**: Detailed diagnostics (impl found, syntax with line/col, per-test pass/fail errors, aux usage). - - Requires `external.expert_model` in config when using `expert_edits` (e.g., `deepseek-coder`, Claude, etc.). This parameter is ignored for other modes (`level_feedback`, `level_passed`, `passed`, `plain`). -- Requires corrsponding API keys in env vars. -- `level_passed`: Binary passed signals (impl found, syntax, tests summary, aux usage). -- `passed`: A binary signal — "All levels passed" or "Not all levels passed". -- `plain`: No signals or diagnostics. +At each node (branch, turn), compute the mean immediate **reward across the sibling** joint actions at that node. If the mean exceeds the threshold, that branch stops expanding at this turn; training backpropagates from the truncated subtree. Other branches continue. -```bash -# HumanEval with detailed feedback signals -python LLM_Collaboration_with_MARL/train_magrpo.py \ - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_he_config.yaml \ - --override external.mode='level_feedback' -``` +### Multi-Turn Prompt -### Sandbox Tests +`external.original_prompt` and `external.previous_response` both default as `true`. 2+ turn prompts include both the original first‑turn problem prompt and the previous response by default to preserve full context; you can shorten the context by setting either to `false` (for example, keep only the previous response to reduce tokens while retaining the most recent interaction). -The external modes obtain `entry_point` and tests via an internal resolver registered by the training script. **By default, sandbox executes only the first assert (`sandbox_slice=1`).** Use all eval tests by setting `external.sandbox_slice` to `0`, `None`, or `'all'`. A negative value uses the last N asserts. Note: `external.sandbox_slice` only affects analysis-based modes (`level_feedback`, `level_passed`, `passed`), and it has no effect on `expert_edits`. +### External Modes -```bash -# Add an external.sandbox_slice override -python LLM_Collaboration_with_MARL/train_magrpo.py \ - --config LLM_Collaboration_with_MARL/configs/mt_magrpo_che_config.yaml \ - --override external.mode='level_feedback' external.sandbox_slice=-2 -``` +`external.mode` is set to be 'level_feedback' by default. This gives additional information from external to prompts in the following turns; 'level_feedback' attaches test‑driven diagnostics, while alternatives include 'expert_edits' (an LLM proposes edits), 'level_passed'/'passed' (binary outcomes), and 'plain' (no signals). + +Specific settings for 'level_feedback' is `external.sandbox_slice`, which controls how many eval tests to include in the feedback. By default, sandbox executes only the first assert (sandbox_slice=1). Use all eval tests by setting `external.sandbox_slice` to 0, None, or 'all'. Negative values use the last asserts. `external.sandbox_slice` only affects analysis-based modes ('level_feedback', 'level_passed', 'passed'), and it has no effect on 'expert_edits'. + +Specific settings for 'expert_edits' is `external.expert_edits_model`, which controls which LLM to use for proposing edits. By default, it uses DeepSeek-Coder. You can also change it to Claude-3, GPT-4, once you have keys/tokens in your global environment variables. + +### Output + +`output.save_model` is set to `false` by default because of the huge storage required by multiple LLMs. `verbose` is used for debug printing on cluster if set to be true, but it is default to be false and you can only see a tqdm bar that shows the training progress. You can also turn on `magrpo.log_code_levels` to log the level-rewards during training, but it will crazily slow down the training. \ No newline at end of file diff --git a/configs/grpo_che_config.yaml b/configs/grpo_che_config.yaml index 8b21935..1d773b0 100644 --- a/configs/grpo_che_config.yaml +++ b/configs/grpo_che_config.yaml @@ -9,7 +9,7 @@ model: trust_remote_code: true model_kwargs: trust_remote_code: true - torch_dtype: "auto" + torch_dtype: "bfloat16" # dataset dataset: @@ -20,8 +20,9 @@ dataset: # output output: - base_dir: "../../../work/hdd/bepg/sliu30/output_st_grpo" + base_dir: "output" save_final_model: false + verbose: false # external external: @@ -32,17 +33,19 @@ external: # grpo grpo: - num_train_epochs: 16 + num_turns: 2 + num_train_epochs: 8 per_device_train_batch_size: 1 - learning_rate: 1.0e-5 + learning_rate: 2.0e-5 logging_steps: 50 save_steps: 200 num_generations: 4 max_new_tokens: 256 - joint_mode: cross + joint_mode: aligned temperature: 0.8 top_p: 0.95 discount: 0.9 + termination_threshold: -0.1 reward_shift: -2.1 # wandb @@ -50,5 +53,5 @@ wandb: project: "mlrl" entity: "nu-llpr" name: "grpo_coophumaneval" - dir: "../../../work/hdd/bepg/sliu30/output_st_grpo" - tags: ["grpo", "coophumaneval", "single-agent"] + dir: "output" + tags: ["grpo", "coophumaneval"] diff --git a/configs/grpo_he_config.yaml b/configs/grpo_he_config.yaml index c9d8bf7..421e2c0 100644 --- a/configs/grpo_he_config.yaml +++ b/configs/grpo_he_config.yaml @@ -9,7 +9,7 @@ model: trust_remote_code: true model_kwargs: trust_remote_code: true - torch_dtype: "auto" + torch_dtype: "bfloat16" # dataset dataset: @@ -20,8 +20,9 @@ dataset: # output output: - base_dir: "../../../work/hdd/bepg/sliu30/output_st_grpo" + base_dir: "output" save_final_model: false + verbose: false # external external: @@ -32,17 +33,19 @@ external: # grpo grpo: - num_train_epochs: 8 + num_turns: 2 + num_train_epochs: 6 per_device_train_batch_size: 1 - learning_rate: 1.0e-5 + learning_rate: 2.0e-5 logging_steps: 50 save_steps: 200 num_generations: 4 max_new_tokens: 256 - joint_mode: cross + joint_mode: aligned temperature: 0.8 top_p: 0.95 discount: 0.9 + termination_threshold: -0.1 reward_shift: -2.1 # wandb @@ -50,5 +53,5 @@ wandb: project: "mlrl" entity: "nu-llpr" name: "grpo_humaneval" - dir: "../../../work/hdd/bepg/sliu30/output_st_grpo" - tags: ["grpo", "humaneval", "single-agent"] + dir: "output" + tags: ["grpo", "humaneval"] diff --git a/configs/magrpo_che_config.yaml b/configs/magrpo_che_config.yaml index 5ace7f6..f6f6c7e 100644 --- a/configs/magrpo_che_config.yaml +++ b/configs/magrpo_che_config.yaml @@ -9,7 +9,7 @@ model: trust_remote_code: true model_kwargs: trust_remote_code: true - torch_dtype: "auto" + torch_dtype: "bfloat16" # dataset dataset: @@ -20,8 +20,9 @@ dataset: # output output: - base_dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo" + base_dir: "output" save_final_model: false + verbose: false # external external: @@ -32,7 +33,8 @@ external: # magrpo magrpo: - num_train_epochs: 16 + num_turns: 2 + num_train_epochs: 8 per_device_train_batch_size: 1 learning_rate: 2.0e-5 logging_steps: 50 @@ -41,9 +43,10 @@ magrpo: max_new_tokens: 256 temperature: 0.8 top_p: 0.95 - joint_mode: cross + joint_mode: aligned num_agents: 2 discount: 0.9 + termination_threshold: -0.2 reward_shift: -4 # wandb @@ -51,5 +54,5 @@ wandb: project: "mlrl" entity: "nu-llpr" name: "magrpo_coophumaneval" - dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo" + dir: "output" tags: ["magrpo", "coophumaneval", "multi-agent"] diff --git a/configs/magrpo_he_config.yaml b/configs/magrpo_he_config.yaml index a7e638c..029dcfe 100644 --- a/configs/magrpo_he_config.yaml +++ b/configs/magrpo_he_config.yaml @@ -9,7 +9,7 @@ model: trust_remote_code: true model_kwargs: trust_remote_code: true - torch_dtype: "auto" + torch_dtype: "bfloat16" # dataset dataset: @@ -20,8 +20,9 @@ dataset: # output output: - base_dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo" + base_dir: "output" save_final_model: false + verbose: false # external external: @@ -32,16 +33,18 @@ external: # magrpo magrpo: - num_train_epochs: 8 + num_turns: 2 + num_train_epochs: 6 per_device_train_batch_size: 1 learning_rate: 2.0e-5 logging_steps: 50 save_steps: 200 num_generations: 4 max_new_tokens: 256 - joint_mode: cross + joint_mode: aligned num_agents: 2 discount: 0.9 + termination_threshold: -0.2 reward_shift: -4 # wandb @@ -49,5 +52,5 @@ wandb: project: "mlrl" entity: "nu-llpr" name: "magrpo_humaneval" - dir: "../../../work/hdd/bepg/sliu30/output_st_magrpo" + dir: "output" tags: ["magrpo", "humaneval", "multi-agent"] diff --git a/configs/mt_grpo_che_config.yaml b/configs/mt_grpo_che_config.yaml deleted file mode 100644 index dae8305..0000000 --- a/configs/mt_grpo_che_config.yaml +++ /dev/null @@ -1,55 +0,0 @@ -# model -model: - name: "Qwen/Qwen2.5-Coder-3B" - type: "qwen" - temperature: 0.7 - top_p: 0.9 - max_length: 2048 - tokenizer_kwargs: - trust_remote_code: true - model_kwargs: - trust_remote_code: true - torch_dtype: "bfloat16" - -# dataset -dataset: - name: "CoMLRL/CoopHumanEval" - type: "coophumaneval" - train_split: "test[16:]" - eval_split: "test[:16]" - -# output -output: - base_dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo" - save_final_model: false - -# external -external: - mode: "level_feedback" - sandbox_slice: 1 - original_prompt: true - previous_response: true - -# grpo -grpo: - num_turns: 2 - num_train_epochs: 8 - per_device_train_batch_size: 1 - learning_rate: 2.0e-5 - logging_steps: 50 - save_steps: 200 - num_generations: 4 - max_new_tokens: 256 - joint_mode: cross - temperature: 0.8 - top_p: 0.95 - discount: 0.9 - reward_shift: -2.1 - -# wandb -wandb: - project: "mlrl" - entity: "nu-llpr" - name: "mt_grpo_coophumaneval" - dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo" - tags: ["mt_grpo", "coophumaneval", "single-agent", "multi-turn"] diff --git a/configs/mt_grpo_he_config.yaml b/configs/mt_grpo_he_config.yaml deleted file mode 100644 index 0157d7c..0000000 --- a/configs/mt_grpo_he_config.yaml +++ /dev/null @@ -1,55 +0,0 @@ -# model -model: - name: "Qwen/Qwen2.5-Coder-3B" - type: "qwen" - temperature: 0.7 - top_p: 0.9 - max_length: 2048 - tokenizer_kwargs: - trust_remote_code: true - model_kwargs: - trust_remote_code: true - torch_dtype: "bfloat16" - -# dataset -dataset: - name: "openai/openai_humaneval" - type: "humaneval" - train_split: "test[33:163]" - eval_split: "test[:32]" - -# output -output: - base_dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo" - save_final_model: false - -# external -external: - mode: "level_feedback" - sandbox_slice: 1 - original_prompt: true - previous_response: true - -# grpo -grpo: - num_turns: 2 - num_train_epochs: 6 - per_device_train_batch_size: 1 - learning_rate: 2.0e-5 - logging_steps: 50 - save_steps: 200 - num_generations: 4 - max_new_tokens: 256 - joint_mode: cross - temperature: 0.8 - top_p: 0.95 - discount: 0.9 - reward_shift: -2.1 - -# wandb -wandb: - project: "mlrl" - entity: "nu-llpr" - name: "mt_grpo_humaneval" - dir: "../../../work/hdd/bepg/sliu30/output_mt_grpo" - tags: ["mt_grpo", "humaneval", "single-agent", "multi-turn"] diff --git a/configs/mt_magrpo_che_config.yaml b/configs/mt_magrpo_che_config.yaml deleted file mode 100644 index 64139cb..0000000 --- a/configs/mt_magrpo_che_config.yaml +++ /dev/null @@ -1,56 +0,0 @@ -# model -model: - name: "Qwen/Qwen2.5-Coder-3B" - type: "qwen" - temperature: 0.7 - top_p: 0.9 - max_length: 2048 - tokenizer_kwargs: - trust_remote_code: true - model_kwargs: - trust_remote_code: true - torch_dtype: "bfloat16" - -# dataset -dataset: - name: "CoMLRL/CoopHumanEval" - type: "coophumaneval" - train_split: "test[16:]" - eval_split: "test[:16]" - -# output -output: - base_dir: "../../../work/hdd/bepg/sliu30/output_mt_magrpo" - save_final_model: false - -# external -external: - mode: "level_feedback" - sandbox_slice: 1 - original_prompt: true - previous_response: true - -# magrpo -magrpo: - num_turns: 2 - num_train_epochs: 8 - per_device_train_batch_size: 1 - learning_rate: 2.0e-5 - logging_steps: 50 - save_steps: 200 - num_generations: 4 - max_new_tokens: 256 - temperature: 0.8 - top_p: 0.95 - joint_mode: cross - num_agents: 2 - discount: 0.9 - reward_shift: -4 - -# wandb -wandb: - project: "mlrl" - entity: "nu-llpr" - name: "mt_magrpo_coophumaneval" - dir: "../../../work/hdd/bepg/sliu30/output_mt_magrpo" - tags: ["mt_magrpo", "coophumaneval", "multi-agent", "multi-turn"] diff --git a/configs/mt_magrpo_he_config.yaml b/configs/mt_magrpo_he_config.yaml deleted file mode 100644 index f0aa070..0000000 --- a/configs/mt_magrpo_he_config.yaml +++ /dev/null @@ -1,54 +0,0 @@ -# model -model: - name: "Qwen/Qwen2.5-Coder-3B" - type: "qwen" - temperature: 0.7 - top_p: 0.9 - max_length: 2048 - tokenizer_kwargs: - trust_remote_code: true - model_kwargs: - trust_remote_code: true - torch_dtype: "bfloat16" - -# dataset -dataset: - name: "openai/openai_humaneval" - type: "humaneval" - train_split: "test[33:163]" - eval_split: "test[:32]" - -# output -output: - base_dir: "../../../work/hdd/bepg/sliu30/output_mt_magrpo" - save_final_model: false - -# external -external: - mode: "level_feedback" - sandbox_slice: 1 - original_prompt: true - previous_response: true - -# magrpo -magrpo: - num_turns: 2 - num_train_epochs: 6 - per_device_train_batch_size: 1 - learning_rate: 2.0e-5 - logging_steps: 50 - save_steps: 200 - num_generations: 4 - max_new_tokens: 256 - joint_mode: cross - num_agents: 2 - discount: 0.9 - reward_shift: -4 - -# wandb -wandb: - project: "mlrl" - entity: "nu-llpr" - name: "mt_magrpo_humaneval" - dir: "../../../work/hdd/bepg/sliu30/output_mt_magrpo" - tags: ["mt_magrpo", "humaneval", "multi-agent", "multi-turn"] diff --git a/external/__init__.py b/external/__init__.py index 2b994ee..a2252e1 100644 --- a/external/__init__.py +++ b/external/__init__.py @@ -6,6 +6,10 @@ from . import level_passed from . import passed from . import plain +import builtins + +# Verbose toggle for external previews +VERBOSE = True # ----------------------------- # Context resolver API @@ -59,6 +63,13 @@ def get_external_transition( Returns: A list/tuple of full prompts for each agent to use in the next turn. """ + # Local print override + if not VERBOSE: + def print(*args, **kwargs): # type: ignore + return None + else: + print = builtins.print # type: ignore + if int(num_agents) not in (1, 2): raise ValueError( f"External transition supports 1 or 2 agents, got {num_agents}." diff --git a/rewards/code_rewards.py b/rewards/code_rewards.py index 45c7858..bd9c79c 100644 --- a/rewards/code_rewards.py +++ b/rewards/code_rewards.py @@ -1,6 +1,10 @@ import re import signal from typing import List +import builtins + +# Verbose toggle (can be set by training scripts) +VERBOSE = True from rewards.code_utils import ( TimeoutException, @@ -43,6 +47,13 @@ def execution_reward_aux( Maximum reward: 4.0 (updated from 3.5) """ + # Local print override based on VERBOSE + if not VERBOSE: + def print(*args, **kwargs): # type: ignore + return None + else: + print = builtins.print # type: ignore + rewards = [] TEST_TIMEOUT = 10 # Timeout per individual test diff --git a/train_grpo.py b/train_grpo.py index 1f7e0a5..ec9d7cc 100644 --- a/train_grpo.py +++ b/train_grpo.py @@ -187,9 +187,11 @@ def main(): num_turns = grpo_config.get("num_turns", 1) is_multi_turn = num_turns > 1 - print(f"Multi-turn GRPO enabled: num_turns={num_turns}") if is_multi_turn else print( - f"Single-turn GRPO: num_turns={num_turns}" - ) + output_verbose = config.get("output.verbose", True) + if output_verbose: + print(f"Multi-turn GRPO enabled: num_turns={num_turns}") if is_multi_turn else print( + f"Single-turn GRPO: num_turns={num_turns}" + ) slurm_job_id = os.environ.get("SLURM_JOB_ID", "no_job_id") # Use different output directory prefix for multi-turn for clarity @@ -353,6 +355,7 @@ def _resolver(prompt: str): num_turns=num_turns, discount=grpo_config.get("discount", 0.9), joint_mode=grpo_config.get("joint_mode", "cross"), + termination_threshold=grpo_config.get("termination_threshold", None), ) formatter = get_formatter(dataset_type) @@ -406,6 +409,18 @@ def _resolver(prompt: str): }, } + # Propagate verbosity to reward/external modules + try: + import rewards.code_rewards as code_rewards + code_rewards.VERBOSE = bool(output_verbose) + except Exception: + pass + try: + import external as external_mod + external_mod.VERBOSE = bool(output_verbose) + except Exception: + pass + reward_processor = None # Optional scale if config.get("reward_processor.enabled", False): diff --git a/train_magrpo.py b/train_magrpo.py index dc1b6f6..a0fdb13 100644 --- a/train_magrpo.py +++ b/train_magrpo.py @@ -232,9 +232,11 @@ def main(): num_turns = magrpo_config.get("num_turns", 1) is_multi_turn = num_turns > 1 - print(f"Multi-turn training enabled: num_turns={num_turns}") if is_multi_turn else print( - f"Single-turn training: num_turns={num_turns}" - ) + output_verbose = config.get("output.verbose", True) + if output_verbose: + print(f"Multi-turn training enabled: num_turns={num_turns}") if is_multi_turn else print( + f"Single-turn training: num_turns={num_turns}" + ) slurm_job_id = os.environ.get("SLURM_JOB_ID", "no_job_id") @@ -260,9 +262,10 @@ def main(): print(f"Error loading dataset: {e}") return - print(f"\nUsing model: {model_name}") - print(f"Model type: {model_config.type}") - print(f"Max context window: {model_config.max_length} tokens") + if output_verbose: + print(f"\nUsing model: {model_name}") + print(f"Model type: {model_config.type}") + print(f"Max context window: {model_config.max_length} tokens") tokenizer = AutoTokenizer.from_pretrained( model_name, **model_config.tokenizer_kwargs @@ -277,11 +280,13 @@ def main(): # Add special tokens if needed (e.g., FIM tokens for StarCoder) if model_config.special_tokens: - print("Adding special tokens...") + if output_verbose: + print("Adding special tokens...") tokenizer.add_special_tokens(model_config.special_tokens) - print( - f"Special tokens added: {model_config.special_tokens.get('additional_special_tokens', [])}" - ) + if output_verbose: + print( + f"Special tokens added: {model_config.special_tokens.get('additional_special_tokens', [])}" + ) temperature = magrpo_config.get("temperature", model_config.temperature) top_p = magrpo_config.get("top_p", model_config.top_p) @@ -402,6 +407,7 @@ def _resolver(prompt: str): num_turns=num_turns, discount=magrpo_config.get("discount", 0.9), joint_mode=magrpo_config.get("joint_mode", "cross"), + termination_threshold=magrpo_config.get("termination_threshold", None), ) # Get appropriate formatters and functions based on dataset type, agent count, and training mode @@ -454,6 +460,18 @@ def _resolver(prompt: str): }, } + # Propagate verbosity to reward/external modules + try: + import rewards.code_rewards as code_rewards + code_rewards.VERBOSE = bool(output_verbose) + except Exception: + pass + try: + import external as external_mod + external_mod.VERBOSE = bool(output_verbose) + except Exception: + pass + # Get num_agents from magrpo config (where it belongs for MAGRPO training) num_agents = magrpo_config.get("num_agents", 2) agents = [