### Hyperparameter Tuning in RL/MARL

Tuning your hyperparameters are very important in RL, as even small tweaks can cause a huge difference in the performance of the policy, especially in a more stochastic environment like MARL.

In this tutorial, we will go through how we can tune the hyperparameters, and also use tools like Optuna to tune our hyperparameters more effectively.

We'll set up a simple MARL environment, as we did in Unit 2, and tune the hyperparameters

### Dependencies

In [1]:
!pip3 install torchrl==0.7.0
!pip3 install tensordict==0.7.2
!pip3 install pettingzoo
!pip3 install tqdm
!pip3 install optuna

Collecting torchrl==0.7.0
  Downloading torchrl-0.7.0-cp311-cp311-manylinux1_x86_64.whl.metadata (39 kB)
Collecting torch>=2.6.0 (from torchrl==0.7.0)
  Downloading torch-2.7.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting tensordict>=0.7.0 (from torchrl==0.7.0)
  Downloading tensordict-0.8.2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting sympy>=1.13.3 (from torch>=2.6.0->torchrl==0.7.0)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch>=2.6.0->torchrl==0.7.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch>=2.6.0->torchrl==0.7.0)
  Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch>=2.6.0->torchrl==0.7.0)
  Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-no

In [2]:
# Torch
import torch
import torch.nn as nn

# Tensordict modules
from tensordict.nn import set_composite_lp_aggregate, TensorDictModule, TensorDictSequential
from tensordict import  TensorDictBase
from torch import multiprocessing

# Data collection
from torchrl.collectors import SyncDataCollector
from torch.distributions import Categorical
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.data.replay_buffers.storages import LazyTensorStorage

#Optuna
import optuna

#Env
from torchrl.envs import RewardSum, TransformedEnv, PettingZooWrapper, Compose, DoubleToFloat, StepCounter, ParallelEnv, EnvCreator, ExplorationType, set_exploration_type

# Utils
from torchrl.envs.utils import check_env_specs

# Multi-agent network
from torchrl.modules import MultiAgentMLP, ProbabilisticActor, TanhNormal

# Loss
from torchrl.objectives import ClipPPOLoss, ValueEstimators

# Utils
torch.manual_seed(0)
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm



### Setup Environment

In [3]:
from pettingzoo.butterfly import knights_archers_zombies_v10

base_env = knights_archers_zombies_v10.parallel_env(render_mode="rgb_array")
env = PettingZooWrapper(base_env)


  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)


### Create a dictionary of default hyperparameters
These are the original hyperparameters we will use for the baseline, and we will see if other tuned hyperparameters can lead to better results.

Do note that unless you have a powerful computer, you should not touch n_parallel_envs

In [4]:
is_fork = multiprocessing.get_start_method() == "fork"
device = (
    torch.device(0)
    if torch.cuda.is_available() and not is_fork
    else torch.device("cpu")
)

#Parameters for Env
# Optuna should not be in control of how many envs you run
# Unless you want to brick your computer
n_parallel_envs = 2  # Number of parallel environments

# Default hyperparameters dictionary
default_params = {

    # Sampling parameters
    "frames_per_batch": 2000,  # Number of frames collected per training iteration
    "total_frames": 200000,  # Total frames for training

    # Training parameters
    "num_epochs": 5,  # Number of optimization steps per training iteration
    "minibatch_size": 400,  # Size of the mini-batches in each optimization step
    "lr": 3e-4,  # Learning rate
    "max_grad_norm": 1.0,  # Maximum norm for the gradients

    # PPO parameters
    "clip_epsilon": 0.2,  # Clip value for PPO loss
    "gamma": 0.99,  # Discount factor
    "lambda": 0.9,  # Lambda for generalized advantage estimation
    "entropy_eps": 1e-4,  # Coefficient of the entropy term in the PPO loss

    # Network parameters
    "network_depth": 2,  # Depth of the neural networks
    "network_width": 256,  # Width of the neural networks
    "activation": "Tanh",  # Activation function
    "share_parameters_policy": True,  # Whether to share parameters in policy
    "share_parameters_critic": True,  # Whether to share parameters in critic
    "mappo": True,  # Whether to use MAPPO (True) or IPPO (False)
}

# disable log-prob aggregation
set_composite_lp_aggregate(False).set()

### Setting up the transformed environment

In [5]:
# Create reward transforms with the correct nested tuple keys
reward_transforms = [
    # For archer agents
    RewardSum(
        in_keys=[("archer", "reward")],  # Use tuple format for nested keys
        out_keys=[("archer", "episode_reward")]
    ),
    # For knight agents
    RewardSum(
        in_keys=[("knight", "reward")],  # Use tuple format for nested keys
        out_keys=[("knight", "episode_reward")]
    )
]

# Apply the transforms
make_env = EnvCreator(lambda: TransformedEnv(
    PettingZooWrapper(knights_archers_zombies_v10.parallel_env(render_mode="rgb_array")),
    Compose(RewardSum(
        in_keys=[("archer", "reward")],  # Use tuple format for nested keys
        out_keys=[("archer", "episode_reward")]
    ),
    # For knight agents
    RewardSum(
        in_keys=[("knight", "reward")],  # Use tuple format for nested keys
        out_keys=[("knight", "episode_reward")]
    ), DoubleToFloat(), StepCounter())
)
)

env = ParallelEnv(n_parallel_envs, make_env, serial_for_single=True)

### Check validity of environment

In [6]:
print("action_keys:", env.action_keys)
print("reward_keys:", env.reward_keys)
print("done_keys:", env.done_keys)

print("Action Spec:", env.action_spec)
print("Observation Spec:", env.observation_spec)
print("Reward Spec:", env.reward_spec)
print("Done Spec:", env.done_spec)

check_env_specs(env)

action_keys: [('archer', 'action'), ('knight', 'action')]
reward_keys: [('archer', 'reward'), ('knight', 'reward')]
done_keys: ['done', 'terminated', 'truncated', ('archer', 'done'), ('archer', 'terminated'), ('archer', 'truncated'), ('knight', 'done'), ('knight', 'terminated'), ('knight', 'truncated')]
Action Spec: Composite(
    archer: Composite(
        action: Categorical(
            shape=torch.Size([2, 2]),
            space=CategoricalBox(n=6),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        device=cpu,
        shape=torch.Size([2, 2])),
    knight: Composite(
        action: Categorical(
            shape=torch.Size([2, 2]),
            space=CategoricalBox(n=6),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        device=cpu,
        shape=torch.Size([2, 2])),
    device=cpu,
    shape=torch.Size([2]))
Observation Spec: Composite(
    archer: Composite(
        observation: BoundedContinuous

2025-05-06 05:43:17,053 [torchrl][INFO] check_env_specs succeeded!


In [7]:
n_rollout_steps = 5
rollout = env.rollout(n_rollout_steps)
print(f"rollout of {n_rollout_steps} steps:", rollout)
print("Shape of the rollout TensorDict:", rollout.batch_size)

rollout of 5 steps: TensorDict(
    fields={
        archer: TensorDict(
            fields={
                action: Tensor(shape=torch.Size([2, 5, 2]), device=cpu, dtype=torch.int64, is_shared=False),
                done: Tensor(shape=torch.Size([2, 5, 2, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                episode_reward: Tensor(shape=torch.Size([2, 5, 2, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                observation: Tensor(shape=torch.Size([2, 5, 2, 27, 5]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([2, 5, 2, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([2, 5, 2, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([2, 5, 2]),
            device=None,
            is_shared=False),
        done: Tensor(shape=torch.Size([2, 5, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        knight

### Dump all the model architecture and training code into functions

As much as it hurts my eyes as it hurts yours, Optuna requires us to spawn a fresh environment and model each time we train, so we have to gather all of the previously separated notebook cells into a single function

This is especially so if the architecture depends on the hyperparameters Optuna suggests to it

However, for our FlattenObs and process_batch helper utilities, they do not require any hyperparameters, so we can abstract them out of the functions

In [8]:
n_archers = env.observation_spec["archer", "observation"].shape[1]
n_knights = env.observation_spec["knight", "observation"].shape[1]
n_entities = env.observation_spec["archer", "observation"].shape[2]
n_features = env.observation_spec["archer", "observation"].shape[3]

# Create a flattening module that handles the batched_env+time dimensions
class FlattenObs(nn.Module):
    def forward(self, obs):
        # Convert to float first
        obs = obs.float()

        # Handle different possible shapes
        if len(obs.shape) == 5:  # [batch_env, time, n_agents, n_entities, n_features]
            batch_env, time, n_agents, n_entities, n_features = obs.shape

            # Reshape to merge batch_env and time dimensions
            # This gives [batch_env*time, n_agents, n_entities, n_features]
            obs = obs.reshape(-1, n_agents, n_entities, n_features)

            # Take only the first entity for each agent (agent itself)
            return obs[:, :, 0, :]  # [batch_env*time, n_agents, n_features]

        elif len(obs.shape) == 4:  # [batch, n_agents, n_entities, n_features]
            batch, n_agents, n_entities, n_features = obs.shape
            return obs[:, :, 0, :]  # [batch, n_agents, n_features]

        elif len(obs.shape) == 3:  # [batch, n_entities, n_features]
              batch, n_entities, n_features = obs.shape
              return obs.reshape(batch, n_entities * n_features)

        # Fallback for unexpected shapes
        return obs



In [9]:
groups = ["archer", "knight"]

def process_batch(batch: TensorDictBase) -> TensorDictBase:
    """
    Expand done and terminated keys for each group to match reward shape.
    """
    for group in groups:  # Changed from env.group_map.keys()
        keys = list(batch.keys(True, True))
        group_shape = batch.get_item_shape(group)
        nested_done_key = ("next", group, "done")
        nested_terminated_key = ("next", group, "terminated")
        if nested_done_key not in keys:
            batch.set(
                nested_done_key,
                batch.get(("next", "done")).unsqueeze(-1).expand((*group_shape, 1)),
            )
        if nested_terminated_key not in keys:
            batch.set(
                nested_terminated_key,
                batch.get(("next", "terminated"))
                .unsqueeze(-1)
                .expand((*group_shape, 1)),
            )
    return batch

In [10]:
# 100 LOC dump of all model code, because you need to have everything wrapped in
# an objective function for Optuna to optimise
def setup_models(params, env, device):
    """
    Set up policy and critic models using parameter dictionary.

    Args:
        params: Dictionary of hyperparameters
        env: Environment instance
        device: Computing device

    Returns:
        tuple: (policies, critics) dictionaries
    """
    n_archers = env.observation_spec["archer", "observation"].shape[1]
    n_knights = env.observation_spec["knight", "observation"].shape[1]
    n_entities = env.observation_spec["archer", "observation"].shape[2]
    n_features = env.observation_spec["archer", "observation"].shape[3]

    # Create activation function
    if params["activation"] == "Tanh":
        activation_class = torch.nn.Tanh
    elif params["activation"] == "ReLU":
        activation_class = torch.nn.ReLU
    else:
        activation_class = torch.nn.Tanh  # Default

    # Create policy modules for each agent type
    policy_modules = {}
    for group in ["archer", "knight"]:
        n_agents = n_archers if group == "archer" else n_knights
        share_parameters_policy = params["share_parameters_policy"]

        # Create MLP for policy
        policy_mlp = MultiAgentMLP(
            n_agent_inputs=n_features,  # Only using features of the agent itself
            n_agent_outputs=6,  # 6 discrete actions in KAZ
            n_agents=n_agents, # 2 agents per type
            centralised=False, #The agents are decentralised
            share_params=share_parameters_policy,
            device=device,
            depth=params["network_depth"],
            num_cells=params["network_width"],
            activation_class=activation_class,
        )

        # Create sequential module with flattening
        policy_seq = nn.Sequential(FlattenObs(), policy_mlp, nn.Softmax(dim=-1))

        # Wrap in TensorDictModule
        policy_modules[group] = TensorDictModule(
            module=policy_seq,
            in_keys=[(group, "observation")],
            out_keys=[(group, "probs")]
        )

    # Create actors for each agent type
    policies = {}
    for group in ["archer", "knight"]:
        policies[group] = ProbabilisticActor(
            module=policy_modules[group],  # Use the policy module directly
            spec=env.action_spec[group, "action"],
            in_keys=[(group, "probs")],
            out_keys=[(group, "action")],
            distribution_class=Categorical,
            return_log_prob=True,
        )

    agents_policy = TensorDictSequential(*policies.values())


    # Create critics
    critics = {}
    for group in ["archer", "knight"]:
        n_agents = n_archers if group == "archer" else n_knights
        share_parameters_critic = params["share_parameters_critic"]
        MAPPO = params["mappo"]

        # Wrap flattener in TensorDictModule
        flatten_obs_module = TensorDictModule(
            FlattenObs(),
            in_keys=[(group, "observation")],
            out_keys=[(group, "flat_observation")],
        )

        # Create critic module
        critic_module = TensorDictModule(
           module = MultiAgentMLP(
               n_agent_inputs=n_features,
               n_agent_outputs=1,
               n_agents=n_agents,
               centralised=MAPPO, #True for MAPPO, False for IPPO
               share_params=share_parameters_critic,
               device=device,
               activation_class=activation_class,
               depth=params["network_depth"],
               num_cells=params["network_width"],
           ), in_keys = [(group, "flat_observation")], out_keys = [(group, "state_value")]
        )

        # Combine modules
        critics[group] = TensorDictSequential(
            flatten_obs_module,
            critic_module,
        )

    return policies, agents_policy, critics

In [11]:
def train_model(env, params, policies, agents_policy, critics, device):
    """
    Train the model using the specified parameters, policies, and critics.

    Args:
        env: Environment
        params: Dictionary of hyperparameters
        policies: Dictionary of policy modules
        critics: Dictionary of critic modules
        device: Computing device

    Returns:
        float: Mean reward achieved during training
    """
    # Create data collector
    collector = SyncDataCollector(
        ParallelEnv(n_parallel_envs, make_env),
        policy=agents_policy,
        frames_per_batch=params["frames_per_batch"],
        total_frames=params["total_frames"],
        device=device,
        storing_device=device,
    )

    # Create loss modules for each agent type
    loss_modules = {}
    for group in ["archer", "knight"]:
        loss_modules[group] = ClipPPOLoss(
            actor=policies[group],
            critic=critics[group],
            clip_epsilon=params["clip_epsilon"],
        )

        loss_modules[group].set_keys(
            reward=(group, "reward"),
            action=(group, "action"),
            value=(group, "state_value"),
            done=(group, "done"),
            terminated=(group, "terminated")
        )

        loss_modules[group].make_value_estimator(ValueEstimators.GAE, gamma = params['gamma'], lmbda = params["lambda"])

    # Create optimizers
    optimizers = {}
    for group in ["archer", "knight"]:
        optimizers[group] = torch.optim.Adam(
            list(policies[group].parameters()) + list(critics[group].parameters()),
            lr=params["lr"],
        )

    replay_buffers = {}
    for group in groups:
        # Create storage and buffer
        storage = LazyTensorStorage(params["frames_per_batch"])
        sampler = SamplerWithoutReplacement()
        replay_buffers[group] = ReplayBuffer(
            storage=storage,
            sampler=sampler,
            batch_size = params["minibatch_size"],
        )

    # Training loop (using the structure from paste.txt)
    pbar = tqdm(
        total=params["total_frames"],
        desc=", ".join([f"episode_reward_mean_{group} = 0" for group in groups])
    )
    episode_reward_mean_map = {group: [] for group in groups}
    total_frames_so_far = 0

    # Training/collection iterations
    for iteration, batch in enumerate(collector):
        batch = process_batch(batch)  # Expand done keys if needed

        # Calculate total frames in this batch
        current_batch_size = batch.numel()
        total_frames_so_far += current_batch_size  # Track total frames

        # Process each group
        for group in groups:
            # Extract data for this group only
            group_batch = batch.exclude(
                *[
                    key
                    for _group in groups
                    if _group != group
                    for key in [_group, ("next", _group)]
                ]
            )

            # Reshape to flatten batch dimensions
            group_batch = group_batch.reshape(-1)

            # Add to this group's replay buffer
            replay_buffers[group].extend(group_batch)

            # PPO training epochs (multiple passes over the same data)
            for _ in range(params["num_epochs"]):
                # Iterate through all minibatches in the buffer once
                for subdata in replay_buffers[group]:
                    # Compute loss
                    loss_vals = loss_modules[group](subdata)

                    # Compute total loss
                    loss_value = (
                        loss_vals["loss_objective"] +
                        loss_vals["loss_critic"] +
                        loss_vals["loss_entropy"]
                    )

                    # Backprop and optimize
                    optimizers[group].zero_grad()
                    loss_value.backward()

                    # Gradient clipping
                    torch.nn.utils.clip_grad_norm_(
                        loss_modules[group].parameters(), params["max_grad_norm"]
                    )

                    optimizers[group].step()

        # Update collector policy with new weights
        collector.update_policy_weights_()

        # Logging with error handling
        for group in groups:
            done_mask = batch.get(("next", group, "done"))

            # Check if any episodes finished
            if done_mask.any():
                episode_reward_mean = (
                    batch.get(("next", group, "episode_reward"))[done_mask]
                    .mean()
                    .item()
                )
            else:
                # No episodes finished, use previous value or 0
                episode_reward_mean = (
                    episode_reward_mean_map[group][-1] if episode_reward_mean_map[group] else 0.0
                )

            episode_reward_mean_map[group].append(episode_reward_mean)

        # Update description with step count
        pbar.set_description(
            f"Steps: {total_frames_so_far}, " +
            ", ".join([
                f"{group}: {episode_reward_mean_map[group][-1]:.2f}"
                for group in groups
            ]),
            refresh=False
        )

        # Update progress bar with total frames processed in this batch
        pbar.update(current_batch_size)

    # Calculate final mean reward (average of last 10 episodes for each group)
    final_rewards = {}
    for group in groups:
        if episode_reward_mean_map[group]:
            last_n = min(10, len(episode_reward_mean_map[group]))
            final_rewards[group] = sum(episode_reward_mean_map[group][-last_n:]) / last_n
        else:
            final_rewards[group] = 0.0

    # Return average of all group rewards
    return sum(final_rewards.values()) / len(final_rewards)



### Setting up hyperparameter tuning

This is the KEY driving code -  We set up an objective function that creates hyperparameters, and suggests new ones based on constraints set by the user

`suggest_<type>` is key here: It allows us to force a certain type like integer, and get Optuna to generate new hyperparameter configurations based on the arguments given.

In [12]:
def objective(trial):
    """
    Optuna objective function for hyperparameter optimization.

    Args:
        trial: Optuna trial object

    Returns:
        float: Mean reward (metric to maximize)
    """
    # Define hyperparameters to optimize
    params = {
        # Sampling parameters
        "frames_per_batch": 2000,  # Fixed for consistency
        "total_frames": 20000,  # Reduced for faster trials

        # Training parameters
        "num_epochs": trial.suggest_int("num_epochs", 3, 10),
        "minibatch_size": trial.suggest_categorical("minibatch_size", [200, 400, 800]),
        "lr": trial.suggest_float("lr", 1e-5, 1e-3, log=True),
        "max_grad_norm": trial.suggest_float("max_grad_norm", 0.5, 2.0),

        # PPO parameters
        "clip_epsilon": trial.suggest_float("clip_epsilon", 0.1, 0.3),
        "gamma": trial.suggest_float("gamma", 0.95, 0.999),
        "lambda": trial.suggest_float("lambda", 0.9, 1.0),
        "entropy_eps": trial.suggest_float("entropy_eps", 1e-5, 1e-3, log=True),

        # Network parameters
        "network_depth": trial.suggest_int("network_depth", 1, 3),
        "network_width": trial.suggest_categorical("network_width", [64, 128, 256, 512]),
        "activation": trial.suggest_categorical("activation", ["Tanh", "ReLU"]),
        "share_parameters_policy": True,  # Fixed for simplicity
        "share_parameters_critic": True,  # Fixed for simplicity
        "mappo": trial.suggest_categorical("mappo", [True, False]),
    }

    # Setup device
    is_fork = multiprocessing.get_start_method() == "fork"
    device = torch.device(0) if torch.cuda.is_available() and not is_fork else torch.device("cpu")

    # Initialize environment (replace with your actual env)
    env = ParallelEnv(n_parallel_envs,make_env)  # This function needs to be defined

    # Setup models using parameter dictionary
    policies, agents_policy, critics = setup_models(params, env, device)

    # Train the model and get mean reward
    mean_reward = train_model(env, params, policies, agents_policy, critics, device)
    #We report back the mean reward, but you can always use other heuristics like weighted average etc.

    return mean_reward


### Using Optuna to tune hyperparameters

We then create a study, allowing us to optimize in a given direction and take the best hyperparameters.

In [13]:
# This will take very long

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=2)  # Adjust based on computational resources (n = 2 is very small)

# Print best parameters
print("Best trial:")
trial = study.best_trial
print(f"  Value: {trial.value}")
print("  Params:")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

# Save best parameters
import json
with open("best_params.json", "w") as f:
    json.dump(trial.params, f, indent=2)

# If you want to use the best parameters for a full training run
best_params = {

    # Sampling parameters (use full budget for final training)
    "frames_per_batch": 2000,
    "total_frames": 200000,  # Full training budget

    # Other parameters from best trial
    "num_epochs": study.best_params["num_epochs"],
    "minibatch_size": study.best_params["minibatch_size"],
    "lr": study.best_params["lr"],
    "max_grad_norm": study.best_params["max_grad_norm"],
    "clip_epsilon": study.best_params["clip_epsilon"],
    "gamma": study.best_params["gamma"],
    "lambda": study.best_params["lambda"],
    "entropy_eps": study.best_params["entropy_eps"],
    "network_depth": study.best_params["network_depth"],
    "network_width": study.best_params["network_width"],
    "activation": study.best_params["activation"],
    "share_parameters_policy": True,
    "share_parameters_critic": True,
    "mappo": study.best_params["mappo"],
}

# Create environment for final training
env = ParallelEnv(n_parallel_envs, make_env)

# Setup device
is_fork = multiprocessing.get_start_method() == "fork"
device = torch.device(0) if torch.cuda.is_available() and not is_fork else torch.device("cpu")

# Setup models with best parameters
policies, agents_policy, critics = setup_models(best_params, env, device)

# Final training run
print("Starting final training with best parameters...")
final_reward = train_model(env, best_params, policies, agents_policy, critics, device)
print(f"Final training complete. Mean reward: {final_reward:.4f}")


[I 2025-05-06 05:43:17,385] A new study created in memory with name: no-name-869a675e-56cd-4a62-b44f-b8c49d0fff2c


episode_reward_mean_archer = 0, episode_reward_mean_knight = 0:   0%|          | 0/20000 [00:00<?, ?it/s]

[I 2025-05-06 05:46:47,811] Trial 0 finished with value: 0.9996051313355565 and parameters: {'num_epochs': 9, 'minibatch_size': 800, 'lr': 7.552929913851593e-05, 'max_grad_norm': 1.9377070642698397, 'clip_epsilon': 0.15359331146445357, 'gamma': 0.9794602657565913, 'lambda': 0.9807239806093849, 'entropy_eps': 0.00011986779432457749, 'network_depth': 1, 'network_width': 128, 'activation': 'Tanh', 'mappo': False}. Best is trial 0 with value: 0.9996051313355565.


episode_reward_mean_archer = 0, episode_reward_mean_knight = 0:   0%|          | 0/20000 [00:00<?, ?it/s]

[I 2025-05-06 05:50:16,605] Trial 1 finished with value: 0.9937416618689895 and parameters: {'num_epochs': 7, 'minibatch_size': 400, 'lr': 5.41729303067796e-05, 'max_grad_norm': 0.9941180552407667, 'clip_epsilon': 0.1212771008887426, 'gamma': 0.9560376967733367, 'lambda': 0.9725653294185096, 'entropy_eps': 0.0006881180021790127, 'network_depth': 2, 'network_width': 128, 'activation': 'Tanh', 'mappo': False}. Best is trial 0 with value: 0.9996051313355565.


Best trial:
  Value: 0.9996051313355565
  Params:
    num_epochs: 9
    minibatch_size: 800
    lr: 7.552929913851593e-05
    max_grad_norm: 1.9377070642698397
    clip_epsilon: 0.15359331146445357
    gamma: 0.9794602657565913
    lambda: 0.9807239806093849
    entropy_eps: 0.00011986779432457749
    network_depth: 1
    network_width: 128
    activation: Tanh
    mappo: False
Starting final training with best parameters...


episode_reward_mean_archer = 0, episode_reward_mean_knight = 0:   0%|          | 0/200000 [00:00<?, ?it/s]

Final training complete. Mean reward: 0.9924
