[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746

A-Artemis · 2023-11-10T18:08:45Z

❓ Question

HI, I am struggling to get PPO to learn effectively on my environment. The reward earned is not smooth and spikes. This is the reward after 7 million steps.

I am using a custom env with these settings:

action_space = spaces.Box(low=0, high=1, shape=(17,))
observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(94,))

The reward per step is between 0 and 1, the max the agent can earn in a single step is 1, with 0 being the least. So if the agent does 35 perfect steps it has a reward of 35.
The agent can do a max of 885 steps, after that the environment is undefined and is_done() returns True.
If the agent goes out-of-bounds then is_truncated() returns True.

The PPO algorithm is setup with the following parameters:

policy_kwargs = {
    "log_std_init": -2,
    "ortho_init": False,
    "activation_fn": nn.Tanh,
    "net_arch": {
        "pi": [128, 128],
        "vf": [128, 128],
    },
}
model = PPO(
    policy="MlpPolicy",
    env=envs, # make_vec_env(env_id=make_callable_env(), n_envs=32, vec_env_cls=SubprocVecEnv)
    learning_rate=0.0005,
    n_steps=1536,
    batch_size=512,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    ent_coef=0.01,
    verbose=True,
    clip_range=0.2,
    policy_kwargs=policy_kwargs,
)
log = configure(folder="./models", format_strings=["stdout", "csv", "tensorboard"])
model.set_logger(log)
model.learn(total_timesteps=50_000_000, progress_bar=True, log_interval=1)

I have tried to use the Optuna framework (https://optuna.org/) to do some hyperparameter optimization. Changing the network architecture size between 64/128/256, as well as different values of n_steps, batch_size, activation_fn .... but I have not found a set which is suitable. Hyperparameter optimization is also incredibly time consuming since I expect it to learn well (where the reward is >50% of the agents episode length) within 1,000,000 steps. Reaching 1,000,000 steps takes hours, and adequate learning takes ~10,000,000 steps so with my current hardware it is not feasible to do such a parameter sweep.

I have used SB2 with the same env and this learned smoothly

I have had a look at the migration of SB2 to SB3 and copied over the old parameters the best I could, but no success. I also checked out the rl_zoo for inspiration.

I have also checked the tensorboard and nothing seems out of the ordinary.

Is there something that I am missing? Are my hyperparameters poorly chosen? Is there anything additional between between SB2 and SB3? I am stuck changing parameters over and over again, and training takes way too long for me to keep my PC running 24/7.

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2023-11-10T19:36:50Z

Hello,
could you please provide the hyperparameters you used for SB2 PPO?

Related issues (please have a look): #90 (comment) and #512 (comment)

A-Artemis · 2023-11-11T16:58:59Z

Here are the hyperparameters used for SB2 PPO

def MlpPolicy(
  name=name,
  ob_space=obs_space, # same as SB3
  ac_space=ac_space, # same as SB3
  hid_size=312,
  num_hid_layers=2,
  num_of_categories=3,
) 

pposgd_simple.learn(
  env_creator=env, # same env as above
  workerseed=seed + 10000 * MPI.COMM_WORLD.Get_rank(), # this was either 4 or 8 threads
  policy_fn=MlpPolicy,
  max_timesteps=50000000,
  timesteps_per_actorbatch=1536,
  clip_param=0.2,
  entcoeff=0.01,
  optim_epochs=4,
  optim_stepsize=0.001,
  optim_batchsize=512,
  gamma=0.99,
  lam=0.95,
  schedule="linear",
  stochastic=True,
)

araffin · 2023-11-15T09:22:17Z

I see, you are using PPO1 (PPO with MPI). I'm not sure how you translated them to SB3 PPO, some seem quite off (for instance optim_stepsize=0.001 in SB2 PPO but you use learning_rate=0.0005).

I'm not sure where you got the

  hid_size=312,
  num_hid_layers=2,
  num_of_categories=3,

from as it is not a parameter of PPO1 MlpPolicy.
Same for stochastic...

Your parameters should translate to:

from typing import Callable

hidden_size = 312
policy_kwargs = {
    "log_std_init": 0.0,
    "ortho_init": True,
    "activation_fn": nn.Tanh,
    "net_arch": {
        "pi": [hidden_size, hidden_size],
        "vf": [hidden_size, hidden_size],
    },
# Note: Adam epsilon is 1e-5 by default for SB3 PPO
}

# IMPORTANT: n_envs influences the number of steps collected
n_envs = 8
# make_vec_env(env_id=make_callable_env(), n_envs=n_envs, vec_env_cls=SubprocVecEnv)

## PPO1 has schedule='linear' has a default
def linear_schedule(initial_value: float) -> Callable[[float], float]:
    """
    Linear learning rate schedule.

    :param initial_value: Initial learning rate.
    :return: schedule that computes
      current learning rate depending on remaining progress
    """
    def func(progress_remaining: float) -> float:
        """
        Progress will decrease from 1 (beginning) to 0.

        :param progress_remaining:
        :return: current learning rate
        """
        return progress_remaining * initial_value

    return func

model = PPO(
    policy="MlpPolicy",
    env=envs, 
    learning_rate=linear_schedule(0.001),
    n_steps=1536,
    batch_size=512,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    ent_coef=0.01,
    verbose=1,
    clip_range=0.2,
    policy_kwargs=policy_kwargs,
    max_grad_norm=100, # PPO1 doesn't rescale the gradient apparently

)

Please note that the number of envs in parallel is an important hyperparameter (see notebook in our doc).

A-Artemis · 2023-11-16T19:19:44Z

Thank you for working out the hyper parameters! I will try these out over the weekend as it takes a day to train.

A-Artemis added the question Further information is requested label Nov 10, 2023

araffin added the more information needed Please fill the issue template completely label Nov 10, 2023

araffin removed the more information needed Please fill the issue template completely label Nov 15, 2023

araffin closed this as completed Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746

[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746

A-Artemis commented Nov 10, 2023

araffin commented Nov 10, 2023

A-Artemis commented Nov 11, 2023

araffin commented Nov 15, 2023 •

edited

A-Artemis commented Nov 16, 2023

[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746

[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746

Comments

A-Artemis commented Nov 10, 2023

❓ Question

Checklist

araffin commented Nov 10, 2023

A-Artemis commented Nov 11, 2023

araffin commented Nov 15, 2023 • edited

A-Artemis commented Nov 16, 2023

araffin commented Nov 15, 2023 •

edited