Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746

Closed
4 tasks done
A-Artemis opened this issue Nov 10, 2023 · 4 comments
Closed
4 tasks done

[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746

A-Artemis opened this issue Nov 10, 2023 · 4 comments
Labels
question Further information is requested

Comments

@A-Artemis
Copy link

❓ Question

HI, I am struggling to get PPO to learn effectively on my environment. The reward earned is not smooth and spikes. This is the reward after 7 million steps.
image

I am using a custom env with these settings:

action_space = spaces.Box(low=0, high=1, shape=(17,))
observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(94,))
  • The reward per step is between 0 and 1, the max the agent can earn in a single step is 1, with 0 being the least. So if the agent does 35 perfect steps it has a reward of 35.
  • The agent can do a max of 885 steps, after that the environment is undefined and is_done() returns True.
  • If the agent goes out-of-bounds then is_truncated() returns True.

The PPO algorithm is setup with the following parameters:

policy_kwargs = {
    "log_std_init": -2,
    "ortho_init": False,
    "activation_fn": nn.Tanh,
    "net_arch": {
        "pi": [128, 128],
        "vf": [128, 128],
    },
}
model = PPO(
    policy="MlpPolicy",
    env=envs, # make_vec_env(env_id=make_callable_env(), n_envs=32, vec_env_cls=SubprocVecEnv)
    learning_rate=0.0005,
    n_steps=1536,
    batch_size=512,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    ent_coef=0.01,
    verbose=True,
    clip_range=0.2,
    policy_kwargs=policy_kwargs,
)
log = configure(folder="./models", format_strings=["stdout", "csv", "tensorboard"])
model.set_logger(log)
model.learn(total_timesteps=50_000_000, progress_bar=True, log_interval=1)

I have tried to use the Optuna framework (https://optuna.org/) to do some hyperparameter optimization. Changing the network architecture size between 64/128/256, as well as different values of n_steps, batch_size, activation_fn .... but I have not found a set which is suitable. Hyperparameter optimization is also incredibly time consuming since I expect it to learn well (where the reward is >50% of the agents episode length) within 1,000,000 steps. Reaching 1,000,000 steps takes hours, and adequate learning takes ~10,000,000 steps so with my current hardware it is not feasible to do such a parameter sweep.

I have used SB2 with the same env and this learned smoothly
image

I have had a look at the migration of SB2 to SB3 and copied over the old parameters the best I could, but no success. I also checked out the rl_zoo for inspiration.

I have also checked the tensorboard and nothing seems out of the ordinary.
image

Is there something that I am missing? Are my hyperparameters poorly chosen? Is there anything additional between between SB2 and SB3? I am stuck changing parameters over and over again, and training takes way too long for me to keep my PC running 24/7.

Checklist

@A-Artemis A-Artemis added the question Further information is requested label Nov 10, 2023
@araffin araffin added the more information needed Please fill the issue template completely label Nov 10, 2023
@araffin
Copy link
Member

araffin commented Nov 10, 2023

Hello,
could you please provide the hyperparameters you used for SB2 PPO?

Related issues (please have a look): #90 (comment) and #512 (comment)

@A-Artemis
Copy link
Author

Here are the hyperparameters used for SB2 PPO

def MlpPolicy(
  name=name,
  ob_space=obs_space, # same as SB3
  ac_space=ac_space, # same as SB3
  hid_size=312,
  num_hid_layers=2,
  num_of_categories=3,
) 

pposgd_simple.learn(
  env_creator=env, # same env as above
  workerseed=seed + 10000 * MPI.COMM_WORLD.Get_rank(), # this was either 4 or 8 threads
  policy_fn=MlpPolicy,
  max_timesteps=50000000,
  timesteps_per_actorbatch=1536,
  clip_param=0.2,
  entcoeff=0.01,
  optim_epochs=4,
  optim_stepsize=0.001,
  optim_batchsize=512,
  gamma=0.99,
  lam=0.95,
  schedule="linear",
  stochastic=True,
)

@araffin araffin removed the more information needed Please fill the issue template completely label Nov 15, 2023
@araffin
Copy link
Member

araffin commented Nov 15, 2023

I see, you are using PPO1 (PPO with MPI). I'm not sure how you translated them to SB3 PPO, some seem quite off (for instance optim_stepsize=0.001 in SB2 PPO but you use learning_rate=0.0005).

I'm not sure where you got the

  hid_size=312,
  num_hid_layers=2,
  num_of_categories=3,

from as it is not a parameter of PPO1 MlpPolicy.
Same for stochastic...

Your parameters should translate to:

from typing import Callable

hidden_size = 312
policy_kwargs = {
    "log_std_init": 0.0,
    "ortho_init": True,
    "activation_fn": nn.Tanh,
    "net_arch": {
        "pi": [hidden_size, hidden_size],
        "vf": [hidden_size, hidden_size],
    },
# Note: Adam epsilon is 1e-5 by default for SB3 PPO
}

# IMPORTANT: n_envs influences the number of steps collected
n_envs = 8
# make_vec_env(env_id=make_callable_env(), n_envs=n_envs, vec_env_cls=SubprocVecEnv)

## PPO1 has schedule='linear' has a default
def linear_schedule(initial_value: float) -> Callable[[float], float]:
    """
    Linear learning rate schedule.

    :param initial_value: Initial learning rate.
    :return: schedule that computes
      current learning rate depending on remaining progress
    """
    def func(progress_remaining: float) -> float:
        """
        Progress will decrease from 1 (beginning) to 0.

        :param progress_remaining:
        :return: current learning rate
        """
        return progress_remaining * initial_value

    return func

model = PPO(
    policy="MlpPolicy",
    env=envs, 
    learning_rate=linear_schedule(0.001),
    n_steps=1536,
    batch_size=512,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    ent_coef=0.01,
    verbose=1,
    clip_range=0.2,
    policy_kwargs=policy_kwargs,
    max_grad_norm=100, # PPO1 doesn't rescale the gradient apparently

)

Please note that the number of envs in parallel is an important hyperparameter (see notebook in our doc).

@A-Artemis
Copy link
Author

Thank you for working out the hyper parameters! I will try these out over the weekend as it takes a day to train.

@araffin araffin closed this as completed Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants