New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Help with understanding PPO hyperparameters (SB2 vs SB3) #1746
Comments
Hello, Related issues (please have a look): #90 (comment) and #512 (comment) |
Here are the hyperparameters used for SB2 PPO def MlpPolicy(
name=name,
ob_space=obs_space, # same as SB3
ac_space=ac_space, # same as SB3
hid_size=312,
num_hid_layers=2,
num_of_categories=3,
)
pposgd_simple.learn(
env_creator=env, # same env as above
workerseed=seed + 10000 * MPI.COMM_WORLD.Get_rank(), # this was either 4 or 8 threads
policy_fn=MlpPolicy,
max_timesteps=50000000,
timesteps_per_actorbatch=1536,
clip_param=0.2,
entcoeff=0.01,
optim_epochs=4,
optim_stepsize=0.001,
optim_batchsize=512,
gamma=0.99,
lam=0.95,
schedule="linear",
stochastic=True,
) |
I see, you are using PPO1 (PPO with MPI). I'm not sure how you translated them to SB3 PPO, some seem quite off (for instance I'm not sure where you got the hid_size=312,
num_hid_layers=2,
num_of_categories=3, from as it is not a parameter of PPO1 MlpPolicy. Your parameters should translate to: from typing import Callable
hidden_size = 312
policy_kwargs = {
"log_std_init": 0.0,
"ortho_init": True,
"activation_fn": nn.Tanh,
"net_arch": {
"pi": [hidden_size, hidden_size],
"vf": [hidden_size, hidden_size],
},
# Note: Adam epsilon is 1e-5 by default for SB3 PPO
}
# IMPORTANT: n_envs influences the number of steps collected
n_envs = 8
# make_vec_env(env_id=make_callable_env(), n_envs=n_envs, vec_env_cls=SubprocVecEnv)
## PPO1 has schedule='linear' has a default
def linear_schedule(initial_value: float) -> Callable[[float], float]:
"""
Linear learning rate schedule.
:param initial_value: Initial learning rate.
:return: schedule that computes
current learning rate depending on remaining progress
"""
def func(progress_remaining: float) -> float:
"""
Progress will decrease from 1 (beginning) to 0.
:param progress_remaining:
:return: current learning rate
"""
return progress_remaining * initial_value
return func
model = PPO(
policy="MlpPolicy",
env=envs,
learning_rate=linear_schedule(0.001),
n_steps=1536,
batch_size=512,
n_epochs=4,
gamma=0.99,
gae_lambda=0.95,
ent_coef=0.01,
verbose=1,
clip_range=0.2,
policy_kwargs=policy_kwargs,
max_grad_norm=100, # PPO1 doesn't rescale the gradient apparently
) Please note that the number of envs in parallel is an important hyperparameter (see notebook in our doc). |
Thank you for working out the hyper parameters! I will try these out over the weekend as it takes a day to train. |
❓ Question
HI, I am struggling to get PPO to learn effectively on my environment. The reward earned is not smooth and spikes. This is the reward after 7 million steps.
I am using a custom env with these settings:
is_done()
returns True.is_truncated()
returns True.The PPO algorithm is setup with the following parameters:
I have tried to use the Optuna framework (https://optuna.org/) to do some hyperparameter optimization. Changing the network architecture size between 64/128/256, as well as different values of
n_steps
,batch_size
,activation_fn
.... but I have not found a set which is suitable. Hyperparameter optimization is also incredibly time consuming since I expect it to learn well (where the reward is >50% of the agents episode length) within 1,000,000 steps. Reaching 1,000,000 steps takes hours, and adequate learning takes ~10,000,000 steps so with my current hardware it is not feasible to do such a parameter sweep.I have used SB2 with the same env and this learned smoothly
I have had a look at the migration of SB2 to SB3 and copied over the old parameters the best I could, but no success. I also checked out the rl_zoo for inspiration.
I have also checked the tensorboard and nothing seems out of the ordinary.
Is there something that I am missing? Are my hyperparameters poorly chosen? Is there anything additional between between SB2 and SB3? I am stuck changing parameters over and over again, and training takes way too long for me to keep my PC running 24/7.
Checklist
The text was updated successfully, but these errors were encountered: