Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question/Discussion] Comparing stable-baselines3 vs stable-baselines #90

Closed
AlessandroZavoli opened this issue Jul 7, 2020 · 28 comments
Labels
question Further information is requested

Comments

@AlessandroZavoli
Copy link

Did anybody compare the training speed (or other performance metrics) of SB and SB3 for the implemented algorithms (e.g., PPO?)
Is there a reason to prefer either one for developing a new project?

@m-rph
Copy link
Contributor

m-rph commented Jul 7, 2020

SB3 is in active development whereas SB2(SB) is in maintenance mode. I use SB3 for my projects since it is more modular and less cluttered than SB2 simply because of dynamic computation graphs, the experience collected by the members in implementing the algorithms, and a well thought out design.

@araffin araffin added the question Further information is requested label Jul 7, 2020
@Miffyli
Copy link
Collaborator

Miffyli commented Jul 7, 2020

To add to comment above, some of the methods are as of writing slower (at least without tuning e.g. number of threads), but we are still in process of going over them and optimizing for speed and matching the performance of SB2 implementations.

@araffin
Copy link
Member

araffin commented Jul 7, 2020

Hello,

I'm glad that you ask ;)

As mentioned by @partiallytyped , SB3 is now the project actively developed by the maintainers.
It does not have all the features of SB2 (yet) but is already ready for most use cases.

Did anybody compare the training speed (or other performance metrics) of SB and SB3 for the implemented algorithms (e.g., PPO?)

We have two related issues for that: #49 #48
The algorithms have been benchmarked recently in a paper for the continuous case and I have already successfully used SAC on real robots.
Because PyTorch uses dynamic graph, you have to expect a small slow down (we plan to use the jit improve the speed in the future #57 ) and you may have to play with torch.set_num_thread() to have the best speed. One exception is DQN which is significantly faster in SB3 because of the new replay buffer implementation.

Is there a reason to prefer either one for developing a new project?

The main advantage of SB3 is that it was re-built (almost) from scratch, trying not to reproduce the errors made in SB2.
That means much clearer code, more test coverage and higher quality standard (with the use of typing notably).
Unless you need to use RNN, I would highly recommend you to use SB3.

If you change the internals, you may expect some changes (they will be documented anyway) until the v1.0 is released (see issue #1 and code review #17 ).
If you use only the "user api" (without changing the internals), then not much should change and I would highly recommend you to use the rl zoo that should cover most needs (and that is up to date with the best practices for using SB3).

It is also in the roadmap to document the differences between SB2 and SB3.

Last thing, for SB3 vs other pytorch libraries: #20

@RezaSwe
Copy link

RezaSwe commented Jul 9, 2020

Hello,

I used SB2 for training with SAC and now switched to SB3. SB3 implementation is right now around 2.5X slower than that of the SB2 with almost the same set of (hyper)parameters. Is this something we should expect or something is wrong in my environment and/or code?

Many thanks,
Reza

@m-rph
Copy link
Contributor

m-rph commented Jul 9, 2020 via email

@RezaSwe
Copy link

RezaSwe commented Jul 9, 2020

Hi PartiallyTyped,

Thanks for your quick reply!

Do you have any idea where is the best to put play with torch.set_num_thread() ? Really appreciate if you can comment on that.

BR,
Reza

@araffin
Copy link
Member

araffin commented Jul 9, 2020

Do you have any idea where is the best to put play with torch.set_num_thread() ? Really appreciate if you can comment on that.

Before creating the model, or if you are using the rl zoo, you can pass it as an argument to the script (--num-threads).

I used SB2 for training with SAC and now switched to SB3. SB3 implementation is right now around 2.5X slower than that of the SB2 with almost the same set of (hyper)parameters. Is this something we should expect or something is wrong in my environment and/or code?

Is it CPU only?
If so, you should play a bit with th.set_num_threads()

@RezaSwe
Copy link

RezaSwe commented Jul 9, 2020

Thanks araffin for your reply!

Since I am using gym not zoo, I tried to use th.set_num_threads() before creating the model. I got this error message:

"MemoryError: Unable to allocate 2.12 GiB for an array with shape (1000000, 1, 568) and data type float32"

Does this show that I do not have enough mem available? I tried with different numbers, yet I always got the same error message.

@araffin
Copy link
Member

araffin commented Jul 10, 2020

Since I am using gym not zoo

Gym and rl zoo are two completely different things (cf doc). You can use the rl zoo to train agents on gym environments.

Does this show that I do not have enough mem available?

yes, you don't have enough RAM. But this is off-topic.

@araffin
Copy link
Member

araffin commented Jul 11, 2020

Hello,

I used SB2 for training with SAC and now switched to SB3. SB3 implementation is right now around 2.5X slower than that of the SB2 with almost the same set of (hyper)parameters. Is this something we should expect or something is wrong in my environment and/or code?

Many thanks,
Reza

thinking about that again, are you sure the network is the same? The default MLP policy of SB3 for SAC is bigger to match original paper.
All those differences will be documented in the near future (see roadmap #1)

@RezaSwe
Copy link

RezaSwe commented Jul 11, 2020

Hi arafinn,

Thanks for asking.

I am changing the default network architecture to get similar nets in SB2 and SB3. Basically, in SB3, I use
net_arch=[700, 700, 250]
and in SB2 I use
layers=[700, 700, 250].
Does this lead to the same net as I am assuming now?

Best regards,
Reza

@jarlva
Copy link

jarlva commented Jul 15, 2020

I just did a comparison between SB1 and SB3. Same PC, same environment and callback. The only difference is that with SB3 I'm using (finally) my cuda gpu (1050 TI). Well, SB1 without GPU give ~900 fps while SB3 with GPU give ~190. There should be definitely a low hanging fruit someplace.

Just wanted to mention sample factory (https://venturebeat.com/2020/06/24/intels-sample-factory-speeds-up-reinforcement-learning-training-on-a-single-pc), I get ~3500 on the same (2-core 6 yo pc) hardware as above. Managed to get a lot more on a multi-core server.

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 15, 2020

@jarlva

Yup, SB3 is still semi-unoptimized and first goal is to achieve the same performance as SB2. One quick trick you could try is setting environment variable OMP_NUM_THREADS=1 (or same via pytorch), which in some cases drastically increases the speed.

I'd like to highlight that SB will never achieve same speeds as sample-factory, as that one is specifically designed for high frames-per-second and implements algorithms designed for that (i.e. IMPALA). Stable-baselines focuses on synchronous execution.

@m-rph
Copy link
Contributor

m-rph commented Jul 15, 2020

@jarlva

Because sb3 is built using pytorch, there is some expected and unavoidable slowdown simply due to python. We discussed a bit about using pytorch's jit here #57.

If you'd like to get your hands dirty, you could compile at least some parts like the replay buffers with numba and jit, but it isn't supported.

I also keep avoiding #93 ;)

@jarlva
Copy link

jarlva commented Jul 15, 2020

Thanks @partiallytyped, just to clarify, SF is also using pytortch. I think @Miffyli is correct.
Thanks again for everyone's response!

@m-rph
Copy link
Contributor

m-rph commented Jul 15, 2020

I was referring to relative performance between identical/same scope torch and tf implementations. @Miffyli is indeed correct.

@araffin
Copy link
Member

araffin commented Jul 16, 2020

The effect of th.set_num_threads() and #106 on a simple example (SAC on Pendulum-v0 with a small network) on cpu only:
Capture d’écran de 2020-07-16 19-01-15

The first group (around 100 FPS) is with num_threads=2 and the second one (around 50FPS) is the default (I have 8 cores).
There is 2x boost.
And each time, the run with #106 is 10% faster, except when num_thread=1 (not shown here)

@m-rph
Copy link
Contributor

m-rph commented Jul 17, 2020

Relevant, I am getting some rather weird performance from DQN, it seems to reach 0 fps (it was with num_threads=1, and old polyak update). When using an ensemble of 10 estimators I got much better performance and I can't pinpoint the issue.

image

@araffin
Copy link
Member

araffin commented Jul 17, 2020

what do you call n_estimators ?

@m-rph
Copy link
Contributor

m-rph commented Jul 17, 2020

In the policy, instead of having a single Qnetwork, I have n_estimator identical QNetworks and their estimation is averaged.
Note, this was running on GPU and the environment was LunarLander.

@araffin
Copy link
Member

araffin commented Jul 17, 2020

ah ok, please move this discussion to #49 then.

@araffin
Copy link
Member

araffin commented Jul 30, 2020

As mentioned here #122 (comment)
you should consider upgrading pytorch ;)
There was a huge gain (20% faster) in the latest release. The gap is filled when setting the number of threads manually.

EDIT: apparently on cpu only

@PierreExeter
Copy link
Contributor

PierreExeter commented Dec 9, 2020

On a related note, I migrated from SB2 to SB3 and the training is taking 24 times longer (same custom environment + PPO + default hyperparameters + 100000 time steps + 8 parallel environments)... I did play with the --num-threads argument in the train.py script from the RL Zoo and I found the most efficient number to be 6 but it only reduced the training time by 3%.
Any suggestions would be welcome, otherwise I might just switch back to SB2 until I find a better solution.

I'm using Pytorch with CUDA support.

@araffin
Copy link
Member

araffin commented Dec 9, 2020

Please read our migration guide (if you did not already):
the default hyperparameters are not the same (tuned for Atari in SB2 vs tuned for continuous actions in SB3)
i'm surprised by the slowdown... i would appreciate if you could provide a minimal example to reproduce.

EDIT: I did two quick tests using the zoo (SB2 and SB3) 8 envs and two environments (CartPole-v1, Breakout) and SB3 was ~2x slower on CartPole but 1.2x faster on Breakout
this was cpu only

@PierreExeter
Copy link
Contributor

Thanks for the suggestions. I couldn't reproduce the 24 times slowdown but I prepared a minimal example where the training is taking 4x longer on my custom environment (and 2.6x on CartPole-v1). The instructions are on the readme but let me know if you can't reproduce.
This is not too bad of a slowdown, I must have done something wrong previously.

@araffin
Copy link
Member

araffin commented Dec 10, 2020

I couldn't reproduce the 24 times slowdown but I prepared a minimal example where the training is taking 4x longer on my custom environment (and 2.6x on CartPole-v1)

thanks for setup that up =)
After a quick check, it seems that you are using the default hyperparameters that are different from PPO2 to SB3 PPO (cf migration guide https://stable-baselines3.readthedocs.io/en/master/guide/migration.html#ppo).
If you want to have the same hyperparameters in SB3, you would to do:

widowx_reacher-v1:
  n_timesteps: 100000
  normalize: true
  policy: 'MlpPolicy'
  n_envs: 8
  n_steps: 128
  n_epochs: 4
  batch_size: 256
  n_timesteps: !!float 1e7
  learning_rate: !!float 2.5e-4
  clip_range: 0.2
  vf_coef: 0.5
  ent_coef: 0.01

I would also advise you to deactivate the vf clipping in SB2.

Note that SB2 n_minibatches lead to a batch size that depends on the number of envs which is not the case anymore.

EDIT: @PierreExeter I ran your env with the same hyperparams and got 39s (SB3) vs 39s (SB2), so the same time (cpu only) with 1 thread only

@PierreExeter
Copy link
Contributor

You're right, it was an issue with the hyperparameters. I also got a training time of 36s when using the SB2 default hyperparameters.
I optimised the hyperparameters with Optuna and this gave me a training time of 18 minutes... I didn't realise that the hyperparameters could have such a strong effect on the training time. Thanks a lot for your useful inputs.

@araffin
Copy link
Member

araffin commented Mar 11, 2022

For latest comparison, please take a look at #122 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants