SubprocVecEnv performance compared to gym.vector.async_vector_env #121

kargarisaac · 2020-07-24T16:10:15Z

Hi,

I'm trying to use SubProcVecEnv to create a vectorized environment and use it in my own PPO implementation. I have a couple of questions about the performance of this vectorization and hyperparameters.

I have a 28 cores CPU on my system and an RTX 2080Ti GPU. When I use gym.vector.async_vector_env to create vectorized envs, it is 3 to 6 times faster than SubProcVecEnv from stable_baselines3.

In SubProcVecEnv, when I set the number of threads using torch.set_num_threads(28) all the cores are involved but again it is almost two times slower than using torch.set_num_threads(10).

I also did all the comparisons by creating 100 parallel envs. I'm not sure how would I set this number of envs and torch number of threads. I think this slower performance compared to gym.vector.async_vector_env is because of the bad hyperparameters I used.

What are the parameters I can use to get the best performance? which parameters are the most important ones?

Thank you

The text was updated successfully, but these errors were encountered:

araffin · 2020-07-24T16:50:50Z

Hello,
Did you take a look at our tutorial?
https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3

especially this part: https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/3_multiprocessing.ipynb
Did you try with a DummyVecEnv (and also setting num threads to a lower number, see #90 )?

kargarisaac · 2020-07-24T17:09:51Z

Thank you @araffin for the links.

I get the same speed with DummyVecEnv by using 10 torch threads or without setting the number of threads. Based on the colab results, it seems more environments don't have a higher return. I need to test more. It is very time consuming and comparing is not that easy.

But I cannot understand why higher number of threads and using all cores is slower.

Thank you

AlessandroZavoli · 2020-07-24T20:10:27Z

I have a similar behavior with stable-baselines (tensorflow).
It seems to me that when the environment is quite fast, using parallel environment is not effective.
As an example, on a IBM power pc, I have a 2x speedup factor using 128core vs 4core.

It might be that most of the time is spent by the "learner" and not by the "worker/experience-collecting" processes, but I might be wrong

Miffyli · 2020-07-25T09:31:42Z

But I cannot understand why higher number of threads and using all cores is slower.

We believe it is PyTorch trying to parallelize every single computation with given number of threads, but if the computation sizes are small (which they are, if you are using a small env with MlpPolicy), then all the overhead just slows you down. We have seen ~5x speedups from setting OMP_NUM_THREADS=1 before running the code in some cases with stable-baselines3, and I have also seen similar benefits with numpy at occasions. I think you should set threads to one as well, so each subprocess has only one core in their use.

But I cannot understand why higher number of threads and using all cores is slower.

The overhead comes from inter-process communication, which is super-slow if the environments are fast and data is small.

Edit: One should bear in mind multiple envs has other effects than sample speed alone. More environments -> more samples from different states of the environment -> better estimation of the expectations. Generally one should see stabler learning with more environments, but not necessarily more sample efficient.

araffin · 2020-07-25T10:03:38Z

The overhead comes from inter-process communication, which is super-slow if the environments are fast and data is small.
Edit: One should bear in mind multiple envs has other effects than sample speed alone. More environments -> more samples from different states of the environment -> better estimation of the expectations. Generally one should see stabler learning with more environments, but not necessarily more sample efficient.

We have a tutorial about that ;)
See notebook 3: https://github.com/araffin/rl-tutorial-jnrr19#content

kargarisaac · 2020-07-26T05:28:41Z

@AlessandroZavoli @Miffyli @araffin
I'm training my own ppo for multi-agent particle env from openai. As I increase the number of envs, without setting the number of threads in pytorch, I get slower performance (which I think is normal), and also faster learning. I think thee better way is to consider the same number of samples (collected by all n envs) and compare again the number pf parallel envs. Now I consider a fixed number of rollout steps. So when I use 2 envs and 2000 rollout steps, I will have 4000 samples and when I use 20 envs, it would be 40000 samples which causes a better learning curve, I think.

I will put the plots here after it is finished.

kargarisaac · 2020-07-26T17:38:05Z

This is the curves for training:
Sry for unclear labels:

gray (the lower curve): 2
orange: 4
dark blue: 8
dark red: 12
light blue: 16
green: 20
pink: 25
gray: 30

for a higher number of envs, it is very time-consuming, but I see that the 16 envs case can reach the same reward value much faster (1h 2m compared to 27m).

AlessandroZavoli · 2020-07-26T19:04:22Z

On my pc, i got a similar result, but 8core was the optimal choice

I think that we should distinguish between sample-collecting time (that depends on the custom env) and training time (spent by SGD, as an example)
I suspect the the most efficient tuning involves not only the overall number of steps but also the frequency of the training etc.
Yet, I'm really confused

Miffyli · 2020-07-27T10:13:44Z

Yes, using more environments (with same n_step -> more samples) is expected to result in stabler and/or faster learning, sometimes even in terms of env steps. I tried to look for a paper that had experiments on this very topic but can not find it for the life of me. Closest thing I have to share is the OpenAI Dota 2 paper, where in Figure 5 they compare different batch sizes.

The training times should be minuscule and only have a real effect on training speed if you can reach thousands of FPS with your environment (e.g. basic control tasks).

araffin added the question Further information is requested label Aug 22, 2020

araffin closed this as completed Oct 11, 2020

araffin mentioned this issue Feb 2, 2023

Add scaling section to A2C documentation #1250

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SubprocVecEnv performance compared to gym.vector.async_vector_env #121

SubprocVecEnv performance compared to gym.vector.async_vector_env #121

kargarisaac commented Jul 24, 2020

araffin commented Jul 24, 2020 •

edited

kargarisaac commented Jul 24, 2020

AlessandroZavoli commented Jul 24, 2020

Miffyli commented Jul 25, 2020 •

edited

araffin commented Jul 25, 2020

kargarisaac commented Jul 26, 2020

kargarisaac commented Jul 26, 2020

AlessandroZavoli commented Jul 26, 2020

Miffyli commented Jul 27, 2020 •

edited

SubprocVecEnv performance compared to gym.vector.async_vector_env #121

SubprocVecEnv performance compared to gym.vector.async_vector_env #121

Comments

kargarisaac commented Jul 24, 2020

araffin commented Jul 24, 2020 • edited

kargarisaac commented Jul 24, 2020

AlessandroZavoli commented Jul 24, 2020

Miffyli commented Jul 25, 2020 • edited

araffin commented Jul 25, 2020

kargarisaac commented Jul 26, 2020

kargarisaac commented Jul 26, 2020

AlessandroZavoli commented Jul 26, 2020

Miffyli commented Jul 27, 2020 • edited

araffin commented Jul 24, 2020 •

edited

Miffyli commented Jul 25, 2020 •

edited

Miffyli commented Jul 27, 2020 •

edited