Bypass out-of-sync Gym registry in SubprocVecEnv by resolving EnvSpec #160

shwang · 2020-01-18T02:07:20Z

We ran into an unusual problem when using imitation.scripts.parallel to parallelize training of our custom ant environment. The subprocesses created by imitation.util.make_vec_env reported that all our custom environments weren't registered, but when we ran imitation.scripts.parallel. During debuggins, we found that it didn't help to forcibly reregister the environments in scope of the stack trace before the inner worker function to SubprocVecEnv. (Even forcibly reregistering environments inside make_vec_env itself did not suffice)

The most likely explanation is that the Python forkserver used by SubprocVecEnv was somehow prematurely initialized by ray so that new processes forked from the forkserver didn't include our custom environments inside the Gym registry. (edit: We are actually quite unsure if this is actually the cause of the bug. THis is just a hypothesis)

h/t: @AdamGleave for the idea to resolve the custom environment's EnvSpec before initializing the SubprocVecEnv.

shwang · 2020-01-18T02:07:39Z

Confirmed that the new test fails before this PR.

AdamGleave · 2020-01-18T21:45:06Z

We ran into an unusual problem when using imitation.scripts.parallel to parallelize training of our custom ant environment. The subprocesses created by imitation.util.make_vec_env reported that all our custom environments weren't registered, but when we ran imitation.scripts.parallel. During debuggins, we found that it didn't help to forcibly reregister the environments in scope of the stack trace before the inner worker function to SubprocVecEnv. (Even forcibly reregistering environments inside make_vec_env itself did not suffice)

The most likely explanation is that the Python forkserver used by SubprocVecEnv was somehow prematurely initialized by ray so that new processes forked from the forkserver didn't include our custom environments inside the Gym registry.

Just to flag I'm still not sure about this particular explanation, although I agree with the overall conclusion that it's a bad interaction between forkserver and ray. In particular my (imperfect) understanding of forkserver is that it's a fresh process, like with spawn start method, but for efficiency it only has to be created once and then fork() inside the forkserver can cheaply create new (clean) Python processes. Some evidence in favour of this is that using spawn also failed, even though those processes are being created JIT.

Admittedly this doesn't explain why it works in Ray local mode. My guess is that the key difference is that when using Ray the worker is executing a pickled function rather than native code. This plausibly would cause a difference in how the helper method is pickled and sent to the SubprocVecEnv worker, although I've not confirmed this.

(We probably don't need to get to the bottom of this if the EnvSpec thing works, but I wanted to put this in writing in case we need to debug a similar issue in the future.)

Also if you haven't seen it https://codewithoutrules.com/2018/09/04/python-multiprocessing/ is a fun discussion of issues with fork (not related to this directly, but explains why we're using forkserver in the first place)

AdamGleave

Not sure about the explanation in the comment. Otherwise LGTM.

AdamGleave · 2020-01-18T21:45:51Z

src/imitation/util/util.py

+    # Previously, we directly called `gym.make(env_name)`.
+    #
+    # That direct approach was problematic (especially in combination with Ray)
+    # because the forkserver from which subprocesses are forked might have been


Are we sure this is what's going on? It's fine as a hypothesis but I don't want to immortalize in a comment something we're not sure about.

I'm 80% confident the forkserver has a very minimal state, since:

The Python multiprocessing docs say "No unnecessary resources are inherited".

The code seems to use spawn to start the forkserver, which starts a new Python process.

This blog states " Note that children retain a copy of the forkserver state. This state is intended to be relatively simple, but it is possible to adjust this through the multiprocess API through the set_forkserver_preload() method."

With that said, there does seem to be some logic to preload modules (by default, I think, the __main__ module i.e. the entrypoint to the script). So IIUC, forkserver is intended to execute the code of the parent process (i.e. all imports), but should not execute an if __name__ == '__main__'-guarded block (Python docs explicitly state need to have this guard).

Seems plausible that starting forkserver in a Ray worker messes with the autodetection of what to import.

If we can't get to the bottom of this, may be better to leave the comment vague and cite this PR.

Agree that we should be vague about the cause in the comments and note the PR, I've made these changes.

codecov · 2020-01-21T22:47:13Z

Codecov Report

❗ No coverage uploaded for pull request base (master@365c177). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master     #160   +/-   ##
=========================================
  Coverage          ?   87.64%           
=========================================
  Files             ?       64           
  Lines             ?     4542           
  Branches          ?        0           
=========================================
  Hits              ?     3981           
  Misses            ?      561           
  Partials          ?        0

Impacted Files	Coverage Δ
src/imitation/algorithms/adversarial.py	`88.81% <100%> (ø)`
src/imitation/util/util.py	`100% <100%> (ø)`
src/imitation/scripts/config/train_adversarial.py	`71.96% <100%> (ø)`
src/imitation/scripts/train_adversarial.py	`98% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 365c177...902fe96. Read the comment docs.

AdamGleave

Two minor suggestions: expanding documentation and removing unused argument

AdamGleave · 2020-01-22T02:32:41Z

src/imitation/util/util.py

@@ -49,15 +49,30 @@ def make_vec_env(env_name: str,
      max_episode_steps: If specified, wraps VecEnv in TimeLimit wrapper with
          this episode length before returning.


Suggested change

this episode length before returning.

this episode length before returning. Otherwise, defaults to `max_episode_steps` for `env_name` in the Gym registry.

I expanded this comment a bit more in 902fe96. Wanted to note that the gym registry total timesteps thing is default behavior for gym.make.

src/imitation/util/util.py

Co-Authored-By: Adam Gleave <adam@gleave.me>

AdamGleave

LGTM, can merge once CI passes.

shwang added 2 commits January 17, 2020 17:32

make_vec_env: Resolve EnvSpec in main process

a1b9c03

test_scripts: Test Ray+custom_ant+SubprocVecEnv

0576bc5

shwang requested a review from AdamGleave January 18, 2020 02:07

shwang changed the title ~~Bypass out-of-sync Gym registry in SubprocVecEnv by passing EnvSpec~~ Bypass out-of-sync Gym registry in SubprocVecEnv by resolving EnvSpec Jan 18, 2020

make_vec_env: Apply spec.max_episode_steps

f527f0f

shwang force-pushed the fix_ray_subproc_gym_registry branch from 7431fb4 to f527f0f Compare January 18, 2020 04:42

AdamGleave reviewed Jan 18, 2020

View reviewed changes

test_scripts: Auto-generate ant rollouts

a442b00

shwang added 2 commits January 21, 2020 15:13

make_vec_env: Don't attempt to explain bug

560c068

test_scripts: Rename helper function

3f8d29f

shwang requested a review from AdamGleave January 22, 2020 01:02

AdamGleave requested changes Jan 22, 2020

View reviewed changes

shwang and others added 4 commits January 21, 2020 18:43

Update src/imitation/util/util.py

e7fc11d

Co-Authored-By: Adam Gleave <adam@gleave.me>

make_vec_env: Expand docstring

365c177

Merge branch 'master' into fix_ray_subproc_gym_registry

e8061a6

make_vec_env: Fix docstring

902fe96

shwang requested a review from AdamGleave January 22, 2020 02:59

AdamGleave approved these changes Jan 22, 2020

View reviewed changes

shwang merged commit 0331f0e into master Jan 22, 2020

shwang deleted the fix_ray_subproc_gym_registry branch January 22, 2020 03:18

araffin mentioned this pull request Oct 27, 2022

Fix bug with custom envs and subprocesses DLR-RM/rl-baselines3-zoo#306

Merged

13 tasks

araffin mentioned this pull request Mar 18, 2024

[Question] SubprocVecEnv doesn't work with registered custom environments DLR-RM/stable-baselines3#1869

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass out-of-sync Gym registry in SubprocVecEnv by resolving EnvSpec #160

Bypass out-of-sync Gym registry in SubprocVecEnv by resolving EnvSpec #160

shwang commented Jan 18, 2020 •

edited

Loading

shwang commented Jan 18, 2020

AdamGleave commented Jan 18, 2020 •

edited

Loading

AdamGleave left a comment

AdamGleave Jan 18, 2020

AdamGleave Jan 18, 2020

AdamGleave Jan 18, 2020

shwang Jan 21, 2020 •

edited

Loading

codecov bot commented Jan 21, 2020 •

edited

Loading

AdamGleave left a comment

AdamGleave Jan 22, 2020

shwang Jan 22, 2020 •

edited

Loading

AdamGleave left a comment

		@@ -49,15 +49,30 @@ def make_vec_env(env_name: str,
		max_episode_steps: If specified, wraps VecEnv in TimeLimit wrapper with
		this episode length before returning.

	this episode length before returning.
	this episode length before returning. Otherwise, defaults to `max_episode_steps` for `env_name` in the Gym registry.

Bypass out-of-sync Gym registry in SubprocVecEnv by resolving EnvSpec #160

Bypass out-of-sync Gym registry in SubprocVecEnv by resolving EnvSpec #160

Conversation

shwang commented Jan 18, 2020 • edited Loading

shwang commented Jan 18, 2020

AdamGleave commented Jan 18, 2020 • edited Loading

AdamGleave left a comment

Choose a reason for hiding this comment

AdamGleave Jan 18, 2020

Choose a reason for hiding this comment

AdamGleave Jan 18, 2020

Choose a reason for hiding this comment

AdamGleave Jan 18, 2020

Choose a reason for hiding this comment

shwang Jan 21, 2020 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jan 21, 2020 • edited Loading

Codecov Report

AdamGleave left a comment

Choose a reason for hiding this comment

AdamGleave Jan 22, 2020

Choose a reason for hiding this comment

shwang Jan 22, 2020 • edited Loading

Choose a reason for hiding this comment

AdamGleave left a comment

Choose a reason for hiding this comment

shwang commented Jan 18, 2020 •

edited

Loading

AdamGleave commented Jan 18, 2020 •

edited

Loading

shwang Jan 21, 2020 •

edited

Loading

codecov bot commented Jan 21, 2020 •

edited

Loading

shwang Jan 22, 2020 •

edited

Loading