Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding SubprocVectorEnv failure #3

Closed
liuzuxin opened this issue Jul 5, 2023 · 7 comments
Closed

Question regarding SubprocVectorEnv failure #3

liuzuxin opened this issue Jul 5, 2023 · 7 comments

Comments

@liuzuxin
Copy link
Contributor

liuzuxin commented Jul 5, 2023

Hi, when I try to use the evaluation script on a headless machine (cloud server) with A10G GPU, I occasionally come across the following error:

Process Process-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/venv.py", line 222, in _worker
    env = env_fn_wrapper.data()
  File "peft/evaluate.py", line 35, in <lambda>
    [lambda: OffScreenRenderEnv(**env_args) for _ in range(env_num)])
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/env_wrapper.py", line 161, in __init__
    super().__init__(**kwargs)
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/env_wrapper.py", line 56, in __init__
    self.env = TASK_MAPPING[self.problem_name](
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/problems/libero_tabletop_manipulation.py", line 40, in __init__
    super().__init__(bddl_file_name, *args, **kwargs)
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/bddl_base_domain.py", line 135, in __init__
    super().__init__(
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/manipulation/manipulation_env.py", line 162, in __init__
    super().__init__(
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/robot_env.py", line 214, in __init__
    super().__init__(
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/base.py", line 143, in __init__
    self._reset_internal()
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/bddl_base_domain.py", line 735, in _reset_internal
    super()._reset_internal()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/robot_env.py", line 510, in _reset_internal
    super()._reset_internal()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/base.py", line 296, in _reset_internal
    render_context = MjRenderContextOffscreen(self.sim, device_id=self.render_gpu_device_id)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/utils/binding_utils.py", line 210, in __init__
    super().__init__(sim, offscreen=True, device_id=device_id, max_width=max_width, max_height=max_height)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/utils/binding_utils.py", line 78, in __init__
    self.gl_ctx = GLContext(max_width=max_width, max_height=max_height, device_id=self.device_id)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 136, in __init__
    self._context = EGL.eglCreateContext(EGL_DISPLAY, config, EGL.EGL_NO_CONTEXT, None)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/OpenGL/platform/baseplatform.py", line 415, in __call__
    return self( *args, **named )
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/OpenGL/error.py", line 230, in glCheckError
    raise self._errorClass(
OpenGL.raw.EGL._errors.EGLError: EGLError(
        err = EGL_BAD_ALLOC,
        baseOperation = eglCreateContext,
        cArguments = (
                <OpenGL._opaque.EGLDisplay_pointer object at 0x7eff68f41640>,
                <OpenGL._opaque.EGLConfig_pointer object at 0x7eff68f41540>,
                <OpenGL._opaque.EGLContext_pointer object at 0x7eff8a264b40>,
                None,
        ),
        result = <OpenGL._opaque.EGLContext_pointer object at 0x7eff68f41a40>
)
Exception ignored in: <function EGLGLContext.__del__ at 0x7eff8a1461f0>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 155, in __del__
    self.free()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 146, in free
    if self._context:
AttributeError: 'EGLGLContext' object has no attribute '_context'
Exception ignored in: <function MjRenderContext.__del__ at 0x7eff8a1463a0>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/utils/binding_utils.py", line 198, in __del__
    self.con.free()
AttributeError: 'MjRenderContextOffscreen' object has no attribute 'con'
Traceback (most recent call last):
  File "lifelong/evaluate.py", line 239, in main
    env.reset()
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/venv.py", line 702, in reset
    ret_list = [self.workers[i].recv() for i in id]
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/venv.py", line 702, in <listcomp>
    ret_list = [self.workers[i].recv() for i in id]
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/venv.py", line 428, in recv
    result = self.parent_remote.recv()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Sometimes I came across this issue due to insufficient CUDA memory; however, now even with enough memory, I still encounter this problem and have no idea how to solve it.
I can use the evaluation script with DummyVectorEnv, but it seems to be too slow.
So I am wondering whether you have encountered similar issues? Any hints would be appreciated. Thanks in advance.

@HeegerGao
Copy link

Hi, @liuzuxin, thanks for your asking! I guess this problem may have something to do with the Cuda version or something. Could you please provide more information about your machine? (The platform, NVIDIA driver version, cuda version). I didn't encounter this problem with Ubuntu 20.04 on A100 and Nvidia driver=515.105.01 and cuda=11.7.

@liuzuxin
Copy link
Contributor Author

liuzuxin commented Jul 6, 2023

Thanks for your reply. Sure, mine is Ubuntu 20.04 with Nvidia driver 525.125.06 and cuda 12.0. I tried downgrading the driver to 470.199.02 and cuda to 11.4, and the SubprocVectorEnv works.
The most strange thing is that I have been successfully using the evaluation script with Nvidia 525 drivers in the past week, but it suddenly broke without upgrading any packages. In other words, after I ran the evaluation script with SubprocVectorEnv successfully, I used the same command again, but it didn't work. So I am curious about what would be the root cause of this problem.

@Cranial-XIX
Copy link
Collaborator

Hi zuxin,

Thanks for asking. We have also noticed this issue and are investigating it. In the meantime, a quick walkaround will be saving the model offline, then you can start multiple evaluation scripts with a single environment for evaluation. This will definitely increase the GPU memory requirement but can make the evaluation faster.

@MMittenbuehler
Copy link

Hi,
Is a solution available that does not involve downgrading the Nvidia driver and cuda version? I still encounter this problem with driver 525.60.13 and cuda 12.0. Thanks!

@lihenglin
Copy link

I resolve the problem by adding these two lines to venv.py.

if multiprocessing.get_start_method(allow_none=True) != "spawn":  
    multiprocessing.set_start_method("spawn", force=True)

@JamesSand
Copy link

I resolve the problem by adding these two lines to venv.py.

if multiprocessing.get_start_method(allow_none=True) != "spawn":  
    multiprocessing.set_start_method("spawn", force=True)

I encountered the same issue, and this solution works for me. Thank you very much!!!

@74284853
Copy link

I resolve the problem by adding these two lines to venv.py.

if multiprocessing.get_start_method(allow_none=True) != "spawn":  
    multiprocessing.set_start_method("spawn", force=True)

I encountered the same issue, and this solution works for me. Thank you very much!!!

May I ask which line of env.py should I add it to? @lihenglin @JamesSand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants