Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CUDA ERROR: initialization error in experiment grid #314

Closed
3 tasks done
Gaiejj opened this issue Apr 9, 2024 · 0 comments · Fixed by #315
Closed
3 tasks done

[BUG] CUDA ERROR: initialization error in experiment grid #314

Gaiejj opened this issue Apr 9, 2024 · 0 comments · Fixed by #315
Assignees
Labels
bug Something isn't working

Comments

@Gaiejj
Copy link
Member

Gaiejj commented Apr 9, 2024

Required prerequisites

What version of OmniSafe are you using?

0.5.0

System information

3.8.19 (default, Mar 20 2024, 19:58:24)
[GCC 11.2.0] linux
0.5.0

Problem description

We have found bugs in using experiment_grid for parallel experiments on certain torch versions or CUDA driver versions, specifically: CUDA ERROR: initialization error. We believe this is a bug in CUDA during the parallel training process. Through this error message prompt and related community issues, we have tentatively set the solution as:

  1. look at omnisafe/common/experiment_grid.py
  2. import multiprocessing as mp
  3. modify pool = Pool(max_workers=num_pool) to pool = Pool(max_workers=num_pool, mp_context=mp.get_context('spawn')) in line 445.

If you encounter similar problems during the use of the OmniSafe experiment grid, you can refer to this solution. If you have a better solution, you are welcome to leave a message under this issue!

Reference:

Reproducible example code

Command lines:

python run_experiment_grid.py

Traceback

No response

Expected behavior

No response

Additional context

No response

@Gaiejj Gaiejj added the bug Something isn't working label Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants