-
Notifications
You must be signed in to change notification settings - Fork 296
Description
Describe the bug
When using megatron, ray node outputs
File "/data/khj/workspace/RL/nemo_rl/models/policy/megatron_policy_worker.py", line 60, in <module>
from nemo.tron import fault_tolerance
ModuleNotFoundError: No module named 'nemo'Steps/Code to reproduce bug
Step1
uv venv
bash tools/build-flash-attn-in-uv-cache.shStep2
Since github is unreachable, manully download flash-attn, update pyproject.toml in nvidia-nemo/RL. Then uv sync --extra vllm and uv sync --extra mcore
Here is all my update, no new feature.
Step3. Run test and passed
uv run python examples/run_grpo_math.py
Step4. Try megatron, but crash
- update
run_grpo_math.py, useconfig/grpo_math_1B_megatron.yaml - rerun
uv run python examples/run_grpo_math.py
File "/data/khj/workspace/RL/.venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/khj/workspace/RL/.venv/lib/python3.12/site-packages/ray/_private/worker.py", line 930, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ModuleNotFoundError): ray::IsolatedWorkerInitializer.create_worker() (pid=1125989, ip=10.15.1.122, actor_id=426525ba1cb791498a2fa12001000000, repr=<nemo_rl.distributed.worker_groups.IsolatedWorkerInitializer object at 0x7fd0ddd0b740>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/khj/workspace/RL/nemo_rl/distributed/worker_groups.py", line 165, in create_worker
module = importlib.import_module(module_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/khj/miniconda3/lib/python3.12/importlib/__init__.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 999, in exec_module
File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
File "/data/khj/workspace/RL/nemo_rl/models/policy/megatron_policy_worker.py", line 60, in <module>
from nemo.tron import fault_tolerance
ModuleNotFoundError: No module named 'nemo'Some trial
- print env in
megatron_policy_worker.py
ray node is not using RL/.env/bin/python, use RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/bin/python instead:
(IsolatedWorkerInitializer pid=1125989) !!!Ray Python executable: /data/khj/workspace/RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/bin/python
(IsolatedWorkerInitializer pid=1125989) !!!Ray sys.path: ['/data/khj/workspace/RL', '/data/khj/workspace/RL/examples', '/data/khj/workspace/RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/ray/thirdparty_files', '/data/khj/workspace/RL/.venv/lib/python3.12/site-packages/ray/_private/workers', '/home/khj/miniconda3/lib/python312.zip', '/home/khj/miniconda3/lib/python3.12', '/home/khj/miniconda3/lib/python3.12/lib-dynload', '/data/khj/workspace/RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages', '/data/khj/workspace/RL/3rdparty/Megatron-LM-workspace/Megatron-LM', '/tmp/tmpiueokp9q', '/data/khj/workspace/RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/setuptools/_vendor']So, I try to specify env in ray_action_environment_registry.py, but it still crash.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of install: from source
- not docker, I install it on host.
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version: centOS
- PyTorch version: 2.7.1
- Python version: 3.12.11
Additional context
Add any other context about the problem here.
GPU H800