Skip to content

No module named 'nemo' #662

@tpoisonooo

Description

@tpoisonooo

Describe the bug

When using megatron, ray node outputs

  File "/data/khj/workspace/RL/nemo_rl/models/policy/megatron_policy_worker.py", line 60, in <module>
    from nemo.tron import fault_tolerance
ModuleNotFoundError: No module named 'nemo'

Steps/Code to reproduce bug

Step1

uv venv
bash tools/build-flash-attn-in-uv-cache.sh

Step2

Since github is unreachable, manully download flash-attn, update pyproject.toml in nvidia-nemo/RL. Then uv sync --extra vllm and uv sync --extra mcore

Here is all my update, no new feature.

Step3. Run test and passed

uv run python examples/run_grpo_math.py

Step4. Try megatron, but crash

  1. update run_grpo_math.py, use config/grpo_math_1B_megatron.yaml
  2. rerun uv run python examples/run_grpo_math.py
  File "/data/khj/workspace/RL/.venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/khj/workspace/RL/.venv/lib/python3.12/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ModuleNotFoundError): ray::IsolatedWorkerInitializer.create_worker() (pid=1125989, ip=10.15.1.122, actor_id=426525ba1cb791498a2fa12001000000, repr=<nemo_rl.distributed.worker_groups.IsolatedWorkerInitializer object at 0x7fd0ddd0b740>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/khj/workspace/RL/nemo_rl/distributed/worker_groups.py", line 165, in create_worker
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/khj/miniconda3/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/data/khj/workspace/RL/nemo_rl/models/policy/megatron_policy_worker.py", line 60, in <module>
    from nemo.tron import fault_tolerance
ModuleNotFoundError: No module named 'nemo'

Some trial

  1. print env in megatron_policy_worker.py
Image

ray node is not using RL/.env/bin/python, use RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/bin/python instead:

(IsolatedWorkerInitializer pid=1125989) !!!Ray Python executable: /data/khj/workspace/RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/bin/python
(IsolatedWorkerInitializer pid=1125989) !!!Ray sys.path: ['/data/khj/workspace/RL', '/data/khj/workspace/RL/examples', '/data/khj/workspace/RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/ray/thirdparty_files', '/data/khj/workspace/RL/.venv/lib/python3.12/site-packages/ray/_private/workers', '/home/khj/miniconda3/lib/python312.zip', '/home/khj/miniconda3/lib/python3.12', '/home/khj/miniconda3/lib/python3.12/lib-dynload', '/data/khj/workspace/RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages', '/data/khj/workspace/RL/3rdparty/Megatron-LM-workspace/Megatron-LM', '/tmp/tmpiueokp9q', '/data/khj/workspace/RL/venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/setuptools/_vendor']

So, I try to specify env in ray_action_environment_registry.py, but it still crash.

Image

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of install: from source
  • not docker, I install it on host.

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version: centOS
  • PyTorch version: 2.7.1
  • Python version: 3.12.11

Additional context

Add any other context about the problem here.
GPU H800

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions