-
Notifications
You must be signed in to change notification settings - Fork 43
Open
Description
Summary
The [[git_repos]] mechanism with mount_as creates a fragile setup where scripts from one version can call core libraries from another version, leading to hard-to-debug errors.
Problem
When using [[git_repos]] with mount_as, a partial override occurs:
- Container has built-in package (e.g., Megatron-Bridge v0.4.0rc0 at
/opt/Megatron-Bridge) - External git clone (e.g., v0.3.1) provides entry scripts via PYTHONPATH
- Scripts from v0.3.1 import core modules from container's v0.4.0rc0
This causes:
- ModuleNotFoundError - Different module structure between versions
- API mismatches - Functions/parameters differ between versions
- Silent failures - No validation that git repo version is compatible with container
Observed Errors
ModuleNotFoundError: No module named 'megatron.core'
ValueError: Currently there is no support for Pipeline parallelism with CPU offloadingRoot Causes
- Partial mounting -
mount_asoverwrites some paths but not others - Two sources of truth -
[[git_repos]]commit vs container's built-in version - Implicit dependencies - No enforcement that versions match
Proposed Solutions
- Version validation - Validate git repo commit is compatible with container
- Full override or none -
mount_asmust override entire package or nothing - Container-only mode - Warn if
[[git_repos]]targets a package already in container - Deprecate partial mounts - Remove support for mounting over container paths
Environment
- CloudAI version: v1.6.beta6
- Container: nvcr.io/nvidian/nemo:26.04.rc2 (Megatron-Bridge v0.4.0rc0)
- External repo: Megatron-Bridge v0.3.1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels