Dockerfile: build from the official PyTorch image#15638
Conversation
5e37348 to
76217fb
Compare
| COPY tutorials /workspace/nemo/tutorials | ||
| # COPY README.rst LICENSE /workspace/nemo/ | ||
|
|
||
| # Fix for Megatron: set a permanent location inside the container (NOT /tmp, which might get wiped) |
There was a problem hiding this comment.
Where is Megatron coming from? We removed it
There was a problem hiding this comment.
Automodel installs megatron-fsdp
https://github.com/NVIDIA-NeMo/Automodel/blob/e34a12c68021f540c85b6e5103ccc1ed2cd04119/pyproject.toml#L78
There was a problem hiding this comment.
I thought megatron-fsdp is a standalone lib with no dependency on Megatron Core / LM
| # Check install | ||
| # NB: adjusting LD_LIBRARY_PATH (only here, should not be persistent!) is a temporary hack | ||
| # to avoid failure if CUDA is unavailable (`docker build` does not expose GPUs) | ||
| # The error is raised in NeMo Core, and the main reason is reinstalled Transformer-Engine; |
There was a problem hiding this comment.
What is installing Transformer Engine? We can install it for Automodel's fast-path in the container, but I'm surprised it's present somehow
| # The error is raised in NeMo Core, and the main reason is reinstalled Transformer-Engine; | ||
| RUN CHECK_MSG=$(export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/compat/lib.real && \ | ||
| python -c "import nemo.collections.asr as nemo_asr" && \ | ||
| python -c "import nemo.collections.nlp as nemo_nlp" && \ |
There was a problem hiding this comment.
This collection no longer exists
| apt-get install -y \ | ||
| libtool \ | ||
| libltdl-dev \ | ||
| automake \ |
There was a problem hiding this comment.
What do we need various build tools listed here for? BTW I thought we removed K2 dependency (mostly), and it's not in this container - do we want to keep supporting it?
|
|
||
| # base image: official PyTorch image | ||
| ARG BASE_IMAGE=pytorch/pytorch:2.10.0-cuda13.0-cudnn9-devel | ||
| # NB: automodel requires torch<=2.10.0, cannot use 2.11.0 for now |
There was a problem hiding this comment.
@adil-a is this true? if yes, why is max torch version capped in Automodel?
There was a problem hiding this comment.
https://github.com/NVIDIA-NeMo/Automodel/blob/e34a12c68021f540c85b6e5103ccc1ed2cd04119/pyproject.toml#L84 (current main)
in dependencies: "torch>=2.6.0,<=2.10.0"
There was a problem hiding this comment.
Change for max version is from here: https://github.com/NVIDIA-NeMo/Automodel/pull/1004/changes (was 2.9.0 before this PR)
There was a problem hiding this comment.
Updated automodel -> now using pytorch 2.11.0
| else echo "Import check success"; fi | ||
|
|
||
|
|
||
| # copy scripts/examples/tests into container for end user |
There was a problem hiding this comment.
Why not copy the entire repo?
There was a problem hiding this comment.
I think that the idea is to copy only necessary separate parts of the repo. NeMo is installed in non-editable mode. But if copied, it can be used or not used (through automatic addition to PYTHONPATH) depending on the directory used.
I would prefer to keep it as is (copy only scripts/examples/test/tutorials). If someone wants to change NeMo code - it is possible through mounting the directory and manually adjusting PYTHONPATH.
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
0e9ca32 to
64b7733
Compare
|
[🤖]: Hi @artbataev 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add
Dockerfile.stableto build a container with NeMo from official PyTorch container https://hub.docker.com/r/pytorch/pytorch/tagsCollection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information