Skip to content

build: Use no-build-isolation to install deep_gemm to fix arm install#970

Merged
chtruong814 merged 3 commits intomainfrom
chtruong/fix-deep-gemm-install
Aug 25, 2025
Merged

build: Use no-build-isolation to install deep_gemm to fix arm install#970
chtruong814 merged 3 commits intomainfrom
chtruong/fix-deep-gemm-install

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Aug 23, 2025

What does this PR do ?

Use no-build-isolation to install deep_gemm to fix arm install

Otherwise, the install fails with a stack trace like this. Inside an arm container, I was able to install deep_gemm with uv pip install --no-build-isolation

Using Python 3.12.3 environment at: /opt/nemo_rl_venv
    Updated https://github.com/deepseek-ai/DeepGEMM.git (7b6b5563b9d4c1ae07ffbce7f78ad3ac9204827c)
  × Failed to build `deep-gemm @ git+https://github.com/deepseek-ai/DeepGEMM.git@7b6b5563b9d4c1ae07ffbce7f78ad3ac9204827c`
  ├─▶ The build backend returned an error
  ╰─▶ Call to `setuptools.build_meta.build_wheel` failed (exit status: 1)

      [stderr]
      /opt/uv_cache/builds-v0/.tmpSkF3nO/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279:
      UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at
      /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
        cpu = _conversion_method_template(device=torch.device("cpu"))
      Traceback (most recent call last):
        File "<string>", line 14, in <module>
        File "/opt/uv_cache/builds-v0/.tmpSkF3nO/lib/python3.12/site-packages/setuptools/build_meta.py", line 331, in
      get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/uv_cache/builds-v0/.tmpSkF3nO/lib/python3.12/site-packages/setuptools/build_meta.py", line 301, in
      _get_build_requires
          self.run_setup()
        File "/opt/uv_cache/builds-v0/.tmpSkF3nO/lib/python3.12/site-packages/setuptools/build_meta.py", line 317, in
      run_setup
          exec(code, locals())
        File "<string>", line 91, in <module>
        File "/opt/uv_cache/builds-v0/.tmpSkF3nO/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1337,
      in CUDAExtension
          library_dirs += library_paths(device_type="cuda")
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/uv_cache/builds-v0/.tmpSkF3nO/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1548,
      in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/uv_cache/builds-v0/.tmpSkF3nO/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2982,
      in _join_cuda_home
          raise OSError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

      hint: This usually indicates a problem with the package or the build environment.

Also, it seems that a PR #859 merged into main that should not have made it through the CI. A Github error happened where the build was skipped, and the CI quality check should have failed but did not. I updated the check to verify the build succeeded when tests are ran and also had to increase the allowed GPU hours for the nightly test suite.

Merge queue that should not have passed:
https://github.com/NVIDIA-NeMo/RL/actions/runs/17169685093/job/48716856164

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 added the CI:L1 Run doctests, unit tests, and functional tests label Aug 23, 2025
terrykong
terrykong previously approved these changes Aug 23, 2025
@terrykong terrykong enabled auto-merge August 23, 2025 15:30
@chtruong814 chtruong814 added CI:docs Run doctest and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 24, 2025
@terrykong terrykong added this pull request to the merge queue Aug 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 24, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Aug 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 24, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Aug 24, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 removed this pull request from the merge queue due to a manual request Aug 24, 2025
@github-actions github-actions bot added the CI Relating to CI label Aug 24, 2025
@chtruong814 chtruong814 added CI:L0 Run doctests and unit tests and removed CI Relating to CI CI:docs Run doctest CI:L0 Run doctests and unit tests labels Aug 24, 2025
@chtruong814 chtruong814 added the CI:L1 Run doctests, unit tests, and functional tests label Aug 24, 2025
@terrykong terrykong added CI:docs Run doctest and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 24, 2025
@terrykong terrykong enabled auto-merge August 24, 2025 17:29
@terrykong terrykong added this pull request to the merge queue Aug 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 24, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Aug 25, 2025
Merged via the queue into main with commit 071ebfc Aug 25, 2025
64 of 68 checks passed
@chtruong814 chtruong814 deleted the chtruong/fix-deep-gemm-install branch August 25, 2025 17:18
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
…#970)

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 28, 2025
…NVIDIA-NeMo#970)

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
skirdey-inflection pushed a commit to skirdey-inflection/RL that referenced this pull request Aug 30, 2025
…NVIDIA-NeMo#970)

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Stanislav Kirdey <stan@inflection.ai>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Sep 4, 2025
…NVIDIA-NeMo#970)

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:docs Run doctest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants