Skip to content

with zstd compression lets just run the steps individually#1117

Merged
pstjohn merged 1 commit into
NVIDIA-BioNeMo:mainfrom
pstjohn:pstjohn/revert-individual-actions
Sep 5, 2025
Merged

with zstd compression lets just run the steps individually#1117
pstjohn merged 1 commit into
NVIDIA-BioNeMo:mainfrom
pstjohn:pstjohn/revert-individual-actions

Conversation

@pstjohn
Copy link
Copy Markdown
Collaborator

@pstjohn pstjohn commented Sep 5, 2025

Reverts part of #1108 since the output can be very hard to parse. with zstd compression and dockerhub caching the image pull doesn't take as much time

Summary by CodeRabbit

  • New Features

    • None
  • Tests

    • Parallelized per-recipe unit tests using containerized images.
    • Switched test execution to PyTest for each recipe.
    • Added per-recipe dependency installation with automatic detection of install method.
    • Introduced targeted/sparse checkouts and a pre-test GPU info check.
  • Chores

    • Refined CI to pass structured per-directory metadata to test jobs and updated outputs/logging to reflect per-recipe images.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 5, 2025

Walkthrough

The workflow .github/workflows/unit-tests-recipes.yml now emits a JSON array of {dir, image} objects, drives a matrix-based unit-tests job per recipe with container.image from each object, uses sparse checkout per directory, installs dependencies per-recipe, and runs pytest in each recipe directory.

Changes

Cohort / File(s) Summary
Changed-dirs output structure
.github/workflows/unit-tests-recipes.yml
set-dirs now builds DIRS_WITH_IMAGES as JSON objects {dir, image} and writes dirs using that new shape; logs updated to print dirs=$DIRS_WITH_IMAGES.
Matrix-based per-recipe execution
.github/workflows/unit-tests-recipes.yml
unit-tests job uses strategy.matrix.recipe sourced from needs.changed-dirs.outputs.dirs (array of objects); container.image set to matrix.recipe.image enabling per-recipe containers (explicit Amplify mapping included).
Sparse checkout per directory
.github/workflows/unit-tests-recipes.yml
Checkout step uses sparse-checkout driven by matrix.recipe.dir with sparse-checkout-cone-mode: false.
Per-recipe dependency installation
.github/workflows/unit-tests-recipes.yml
In working-directory: matrix.recipe.dir: conditionally pip install -e . if pyproject.toml/setup.py exists, else install from requirements.txt, else fail.
Test execution switch
.github/workflows/unit-tests-recipes.yml
Replaces previous script with pytest -v . executed inside matrix.recipe.dir; adds a GPU info step prior to cache/setup.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Dev as Developer
  participant CI as GitHub Actions
  participant CD as Job: changed-dirs
  participant UT as Job: unit-tests (matrix)
  participant C as Container (per-recipe)
  participant Repo as Repo (sparse)

  Dev->>CI: Push / PR
  CI->>CD: Run set-dirs
  CD->>CD: Build DIRS_WITH_IMAGES = [{dir, image}, ...]
  CD-->>UT: Output dirs (array of {dir, image})

  rect rgba(230,245,255,0.5)
  note over UT: Matrix fan-out per recipe
  loop for each recipe in dirs
    UT->>C: Start container using recipe.image
    UT->>Repo: Checkout with sparse path = recipe.dir (cone=false)
    UT->>UT: Show GPU info
    UT->>UT: Install deps in recipe.dir (editable or requirements)
    UT->>UT: Run pytest -v . in recipe.dir
  end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • jstjohn
  • jwilber
  • cspades
  • dorotat-nv
  • trvachov
  • jomitchellnv

Poem

I hop through dirs with tiny paws,
Pair images to paths without a pause.
Sparse-checkout trails I lightly roam,
Install and pytest—then back home. 🥕🐇

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
.github/workflows/unit-tests-recipes.yml (3)

92-115: Confirm image availability and consider a small guard.

  • Verify both Docker Hub tags exist and are readable by CI; a missing/private tag will hard-fail matrix setup.
  • Optional: guard against empty DIRS to avoid confusing logs.

Apply this minimal guard:

           DIRS_WITH_IMAGES=$(echo "$DIRS" | jq -c '
-            map({
+            ( . // [] ) | map({
               dir: .,
               image: (
                 if . == "models/amplify" then
                   "svcbionemo023/bionemo-framework:amplify-model-devcontainer-082025"
                 else
                   "svcbionemo023/bionemo-framework:pytorch25.06-py3-squashed-zstd"
                 end
               )
             })
           ')

Also please confirm via a one-off CI run that both images pull successfully on the gpu runner.


145-148: Sparse checkout may miss shared root files; consider adding fetch-depth and patterns if needed.

If any recipe relies on repo-root config (e.g., .coveragerc, tox.ini, pyproject.toml), sparse-checkout will omit it.

Suggestion (only if you hit missing-file errors):

       uses: actions/checkout@v4
       with:
+        fetch-depth: 0
-        sparse-checkout: "${{ matrix.recipe.dir }}"
+        sparse-checkout: |
+          ${{ matrix.recipe.dir }}
+          .coveragerc
+          pyproject.toml
         sparse-checkout-cone-mode: false

Confirm whether any recipe imports across dirs or reads root-level configs before changing this.


169-170: Emit JUnit for easier triage (optional).

JUnit output per-matrix job makes failures easier to inspect in the UI.

Apply:

-      - name: Run tests
-        working-directory: ${{ matrix.recipe.dir }}
-        run: pytest -v .
+      - name: Run tests
+        working-directory: ${{ matrix.recipe.dir }}
+        run: pytest -v . --maxfail=1 --durations=25 --junitxml=/tmp/pytest.xml
+      - name: Upload test report
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: junit-${{ matrix.recipe.dir }}
+          path: /tmp/pytest.xml

Note: If slashes in artifact names are a problem in your setup, replace with a sanitized name first.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9a9acd5 and 885e974.

📒 Files selected for processing (1)
  • .github/workflows/unit-tests-recipes.yml (2 hunks)

Comment on lines +132 to +137
container:
image: ${{ matrix.recipe.image }}
strategy:
matrix:
recipe: ${{ fromJson(needs.changed-dirs.outputs.dirs) }}
fail-fast: false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

GPU passthrough likely needs explicit Docker options.

Containers on GPU runners typically need --gpus all and larger shared memory for PyTorch.

Apply:

     container:
       image: ${{ matrix.recipe.image }}
+      options: --gpus all --ipc=host --shm-size=16g

Please verify CUDA is visible inside the job by adding a quick probe:

python -c "import torch; print('cuda:', torch.cuda.is_available(), 'devices:', torch.cuda.device_count())"

🏁 Script executed:

#!/bin/bash
set -e
echo "=== runs-on entries in unit-tests-recipes.yml ==="
rg -n '^  runs-on:' -n .github/workflows/unit-tests-recipes.yml

echo
echo "=== container block around line 132 ==="
sed -n '120,160p' .github/workflows/unit-tests-recipes.yml

echo
echo "=== Searching for torch.cuda usage in the repository ==="
rg -nP --type=py 'torch\.cuda' -C2 || echo "No torch.cuda references detected"

Length of output: 183


🏁 Script executed:

#!/bin/bash
set -e

echo "=== runs-on entries in .github/workflows/unit-tests-recipes.yml ==="
rg -n '^ *runs-on:' .github/workflows/unit-tests-recipes.yml

echo
echo "=== container block around matrix definition ==="
rg -nP 'container:' -C5 .github/workflows/unit-tests-recipes.yml

echo
echo "=== Any GPU-related labels or runner names ==="
rg -nP 'gpu|cuda' -n .github/workflows/unit-tests-recipes.yml || echo "No GPU or CUDA labels found"

Length of output: 1072


Add Docker GPU and IPC options to the container
By default the Actions container won’t passthrough GPUs or enlarge shared memory—add the following under container to enable GPU support and sufficient /dev/shm:

     container:
       image: ${{ matrix.recipe.image }}
+      options: --gpus all --ipc=host --shm-size=16g

Then confirm CUDA inside the job:

python -c "import torch; print('cuda:', torch.cuda.is_available(), 'devices:', torch.cuda.device_count())"
🤖 Prompt for AI Agents
In .github/workflows/unit-tests-recipes.yml around lines 132 to 137, the
container configuration lacks Docker options to expose GPUs and enlarge shared
memory; add a container.options entry with Docker flags like --gpus all
--ipc=host --shm-size=1g (or your preferred size) so the runner will passthrough
GPUs and increase /dev/shm, and then add a job step that runs python -c "import
torch; print('cuda:', torch.cuda.is_available(), 'devices:',
torch.cuda.device_count())" to confirm CUDA availability inside the job.

Comment thread .github/workflows/unit-tests-recipes.yml
@pstjohn pstjohn added the bug Something isn't working label Sep 5, 2025
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
@pstjohn pstjohn force-pushed the pstjohn/revert-individual-actions branch from c279cce to 7f31546 Compare September 5, 2025 14:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
.github/workflows/unit-tests-recipes.yml (2)

141-143: Augment GPU probe to verify torch CUDA availability.
Keep nvidia-smi, but also check the Python stack loads CUDA.

Apply:

   - name: Show GPU info
     run: nvidia-smi
+  - name: Verify torch CUDA
+    run: |
+      python - <<'PY'
+import torch
+print("cuda_available:", torch.cuda.is_available(), "device_count:", torch.cuda.device_count())
+assert torch.cuda.is_available(), "CUDA not available inside container"
+PY

132-134: GPU not passed through to the container; add Docker options.
Without these, nvidia-smi/torch CUDA will likely fail inside the container on GPU runners.

Apply:

     container:
       image: ${{ matrix.recipe.image }}
+      options: --gpus all --ipc=host --shm-size=16g
🧹 Nitpick comments (3)
.github/workflows/unit-tests-recipes.yml (3)

92-115: Dir→image mapping looks good; pin images by digest for reproducibility.
Tag-only versions can drift. Consider pinning both images with @sha256 digests and documenting the provenance of the squashed image.


152-165: Harden dependency install; use the container’s interpreter and ensure pytest.
Avoid bare pip; invoke via python, upgrade tooling, and guarantee pytest exists. Also prefer unsetting PIP_CONSTRAINT vs setting it empty.

Apply:

-      - name: Install dependencies
+      - name: Install dependencies
         working-directory: ${{ matrix.recipe.dir }}
         run: |
-          if [ -f pyproject.toml ] || [ -f setup.py ]; then
-            PIP_CONSTRAINT= pip install -e .
-            echo "Installed ${{ matrix.recipe.dir }} as editable package"
-          elif [ -f requirements.txt ]; then
-            PIP_CONSTRAINT= pip install -r requirements.txt
-            echo "Installed ${{ matrix.recipe.dir }} from requirements.txt"
-          else
-            echo "No pyproject.toml, setup.py, or requirements.txt found in ${{ matrix.recipe.dir }}"
-            exit 1
-          fi
+          python --version
+          python -m pip install -U pip setuptools wheel
+          unset PIP_CONSTRAINT || true
+          if [ -f pyproject.toml ] || [ -f setup.py ]; then
+            python -m pip install -e .
+            echo "Installed ${{ matrix.recipe.dir }} as editable package"
+          elif [ -f requirements.txt ]; then
+            python -m pip install -r requirements.txt
+            echo "Installed ${{ matrix.recipe.dir }} from requirements.txt"
+          else
+            echo "No pyproject.toml, setup.py, or requirements.txt found in ${{ matrix.recipe.dir }}"
+            exit 1
+          fi
+          python - <<'PY'
+try:
+  import pytest  # noqa
+  print("pytest present")
+except Exception:
+  import sys, subprocess
+  subprocess.check_call([sys.executable, "-m", "pip", "install", "pytest"])
+PY

167-168: Consider clearer test output and quicker failure.
Optional: add --maxfail=1 and -r a for actionable logs; or emit JUnit XML for parsing.

Example:

-        run: pytest -v .
+        run: pytest -v -r a --maxfail=1 .
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 885e974 and 7f31546.

📒 Files selected for processing (1)
  • .github/workflows/unit-tests-recipes.yml (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests (recipes/esm2_native_te_mfsdp, svcbionemo023/bionemo-framework:pytorch25.06-py3-squash...
🔇 Additional comments (1)
.github/workflows/unit-tests-recipes.yml (1)

149-151: Sparse checkout: confirm recipes are fully self-contained.
If tests import shared modules outside ${{ matrix.recipe.dir }}, sparse checkout will break them. Either include shared paths in sparse-checkout or verify no cross-dir imports.

@pstjohn pstjohn force-pushed the pstjohn/revert-individual-actions branch from 7f31546 to 11e6464 Compare September 5, 2025 15:01
@pstjohn pstjohn removed the bug Something isn't working label Sep 5, 2025
@pstjohn pstjohn merged commit e213d40 into NVIDIA-BioNeMo:main Sep 5, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant