Skip to content

Conversation

@guptaNswati
Copy link
Contributor

@guptaNswati guptaNswati commented Feb 13, 2025

This is to build a container image for NVIDIA nvbandwidth GPU benchmarking tool. This helps us demo and test the multinode NVLINK functionality requiring MPI and IMEX and DRA.

@@ -0,0 +1,30 @@
FROM nvcr.io/nvidia/cuda:12.5.0-devel-ubuntu22.04 AS builder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: License header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make the cuda version here an argument so that we can easily update to a newer one without having to change this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. need the license and the cuda version as ARG

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska on other projects we have moved to defining the versions in the FROM line directly since this allows us to leverage dependabot automation to update our base images instead of doing this manually.

cmake -DMULTINODE=1 . && \
make -j$(nproc)

FROM mpioperator/openmpi:v0.6.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required because the application needs MPI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my suggestion, because yes, all multi-node tests rely on MPI, and it lets us create jobs such as:

---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: mpi-memcpy-imex-test-compute-domain
spec:
  numNodes: 4
  channel:
    resourceClaimTemplate:
      name: mpi-memcpy-imex-test-compute-domain-resource-claim

---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: mpi-memcpy-imex-test
spec:
  slotsPerWorker: 2
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-launcher
        spec:
          containers:
          - image: nvcr.io/nvidia/nvbandwidth:v0.7
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - --bind-to
            - core
            - --map-by
            - ppr:2:node
            - -np
            - "8"
            - --report-bindings
            - -q
            - nvbandwidth
            - -t
            - multinode_device_to_device_memcpy_read_ce
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
    Worker:
      replicas: 4
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-worker
        spec:
          containers:
          - image: nvcr.io/nvidia/nvbandwidth:v0.7
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            env:
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                nvidia.com/gpu: 2
              claims:
              - name: compute-domain
          resourceClaims:
          - name: compute-domain
            resourceClaimTemplateName: mpi-memcpy-imex-test-compute-domain-resource-claim

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One should note that this image may then need additional steps to be approved fro publication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES. will do the approvals.

@guptaNswati guptaNswati self-assigned this Feb 14, 2025
PUSH_ON_BUILD: true
run: |
echo "nvbandwidth ${NVBANDWIDTH_VERSION}"
docker build --platform=linux/arm64 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a Makefile at multinodeNVL/nvbandwidth so users can reproduce this command if they want to build the image locally, and also so we can call the make target from CI to align with all our CI image builds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- also it should not build specifically for arm -- it should build a multi-arch image like all of our other projects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dint add the amd build because we only tested it on ARM. and i was not sure how to test it on x86.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the MULTIARCH build similar as we have for other images across our projects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run one of the single-node tests. Here are the results of running on a DGXA100, for example:

kklues@luna-08:~/nvbandwidth$ docker run --rm -it --gpus all nvcr.io/nvidia/nvbandwidth:v0.7 nvbandwidth -t device_to_device_memcpy_read_ce
nvbandwidth Version: v0.7
Built from Git version: v0.7

MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
CUDA Runtime Version: 12070
CUDA Driver Version: 12070
Driver Version: 565.57.01

Process 0 (2f68f68068f0): device 0: NVIDIA A100-SXM4-40GB (00000000:07:00)

Running device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0       N/A    266.54    267.13    266.40    266.37    266.27    267.88    267.39
 1    277.42       N/A    278.47    278.58    271.68    267.18    266.98    267.62
 2    278.58    277.56       N/A    278.59    277.56    267.23    268.16    266.20
 3    278.77    278.86    278.22       N/A    278.49    278.76    278.77    266.96
 4    278.84    277.63    278.76    277.65       N/A    278.39    278.86    269.22
 5    278.76    278.86    278.49    278.74    277.65       N/A    278.76    278.67
 6    278.86    278.58    276.19    267.67    267.44    267.20       N/A    267.24
 7    278.69    278.94    278.86    278.95    278.67    278.86    278.39       N/A

SUM device_to_device_memcpy_read_ce 15372.40

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

@@ -0,0 +1,46 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025?

@guptaNswati guptaNswati force-pushed the bandwidthtest branch 3 times, most recently from e94c3dc to f853c5a Compare February 18, 2025 23:42
cmake -DMULTINODE=1 . && \
make -j$(nproc)

FROM mpioperator/openmpi:v0.6.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the openmpi version configurable as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we try to use dependabot to update base container images where applicable, and if we wanted to leverage that for openmpi as well we would not want to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of duplicating the make logic here, would it make sense to leverage the Makefile in deployments/container/ instead? This makefile follows the required pattern for all our other projects and also ensures that our base images are up to date.

Moving the Docker file from this folder to /deployments/container/nvbandwidth/Dockerfile.ubuntu

And applying the following diff to deployments/container/Makefile:

diff --git a/deployments/container/Makefile b/deployments/container/Makefile
index 4a74687..d25bb0f 100644
--- a/deployments/container/Makefile
+++ b/deployments/container/Makefile
@@ -80,7 +80,11 @@ endif
 endif
 
 build-%: DIST = $(*)
+ifneq ($(SAMPLE),nvbandwidth)
 build-%: DOCKERFILE = $(CURDIR)/deployments/container/Dockerfile.$(DOCKERFILE_SUFFIX)
+else
+build-%: DOCKERFILE = $(CURDIR)/deployments/container/$(SAMPLE)/Dockerfile.$(DOCKERFILE_SUFFIX)
+endif
 
 build-ubuntu%: DOCKERFILE_SUFFIX = ubuntu
 

allows the exisint CI to be leveraged with:

diff --git a/.github/workflows/image.yaml b/.github/workflows/image.yaml
index b066bf1..2b7de81 100644
--- a/.github/workflows/image.yaml
+++ b/.github/workflows/image.yaml
@@ -40,11 +40,14 @@ jobs:
         - vectorAdd
         - nbody
         - deviceQuery
+        - nvbandwidth
         exclude:
         - dist: ubi8
           sample: deviceQuery
         - dist: ubi8
           sample: nbody
+        - dist: ubi8
+          sample: nvbandwidth
 
     steps:
       - uses: actions/checkout@v4
@@ -86,30 +89,3 @@ jobs:
           echo "${VERSION}"
           SAMPLE=${{ matrix.sample }} make -f deployments/container/Makefile build-${{ matrix.dist }}
 
-  build-nvbandwidth:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Check out code
-        uses: actions/checkout@v4
-      - name: Set up QEMU
-        uses: docker/setup-qemu-action@v3
-        with:
-          image: tonistiigi/binfmt:master
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3
-      - name: Login to GitHub Container Registry
-        uses: docker/login-action@v3
-        with:
-          registry: ghcr.io
-          username: ${{ github.actor }}
-          password: ${{ secrets.GITHUB_TOKEN }}
-      - name: Build nvbandwidth image
-        env:
-          IMAGE_NAME: ghcr.io/nvidia/k8s-samples:nvbandwidth
-          CUDA_VERSION: 12.5.0
-          NVBANDWIDTH_VERSION: v0.7
-          PUSH_ON_BUILD: true
-          BUILD_MULTI_ARCH_IMAGES: true
-        run: |
-          echo "nvbandwidth ${NVBANDWIDTH_VERSION}"
-          make -f multinodeNVL/nvbandwidth/Makefile build

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See for example: #20

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike other cuda-sample images, this image seems more version sensitive (CUDA, DRIVER and nvbandwidth release version). This should not be updated or built in the same way we do the other sample images. I agree with the structure alignment but they should be separate builds IMO.

I can update it based on carlos changes
main...ArangoGutierrez:k8s-samples:pr-19

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Arjun's work from here https://github.com/NVIDIA/gpu-operator/pull/1262/files
Platform specific build seems like a better option

nvbandwidth ${COMMIT_SHORT_SHA}-amd64
| Building Docker image: ghcr.io/guptanswati/nvbandwidth:v0.7-485225f2-amd64
| DOCKER_BUILDKIT=1 \
| 	docker  build --pull \
| 	--provenance=false --sbom=false \
| 	--platform=linux/aarch64 \
| 	--tag ghcr.io/guptanswati/nvbandwidth:v0.7-485225f2-amd64 \
| 	--build-arg CUDA_VERSION="12.5.0" \
| 	--build-arg NVBANDWIDTH_VERSION="v0.7" \
| 	--build-arg OPENMPI_VERSION="v0.6.0" \
| 	-f /home/nvidia/swati-imex-test/k8s-samples/multinodeNVL/nvbandwidth/Dockerfile \
| 	/home/nvidia/swati-imex-test/k8s-samples
[+] Building 0.3s (11/11) FINISHED                               docker:default
|  => [internal] load build definition from Dockerfile                       0.0s
|  => => transferring dockerfile: 1.66kB                                     0.0s
|  => [internal] load metadata for docker.io/mpioperator/openmpi:v0.6.0      0.3s
|  => [internal] load metadata for nvcr.io/nvidia/cuda:12.5.0-devel-ubuntu2  0.1s
|  => [internal] load .dockerignore                                          0.0s
|  => => transferring context: 2B                                            0.0s
|  => [builder 1/4] FROM nvcr.io/nvidia/cuda:12.5.0-devel-ubuntu22.04@sha25  0.0s
|  => => resolve nvcr.io/nvidia/cuda:12.5.0-devel-ubuntu22.04@sha256:4b46c6  0.0s
|  => [stage-1 1/2] FROM docker.io/mpioperator/openmpi:v0.6.0@sha256:6d6bf8  0.0s
|  => CACHED [builder 2/4] RUN apt-get update && apt-get install -y --no-in  0.0s
|  => CACHED [builder 3/4] WORKDIR /bandwidthtest                            0.0s
|  => CACHED [builder 4/4] RUN git clone --branch v0.7 --depth 1 --single-b  0.0s
|  => CACHED [stage-1 2/2] COPY --from=builder /bandwidthtest/nvbandwidth/n  0.0s
|  => exporting to image                                                     0.0s
|  => => exporting layers                                                    0.0s
|  => => writing image sha256:a45c0e75199c4a129a178b132ed254f61706e2842ee54  0.0s
|  => => naming to ghcr.io/guptanswati/nvbandwidth:v0.7-485225f2-amd64       0.0s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't disagree that using platform-specific runners is a good option here -- especially since we're not cross compiling a language like go -- I would argue that we should do that for ALL our samples and not a single one. Does it make sense to push this forward as is, and then switch to native runners as a follow-up once we have ironed out the changes required to the makefiles and the CI as are being discussed in NVIDIA/gpu-operator#1262?

on:

Unlike other cuda-sample images, this image seems more version sensitive (CUDA, DRIVER and nvbandwidth release version). This should not be updated or built in the same way we do the other sample images. I agree with the structure alignment but they should be separate builds IMO.

To be clear. They are separate builds and all verisons are currently being specified in the Docker file. I know that in the past we have leveraged build args to set values so that these are more flexible. I believe that the intent is to be able to set these "globally" in CI variables, but that was never followed, and version bumps were file edits anyway. If these examples are so version dependent, we should define the versions in one place -- the Dockerfile, for example -- and then only change them there. This also allows users to more easily replicate the builds themselves.

@elezar
Copy link
Member

elezar commented Feb 19, 2025

@guptaNswati please also add a meaningful description to the PR.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar changed the title Add nvbandwidth image Add nvbandwidth sample Feb 21, 2025
@elezar elezar mentioned this pull request Feb 21, 2025
@ArangoGutierrez ArangoGutierrez self-requested a review February 21, 2025 15:19
@ArangoGutierrez
Copy link
Collaborator

Once the merge conflict is resolved, we can proceed

This change adds an nvbandwidth sample that can be used to test
both single and multi-node GPU interconnectivity.

The multi-arch images are generated with the following image root:

nvcr.io/ghcr.io/nvidia/k8s-samples:nvbandwidth

Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guptaNswati

@elezar elezar merged commit a0d87c7 into NVIDIA:main Feb 21, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants