Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 7 additions & 13 deletions docs/airgap/mirror-docker-images.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,8 @@ Then, for each image you want to download, you should pull the image from the re
In this example, we're saving all our Docker images to `/tmp/images`:

```
$ docker pull nvidia/cuda:11.1-devel-ubuntu20.04
$ docker save -o /tmp/images/nvidia-cuda-11.1-devel-ubuntu20.04.tar nvidia/cuda:11.1-devel-ubuntu20.04
>>>>>>> e1d0a775 (Airgap documentation update.)
$ docker pull nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
$ docker save -o /tmp/images/nvidia-cuda-12.4.1-base-ubuntu22.04.tar nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
```

Additionally, you should download and save the [`registry` image](https://hub.docker.com/_/registry) so that you can deploy a local registry on the offline network.
Expand Down Expand Up @@ -76,13 +75,8 @@ Additionally, we assume that the `registry` image was included when you transfer

Load the registry image into the Docker image cache of your container registry host:

<<<<<<< HEAD
```bash
docker load < /tmp/images/registry-2.7.tar
=======
```
$ docker load -i /tmp/images/registry-2.7.tar
>>>>>>> e1d0a775 (Airgap documentation update.)
$ docker load -i /tmp/images/registry-3.1.1.tar
```

Then create a Docker volume to store your container images:
Expand All @@ -99,7 +93,7 @@ docker run -d \
--restart=always \
--name registry \
-v registry-images:/var/lib/registry \
registry:2.7
registry:3.1.1
```

## Configuring your hosts to use the offline container registry
Expand All @@ -121,7 +115,7 @@ docker_insecure_registries:
Once your registry is running and you've configured your hosts to access it, you can load additional images and push them to the offline registry:

```bash
docker load < /tmp/images/nvidia-cuda-11.1-devel-ubuntu20.04.tar
docker tag nvidia/cuda:11.1-devel-ubuntu20.04 registry-host:5000/nvidia/cuda:11.1-devel-ubuntu20.04
docker push registry-host:5000/nvidia/cuda:11.1-devel-ubuntu20.04
docker load -i /tmp/images/nvidia-cuda-12.4.1-base-ubuntu22.04.tar
docker tag nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 registry-host:5000/nvidia/cuda:12.4.1-base-ubuntu22.04
docker push registry-host:5000/nvidia/cuda:12.4.1-base-ubuntu22.04
```
18 changes: 9 additions & 9 deletions docs/airgap/ngc-ready.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@ For instructions on setting up an HTTP mirror, see the [doc on HTTP mirrors](./m

Container images are only needed if you want to run the tests built into the playbook:

- nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04
- nvcr.io/nvidia/pytorch:18.10-py3
- nvcr.io/nvidia/tensorflow:18.10-py3
- nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
- nvcr.io/nvidia/pytorch:24.04-py3
- nvcr.io/nvidia/tensorflow:24.04-tf2-py3

For instructions on setting up a Docker registry mirror, see the [doc on Docker mirrors](./mirror-docker-images.md).

Expand All @@ -63,9 +63,9 @@ For instructions on setting up an HTTP mirror, see the [doc on HTTP mirrors](./m

Container images (how to mirror) are only needed if you want to run the tests built into the playbook:

- nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04
- nvcr.io/nvidia/pytorch:18.10-py3
- nvcr.io/nvidia/tensorflow:18.10-py3
- nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
- nvcr.io/nvidia/pytorch:24.04-py3
- nvcr.io/nvidia/tensorflow:24.04-tf2-py3

For instructions on setting up a Docker registry mirror, see the [doc on Docker mirrors](./mirror-docker-images.md).

Expand Down Expand Up @@ -182,9 +182,9 @@ dcgm_rpm_package: "/path/to/datacenter-gpu-manager.rpm"
If running the container tests as part of the NGC-Ready playbook, set the following variables in your DeepOps configuration:

```bash
ngc_ready_cuda_container: "<your-container-registry>/nvidia/cuda:10.1-base-ubuntu18.04"
ngc_ready_pytorch: "<your-container-registry>/nvidia/pytorch:18.10-py3"
ngc_ready_tensorflow: "<your-container-registry>/nvidia/tensorflow:18.10-py3"
ngc_ready_cuda_container: "<your-container-registry>/nvidia/cuda:12.4.1-base-ubuntu22.04"
ngc_ready_pytorch: "<your-container-registry>/nvidia/pytorch:24.04-py3"
ngc_ready_tensorflow: "<your-container-registry>/nvidia/tensorflow:24.04-tf2-py3"
```

## Running the NGC-Ready playbook
Expand Down
2 changes: 1 addition & 1 deletion virtual/scripts/setup_k8s.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ source "${VIRT_DIR}/k8s_environment.sh"
# Verify that the cluster is up
file ${K8S_CONFIG_DIR}/artifacts/kubectl && chmod +x ${K8S_CONFIG_DIR}/artifacts/kubectl
kubectl get nodes
#kubectl run gpu-test --rm -t -i --restart=Never --image=nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04 --limits=nvidia.com/gpu=1 -- nvidia-smi
#kubectl run gpu-test --rm -t -i --restart=Never --image=nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 --limits=nvidia.com/gpu=1 -- nvidia-smi

# Install helm
"${ROOT_DIR}/scripts/k8s/install_helm.sh"
Expand Down
2 changes: 1 addition & 1 deletion workloads/examples/k8s/cluster-gpu-test-job.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ spec:
spec:
containers:
- name: cluster-gpu-tests
image: nvcr.io/nvidia/cuda:9.0-base
image: nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["/bin/bash","-c","nvidia-smi && sleep 10"]
args:
resources:
Expand Down
3 changes: 1 addition & 2 deletions workloads/examples/k8s/gpu-test-job.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,10 @@ metadata:
spec:
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:10.0-devel
image: nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["sleep", "6000"]
args:
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never

2 changes: 1 addition & 1 deletion workloads/jenkins/scripts/run-gpu-job.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ fi

# Occassionally this gpu-test fails and/or hangs. To ease debugging of this we run a describe several seconds into the launch.
sleep 10 && kubectl describe pods gpu-test &
timeout 300 kubectl run gpu-test --rm -t -i --restart=Never --image=nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04 --limits=nvidia.com/gpu=1 -- nvidia-smi
timeout 300 kubectl run gpu-test --rm -t -i --restart=Never --image=nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 --limits=nvidia.com/gpu=1 -- nvidia-smi

# Run multi-GPU test
if [ ${DEEPOPS_FULL_INSTALL} ]; then
Expand Down
2 changes: 1 addition & 1 deletion workloads/jenkins/scripts/test-slurm-enroot-job.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@ ssh -v \
-i "${HOME}/.ssh/id_rsa" \
"10.0.0.5${GPU01}" \
srun -N1 -G1 \
--container-image="nvcr.io#nvidia/cuda:10.2-base-ubuntu18.04" \
--container-image="nvcr.io#nvidia/cuda:12.4.1-base-ubuntu22.04" \
nvidia-smi -L
Loading