Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ logs/
tmp/
temp/
*.tmp
e2e/gpu/images/.build/

# Secrets/credentials (should never be committed)
*.pem
Expand Down
113 changes: 113 additions & 0 deletions e2e/gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# GPU workload images

This directory defines GPU workload images used by OpenShell GPU e2e tests.

The image definitions live here first so the OpenShell e2e harness can iterate
against a concrete contract. The long-term image ownership should move to
`NVIDIA/OpenShell-Community`; OpenShell should then keep the contract, local
build task, and tests that consume published image refs.

## Contract

Each workload image must:

- Use the OpenShell community base image as its final-stage base.
- Install the workload at `/usr/local/bin/openshell-gpu-workload`.
- Run the same workload as the image default entrypoint for direct
container-engine validation.
- Require no network access after the image is pulled.
- Print `OPENSHELL_GPU_WORKLOAD_SUCCESS` only when validation succeeds.
- Print `OPENSHELL_GPU_WORKLOAD_FAILURE` and exit non-zero when validation
fails.
- Be usable as an OpenShell sandbox image with `openshell sandbox create
--from <image>`.

OpenShell sandbox creation replaces the image entrypoint with the supervisor and
does not run the OCI image `CMD`. E2e tests that use these images through
OpenShell should run `/usr/local/bin/openshell-gpu-workload` explicitly.

## Images

| Source directory | Image name | Purpose |
| --- | --- | --- |
| `smoke-pass` | `gpu-workload-smoke-pass` | Always succeeds and prints the success marker. |
| `smoke-fail` | `gpu-workload-smoke-fail` | Always fails and prints the failure marker. |
| `cuda-basic` | `gpu-workload-cuda-basic` | Runs CUDA `deviceQuery` and `vectorAdd` validation. |

## Build

Build all workload images:

```shell
mise run e2e:gpu:images:build
```

Build a subset by source directory name:

```shell
OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-pass,smoke-fail \
mise run e2e:gpu:images:build
```

The build task uses `tasks/scripts/container-engine.sh`. Set
`CONTAINER_ENGINE=docker` or `CONTAINER_ENGINE=podman` to choose an engine
explicitly. When unset, the helper uses its existing auto-detection behavior.

Local tags use the current commit short SHA. Dirty local trees append `-dirty`.
Set `OPENSHELL_GPU_WORKLOAD_IMAGE_TAG=<tag>` to override the tag.

The task writes the latest build refs to:

```text
e2e/gpu/images/.build/latest.env
```

Use it in later commands:

```shell
source e2e/gpu/images/.build/latest.env
```

## Direct Validation

Validate smoke pass:

```shell
docker run --rm "${OPENSHELL_E2E_GPU_SMOKE_PASS_IMAGE}"
```

Validate smoke fail:

```shell
docker run --rm "${OPENSHELL_E2E_GPU_SMOKE_FAIL_IMAGE}"
```

The smoke fail command should exit non-zero and print
`OPENSHELL_GPU_WORKLOAD_FAILURE`.

Validate CUDA with Docker CDI:

```shell
docker run --rm --device nvidia.com/gpu=all \
"${OPENSHELL_E2E_GPU_CUDA_WORKLOAD_IMAGE}"
```

Use `podman run` with the same `--device nvidia.com/gpu=all` option on hosts
where Podman CDI is configured.

Direct container-engine validation catches image, CDI, CUDA, and host GPU setup
issues before OpenShell sandbox behavior is involved.

## Publish Guidance

Published tests should reference immutable image refs:

```shell
OPENSHELL_E2E_GPU_CUDA_WORKLOAD_IMAGE=ghcr.io/nvidia/openshell-community/sandboxes/gpu-workload-cuda-basic@sha256:<digest>
```

Mutable tags are acceptable for local iteration. CI should use a digest or an
immutable release tag once the images are published from OpenShell-Community.
72 changes: 72 additions & 0 deletions e2e/gpu/images/cuda-basic/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# syntax=docker/dockerfile:1

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

ARG CUDA_BUILD_IMAGE=nvcr.io/nvidia/cuda:12.8.1-base-ubuntu22.04
ARG OPENSHELL_SANDBOX_BASE_IMAGE=ghcr.io/nvidia/openshell-community/sandboxes/base:latest

FROM ${CUDA_BUILD_IMAGE} AS builder

ARG DEBIAN_FRONTEND=noninteractive
ARG CUDA_SAMPLES_REF=v12.8
ARG CUDA_SAMPLES_REPO=https://github.com/NVIDIA/cuda-samples

RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
cmake \
cuda-nvcc-12-8 \
curl \
g++ \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /build/cuda-samples

RUN set -eux; \
curl -fsSL "${CUDA_SAMPLES_REPO}/archive/refs/tags/${CUDA_SAMPLES_REF}.tar.gz" \
-o /tmp/cuda-samples.tar.gz; \
tar -xzf /tmp/cuda-samples.tar.gz \
--strip-components=1 \
--wildcards \
'*/Common/*' \
'*/cmake/*' \
'*/Samples/0_Introduction/vectorAdd/*' \
'*/Samples/1_Utilities/deviceQuery/*' \
'*/LICENSE'; \
sed -i 's/CUDA::cudart/CUDA::cudart_static/g' \
Samples/1_Utilities/deviceQuery/CMakeLists.txt; \
cmake -S Samples/1_Utilities/deviceQuery -B /tmp/build-device-query \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_RUNTIME_LIBRARY=Static; \
cmake --build /tmp/build-device-query --parallel; \
cmake -S Samples/0_Introduction/vectorAdd -B /tmp/build-vector-add \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_RUNTIME_LIBRARY=Static; \
cmake --build /tmp/build-vector-add --parallel; \
mkdir -p /opt/openshell-gpu-workload; \
cp /tmp/build-device-query/deviceQuery /opt/openshell-gpu-workload/deviceQuery; \
cp /tmp/build-vector-add/vectorAdd /opt/openshell-gpu-workload/vectorAdd; \
cp LICENSE /opt/openshell-gpu-workload/cuda-samples.LICENSE; \
rm -f /tmp/cuda-samples.tar.gz

FROM ${OPENSHELL_SANDBOX_BASE_IMAGE}

ARG CUDA_SAMPLES_REF=v12.8

LABEL com.nvidia.openshell.gpu-workload.name="cuda-basic" \
com.nvidia.openshell.gpu-workload.cuda-samples-ref="${CUDA_SAMPLES_REF}"

USER root
RUN mkdir -p /usr/local/lib/openshell-gpu-workload \
/usr/local/share/doc/openshell-gpu-workload
COPY --from=builder /opt/openshell-gpu-workload/deviceQuery /usr/local/lib/openshell-gpu-workload/deviceQuery
COPY --from=builder /opt/openshell-gpu-workload/vectorAdd /usr/local/lib/openshell-gpu-workload/vectorAdd
COPY --from=builder /opt/openshell-gpu-workload/cuda-samples.LICENSE /usr/local/share/doc/openshell-gpu-workload/cuda-samples.LICENSE
COPY workload.sh /usr/local/bin/openshell-gpu-workload
RUN chmod 0755 /usr/local/bin/openshell-gpu-workload \
/usr/local/lib/openshell-gpu-workload/deviceQuery \
/usr/local/lib/openshell-gpu-workload/vectorAdd

USER sandbox
ENTRYPOINT ["/usr/local/bin/openshell-gpu-workload"]
42 changes: 42 additions & 0 deletions e2e/gpu/images/cuda-basic/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# GPU workload CUDA basic

`cuda-basic` validates that a GPU-enabled environment can run a basic CUDA
runtime workload. It is a single image that runs two validation steps:

1. `deviceQuery` checks CUDA runtime, driver, and device discovery.
2. `vectorAdd` checks kernel launch, device memory allocation, host/device
copies, synchronization, and result validation.

The image builds the samples from `NVIDIA/cuda-samples` tag `v12.8` with a CUDA
12.8 builder image, then copies only the compiled binaries into the OpenShell
community base final image.

The workload prints `OPENSHELL_GPU_WORKLOAD_SUCCESS` only after both samples
pass. On failure it prints `OPENSHELL_GPU_WORKLOAD_FAILURE` and exits non-zero.

Build it with:

```shell
OPENSHELL_GPU_WORKLOAD_IMAGES=cuda-basic mise run e2e:gpu:images:build
```

Run it directly with Docker CDI:

```shell
source e2e/gpu/images/.build/latest.env
docker run --rm --device nvidia.com/gpu=all \
"${OPENSHELL_E2E_GPU_CUDA_WORKLOAD_IMAGE}"
```

Use `podman run` with the same `--device nvidia.com/gpu=all` option when Podman
CDI is configured.

The image does not vendor GPU driver libraries such as `libcuda.so.1`. Those
libraries must be provided by the host GPU runtime or CDI injection.

The CUDA samples are redistributed under the NVIDIA CUDA samples license. The
license text is copied into the image at
`/usr/local/share/doc/openshell-gpu-workload/cuda-samples.LICENSE`.
40 changes: 40 additions & 0 deletions e2e/gpu/images/cuda-basic/workload.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/usr/bin/env bash

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

set -euo pipefail

readonly SUCCESS_MARKER="OPENSHELL_GPU_WORKLOAD_SUCCESS"
readonly FAILURE_MARKER="OPENSHELL_GPU_WORKLOAD_FAILURE"
readonly WORKLOAD_DIR="/usr/local/lib/openshell-gpu-workload"

run_sample() {
local name=$1
local expected=$2
local binary="${WORKLOAD_DIR}/${name}"
local output

output="$(mktemp)"
echo "running CUDA sample: ${name}"
if ! "${binary}" >"${output}" 2>&1; then
cat "${output}"
echo "${FAILURE_MARKER} ${name} exited non-zero" >&2
rm -f "${output}"
exit 1
fi

cat "${output}"
if ! grep -Fq "${expected}" "${output}"; then
echo "${FAILURE_MARKER} ${name} did not print expected output: ${expected}" >&2
rm -f "${output}"
exit 1
fi

rm -f "${output}"
}

run_sample "deviceQuery" "Result = PASS"
run_sample "vectorAdd" "Test PASSED"

echo "${SUCCESS_MARKER} cuda-basic"
15 changes: 15 additions & 0 deletions e2e/gpu/images/smoke-fail/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# syntax=docker/dockerfile:1

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

ARG OPENSHELL_SANDBOX_BASE_IMAGE=ghcr.io/nvidia/openshell-community/sandboxes/base:latest

FROM ${OPENSHELL_SANDBOX_BASE_IMAGE}

USER root
COPY workload.sh /usr/local/bin/openshell-gpu-workload
RUN chmod 0755 /usr/local/bin/openshell-gpu-workload

USER sandbox
ENTRYPOINT ["/usr/local/bin/openshell-gpu-workload"]
24 changes: 24 additions & 0 deletions e2e/gpu/images/smoke-fail/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# GPU workload smoke fail

`smoke-fail` validates negative-path diagnostics in e2e test plumbing.

The workload does not perform GPU-specific work. It prints
`OPENSHELL_GPU_WORKLOAD_FAILURE`, emits a stable diagnostic, and exits non-zero.

Build it with:

```shell
OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-fail mise run e2e:gpu:images:build
```

Run it directly:

```shell
source e2e/gpu/images/.build/latest.env
docker run --rm "${OPENSHELL_E2E_GPU_SMOKE_FAIL_IMAGE}"
```

The direct run should fail.
9 changes: 9 additions & 0 deletions e2e/gpu/images/smoke-fail/workload.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/env bash

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

set -euo pipefail

echo "OPENSHELL_GPU_WORKLOAD_FAILURE smoke-fail intentional failure" >&2
exit 42
15 changes: 15 additions & 0 deletions e2e/gpu/images/smoke-pass/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# syntax=docker/dockerfile:1

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

ARG OPENSHELL_SANDBOX_BASE_IMAGE=ghcr.io/nvidia/openshell-community/sandboxes/base:latest

FROM ${OPENSHELL_SANDBOX_BASE_IMAGE}

USER root
COPY workload.sh /usr/local/bin/openshell-gpu-workload
RUN chmod 0755 /usr/local/bin/openshell-gpu-workload

USER sandbox
ENTRYPOINT ["/usr/local/bin/openshell-gpu-workload"]
23 changes: 23 additions & 0 deletions e2e/gpu/images/smoke-pass/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# GPU workload smoke pass

`smoke-pass` validates image publishing, sandbox image compatibility, default
entrypoint execution, and success-marker assertion plumbing.

The workload does not perform GPU-specific work. It prints
`OPENSHELL_GPU_WORKLOAD_SUCCESS` and exits `0`.

Build it with:

```shell
OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-pass mise run e2e:gpu:images:build
```

Run it directly:

```shell
source e2e/gpu/images/.build/latest.env
docker run --rm "${OPENSHELL_E2E_GPU_SMOKE_PASS_IMAGE}"
```
8 changes: 8 additions & 0 deletions e2e/gpu/images/smoke-pass/workload.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env bash

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

set -euo pipefail

echo "OPENSHELL_GPU_WORKLOAD_SUCCESS smoke-pass"
Loading
Loading