Add basic integration tests for docker runtime #860

ArangoGutierrez · 2025-01-14T14:32:00Z

Add a set of basic integration tests for using the NVIDIA Container Toolkit with Docker.

This adds the following tests:

Basic nvidia-smi tests
Basic vectorAdd tests
Basic deviceQuery tests

These tests assume that the NVIDIA Container Toolkit is installed and that Docker is configured with the nvidia runtime.

Example output

make -f test/e2e/Makefile test 
cd /localhome/local-eduardoa/test/e2e && go test -v . -args \
        -ginkgo.focus="docker" \
        -test.timeout=1h \
        -ginkgo.v
=== RUN   TestMain
Running Suite: NVIDIA Container Toolkit E2E - /localhome/local-eduardoa/test/e2e
================================================================================
Random Seed: 1736942191

Will run 12 of 12 specs
------------------------------
[BeforeSuite] 
/localhome/local-eduardoa/test/e2e/e2e_test.go:45
[BeforeSuite] PASSED [0.000 seconds]
------------------------------
docker when running nvidia-smi -L should support NVIDIA_VISIBLE_DEVICES
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:43
• [1.493 seconds]
------------------------------
docker when running nvidia-smi -L should support automatic CDI spec generation
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:49
• [0.736 seconds]
------------------------------
docker when running nvidia-smi -L should support the --gpus flag using the nvidia-container-runtime
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:55
• [0.701 seconds]
------------------------------
docker when running nvidia-smi -L should support the --gpus flag using the nvidia-container-runtime-hook
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:61
• [0.630 seconds]
------------------------------
docker when Running the cuda-vectorAdd sample should support NVIDIA_VISIBLE_DEVICES
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:78
• [1.229 seconds]
------------------------------
docker when Running the cuda-vectorAdd sample should support automatic CDI spec generation
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:87
• [1.015 seconds]
------------------------------
docker when Running the cuda-vectorAdd sample should support the --gpus flag using the nvidia-container-runtime
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:93
• [0.902 seconds]
------------------------------
docker when Running the cuda-vectorAdd sample should support the --gpus flag using the nvidia-container-runtime-hook
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:99
• [0.889 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support NVIDIA_VISIBLE_DEVICES
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:116
• [1.186 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support automatic CDI spec generation
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:125
• [1.081 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support the --gpus flag using the nvidia-container-runtime
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:131
• [0.890 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support the --gpus flag using the nvidia-container-runtime-hook
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:137
• [0.887 seconds]
------------------------------

Ran 12 of 12 Specs in 11.640 seconds
SUCCESS! -- 12 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestMain (11.64s)
PASS
ok      github.com/NVIDIA/nvidia-container-toolkit/test/e2e     11.649s

test/go.mod

test/e2e/e2e_test.go

elezar · 2025-01-14T14:44:01Z

test/e2e/e2e_test.go

+
+	// If there's an error, include stderr in the error message
+	if err != nil {
+		return "", errors.New(stderr.String())


We should deffinitely log the output and return a wrapped err here instead.

Would we not still want to return the stdout?

I introduced an error on purpose to see how it would look like

[FAILED] in [It] - /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:45 @ 01/14/25 16:30:38.847 • [FAILED] [1.528 seconds] docker when running nvidia-smi -L [It] it should show the same output from outside a container /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:38 [FAILED] Unexpected error: <*errors.errorString | 0xc000128030>: Unable to find image 'bla:latest' locally docker: Error response from daemon: pull access denied for bla, repository does not exist or may require 'docker login': denied: requested access to the resource is denied. See 'docker run --help'. { s: "Unable to find image 'bla:latest' locally\ndocker: Error response from daemon: pull access denied for bla, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.\nSee 'docker run --help'.\n", } occurred In [It] at: /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:45 @ 01/14/25 16:30:38.847

My question was more on whether this is idiomatic. What about:

if err := cmd.Run(); err != nil { GinkgoLogr.Error(err, "Failed to run script:\nSTDERR: %v\nSTDOUT: %v", stderr.String(), stdout.String()) return "", err }

instead?

and error now reads

docker when running nvidia-smi -L should support NVIDIA_VISIBLE_DEVICES /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:43 [FAILED] in [It] - /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:45 @ 01/15/25 11:31:24.787 • [FAILED] [1.621 seconds] docker when running nvidia-smi -L [It] should support NVIDIA_VISIBLE_DEVICES /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:43 [FAILED] Unexpected error: <*errors.errorString | 0xc00020c4c0>: script execution failed: exit status 125 STDOUT: STDERR: Unable to find image 'bla:latest' locally docker: Error response from daemon: pull access denied for bla, repository does not exist or may require 'docker login': denied: requested access to the resource is denied. See 'docker run --help'. { s: "script execution failed: exit status 125\nSTDOUT: \nSTDERR: Unable to find image 'bla:latest' locally\ndocker: Error response from daemon: pull access denied for bla, repository do es not exist or may require 'docker login': denied: requested access to the resource is denied.\nSee 'docker run --help'.\n", } occurred In [It] at: /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:45 @ 01/15/25 11:31:24.787

OK. Not going to block on this.

elezar · 2025-01-14T14:45:52Z

test/e2e/nvidia-container-toolkit_test.go

+	// container shows the same output inside the container as outside the
+	// container. This means that the following commands must all produce
+	// the same output
+	When("Running nvidia-smi -L", Ordered, func() {


Is Ordered required here? Is there a way that the different mechanisms to select devices are treated as separate "contexts"?

from documentation: Ordered is a decorator that allows you to mark a container as ordered. Specs in the container will always run in the order they appear. They will never be randomized and they will never run in parallel with one another, though they may run in parallel with other specs.
So is not required, we can remove it if we want.

There are no specs in this container though, or am I misunderstanding what a "Spec" is? Can we rewrite the separate tests as Specs?

you mean an It() for each run?

I have updated the PR description to reflect new changes related to this thread

Let's also update the title to make it more specific.

both PR title and description are updated now

test/e2e/nvidia-container-toolkit_test.go

test/e2e/Makefile

elezar · 2025-01-14T17:24:16Z

test/e2e/nvidia-container-toolkit_test.go

+			Expect(err).ToNot(HaveOccurred())
+		})
+
+		Describe("comparing outputs of nvidia-smi -L inside and outside a container", func() {


Do we need the Describe spec? What about:

When("running nvidia-smi -L") It("should support NVIDIA_VISIBLE_DEVICES") It("should support automatic CDI spec generation") It("should support the --gpus flag using the nvidia-container-runtime-hook") It("should support the --gpus flag using the nvidia-container-runtime")

it is not needed, let me check how it looks without it.

Done, I have removed the Describe , I decided to keep the strings as is for now to reflect the test being run as described in the design doc, we can edit the strings as a follow up PR once we add further complexity like installing and uninstalling things during testing

Why do we want to match the strings to the doc? They should signify the intent, and if that's not clear from the doc then the doc should be updated.

To sumarize: We want these tests to validate that one is able to run nvidia-smi in a container and get the expected result regardless of the mechanism used to inject the devices into the container.

ack, will edit

should I also re-write the vectorAdd and deviceQuery tests?

ready for review

elezar · 2025-01-15T12:37:56Z

test/e2e/nvidia-container-toolkit_test.go

+			_, err := runScript("docker pull ubuntu")
+			Expect(err).ToNot(HaveOccurred())
+
+			rawOut, err = runScript("nvidia-smi -L")


nit:

Suggested change

rawOut, err = runScript("nvidia-smi -L")

hostOutput, err = runScript("nvidia-smi -L")

elezar · 2025-01-15T12:38:25Z

test/e2e/nvidia-container-toolkit_test.go

+		})
+
+		It("should support NVIDIA_VISIBLE_DEVICES", func(ctx context.Context) {
+			outContainer, err := runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L")


nit:

Suggested change

outContainer, err := runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L")

containerOutput, err := runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L")

elezar · 2025-01-15T12:39:44Z

test/e2e/nvidia-container-toolkit_test.go

+			out, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")
+			Expect(err).ToNot(HaveOccurred())
+
+			// look for string "Test PASSED" in the output, if not found, fail the test


I don't think this comment says more than what is already readable through the expect statement.

elezar · 2025-01-15T12:40:02Z

test/e2e/nvidia-container-toolkit_test.go

+
+		It("should support NVIDIA_VISIBLE_DEVICES", func(ctx context.Context) {
+			var err error
+			out, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")


nit:

Suggested change

out, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")

containerOutput, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")

elezar · 2025-01-15T12:40:41Z

test/e2e/nvidia-container-toolkit_test.go

+			Expect(err).ToNot(HaveOccurred())
+		})
+
+		var out string


alternatively call this referenceOutput

elezar · 2025-01-15T12:41:57Z

test/go.mod

@@ -0,0 +1,20 @@
+module github.com/NVIDIA/nvidia-container-toolkit/test


not critical to this PR, but we should ensure that we update the dependabot configs.

Was it done here?

elezar · 2025-01-15T12:47:34Z

Thanks @ArangoGutierrez. This looks really good now. I have some minor comments around variable naming.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

elezar

Thanks @ArangoGutierrez. Let's get this in.

ArangoGutierrez requested a review from elezar January 14, 2025 14:32

ArangoGutierrez self-assigned this Jan 14, 2025

elezar reviewed Jan 14, 2025

View reviewed changes

test/go.mod Outdated Show resolved Hide resolved

elezar reviewed Jan 14, 2025

View reviewed changes

test/e2e/e2e_test.go Outdated Show resolved Hide resolved

elezar reviewed Jan 14, 2025

View reviewed changes

test/e2e/e2e_test.go Outdated Show resolved Hide resolved

elezar reviewed Jan 14, 2025

View reviewed changes

test/e2e/nvidia-container-toolkit_test.go Outdated Show resolved Hide resolved

elezar reviewed Jan 14, 2025

View reviewed changes

test/e2e/nvidia-container-toolkit_test.go Show resolved Hide resolved

elezar reviewed Jan 14, 2025

View reviewed changes

test/e2e/Makefile Outdated Show resolved Hide resolved

ArangoGutierrez force-pushed the reg_test01 branch from 43b82e4 to 097f649 Compare January 14, 2025 16:40

ArangoGutierrez requested a review from elezar January 14, 2025 16:40

ArangoGutierrez force-pushed the reg_test01 branch 3 times, most recently from 894da55 to 255892b Compare January 14, 2025 16:56

elezar reviewed Jan 14, 2025

View reviewed changes

ArangoGutierrez force-pushed the reg_test01 branch from 255892b to b22dfd8 Compare January 14, 2025 18:16

ArangoGutierrez requested a review from elezar January 14, 2025 18:17

ArangoGutierrez force-pushed the reg_test01 branch from b22dfd8 to ede9606 Compare January 15, 2025 11:57

ArangoGutierrez changed the title ~~Automated regression testing for the NVIDIA Container Toolkit~~ Add base for integration tests Jan 15, 2025

elezar reviewed Jan 15, 2025

View reviewed changes

elezar changed the title ~~Add base for integration tests~~ Add basic integration tests for docker runtime Jan 15, 2025

Automated regression testing for the NVIDIA Container Toolkit

8cc672b

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez force-pushed the reg_test01 branch from ede9606 to 8cc672b Compare January 15, 2025 13:39

ArangoGutierrez requested a review from elezar January 15, 2025 13:39

elezar approved these changes Jan 15, 2025

View reviewed changes

ArangoGutierrez merged commit cb52e77 into NVIDIA:main Jan 15, 2025
11 checks passed

	rawOut, err = runScript("nvidia-smi -L")
	hostOutput, err = runScript("nvidia-smi -L")

	outContainer, err := runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L")
	containerOutput, err := runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L")

	out, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")
	containerOutput, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")

		@@ -0,0 +1,20 @@
		module github.com/NVIDIA/nvidia-container-toolkit/test

Add basic integration tests for docker runtime #860

Add basic integration tests for docker runtime #860

Uh oh!

Conversation

ArangoGutierrez commented Jan 14, 2025 • edited by elezar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example output

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented Jan 14, 2025 •

edited by elezar

Loading

elezar Jan 14, 2025 •

edited

Loading

ArangoGutierrez Jan 14, 2025 •

edited

Loading