Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Jan 14, 2025

Add a set of basic integration tests for using the NVIDIA Container Toolkit with Docker.

This adds the following tests:

  • Basic nvidia-smi tests
  • Basic vectorAdd tests
  • Basic deviceQuery tests

These tests assume that the NVIDIA Container Toolkit is installed and that Docker is configured with the nvidia runtime.

Example output

make -f test/e2e/Makefile test 
cd /localhome/local-eduardoa/test/e2e && go test -v . -args \
        -ginkgo.focus="docker" \
        -test.timeout=1h \
        -ginkgo.v
=== RUN   TestMain
Running Suite: NVIDIA Container Toolkit E2E - /localhome/local-eduardoa/test/e2e
================================================================================
Random Seed: 1736942191

Will run 12 of 12 specs
------------------------------
[BeforeSuite] 
/localhome/local-eduardoa/test/e2e/e2e_test.go:45
[BeforeSuite] PASSED [0.000 seconds]
------------------------------
docker when running nvidia-smi -L should support NVIDIA_VISIBLE_DEVICES
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:43
• [1.493 seconds]
------------------------------
docker when running nvidia-smi -L should support automatic CDI spec generation
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:49
• [0.736 seconds]
------------------------------
docker when running nvidia-smi -L should support the --gpus flag using the nvidia-container-runtime
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:55
• [0.701 seconds]
------------------------------
docker when running nvidia-smi -L should support the --gpus flag using the nvidia-container-runtime-hook
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:61
• [0.630 seconds]
------------------------------
docker when Running the cuda-vectorAdd sample should support NVIDIA_VISIBLE_DEVICES
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:78
• [1.229 seconds]
------------------------------
docker when Running the cuda-vectorAdd sample should support automatic CDI spec generation
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:87
• [1.015 seconds]
------------------------------
docker when Running the cuda-vectorAdd sample should support the --gpus flag using the nvidia-container-runtime
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:93
• [0.902 seconds]
------------------------------
docker when Running the cuda-vectorAdd sample should support the --gpus flag using the nvidia-container-runtime-hook
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:99
• [0.889 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support NVIDIA_VISIBLE_DEVICES
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:116
• [1.186 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support automatic CDI spec generation
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:125
• [1.081 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support the --gpus flag using the nvidia-container-runtime
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:131
• [0.890 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support the --gpus flag using the nvidia-container-runtime-hook
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:137
• [0.887 seconds]
------------------------------

Ran 12 of 12 Specs in 11.640 seconds
SUCCESS! -- 12 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestMain (11.64s)
PASS
ok      github.com/NVIDIA/nvidia-container-toolkit/test/e2e     11.649s

@ArangoGutierrez ArangoGutierrez self-assigned this Jan 14, 2025

// If there's an error, include stderr in the error message
if err != nil {
return "", errors.New(stderr.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should deffinitely log the output and return a wrapped err here instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we not still want to return the stdout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced an error on purpose to see how it would look like

  [FAILED] in [It] - /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:45 @ 01/14/25 16:30:38.847
• [FAILED] [1.528 seconds]                                                                                
docker when running nvidia-smi -L [It] it should show the same output from outside a container            
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:38                                    
                                                                                                          
  [FAILED] Unexpected error:                                                                              
      <*errors.errorString | 0xc000128030>:                                                               
      Unable to find image 'bla:latest' locally                                                           
      docker: Error response from daemon: pull access denied for bla, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
      See 'docker run --help'.                                                                            
                                                                                                          
      {                                                                                                   
          s: "Unable to find image 'bla:latest' locally\ndocker: Error response from daemon: pull access denied for bla, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.\nSee 'docker run --help'.\n",
      }                                                                                                   
  occurred                                                                                                
  In [It] at: /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:45 @ 01/14/25 16:30:38.847

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more on whether this is idiomatic. What about:

	if err := cmd.Run(); err != nil {
		GinkgoLogr.Error(err, "Failed to run script:\nSTDERR: %v\nSTDOUT: %v", stderr.String(), stdout.String())
		return "", err
	}

instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and error now reads

docker when running nvidia-smi -L should support NVIDIA_VISIBLE_DEVICES                          
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:43                           
  [FAILED] in [It] - /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:45 @ 01/15/25 11:31:24.787
• [FAILED] [1.621 seconds]       
docker when running nvidia-smi -L [It] should support NVIDIA_VISIBLE_DEVICES                                                                                                                      
/localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:43 
                                                                                                 
  [FAILED] Unexpected error:                                                                     
      <*errors.errorString | 0xc00020c4c0>:                                                                                                                                                       
      script execution failed: exit status 125                                                   
      STDOUT:                                                                                    
      STDERR: Unable to find image 'bla:latest' locally        
      docker: Error response from daemon: pull access denied for bla, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
      See 'docker run --help'.                                                                   
                                                                                                 
      {                                                                                          
          s: "script execution failed: exit status 125\nSTDOUT: \nSTDERR: Unable to find image 'bla:latest' locally\ndocker: Error response from daemon: pull access denied for bla, repository do
es not exist or may require 'docker login': denied: requested access to the resource is denied.\nSee 'docker run --help'.\n",
      }                                                                                          
  occurred
  In [It] at: /localhome/local-eduardoa/test/e2e/nvidia-container-toolkit_test.go:45 @ 01/15/25 11:31:24.787

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Not going to block on this.

// container shows the same output inside the container as outside the
// container. This means that the following commands must all produce
// the same output
When("Running nvidia-smi -L", Ordered, func() {
Copy link
Member

@elezar elezar Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Ordered required here? Is there a way that the different mechanisms to select devices are treated as separate "contexts"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from documentation: Ordered is a decorator that allows you to mark a container as ordered. Specs in the container will always run in the order they appear. They will never be randomized and they will never run in parallel with one another, though they may run in parallel with other specs.
So is not required, we can remove it if we want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no specs in this container though, or am I misunderstanding what a "Spec" is? Can we rewrite the separate tests as Specs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean an It() for each run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR description to reflect new changes related to this thread

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update the title to make it more specific.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both PR title and description are updated now

Expect(err).ToNot(HaveOccurred())
})

Describe("comparing outputs of nvidia-smi -L inside and outside a container", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the Describe spec? What about:

When("running nvidia-smi -L")
   It("should support NVIDIA_VISIBLE_DEVICES")
   It("should support automatic CDI spec generation")
   It("should support the --gpus flag using the nvidia-container-runtime-hook")
   It("should support the --gpus flag using the nvidia-container-runtime")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not needed, let me check how it looks without it.

Copy link
Collaborator Author

@ArangoGutierrez ArangoGutierrez Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I have removed the Describe , I decided to keep the strings as is for now to reflect the test being run as described in the design doc, we can edit the strings as a follow up PR once we add further complexity like installing and uninstalling things during testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to match the strings to the doc? They should signify the intent, and if that's not clear from the doc then the doc should be updated.

To sumarize: We want these tests to validate that one is able to run nvidia-smi in a container and get the expected result regardless of the mechanism used to inject the devices into the container.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will edit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I also re-write the vectorAdd and deviceQuery tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready for review

@ArangoGutierrez ArangoGutierrez changed the title Automated regression testing for the NVIDIA Container Toolkit Add base for integration tests Jan 15, 2025
_, err := runScript("docker pull ubuntu")
Expect(err).ToNot(HaveOccurred())

rawOut, err = runScript("nvidia-smi -L")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
rawOut, err = runScript("nvidia-smi -L")
hostOutput, err = runScript("nvidia-smi -L")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})

It("should support NVIDIA_VISIBLE_DEVICES", func(ctx context.Context) {
outContainer, err := runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
outContainer, err := runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L")
containerOutput, err := runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

out, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")
Expect(err).ToNot(HaveOccurred())

// look for string "Test PASSED" in the output, if not found, fail the test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment says more than what is already readable through the expect statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


It("should support NVIDIA_VISIBLE_DEVICES", func(ctx context.Context) {
var err error
out, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
out, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")
containerOutput, err = runScript("docker run --rm -i --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Expect(err).ToNot(HaveOccurred())
})

var out string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively call this referenceOutput

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,20 @@
module github.com/NVIDIA/nvidia-container-toolkit/test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not critical to this PR, but we should ensure that we update the dependabot configs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it done here?

@elezar elezar changed the title Add base for integration tests Add basic integration tests for docker runtime Jan 15, 2025
@elezar
Copy link
Member

elezar commented Jan 15, 2025

Thanks @ArangoGutierrez. This looks really good now. I have some minor comments around variable naming.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArangoGutierrez. Let's get this in.

@ArangoGutierrez ArangoGutierrez merged commit cb52e77 into NVIDIA:main Jan 15, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants