Skip to content

Conversation

ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Jun 30, 2025

This patch fixes the handling of temporary files and directories in the NVIDIA container runtime hook, ensuring we don't leak tmpfs mounts when handling the params file in the container when in CDI mode. This is done by ensuring that the tmpfs mount that we create for the modified params file is created in the container's mount namespace instead of on the host.

We also add end2end tests to check for leaking of mounts.

@ArangoGutierrez ArangoGutierrez self-assigned this Jun 30, 2025
Copilot

This comment was marked as outdated.

@coveralls
Copy link

coveralls commented Jun 30, 2025

Pull Request Test Coverage Report for Build 16051212578

Details

  • 0 of 52 (0.0%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.09%) to 33.117%

Changes Missing Coverage Covered Lines Changed/Added Lines %
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go 0 52 0.0%
Files with Coverage Reduction New Missed Lines %
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go 2 0.0%
Totals Coverage Status
Change from base Build 16051019037: -0.09%
Covered Lines: 4381
Relevant Lines: 13229

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez added the bug Issue/PR to expose/discuss/fix a bug label Jun 30, 2025
@ArangoGutierrez ArangoGutierrez force-pushed the b/5363680 branch 4 times, most recently from 38c25d5 to 1c79350 Compare June 30, 2025 12:00
return unix.Mount("tmpfs", target, "tmpfs", 0, fmt.Sprintf("size=%d", size))
}

func createFileInRoot(containerRootDirPath string, destinationPath string, mode os.FileMode) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: This function also exists in internal/ldconfig we should move this to a separate pacakge.

_, _, err = runner.Run("mkdir -p /tmp/empty")
Expect(err).ToNot(HaveOccurred())

_, _, err = runner.Run("mount | sort > /tmp/mounts.before")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about capturing the output of mount | sort as variables and then comparing these using the Gomega matchers?

Expect(output).To(Equal("ModifyDeviceFiles: 0\n"))
})

//sudo docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all --mount type=bind,source=/tmp/empty,target=/empty,bind-propagation=shared ubuntu true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry it was me during devel, to not forget that bit

})

//sudo docker run --runtime=nvidia --rm -ti -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all --mount type=bind,source=/tmp/empty,target=/empty,bind-propagation=shared ubuntu true
It("should work with nvidia-container-runtime", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a case for --runtime=runc for the "legacy" code path. As a general question, could we add a tag to tests to indicate that it's targeting the legacy code path?

BeforeAll(func(ctx context.Context) {
_, _, err := runner.Run("docker pull ubuntu")
Expect(err).ToNot(HaveOccurred())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's remove this from the diff.

@ArangoGutierrez ArangoGutierrez changed the title Fix Disabling device node creation hook by passing MS_PRIVATE flag during mount creation Fix handling of temporary files and directories in the NVIDIA container runtime hook Jun 30, 2025
@ArangoGutierrez ArangoGutierrez changed the title Fix handling of temporary files and directories in the NVIDIA container runtime hook Fix createParamsFileInContainer func to prevent mount leaks when calling the NVIDIA container runtime hook Jun 30, 2025
@ArangoGutierrez ArangoGutierrez force-pushed the b/5363680 branch 2 times, most recently from 50ac666 to 91f1a73 Compare June 30, 2025 15:48
@ArangoGutierrez ArangoGutierrez requested review from Copilot and elezar July 1, 2025 07:45
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors how the NVIDIA container runtime hook creates and mounts its params file—switching to procfd-based APIs and secure file creation to prevent mount leaks—and adds end-to-end tests to ensure no host mounts remain after running a container.

  • Switch createParamsFileInContainer to use utils.WithProcfd and a createFileInRoot helper for safer tmpfs and bind mounts.
  • Introduce a secure, mknodat-based file creation function (createFileInRoot).
  • Add E2E tests that record host mounts before/after running containers in both legacy and nvidia runtimes to catch mount leaks.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/e2e/nvidia-container-toolkit_test.go New “Disabling device node creation” suite: captures host mounts, runs containers, and asserts no new mounts
cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go Refactored createParamsFileInContainer to use procfd mounts and secure file creation, replacing temp dir logic
Comments suppressed due to low confidence (1)

cmd/nvidia-cdi-hook/disable-device-node-modification/params_linux.go:44

  • [nitpick] The error message could include the target path (e.g., hookScratchDirPath) to make debugging mount failures clearer.
		return fmt.Errorf("failed to create tmpfs mount for params file: %w", err)

@ArangoGutierrez ArangoGutierrez changed the title Fix createParamsFileInContainer func to prevent mount leaks when calling the NVIDIA container runtime hook Fix leaking of tmpfs mount in CDI mode Jul 1, 2025
This change adds e2e tests to ensure that running a container with
shared mount propagaion does not result in leaked mounts.

Note that as soon as a single bind mount is requested with shared propagation,
the rootfs is also mounted with shared propagation.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar added this to the v1.18.0 milestone Jul 2, 2025
This change ensures that the tmpfs mount created for the modified
NVIDIA params file does not leak to the host.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar merged commit d6326e7 into NVIDIA:main Jul 3, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants