Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] Add support for secure boot with examples #83

Merged
merged 15 commits into from
Jun 27, 2024

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Jun 17, 2024

examples/secure-boot/install-nvidia-driver-debian12.sh

This file includes instructions to install nvidia drivers the Debian way on Bookworm and presumably future releases.

examples/secure-boot/install-nvidia-driver-debian11.sh

This file includes instructions to install nvidia drivers the NVIDIA way on Bullseye, Buster and presumably previous releases.

examples/secure-boot/create-key-pair.sh

This file includes instructions to create a key pair, and to make them available via google cloud secret manager

examples/secure-boot/README.md

This file includes instructions to execute the generate_custom_image.py script with parameters necessary to create a custom image for use with secure-boot. Either of the above installer scripts can be used to exercise the process.

custom_image_utils/shell_script_generator.py
  • updated the copyright date
  • when --trusted-cert path/to/cert.der is supplied, inject that cert.der and the MS UEFI CA 2011 into the disk image's efi signature database
custom_image_utils/args_parser.py
  • increased default disk image size from 20G to 30G
  • added new argument, --trusted-cert:
    (Optional) Inserts the specified DER-format certificate into the custom image's EFI boot sector for use with secure boot.
Dockerfile

Since this package depends on python 2.7 and since I don't have time to help it move into the next decade, I've created a Dockerfile that can be used to build an image from which the script can be executed.

README.md

Documented the --trusted-cert argument

instructions on how to insert a trust database into gce disk image for
use with secure boot

forthcoming is a python script to install the kernel modules on debian12 instance
@cjac cjac self-assigned this Jun 17, 2024
@cjac cjac requested a review from kuldeepkk-dev June 17, 2024 22:48
@cjac
Copy link
Contributor Author

cjac commented Jun 20, 2024

The scripts examples/secure-boot/install-nvidia-driver-debian*.sh fails for me because my default network isn't configured for private google access. Other than that, this looks like it's working. The *-with-certs disk image is created ; I have not checked the kernel log to see whether the expected modulus md5sum is printed as the one read from the kernel.

I have not attempted to create a dataproc cluster using any images I've generated yet. That's next on my list.

cjac added 2 commits June 20, 2024 13:23
create-key-pair.sh now emits on stdout the variables needed to create a disk image and sign kernel drivers

brought the driver installer scripts closer to parity
* examples/secure-boot/README.md
included example of how to grant secretAccessor role
moved parameters that require defaults to the top
removed noise and sleeps
removed recommendation of using shutdown-instance-timer-sec
added recommendation to use --disk-size 50

* examples/secure-boot/create-key-pair.sh
collecting the modulus md5sum so that it can be passed in metadata

* examples/secure-boot/install-nvidia-driver-debian11.sh
execute script in /opt/install-nvidia-driver
update package cache before installing from it
clean up after downloaded packages
remove excess packages
remove driver.run and cuda.run after installation
redirect make log to /var/log/

* examples/secure-boot/install-nvidia-driver-debian12.sh
also execute script in /opt/install-nvidia-driver
clean up after package installation
@cjac
Copy link
Contributor Author

cjac commented Jun 21, 2024

I do see in the kernel logs that the certificate is trusted:

2024-06-21T00:27:20.415080+00:00 localhost kernel: [    0.808908] integrity: Loading X.509 certificate: UEFI:db
2024-06-21T00:27:20.415081+00:00 localhost kernel: [    0.810005] integrity: Loaded X.509 cert 'Cloud Dataproc Custom Image CA 005: 4029b4e4991c2967c2f6fcddd6dd93b557b6bcc4'

@cjac
Copy link
Contributor Author

cjac commented Jun 21, 2024

The kernel drivers came out unsigned when I booted the cluster. I'm running a build of a new image. I'll try booting it shortly.

@cjac
Copy link
Contributor Author

cjac commented Jun 21, 2024

The new cluster nodes built with the custom bookworm image come up with a working nvidia-smi

cjac@cluster-1718310842-w-0:~$ mokutil --sb-state ; nvidia-smi
SecureBoot enabled
Fri Jun 21 03:30:02 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA L4           Off  | 00000000:00:03.0 Off |                    0 |
| N/A   53C    P0    30W /  72W |      0MiB / 23034MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@cjac
Copy link
Contributor Author

cjac commented Jun 21, 2024

The Buster image looks good, too:

qcjac@cluster-1718310842-w-0:~$ lsb_release -a ; nvidia-smi
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:        10
Codename:       buster
Fri Jun 21 04:18:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   59C    P0             32W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

@cjac
Copy link
Contributor Author

cjac commented Jun 21, 2024

hmmm.. the 2.1 image is having no luck:

cjac@cluster-1718310842-w-0:~$ lsb_release -a ; nvidia-smi
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

cjac@cluster-1718310842-w-0:~$ sudo dmesg | grep -i x.509
[    1.019869] Loading compiled-in X.509 certificates
[    1.051966] Loaded X.509 cert 'Debian Secure Boot CA: 6ccece7e4c6c0d1f6149f3dd27dfcc5cbb419ea1'
[    1.053290] Loaded X.509 cert 'Debian Secure Boot Signer 2022 - linux: 14011249c2675ea8e5148542202005810584b25f'
cjac@cluster-1718310842-w-0:~$ sudo modinfo nvidia | grep -i sig
sig_id:         PKCS#7
signer:         Cloud Dataproc Custom Image CA 005
sig_key:        0C:5A:46:71:8B:0B:99:76:B6:8A:24:0D:15:D9:BA:44:51:AC:97:BC
sig_hashalgo:   sha256
signature:      3E:5E:A2:2B:03:95:69:5D:41:88:A4:28:1F:59:7E:03:4C:21:45:28:

It seems somehow that the Cloud Dataproc Custom Image CA 005 cert isn't in the boot sector.

@cjac
Copy link
Contributor Author

cjac commented Jun 21, 2024

I do see in the logs that the base image was created with the --signature-database-file argument passed:

+ gcloud compute images create dataproc-2-1-deb11-20240609-165100-rc01-with-certs --source-image projects/cloud-dataproc/global/images/dataproc-2-1-deb11-20240609-165100-rc01 --signature-database-file=tls/db.der,tls/MicCorUEFCA2011_2011-06-27.crt --guest-os-features=UEFI_COMPATIBLE

@cjac cjac changed the title [gpu] create trusted base image for driver signing [gpu] Add support for secure boot with examples Jun 26, 2024
@cjac
Copy link
Contributor Author

cjac commented Jun 27, 2024

Opting to commit without approval to match release date of GoogleCloudDataproc/initialization-actions#1190

@cjac cjac merged commit 8d1dd23 into GoogleCloudDataproc:master Jun 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant