Allow Nvidia driver script to set recommendations for LD_PRELOAD#754
Conversation
|
Instance
|
|
Instance
|
|
Instance
|
|
Example output: [rocky@ip-172-31-27-81 software-layer]$ ./scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh --ld-preload --no-download
Found NVIDIA GPU driver version 545.23.08
Found host CUDA version 12.3
Using default list of libraries
Matched 48 CUDA Libraries
When attempting to use LD_PRELOAD we exclude anything related to graphics
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGL.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGL.so.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX_nvidia.so.0.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so.0.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so.
libwayland-server.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-egl-wayland.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so.1.
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so.
libXNVCtrl.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-gtk3.so.545.23.08.
The recommended way to use LD_PRELOAD is to only use it when you need to:
export EESSI_GPU_LD_PRELOAD="/lib64/libcuda.so.1:/lib64/libcuda.so:/lib64/libcudadebugger.so.1:/lib64/libnvcuvid.so.1:/lib64/libnvcuvid.so:/lib64/libnvidia-cfg.so.1:/lib64/libnvidia-cfg.so:/lib64/libnvidia-eglcore.so.545.23.08:/lib64/libnvidia-encode.so.1:/lib64/libnvidia-encode.so:/lib64/libnvidia-glcore.so.545.23.08:/lib64/libnvidia-glsi.so.545.23.08:/lib64/libnvidia-glvkspirv.so.545.23.08:/lib64/libnvidia-gpucomp.so.545.23.08:/lib64/libnvidia-ml.so.1:/lib64/libnvidia-ml.so:/lib64/libnvidia-nvvm.so.4:/lib64/libnvidia-nvvm.so:/lib64/libnvidia-opencl.so.1:/lib64/libnvidia-opticalflow.so.1:/lib64/libnvidia-ptxjitcompiler.so.1:/lib64/libnvidia-ptxjitcompiler.so:/lib64/libnvidia-rtcore.so.545.23.08:/lib64/libnvidia-tls.so.545.23.08:/lib64/libnvoptix.so.1:/lib64/libOpenCL.so.1"
export EESSI_OVERRIDE_GPU_CHECK="1"
Then you can set LD_PRELOAD only when you want to run a GPU application, e.g.,
LD_PRELOAD="$EESSI_GPU_LD_PRELOAD" device_query |
|
@ocaisa There's duplicate entries here, |
| # Filter out all symlinks and libraries that have missing library dependencies under EESSI | ||
| filtered_libraries=() | ||
| for library in "${matched_libraries[@]}"; do | ||
| if [ ! -L "$library" ]; then |
There was a problem hiding this comment.
This is too aggressive, instead we should just resolve the symlink and remove duplicate entries
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
LD_PRELOAD
|
Also tested the script within |
Accepted all except one Co-authored-by: TopRichard <121792457+TopRichard@users.noreply.github.com>
|
@TopRichard This will need to be re-tested now to make sure the changes haven't had an unintended impact |
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/generic |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
|
New job on instance
|
re-testing: Apptainer> /cvmfs/software.eessi.io/versions/2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
Found host CUDA version 9.0
Found NVIDIA GPU driver version 535.129.03
Using downloaded list of libraries
Matched 41 CUDA Libraries
Successfully created symlink between latest and host in /cvmfs/software.eessi.io/host_injections/nvidia/aarch64
Successfully created symlink between /cvmfs/software.eessi.io/host_injections/nvidia/aarch64/latest and lib in /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/aarch64
Host NVIDIA GPU drivers linked successfully for EESSI |
boegel
left a comment
There was a problem hiding this comment.
I know this has just been deployed, so this review comes way too late, but I had a bunch of draft remarks that I somehow never submitted.
Take it as input for a follow-up PR...
|
|
||
| # This script links host libraries related to GPU drivers to a location where | ||
| # they can be found by the EESSI linker | ||
| # they can be found by the EESSI linker (or sets LD_PRELOAD as an |
There was a problem hiding this comment.
It doesn't set $LD_PRELOAD, it prints how $EESSI_*LD_PRELOAD can be set
|
|
||
| get_nvlib_list() { | ||
| local nvliblist_url="https://raw.githubusercontent.com/apptainer/apptainer/main/etc/nvliblist.conf" | ||
| local default_nvlib_list=( |
There was a problem hiding this comment.
Can we add a comment here mentioning how this list was compiled, and with which driver version this current list corresponds?
There was a problem hiding this comment.
The initial list basically comes from the location above, it's intended as a fall back
| "tls_test_.so" | ||
| ) | ||
|
|
||
| # Check if the function was called with the "default" argument |
There was a problem hiding this comment.
nit-picking: please fix indent of the comment
|
|
||
| # Check if curl failed (i.e., the content is empty) | ||
| if [ -z "$nvliblist_content" ]; then | ||
| # Failed to download nvliblist.conf, using default list instead |
There was a problem hiding this comment.
Should we print a warning here (via echo_yellow) that the download failed, and that we're doing a fallback?
|
|
||
| # Check if umask allows global read | ||
| if [ "$umask_octal" -gt 022 ]; then | ||
| fatal_error "The current umask ($current_umask) does not allow global read permissions, you'll want everyone to be able to read the created directory." |
There was a problem hiding this comment.
Make a suggestion on how to fix this in the error message?
Also, can't we set $UMASK here to dance around this ourselves?
There was a problem hiding this comment.
Reluctant to silently override their $UMASK, but an option could be supported to allow this
| filtered_libraries=() | ||
| compat_filtered_libraries=() | ||
| for library in "${matched_libraries[@]}"; do | ||
| # Run ldd on the given binary and filter for "not found" libraries |
| compat_filtered_libraries=() | ||
| for library in "${matched_libraries[@]}"; do | ||
| # Run ldd on the given binary and filter for "not found" libraries | ||
| not_found_libs=$(ldd "$library" 2>/dev/null | grep "not found" | awk '{print $1}') |
There was a problem hiding this comment.
We should avoid relying on awk, it may not be there, cut is way more likely to be there
| # Find the host ldconfig | ||
| host_ldconfig=$(get_host_ldconfig) | ||
| # Gather libraries on the host (_must_ be host ldconfig) | ||
| host_libraries=$($host_ldconfig -p | awk '{print $NF}') |
There was a problem hiding this comment.
If we can, we should avoid relying on awk, it may not be installed, use cut instead?
| # Check if it is missing an so dep under EESSI | ||
| if [[ -z "$not_found_libs" ]]; then | ||
| # Resolve any symlink | ||
| realpath_library=$(realpath "$library") |
There was a problem hiding this comment.
There's a small chance that realpath is not available, so perhaps we should check for this early and exit with an error if it's not available
| # Resolve any symlink | ||
| realpath_library=$(realpath "$library") | ||
| if [[ ! " ${filtered_libraries[@]} " =~ " $realpath_library " ]]; then | ||
| filtered_libraries+=("$realpath_library") |
There was a problem hiding this comment.
line above assumes there's spaces?
| filtered_libraries+=("$realpath_library") | |
| filtered_libraries+=(" ${realpath_library} ") |
|
@bedroge This was deployed, so PR should be merged too? |
Yes, the tarball has been ingested. |
This has already been deployed, requested changes can be done in a follow-up PR.
|
PR merged! Moved |
|
PR merged! Moved |
No description provided.