-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
🐛 Bug
RAPIDS cuML cannot be imported in v132 Docker images. It raises an error on import:
ImportError: /opt/conda/lib/python3.10/site-packages/cuml/internals/../../../.././libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11
My best guess would be this is an issue with mixing conda and system CUDA Toolkits. This container has three different CTKs installed.
- The system version in
/usr/local/cudais CUDA 11.3- Provides
/usr/local/cuda/lib64/libcublas.so.11.5.1.109(and similar forlibcublasLt)
- Provides
- The conda environment has
cudatoolkit11.7.0- Provides
/opt/conda/lib/libcublas.so.11.10.1.25(and similar forlibcublasLt)
- Provides
- The conda environment has
libcublas/libcublas-devfrom CUDA 11.8.0- Provides
/opt/conda/lib/libcublas.so.11.11.3.6(and similar forlibcublasLt)
- Provides
The function cublasLtGetStatusString was added in CUDA 11.4.2. https://docs.nvidia.com/cuda/archive/11.4.2/cuda-toolkit-release-notes/index.html#cublas-11.4.2
I suspect this is somehow getting the system CUDA Toolkit version 11.3 when we need something that is 11.4.2 or newer, perhaps from the conda environment.
To Reproduce
docker run -it gcr.io/kaggle-gpu-images/python:v132 python -c "import cuml"
Image v122 does not show this problem. I can try to get a more detailed diagnosis and attempt to find when this was broken. The images are large so it takes some time to download and test them.
Workaround: @cdeotte found that running import cudf before import cuml fixes the problem in v132.
Expected behavior
cuML imports successfully.
Additional context
There has been a similar issue raised on this repo before, but with less detail: #1224 (comment)
I looked at the library dependencies and unresolved symbols in RAPIDS (cuml and raft) but I don't see any direct references to cublasLtGetStatusString, which makes me think it's failing to resolve a shared library needed within the CUDA Toolkit itself. There is a reference to that symbol in libcublas, which probably expects to load the symbol from libcublasLt. I'm not sure why importing cudf first fixes this issue, since cudf doesn't use libcublas.
I was able to successfully use the commands below to install/import cuml with CUDA 11.3.0, so I know this cannot be reproduced by only cuml with a CUDA Toolkit prior to 11.4.2 when cublasLtGetStatusString was introduced. It must have something to do with the three separate CUDA Toolkits in the Kaggle image.
nvidia-docker run -it nvidia/cuda:11.3.0-devel-ubi8
# then in the container...
yum install python3.9
pip3.9 install 'numba<0.57' cudf-cu11 cuml-cu11 --extra-index-url=https://pypi.nvidia.com/
python3.9 -c "import cuml"
I don't think there's much that can be fixed on the cuML side. A fix would probably require some consolidation among the image's three separate CUDA Toolkits. There is a similar issue reported here, where a user is combining TensorFlow and PyTorch without RAPIDS, but probably with multiple CTKs involved: pytorch/pytorch#88882
Full traceback here:
Details
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import cuml
File /opt/conda/lib/python3.10/site-packages/cuml/__init__.py:17
1 #
2 # Copyright (c) 2022-2023, NVIDIA CORPORATION.
3 #
(...)
14 # limitations under the License.
15 #
---> 17 from cuml.internals.base import Base, UniversalBase
19 # GPU only packages
21 import cuml.common.cuda as cuda
File /opt/conda/lib/python3.10/site-packages/cuml/internals/__init__.py:17
1 #
2 # Copyright (c) 2019-2023, NVIDIA CORPORATION.
3 #
(...)
14 # limitations under the License.
15 #
---> 17 from cuml.internals.base_helpers import BaseMetaClass, _tags_class_and_instance
18 from cuml.internals.api_decorators import (
19 _deprecate_pos_args,
20 api_base_fit_transform,
(...)
32 exit_internal_api,
33 )
34 from cuml.internals.api_context_managers import (
35 in_internal_api,
36 set_api_output_dtype,
37 set_api_output_type,
38 )
File /opt/conda/lib/python3.10/site-packages/cuml/internals/base_helpers.py:20
17 from inspect import Parameter, signature
18 import typing
---> 20 from cuml.internals.api_decorators import (
21 api_base_return_generic,
22 api_base_return_array,
23 api_base_return_sparse_array,
24 api_base_return_any,
25 api_return_any,
26 _deprecate_pos_args,
27 )
28 from cuml.internals.array import CumlArray
29 from cuml.internals.array_sparse import SparseCumlArray
File /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py:24
21 import warnings
23 # TODO: Try to resolve circular import that makes this necessary:
---> 24 from cuml.internals import input_utils as iu
25 from cuml.internals.api_context_managers import BaseReturnAnyCM
26 from cuml.internals.api_context_managers import BaseReturnArrayCM
File /opt/conda/lib/python3.10/site-packages/cuml/internals/input_utils.py:19
1 #
2 # Copyright (c) 2019-2023, NVIDIA CORPORATION.
3 #
(...)
14 # limitations under the License.
15 #
17 from collections import namedtuple
---> 19 from cuml.internals.array import CumlArray
20 from cuml.internals.array_sparse import SparseCumlArray
21 from cuml.internals.global_settings import GlobalSettings
File /opt/conda/lib/python3.10/site-packages/cuml/internals/array.py:22
19 import operator
20 import pickle
---> 22 from cuml.internals.global_settings import GlobalSettings
23 from cuml.internals.logger import debug
24 from cuml.internals.mem_type import MemoryType, MemoryTypeError
File /opt/conda/lib/python3.10/site-packages/cuml/internals/global_settings.py:19
17 import os
18 import threading
---> 19 from cuml.internals.available_devices import is_cuda_available
20 from cuml.internals.device_type import DeviceType
21 from cuml.internals.logger import warn
File /opt/conda/lib/python3.10/site-packages/cuml/internals/available_devices.py:17
1 #
2 # Copyright (c) 2022-2023, NVIDIA CORPORATION.
3 #
(...)
14 # limitations under the License.
15 #
16 from cuml.internals.device_support import GPU_ENABLED
---> 17 from cuml.internals.safe_imports import gpu_only_import_from, UnavailableError
19 try:
20 from functools import cache # requires Python >= 3.9
File /opt/conda/lib/python3.10/site-packages/cuml/internals/safe_imports.py:21
19 import traceback
20 from cuml.internals.device_support import CPU_ENABLED, GPU_ENABLED
---> 21 from cuml.internals import logger
24 class UnavailableError(Exception):
25 """Error thrown if a symbol is unavailable due to an issue importing it"""
ImportError: /opt/conda/lib/python3.10/site-packages/cuml/internals/../../../.././libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11