Skip to content

cuML fails to import with undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11 #1258

@bdice

Description

@bdice

🐛 Bug

RAPIDS cuML cannot be imported in v132 Docker images. It raises an error on import:

ImportError: /opt/conda/lib/python3.10/site-packages/cuml/internals/../../../.././libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11

My best guess would be this is an issue with mixing conda and system CUDA Toolkits. This container has three different CTKs installed.

  • The system version in /usr/local/cuda is CUDA 11.3
    • Provides /usr/local/cuda/lib64/libcublas.so.11.5.1.109 (and similar for libcublasLt)
  • The conda environment has cudatoolkit 11.7.0
    • Provides /opt/conda/lib/libcublas.so.11.10.1.25 (and similar for libcublasLt)
  • The conda environment has libcublas / libcublas-dev from CUDA 11.8.0
    • Provides /opt/conda/lib/libcublas.so.11.11.3.6 (and similar for libcublasLt)

The function cublasLtGetStatusString was added in CUDA 11.4.2. https://docs.nvidia.com/cuda/archive/11.4.2/cuda-toolkit-release-notes/index.html#cublas-11.4.2

I suspect this is somehow getting the system CUDA Toolkit version 11.3 when we need something that is 11.4.2 or newer, perhaps from the conda environment.

To Reproduce

docker run -it gcr.io/kaggle-gpu-images/python:v132 python -c "import cuml"

Image v122 does not show this problem. I can try to get a more detailed diagnosis and attempt to find when this was broken. The images are large so it takes some time to download and test them.

Workaround: @cdeotte found that running import cudf before import cuml fixes the problem in v132.

Expected behavior

cuML imports successfully.

Additional context

There has been a similar issue raised on this repo before, but with less detail: #1224 (comment)

I looked at the library dependencies and unresolved symbols in RAPIDS (cuml and raft) but I don't see any direct references to cublasLtGetStatusString, which makes me think it's failing to resolve a shared library needed within the CUDA Toolkit itself. There is a reference to that symbol in libcublas, which probably expects to load the symbol from libcublasLt. I'm not sure why importing cudf first fixes this issue, since cudf doesn't use libcublas.

I was able to successfully use the commands below to install/import cuml with CUDA 11.3.0, so I know this cannot be reproduced by only cuml with a CUDA Toolkit prior to 11.4.2 when cublasLtGetStatusString was introduced. It must have something to do with the three separate CUDA Toolkits in the Kaggle image.

nvidia-docker run -it nvidia/cuda:11.3.0-devel-ubi8
# then in the container...
yum install python3.9
pip3.9 install 'numba<0.57' cudf-cu11 cuml-cu11 --extra-index-url=https://pypi.nvidia.com/
python3.9 -c "import cuml"

I don't think there's much that can be fixed on the cuML side. A fix would probably require some consolidation among the image's three separate CUDA Toolkits. There is a similar issue reported here, where a user is combining TensorFlow and PyTorch without RAPIDS, but probably with multiple CTKs involved: pytorch/pytorch#88882

Full traceback here:

Details
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 import cuml

File /opt/conda/lib/python3.10/site-packages/cuml/__init__.py:17
      1 #
      2 # Copyright (c) 2022-2023, NVIDIA CORPORATION.
      3 #
   (...)
     14 # limitations under the License.
     15 #
---> 17 from cuml.internals.base import Base, UniversalBase
     19 # GPU only packages
     21 import cuml.common.cuda as cuda

File /opt/conda/lib/python3.10/site-packages/cuml/internals/__init__.py:17
      1 #
      2 # Copyright (c) 2019-2023, NVIDIA CORPORATION.
      3 #
   (...)
     14 # limitations under the License.
     15 #
---> 17 from cuml.internals.base_helpers import BaseMetaClass, _tags_class_and_instance
     18 from cuml.internals.api_decorators import (
     19     _deprecate_pos_args,
     20     api_base_fit_transform,
   (...)
     32     exit_internal_api,
     33 )
     34 from cuml.internals.api_context_managers import (
     35     in_internal_api,
     36     set_api_output_dtype,
     37     set_api_output_type,
     38 )

File /opt/conda/lib/python3.10/site-packages/cuml/internals/base_helpers.py:20
     17 from inspect import Parameter, signature
     18 import typing
---> 20 from cuml.internals.api_decorators import (
     21     api_base_return_generic,
     22     api_base_return_array,
     23     api_base_return_sparse_array,
     24     api_base_return_any,
     25     api_return_any,
     26     _deprecate_pos_args,
     27 )
     28 from cuml.internals.array import CumlArray
     29 from cuml.internals.array_sparse import SparseCumlArray

File /opt/conda/lib/python3.10/site-packages/cuml/internals/api_decorators.py:24
     21 import warnings
     23 # TODO: Try to resolve circular import that makes this necessary:
---> 24 from cuml.internals import input_utils as iu
     25 from cuml.internals.api_context_managers import BaseReturnAnyCM
     26 from cuml.internals.api_context_managers import BaseReturnArrayCM

File /opt/conda/lib/python3.10/site-packages/cuml/internals/input_utils.py:19
      1 #
      2 # Copyright (c) 2019-2023, NVIDIA CORPORATION.
      3 #
   (...)
     14 # limitations under the License.
     15 #
     17 from collections import namedtuple
---> 19 from cuml.internals.array import CumlArray
     20 from cuml.internals.array_sparse import SparseCumlArray
     21 from cuml.internals.global_settings import GlobalSettings

File /opt/conda/lib/python3.10/site-packages/cuml/internals/array.py:22
     19 import operator
     20 import pickle
---> 22 from cuml.internals.global_settings import GlobalSettings
     23 from cuml.internals.logger import debug
     24 from cuml.internals.mem_type import MemoryType, MemoryTypeError

File /opt/conda/lib/python3.10/site-packages/cuml/internals/global_settings.py:19
     17 import os
     18 import threading
---> 19 from cuml.internals.available_devices import is_cuda_available
     20 from cuml.internals.device_type import DeviceType
     21 from cuml.internals.logger import warn

File /opt/conda/lib/python3.10/site-packages/cuml/internals/available_devices.py:17
      1 #
      2 # Copyright (c) 2022-2023, NVIDIA CORPORATION.
      3 #
   (...)
     14 # limitations under the License.
     15 #
     16 from cuml.internals.device_support import GPU_ENABLED
---> 17 from cuml.internals.safe_imports import gpu_only_import_from, UnavailableError
     19 try:
     20     from functools import cache  # requires Python >= 3.9

File /opt/conda/lib/python3.10/site-packages/cuml/internals/safe_imports.py:21
     19 import traceback
     20 from cuml.internals.device_support import CPU_ENABLED, GPU_ENABLED
---> 21 from cuml.internals import logger
     24 class UnavailableError(Exception):
     25     """Error thrown if a symbol is unavailable due to an issue importing it"""

ImportError: /opt/conda/lib/python3.10/site-packages/cuml/internals/../../../.././libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbug & failures with existing packageshelp wanted

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions