Runtime error: CUDA Setup failed despite GPU being available (bitsandbytes) #1280

alistairwgillespie · 2024-02-09T02:04:44Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Hi, I'm trying the public cloud example that trains mistral on AWS expecting a training run to spin up and complete. Instead, I get the following CUDA error. I've modified the config to use a single spot V100. In my testing, I've tried the latest image versions and winglian/axolotl and winglian/axolotl-cloud image sources, which didn't help.

Current behaviour

(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
(axolotl, pid=29373) BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
(axolotl, pid=29373) If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
(axolotl, pid=29373) If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
(axolotl, pid=29373) For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64
(axolotl, pid=29373) Loading CUDA version: BNB_CUDA_VERSION=118
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) 
(axolotl, pid=29373) 
(axolotl, pid=29373)   warn((f'\n\n{"="*80}\n'
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Welcome to bitsandbytes. For bug reports, please run
(axolotl, pid=29373) 
(axolotl, pid=29373) python -m bitsandbytes
(axolotl, pid=29373) 
(axolotl, pid=29373) 
(axolotl, pid=29373)   warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
(axolotl, pid=29373)   warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
(axolotl, pid=29373)   warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!                     If you run into issues with 8-bit matmul, you can try 4-bit quantization: https://huggingface.co/blog/4bit-transformers-bitsandbytes
(axolotl, pid=29373)   warn(msg)
(axolotl, pid=29373) Traceback (most recent call last):
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 187, in _run_module_as_main
(axolotl, pid=29373)     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 110, in _get_module_details
(axolotl, pid=29373)     __import__(pkg_name)
(axolotl, pid=29373)   File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 24, in <module>
(axolotl, pid=29373)     from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
(axolotl, pid=29373)   File "/workspace/axolotl/src/axolotl/common/cli.py", line 12, in <module>
(axolotl, pid=29373)     from axolotl.utils.models import load_model, load_tokenizer
(axolotl, pid=29373)   File "/workspace/axolotl/src/axolotl/utils/models.py", line 8, in <module>
(axolotl, pid=29373)     import bitsandbytes as bnb
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module>
(axolotl, pid=29373)     from . import cuda_setup, utils, research
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/__init__.py", line 1, in <module>
(axolotl, pid=29373)     from . import nn
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module>
(axolotl, pid=29373)     from .modules import LinearFP8Mixed, LinearFP8Global
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in <module>
(axolotl, pid=29373)     from bitsandbytes.optim import GlobalOptimManager
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/optim/__init__.py", line 6, in <module>
(axolotl, pid=29373)     from bitsandbytes.cextension import COMPILED_WITH_CUDA
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in <module>
(axolotl, pid=29373)     raise RuntimeError('''
(axolotl, pid=29373) RuntimeError: 
(axolotl, pid=29373)         CUDA Setup failed despite GPU being available. Please run the following command to get more information:
(axolotl, pid=29373) 
(axolotl, pid=29373)         python -m bitsandbytes
(axolotl, pid=29373) 
(axolotl, pid=29373)         Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
(axolotl, pid=29373)         to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
(axolotl, pid=29373)         and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
(axolotl, pid=29373) False
(axolotl, pid=29373) 
(axolotl, pid=29373) ===================================BUG REPORT===================================
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
(axolotl, pid=29373) The following directories listed in your path were found to be non-existent: {PosixPath('/workspace/data/huggingface-cache/datasets')}
(axolotl, pid=29373) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
(axolotl, pid=29373) DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}
(axolotl, pid=29373) CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 7.0.
(axolotl, pid=29373) CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
(axolotl, pid=29373) CUDA SETUP: Required library version not found: libbitsandbytes_cuda118_nocublaslt.so. Maybe you need to compile it from source?
(axolotl, pid=29373) CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
(axolotl, pid=29373) 
(axolotl, pid=29373) ================================================ERROR=====================================
(axolotl, pid=29373) CUDA SETUP: CUDA detection failed! Possible reasons:
(axolotl, pid=29373) 1. You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
(axolotl, pid=29373) 2. CUDA driver not installed
(axolotl, pid=29373) 3. CUDA not installed
(axolotl, pid=29373) 4. You have multiple conflicting CUDA libraries
(axolotl, pid=29373) 5. Required library not pre-compiled for this bitsandbytes release!
(axolotl, pid=29373) CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
(axolotl, pid=29373) CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via `conda list | grep cuda`.
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) 
(axolotl, pid=29373) CUDA SETUP: Something unexpected happened. Please compile from source:
(axolotl, pid=29373) git clone https://github.com/TimDettmers/bitsandbytes.git
(axolotl, pid=29373) cd bitsandbytes
(axolotl, pid=29373) CUDA_VERSION=118 make cuda11x_nomatmul
(axolotl, pid=29373) python setup.py install
(axolotl, pid=29373) CUDA SETUP: Setup Failed!
(axolotl, pid=29373) Traceback (most recent call last):
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
(axolotl, pid=29373)     sys.exit(main())
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
(axolotl, pid=29373)     args.func(args)
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
(axolotl, pid=29373)     simple_launcher(args)
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
(axolotl, pid=29373)     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
(axolotl, pid=29373) subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.10/bin/python3', '-m', 'axolotl.cli.train', '/sky_workdir/qlora-checkpoint.yaml']' returned non-zero exit status 1.
ERROR: Job 1 failed with return code list: [1]

Steps to reproduce

Steps:

pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds
sky check
git clone https://github.com/skypilot-org/skypilot.git
cd skypilot/llm/axolotl
HF_TOKEN="" BUCKET="" sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

Config yaml

name: axolotl

resources:
accelerators: V100:1
cloud: aws # optional
use_spot: True

workdir: mistral

file_mounts:
/sky-notebook:
name: ${BUCKET}
mode: MOUNT

setup: |
docker pull winglian/axolotl-cloud:main-py3.10-cu118-2.1.2

run: |
docker run --gpus all
-v ~/sky_workdir:/sky_workdir
-v /root/.cache:/root/.cache
winglian/axolotl-cloud:main-py3.10-cu118-2.1.2
huggingface-cli login --token ${HF_TOKEN}

docker run --gpus all
-v ~/sky_workdir:/sky_workdir
-v /root/.cache:/root/.cache
-v /sky-notebook:/sky-notebook
winglian/axolotl-cloud:main-py3.10-cu118-2.1.2
accelerate launch -m axolotl.cli.train /sky_workdir/lora.yaml

envs:
HF_TOKEN: # TODO: Replace with huggingface token
BUCKET:

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2024-02-17T03:31:46Z

Are you able to test with a newer GPU? I do not remember if bnb works well with V100.

alistairwgillespie · 2024-02-17T12:37:46Z

@NanoCode012 Any suggested accelerators? A100s are difficult to get ahold of on AWS. Thanks

NanoCode012 · 2024-02-17T13:18:48Z

A6000, L4 . You may also try some alternative providers as aws is quite expensive?

alistairwgillespie added the bug Something isn't working label Feb 9, 2024

alistairwgillespie mentioned this issue Feb 10, 2024

make: No rule to make target ' cuda10x' TimDettmers/bitsandbytes#928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime error: CUDA Setup failed despite GPU being available (bitsandbytes) #1280

Runtime error: CUDA Setup failed despite GPU being available (bitsandbytes) #1280

alistairwgillespie commented Feb 9, 2024

NanoCode012 commented Feb 17, 2024

alistairwgillespie commented Feb 17, 2024

NanoCode012 commented Feb 17, 2024

Runtime error: CUDA Setup failed despite GPU being available (bitsandbytes) #1280

Runtime error: CUDA Setup failed despite GPU being available (bitsandbytes) #1280

Comments

alistairwgillespie commented Feb 9, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Feb 17, 2024

alistairwgillespie commented Feb 17, 2024

NanoCode012 commented Feb 17, 2024