Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error: CUDA Setup failed despite GPU being available (bitsandbytes) #1280

Open
6 of 8 tasks
alistairwgillespie opened this issue Feb 9, 2024 · 3 comments
Open
6 of 8 tasks
Labels
bug Something isn't working

Comments

@alistairwgillespie
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Hi, I'm trying the public cloud example that trains mistral on AWS expecting a training run to spin up and complete. Instead, I get the following CUDA error. I've modified the config to use a single spot V100. In my testing, I've tried the latest image versions and winglian/axolotl and winglian/axolotl-cloud image sources, which didn't help.

Current behaviour

(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
(axolotl, pid=29373) BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
(axolotl, pid=29373) If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
(axolotl, pid=29373) If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
(axolotl, pid=29373) For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64
(axolotl, pid=29373) Loading CUDA version: BNB_CUDA_VERSION=118
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) 
(axolotl, pid=29373) 
(axolotl, pid=29373)   warn((f'\n\n{"="*80}\n'
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Welcome to bitsandbytes. For bug reports, please run
(axolotl, pid=29373) 
(axolotl, pid=29373) python -m bitsandbytes
(axolotl, pid=29373) 
(axolotl, pid=29373) 
(axolotl, pid=29373)   warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
(axolotl, pid=29373)   warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
(axolotl, pid=29373)   warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!                     If you run into issues with 8-bit matmul, you can try 4-bit quantization: https://huggingface.co/blog/4bit-transformers-bitsandbytes
(axolotl, pid=29373)   warn(msg)
(axolotl, pid=29373) Traceback (most recent call last):
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 187, in _run_module_as_main
(axolotl, pid=29373)     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 110, in _get_module_details
(axolotl, pid=29373)     __import__(pkg_name)
(axolotl, pid=29373)   File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 24, in <module>
(axolotl, pid=29373)     from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
(axolotl, pid=29373)   File "/workspace/axolotl/src/axolotl/common/cli.py", line 12, in <module>
(axolotl, pid=29373)     from axolotl.utils.models import load_model, load_tokenizer
(axolotl, pid=29373)   File "/workspace/axolotl/src/axolotl/utils/models.py", line 8, in <module>
(axolotl, pid=29373)     import bitsandbytes as bnb
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module>
(axolotl, pid=29373)     from . import cuda_setup, utils, research
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/__init__.py", line 1, in <module>
(axolotl, pid=29373)     from . import nn
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module>
(axolotl, pid=29373)     from .modules import LinearFP8Mixed, LinearFP8Global
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in <module>
(axolotl, pid=29373)     from bitsandbytes.optim import GlobalOptimManager
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/optim/__init__.py", line 6, in <module>
(axolotl, pid=29373)     from bitsandbytes.cextension import COMPILED_WITH_CUDA
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in <module>
(axolotl, pid=29373)     raise RuntimeError('''
(axolotl, pid=29373) RuntimeError: 
(axolotl, pid=29373)         CUDA Setup failed despite GPU being available. Please run the following command to get more information:
(axolotl, pid=29373) 
(axolotl, pid=29373)         python -m bitsandbytes
(axolotl, pid=29373) 
(axolotl, pid=29373)         Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
(axolotl, pid=29373)         to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
(axolotl, pid=29373)         and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
(axolotl, pid=29373) False
(axolotl, pid=29373) 
(axolotl, pid=29373) ===================================BUG REPORT===================================
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
(axolotl, pid=29373) The following directories listed in your path were found to be non-existent: {PosixPath('/workspace/data/huggingface-cache/datasets')}
(axolotl, pid=29373) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
(axolotl, pid=29373) DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}
(axolotl, pid=29373) CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 7.0.
(axolotl, pid=29373) CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
(axolotl, pid=29373) CUDA SETUP: Required library version not found: libbitsandbytes_cuda118_nocublaslt.so. Maybe you need to compile it from source?
(axolotl, pid=29373) CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
(axolotl, pid=29373) 
(axolotl, pid=29373) ================================================ERROR=====================================
(axolotl, pid=29373) CUDA SETUP: CUDA detection failed! Possible reasons:
(axolotl, pid=29373) 1. You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
(axolotl, pid=29373) 2. CUDA driver not installed
(axolotl, pid=29373) 3. CUDA not installed
(axolotl, pid=29373) 4. You have multiple conflicting CUDA libraries
(axolotl, pid=29373) 5. Required library not pre-compiled for this bitsandbytes release!
(axolotl, pid=29373) CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
(axolotl, pid=29373) CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via `conda list | grep cuda`.
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) 
(axolotl, pid=29373) CUDA SETUP: Something unexpected happened. Please compile from source:
(axolotl, pid=29373) git clone https://github.com/TimDettmers/bitsandbytes.git
(axolotl, pid=29373) cd bitsandbytes
(axolotl, pid=29373) CUDA_VERSION=118 make cuda11x_nomatmul
(axolotl, pid=29373) python setup.py install
(axolotl, pid=29373) CUDA SETUP: Setup Failed!
(axolotl, pid=29373) Traceback (most recent call last):
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
(axolotl, pid=29373)     sys.exit(main())
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
(axolotl, pid=29373)     args.func(args)
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
(axolotl, pid=29373)     simple_launcher(args)
(axolotl, pid=29373)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
(axolotl, pid=29373)     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
(axolotl, pid=29373) subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.10/bin/python3', '-m', 'axolotl.cli.train', '/sky_workdir/qlora-checkpoint.yaml']' returned non-zero exit status 1.
ERROR: Job 1 failed with return code list: [1]

Steps to reproduce

Steps:

  1. pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds
  2. sky check
  3. git clone https://github.com/skypilot-org/skypilot.git
  4. cd skypilot/llm/axolotl
  5. HF_TOKEN="" BUCKET="" sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET

Config yaml

name: axolotl

resources:
accelerators: V100:1
cloud: aws # optional
use_spot: True

workdir: mistral

file_mounts:
/sky-notebook:
name: ${BUCKET}
mode: MOUNT

setup: |
docker pull winglian/axolotl-cloud:main-py3.10-cu118-2.1.2

run: |
docker run --gpus all
-v ~/sky_workdir:/sky_workdir
-v /root/.cache:/root/.cache
winglian/axolotl-cloud:main-py3.10-cu118-2.1.2
huggingface-cli login --token ${HF_TOKEN}

docker run --gpus all
-v ~/sky_workdir:/sky_workdir
-v /root/.cache:/root/.cache
-v /sky-notebook:/sky-notebook
winglian/axolotl-cloud:main-py3.10-cu118-2.1.2
accelerate launch -m axolotl.cli.train /sky_workdir/lora.yaml

envs:
HF_TOKEN: # TODO: Replace with huggingface token
BUCKET:

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@NanoCode012
Copy link
Collaborator

Are you able to test with a newer GPU? I do not remember if bnb works well with V100.

@alistairwgillespie
Copy link
Author

@NanoCode012 Any suggested accelerators? A100s are difficult to get ahold of on AWS. Thanks

@NanoCode012
Copy link
Collaborator

A6000, L4 . You may also try some alternative providers as aws is quite expensive?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants