You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Hi, I'm trying the public cloud example that trains mistral on AWS expecting a training run to spin up and complete. Instead, I get the following CUDA error. I've modified the config to use a single spot V100. In my testing, I've tried the latest image versions and winglian/axolotl and winglian/axolotl-cloud image sources, which didn't help.
Current behaviour
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
(axolotl, pid=29373) BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
(axolotl, pid=29373) If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
(axolotl, pid=29373) If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
(axolotl, pid=29373) For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64
(axolotl, pid=29373) Loading CUDA version: BNB_CUDA_VERSION=118
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373)
(axolotl, pid=29373)
(axolotl, pid=29373) warn((f'\n\n{"="*80}\n'
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Welcome to bitsandbytes. For bug reports, please run
(axolotl, pid=29373)
(axolotl, pid=29373) python -m bitsandbytes
(axolotl, pid=29373)
(axolotl, pid=29373)
(axolotl, pid=29373) warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
(axolotl, pid=29373) warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
(axolotl, pid=29373) warn(msg)
(axolotl, pid=29373) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! If you run into issues with 8-bit matmul, you can try 4-bit quantization: https://huggingface.co/blog/4bit-transformers-bitsandbytes
(axolotl, pid=29373) warn(msg)
(axolotl, pid=29373) Traceback (most recent call last):
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 187, in _run_module_as_main
(axolotl, pid=29373) mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 110, in _get_module_details
(axolotl, pid=29373) __import__(pkg_name)
(axolotl, pid=29373) File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 24, in <module>
(axolotl, pid=29373) from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
(axolotl, pid=29373) File "/workspace/axolotl/src/axolotl/common/cli.py", line 12, in <module>
(axolotl, pid=29373) from axolotl.utils.models import load_model, load_tokenizer
(axolotl, pid=29373) File "/workspace/axolotl/src/axolotl/utils/models.py", line 8, in <module>
(axolotl, pid=29373) import bitsandbytes as bnb
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 6, in <module>
(axolotl, pid=29373) from . import cuda_setup, utils, research
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/__init__.py", line 1, in <module>
(axolotl, pid=29373) from . import nn
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module>
(axolotl, pid=29373) from .modules import LinearFP8Mixed, LinearFP8Global
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in <module>
(axolotl, pid=29373) from bitsandbytes.optim import GlobalOptimManager
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/optim/__init__.py", line 6, in <module>
(axolotl, pid=29373) from bitsandbytes.cextension import COMPILED_WITH_CUDA
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in <module>
(axolotl, pid=29373) raise RuntimeError('''
(axolotl, pid=29373) RuntimeError:
(axolotl, pid=29373) CUDA Setup failed despite GPU being available. Please run the following command to get more information:
(axolotl, pid=29373)
(axolotl, pid=29373) python -m bitsandbytes
(axolotl, pid=29373)
(axolotl, pid=29373) Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
(axolotl, pid=29373) to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
(axolotl, pid=29373) and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
(axolotl, pid=29373) False
(axolotl, pid=29373)
(axolotl, pid=29373) ===================================BUG REPORT===================================
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373) The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
(axolotl, pid=29373) The following directories listed in your path were found to be non-existent: {PosixPath('/workspace/data/huggingface-cache/datasets')}
(axolotl, pid=29373) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
(axolotl, pid=29373) DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}
(axolotl, pid=29373) CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 7.0.
(axolotl, pid=29373) CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
(axolotl, pid=29373) CUDA SETUP: Required library version not found: libbitsandbytes_cuda118_nocublaslt.so. Maybe you need to compile it from source?
(axolotl, pid=29373) CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
(axolotl, pid=29373)
(axolotl, pid=29373) ================================================ERROR=====================================
(axolotl, pid=29373) CUDA SETUP: CUDA detection failed! Possible reasons:
(axolotl, pid=29373) 1. You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
(axolotl, pid=29373) 2. CUDA driver not installed
(axolotl, pid=29373) 3. CUDA not installed
(axolotl, pid=29373) 4. You have multiple conflicting CUDA libraries
(axolotl, pid=29373) 5. Required library not pre-compiled for this bitsandbytes release!
(axolotl, pid=29373) CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
(axolotl, pid=29373) CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via `conda list | grep cuda`.
(axolotl, pid=29373) ================================================================================
(axolotl, pid=29373)
(axolotl, pid=29373) CUDA SETUP: Something unexpected happened. Please compile from source:
(axolotl, pid=29373) git clone https://github.com/TimDettmers/bitsandbytes.git
(axolotl, pid=29373) cd bitsandbytes
(axolotl, pid=29373) CUDA_VERSION=118 make cuda11x_nomatmul
(axolotl, pid=29373) python setup.py install
(axolotl, pid=29373) CUDA SETUP: Setup Failed!
(axolotl, pid=29373) Traceback (most recent call last):
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
(axolotl, pid=29373) sys.exit(main())
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
(axolotl, pid=29373) args.func(args)
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
(axolotl, pid=29373) simple_launcher(args)
(axolotl, pid=29373) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
(axolotl, pid=29373) raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
(axolotl, pid=29373) subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.10/bin/python3', '-m', 'axolotl.cli.train', '/sky_workdir/qlora-checkpoint.yaml']' returned non-zero exit status 1.
ERROR: Job 1 failed with return code list: [1]
Steps to reproduce
Steps:
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds
Please check that this issue hasn't been reported before.
Expected Behavior
Hi, I'm trying the public cloud example that trains mistral on AWS expecting a training run to spin up and complete. Instead, I get the following CUDA error. I've modified the config to use a single spot V100. In my testing, I've tried the latest image versions and winglian/axolotl and winglian/axolotl-cloud image sources, which didn't help.
Current behaviour
Steps to reproduce
Steps:
Config yaml
name: axolotl
resources:
accelerators: V100:1
cloud: aws # optional
use_spot: True
workdir: mistral
file_mounts:
/sky-notebook:
name: ${BUCKET}
mode: MOUNT
setup: |
docker pull winglian/axolotl-cloud:main-py3.10-cu118-2.1.2
run: |
docker run --gpus all
-v ~/sky_workdir:/sky_workdir
-v /root/.cache:/root/.cache
winglian/axolotl-cloud:main-py3.10-cu118-2.1.2
huggingface-cli login --token ${HF_TOKEN}
docker run --gpus all
-v ~/sky_workdir:/sky_workdir
-v /root/.cache:/root/.cache
-v /sky-notebook:/sky-notebook
winglian/axolotl-cloud:main-py3.10-cu118-2.1.2
accelerate launch -m axolotl.cli.train /sky_workdir/lora.yaml
envs:
HF_TOKEN: # TODO: Replace with huggingface token
BUCKET:
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: