RuntimeError: CUDA error: device-side assert triggered #42

Yazooliu · 2023-08-02T09:13:53Z

Hi Team，

I meet the error during run rum_demo.py.

OS Environment:
Centos
python version:
Python 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

StartUp method:
nohup python run_demo.py

Error info in nohup.out
Traceback (most recent call last):
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/gradio/routes.py", line 442, in run_predict
output = await app.get_blocks().process_api(
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/gradio/blocks.py", line 1392, in process_api
result = await self.call_function(
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/gradio/blocks.py", line 1097, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/gradio/utils.py", line 703, in wrapper
response = f(*args, **kwargs)
File "/home/llm/app/CodeGeeX2-6B/source/CodeGeeX2/run_demo_CodeGeeX2.py", line 117, in predict
set_random_seed(seed)
File "/home/llm/app/CodeGeeX2-6B/source/CodeGeeX2/run_demo_CodeGeeX2.py", line 104, in set_random_seed
torch.manual_seed(seed)
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/torch/cuda/random.py", line 113, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/torch/cuda/init.py", line 183, in _lazy_call
callable()
File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/torch/cuda/random.py", line 111, in cb
default_generator.manual_seed(seed)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I try the method as following:
add CUDA_LAUNCH_BLOCKING=1 in run_demo.py

I am not sure the reason and the method is correct or not??
If correct, could I pull the PR?

BestRegards
Yazhou

The text was updated successfully, but these errors were encountered:

Stanislas0 · 2023-08-02T12:15:17Z

Which torch version do you use? And the CUDA version?

Yazooliu · 2023-08-03T08:46:40Z

Which torch version do you use? And the CUDA version?

show you the pip list detail info as following:
Package Version

accelerate 0.21.0
aiofiles 23.1.0
aiohttp 3.8.5
aiosignal 1.3.1
altair 5.0.1
annotated-types 0.5.0
anyio 3.7.1
asttokens 2.2.1
async-timeout 4.0.2
attrs 23.1.0
backcall 0.2.0
certifi 2023.7.22
charset-normalizer 3.2.0
click 8.1.6
cmake 3.27.0
contourpy 1.1.0
cpm-kernels 1.0.11
cycler 0.11.0
decorator 5.1.1
exceptiongroup 1.1.2
executing 1.2.0
fastapi 0.100.0
ffmpy 0.3.1
filelock 3.12.2
fonttools 4.41.1
frozenlist 1.4.0
fsspec 2023.6.0
gradio 3.39.0
gradio_client 0.3.0
h11 0.14.0
httpcore 0.17.3
httpx 0.24.1
huggingface-hub 0.16.4
idna 3.4
ipython 8.14.0
jedi 0.18.2
Jinja2 3.1.2
jsonschema 4.18.4
jsonschema-specifications 2023.7.1
kiwisolver 1.4.4
latex2mathml 3.76.0
linkify-it-py 2.0.2
lit 16.0.6
Markdown 3.4.4
markdown-it-py 2.2.0
MarkupSafe 2.1.3
matplotlib 3.7.2
matplotlib-inline 0.1.6
mdit-py-plugins 0.3.3
mdtex2html 1.2.0
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.4
networkx 3.1
numpy 1.25.1
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
orjson 3.9.2
packaging 23.1
pandas 2.0.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.0.0
pip 23.1.2
prompt-toolkit 3.0.39
protobuf 4.23.4
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pydantic 1.10.9
pydantic_core 2.3.0
pydub 0.25.1
Pygments 2.15.1
pyparsing 3.0.9
python-dateutil 2.8.2
python-multipart 0.0.6
pytz 2023.3
PyYAML 6.0.1
referencing 0.30.0
regex 2023.6.3
requests 2.31.0
rpds-py 0.9.2
safetensors 0.3.1
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 67.8.0
six 1.16.0
sniffio 1.3.0
sse-starlette 1.6.1
stack-data 0.6.2
starlette 0.27.0
sympy 1.12
tokenizers 0.13.3
toolz 0.12.0
torch 2.0.1
tqdm 4.65.0
traitlets 5.9.0
transformers 4.30.2
triton 2.0.0
typing_extensions 4.7.1
tzdata 2023.3
uc-micro-py 1.0.2
urllib3 2.0.4
uvicorn 0.23.1
wcwidth 0.2.6
websockets 11.0.3
wheel 0.38.4
yarl 1.9.2

NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 V100 32G GPU/Card

Yazooliu · 2023-08-08T00:48:48Z

Which torch version do you use? And the CUDA version?

Could I PR to fix this issue ? Thanks

BestRegards
Yazhou

Yazooliu · 2023-08-11T05:59:53Z

PR is arise to fix this issue , please review and verify

Yazooliu · 2023-12-23T08:44:37Z

No response ？ for my PR?

Yazooliu mentioned this issue Aug 11, 2023

Fix the issue RuntimeError: CUDA error: device-side assert triggered #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: device-side assert triggered #42

RuntimeError: CUDA error: device-side assert triggered #42

Yazooliu commented Aug 2, 2023

Stanislas0 commented Aug 2, 2023

Yazooliu commented Aug 3, 2023

Yazooliu commented Aug 8, 2023

Yazooliu commented Aug 11, 2023

Yazooliu commented Dec 23, 2023

RuntimeError: CUDA error: device-side assert triggered #42

RuntimeError: CUDA error: device-side assert triggered #42

Comments

Yazooliu commented Aug 2, 2023

Stanislas0 commented Aug 2, 2023

Yazooliu commented Aug 3, 2023

Yazooliu commented Aug 8, 2023

Yazooliu commented Aug 11, 2023

Yazooliu commented Dec 23, 2023