Ran into crashes when testing LLM.int8() from transformers #18

changlan · 2022-08-20T01:07:45Z

Hi, I was testing LLM.int8() on the LongT5 model, but I consistently ran into the following errors:

CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /opt/conda/envs/python38/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so...

python3: /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu:375: int igemmlt(cublasLtHandle_t, int, int, int, const int8_t*, const int8_t*, void*, float*, int, int, int) [with int FORMATB = 3; int DTYPE_OUT = 32; int SCALE_ROWS = 0; cublasLtHandle_t = cublasLtContext*; int8_t = signed char]: Assertion `false' failed.
Aborted

Sample script to reproduce:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('google/t5-v1_1-large')
model_8bit = AutoModelForSeq2SeqLM.from_pretrained('google/t5-v1_1-large', device_map="auto", load_in_8bit=True)

sentences = ['hello world']

inputs = tokenizer(sentences, return_tensors="pt", padding=True)

output_sequences = model_8bit.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=256
)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

The text was updated successfully, but these errors were encountered:

younesbelkada · 2022-08-20T09:08:34Z

Hi @changlan, thanks for your message!
I managed to run your script with Google colab without any issue, so I would suspect that there might be something wrong with bitsandbytes-cuda110 so pinging @TimDettmers
On the other hand, I think that you should use T5 models that are hosted by HuggingFace such as t5-base as I got weird output even with the native model on models hosted on google namespace.
Please check this colab where I use your script: https://colab.research.google.com/drive/1Nfo-BnIMUy0xL4JlwMJ6hM5oms2Vdhis?usp=sharing

lessw2020 · 2022-08-20T23:29:14Z

@changlan @younesbelkada - the google versions of T5 (ala google/t5-v1_1_...) are not designed to runnable as-is, unlike the t5-large, etc.
The v11 versions are setup to be used as better starting points for fine tuning...but have no actual task training unlike the original t5s.
Thus, beyond your specific error, I would not recommend trying to do anything directly with these except using as better starting points for task fine tuning. (I used it for making a grammar checker as an example).

changlan · 2022-08-21T04:09:05Z

Thanks all! It turns out that it was the GPU (V100) that is not compatible. It works after I use T4 or A100. It also seems that int8 **increased** the inference latency. Is this expected?

…

On Sat, Aug 20, 2022 at 4:29 PM Less Wright ***@***.***> wrote: @changlan <https://github.com/changlan> @younesbelkada <https://github.com/younesbelkada> - the google versions of T5 (ala google/t5-v1_1_...) are not designed to runnable as-is, unlike the t5-large, etc. The v11 versions are setup to be used as better starting points for fine tuning...but have no actual task training unlike the original t5s. Thus, beyond your specific error, I would not recommend trying to do anything directly with these except using as better starting points for task fine tuning. (I used it for making a grammar checker as an example). — Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETJHMLUPVAQYVCUTUI3LTV2FS5JANCNFSM57CLWC2A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lessw2020 · 2022-08-21T04:32:02Z

good to hear! (for reference, T5 was trained in BFloat16, so that's likely why...if you are on AWS, try G5's..those are A10, BFloat16 compat, and work nicely).

Re: increased inference latency - there's a whole sep thread on this. There's some recent improvements and more coming:
#6

changlan closed this as completed Aug 21, 2022

CRyan2016 mentioned this issue Jul 16, 2023

terminate called after throwing an instance of 'c10::Error' #597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ran into crashes when testing LLM.int8() from transformers #18

Ran into crashes when testing LLM.int8() from transformers #18

changlan commented Aug 20, 2022

younesbelkada commented Aug 20, 2022

lessw2020 commented Aug 20, 2022

changlan commented Aug 21, 2022 via email

lessw2020 commented Aug 21, 2022

Ran into crashes when testing LLM.int8() from transformers #18

Ran into crashes when testing LLM.int8() from transformers #18

Comments

changlan commented Aug 20, 2022

younesbelkada commented Aug 20, 2022

lessw2020 commented Aug 20, 2022

changlan commented Aug 21, 2022 via email

lessw2020 commented Aug 21, 2022