Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ran into crashes when testing LLM.int8() from transformers #18

Closed
changlan opened this issue Aug 20, 2022 · 4 comments
Closed

Ran into crashes when testing LLM.int8() from transformers #18

changlan opened this issue Aug 20, 2022 · 4 comments

Comments

@changlan
Copy link

Hi, I was testing LLM.int8() on the LongT5 model, but I consistently ran into the following errors:

CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /opt/conda/envs/python38/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so...

python3: /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu:375: int igemmlt(cublasLtHandle_t, int, int, int, const int8_t*, const int8_t*, void*, float*, int, int, int) [with int FORMATB = 3; int DTYPE_OUT = 32; int SCALE_ROWS = 0; cublasLtHandle_t = cublasLtContext*; int8_t = signed char]: Assertion `false' failed.
Aborted

Sample script to reproduce:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('google/t5-v1_1-large')
model_8bit = AutoModelForSeq2SeqLM.from_pretrained('google/t5-v1_1-large', device_map="auto", load_in_8bit=True)

sentences = ['hello world']

inputs = tokenizer(sentences, return_tensors="pt", padding=True)

output_sequences = model_8bit.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=256
)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
@younesbelkada
Copy link
Collaborator

Hi @changlan, thanks for your message!
I managed to run your script with Google colab without any issue, so I would suspect that there might be something wrong with bitsandbytes-cuda110 so pinging @TimDettmers
On the other hand, I think that you should use T5 models that are hosted by HuggingFace such as t5-base as I got weird output even with the native model on models hosted on google namespace.
Please check this colab where I use your script: https://colab.research.google.com/drive/1Nfo-BnIMUy0xL4JlwMJ6hM5oms2Vdhis?usp=sharing

@lessw2020
Copy link

@changlan @younesbelkada - the google versions of T5 (ala google/t5-v1_1_...) are not designed to runnable as-is, unlike the t5-large, etc.
The v11 versions are setup to be used as better starting points for fine tuning...but have no actual task training unlike the original t5s.
Thus, beyond your specific error, I would not recommend trying to do anything directly with these except using as better starting points for task fine tuning. (I used it for making a grammar checker as an example).

@changlan
Copy link
Author

changlan commented Aug 21, 2022 via email

@lessw2020
Copy link

good to hear! (for reference, T5 was trained in BFloat16, so that's likely why...if you are on AWS, try G5's..those are A10, BFloat16 compat, and work nicely).

Re: increased inference latency - there's a whole sep thread on this. There's some recent improvements and more coming:
#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants