train_loss is 0.0 in 7.0 but works fine on 7.5 and 8.6 #343

adibMosharrof · 2023-04-25T08:52:41Z

Hello,

I have the same python environment in different machines, but when I run my code in the machine with Tesla V100-SXM2-32GB GPU, which has compute cability of 7.0, I get a train_loss of 0.0.

On machines with 7.5 (Nvidia Titan RTX) and 8.6 (RTX 3090) the train_loss is not 0.0.

I used pip install bitsandbytes==0.38.0

I had to manually provide the fix from #300 into bitsandbytes/cuda_setup/main.py.

It is mentioned that in v0.37.0, all GPUs are supported.

#240 also talks about train loss becoming 0.0

I am using peft to do some work, and that has led me here. I also have an issue in peft, which is

huggingface/peft#334

and a sample code that shows what I am doing is in this notebook

https://colab.research.google.com/drive/16qKy92cGoNPWrlQ4zlvntVGeSgjrknVF?usp=sharing

Here is the output of bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /project/msi290_uksr/generative_tod/myenv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
CUDA SETUP: CUDA runtime path found: /project/msi290_uksr/generative_tod/myenv/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 116

The text was updated successfully, but these errors were encountered:

zhaoqf123 · 2023-05-11T10:03:04Z

Based on the survey of the similar issues found in internet and our own experiments, the root cause can be traced to that V100 does not support int8 tensorcore, so bitsandbytes(bnb) cannot apply native int8 matrix mult in V100.

However, bnb adopts a workaround in this version:

0.37.0
Int8 Matmul + backward support for all GPUs
Features:

Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
Int8 now supported on all GPUs. On devices with compute capability < 7.5, the Int weights are cast to 16/32-bit for the matrix multiplication. Contributed by @borzunov
Improvements:

Improved logging for the CUDA detection mechanism.

Compare to the native support of int8 mat-mult, this workaround may accumulate larger errors as the fine-tune goes on, thus lead to the unstable loss of either a very huge value or 0.

Currently, we have 2 methods to ease the issue:

set the llm_int8_threshold to a smaller value. By this way, we can reduce the number of params used in int8 mat-mult, thus reduce the instability in training. The side effect is that the memory consumption will increase.

from transformers import (
    AutoModel
    BitsAndBytesConfig
)

device_map = "auto"
llm_int8_threshold = 3.5
model = AutoModel.from_pretrained(
    base_model,
    cache_dir=cache_dir,
    load_in_8bit=True,
    quantization_config=BitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=llm_int8_threshold),
    torch_dtype=torch.float16,
    device_map=device_map,
    trust_remote_code=True
)

set the learning rate to a smaller value.

Bear in mind that both of the above methods cannot solve the issue completely.

adibMosharrof · 2023-05-13T07:40:52Z

unfortunately what you suggested did not work, I am still getting nan values

zhaoqf123 · 2023-05-16T03:15:34Z

unfortunately what you suggested did not work, I am still getting nan values

You can also the optimizer bnb.optim.Adam8bit. Do you observe nan from the beginning of the fine-tuning, say the first 10 steps all displays nan?

zhaoqf123 · 2023-05-16T07:33:11Z

unfortunately what you suggested did not work, I am still getting nan values

#165 (comment)

check the above solution. It shall be fixed already.

adibMosharrof · 2023-05-16T14:34:39Z

For training, I get loss the first time it is logged, but from second time the loss is 0.
For eval loss, I get nan from the first evaluation step.

TingchenFu · 2023-06-18T05:59:56Z

Hi, do you solve the problem now? @adibMosharrof I encounter with the similar issue. I even can not load the model (BLOOM) into 8*V100.

github-actions · 2023-12-20T16:09:08Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions bot closed this as completed Dec 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_loss is 0.0 in 7.0 but works fine on 7.5 and 8.6 #343

train_loss is 0.0 in 7.0 but works fine on 7.5 and 8.6 #343

adibMosharrof commented Apr 25, 2023 •

edited

zhaoqf123 commented May 11, 2023

adibMosharrof commented May 13, 2023

zhaoqf123 commented May 16, 2023

zhaoqf123 commented May 16, 2023

adibMosharrof commented May 16, 2023 •

edited

TingchenFu commented Jun 18, 2023

github-actions bot commented Dec 20, 2023

train_loss is 0.0 in 7.0 but works fine on 7.5 and 8.6 #343

train_loss is 0.0 in 7.0 but works fine on 7.5 and 8.6 #343

Comments

adibMosharrof commented Apr 25, 2023 • edited

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

zhaoqf123 commented May 11, 2023

adibMosharrof commented May 13, 2023

zhaoqf123 commented May 16, 2023

zhaoqf123 commented May 16, 2023

adibMosharrof commented May 16, 2023 • edited

TingchenFu commented Jun 18, 2023

github-actions bot commented Dec 20, 2023

adibMosharrof commented Apr 25, 2023 •

edited

adibMosharrof commented May 16, 2023 •

edited