Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_loss is 0.0 in 7.0 but works fine on 7.5 and 8.6 #343

Closed
adibMosharrof opened this issue Apr 25, 2023 · 7 comments
Closed

train_loss is 0.0 in 7.0 but works fine on 7.5 and 8.6 #343

adibMosharrof opened this issue Apr 25, 2023 · 7 comments

Comments

@adibMosharrof
Copy link

adibMosharrof commented Apr 25, 2023

Hello,

I have the same python environment in different machines, but when I run my code in the machine with Tesla V100-SXM2-32GB GPU, which has compute cability of 7.0, I get a train_loss of 0.0.

On machines with 7.5 (Nvidia Titan RTX) and 8.6 (RTX 3090) the train_loss is not 0.0.

I used pip install bitsandbytes==0.38.0

I had to manually provide the fix from #300 into bitsandbytes/cuda_setup/main.py.

It is mentioned that in v0.37.0, all GPUs are supported.

#240 also talks about train loss becoming 0.0

I am using peft to do some work, and that has led me here. I also have an issue in peft, which is

huggingface/peft#334

and a sample code that shows what I am doing is in this notebook

https://colab.research.google.com/drive/16qKy92cGoNPWrlQ4zlvntVGeSgjrknVF?usp=sharing

Here is the output of bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /project/msi290_uksr/generative_tod/myenv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
CUDA SETUP: CUDA runtime path found: /project/msi290_uksr/generative_tod/myenv/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 116

@zhaoqf123
Copy link

Based on the survey of the similar issues found in internet and our own experiments, the root cause can be traced to that V100 does not support int8 tensorcore, so bitsandbytes(bnb) cannot apply native int8 matrix mult in V100.

However, bnb adopts a workaround in this version:

0.37.0
Int8 Matmul + backward support for all GPUs
Features:

Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
Int8 now supported on all GPUs. On devices with compute capability < 7.5, the Int weights are cast to 16/32-bit for the matrix multiplication. Contributed by @borzunov
Improvements:

Improved logging for the CUDA detection mechanism.

Compare to the native support of int8 mat-mult, this workaround may accumulate larger errors as the fine-tune goes on, thus lead to the unstable loss of either a very huge value or 0.

Currently, we have 2 methods to ease the issue:

  1. set the llm_int8_threshold to a smaller value. By this way, we can reduce the number of params used in int8 mat-mult, thus reduce the instability in training. The side effect is that the memory consumption will increase.
from transformers import (
    AutoModel
    BitsAndBytesConfig
)

device_map = "auto"
llm_int8_threshold = 3.5
model = AutoModel.from_pretrained(
    base_model,
    cache_dir=cache_dir,
    load_in_8bit=True,
    quantization_config=BitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=llm_int8_threshold),
    torch_dtype=torch.float16,
    device_map=device_map,
    trust_remote_code=True
)
  1. set the learning rate to a smaller value.

Bear in mind that both of the above methods cannot solve the issue completely.

@adibMosharrof
Copy link
Author

unfortunately what you suggested did not work, I am still getting nan values

@zhaoqf123
Copy link

unfortunately what you suggested did not work, I am still getting nan values

You can also the optimizer bnb.optim.Adam8bit. Do you observe nan from the beginning of the fine-tuning, say the first 10 steps all displays nan?

@zhaoqf123
Copy link

unfortunately what you suggested did not work, I am still getting nan values

#165 (comment)

check the above solution. It shall be fixed already.

@adibMosharrof
Copy link
Author

adibMosharrof commented May 16, 2023

For training, I get loss the first time it is logged, but from second time the loss is 0.
For eval loss, I get nan from the first evaluation step.

@TingchenFu
Copy link

Hi, do you solve the problem now? @adibMosharrof I encounter with the similar issue. I even can not load the model (BLOOM) into 8*V100.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants