-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Why does prefix caching change the generated content #1719
Comments
We will verify it locally and update you on any progress. |
It seems cannot be reproduced in llama2 (llama2_13B_chat) model, but can be reproduced in llama3, internlm2. So I guess this issue is for GQA models. |
A workaround is to change |
https://huggingface.co/internlm/internlm2-chat-7b/blob/main/config.json#L28 |
The |
We print part of the KV cache values in each block to debug it: for (int i = 0; i < seq.blocks.size(); i++) {
std::vector<half> v(20);
Copy(static_cast<half*>(sequence_manager_->GetBlockPtr(seq.blocks[i])), 20, v.data());
for (int k = 0; k < 20; k++) {
std::cout << __half2float(v[k]) << " ";
}
std::cout << ", ";
} We find some small values diff (probably caused by precision conversion) in the block after the cached blocks. |
Discussion about float16 and bfloat16 can be found at #1140 (comment). Currently the issue is caused by precision problems. The reused block KV cache value is consistent. When the type is bfloat16, inconsistencies in precision have emerged in the following generated token. |
https://huggingface.co/internlm/internlm2-chat-7b/commit/5b50661e5ba16c9ded1047a51e394280b3b9bda1 |
@DayDayupupupup May you try the latest version in this way |
I have also got a different answer for internlm-xcomposer. When using the official code of internlm-xcompose, the model behaves correctly. However, it cannot output a same answer for lmdeploy. Changing bfloat16 to float16 doesn't help BTW. |
Are you referring to this #1688 |
Using the latest version(commit [3e6b81c]), and changing bf16 to f16. So why the old version with fp16 weight is not working? |
When prefix caching is enabled, cached part of the prompt will not be prefilled again. This leads to different GEMM problem size and the dispatched kernel may be different. When doing GEMM, different concurrency level in the k-mode leads to different accumulation order and thus different floating point outcome. |
This #1719 (comment) may not be explained |
ref #1688 (comment) |
To summarize, there are several scenarios where using temperature 0 results in output differences:
Is there currently a plan to address this issue? In some scenarios, such as generative search, the temperature is usually set very low or even to 0. For instance, when it's at 0, if algorithm engineers find that the results are inconsistent with those from transformers, it can be quite perplexing. @lvhan028 @lzhangzz |
Checklist
Describe the bug
Model: internlm2-chat-7b
GPU: A30
VERSION:0.4.2
When
enable_prefix_caching=True
, the generated content is different fromenable_prefix_caching=False
Reproduction
test script: internlm.py
TEST 1 Disable prefix caching
python internlm.py -m internlm2-chat-7b
All three requests generate exactly the same content.
TEST 2 Enable prefix caching
python internlm.py -m internlm2-chat-7b --enable_prefix_caching
The generated content of the 2.3 request is different from that of the first request.
Environment
Error traceback
No response
The text was updated successfully, but these errors were encountered: