Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbled output on very long prompts #339

Open
LLukas22 opened this issue May 21, 2024 · 4 comments
Open

Garbled output on very long prompts #339

LLukas22 opened this issue May 21, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@LLukas22
Copy link
Contributor

Describe the bug
Models seam to produce garbled output on very long prompts.

If i use the following script:

import openai
from transformers import AutoTokenizer

if __name__ == "__main__":
    client = openai.Client(api_key="foobar", base_url=MISTRAL)
    tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
    with open("prompt.txt", "r", encoding="UTF-8") as f:
        content = f.read()
        
    print(len(tok.encode(content)))
    
    response = client.chat.completions.create(
        model="llama3",
        messages=[
            {
                "role": "user",
                "content": content,
            }
        ],
        max_tokens=256,
        temperature=0.0,
    )
    print(response.choices[0])

To send a 7368 token long prompt to a mistralrs server i recieve the following output:

Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!', role='assistant', function_call=None, tool_calls=None))

Meaning , that the server just filled the rest of the context length with !.

If i send the same prompt to an ollama server i get the following result:

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='One mile is approximately equal to 1.6093 kilometers.', role='assistant', function_call=None, tool_calls=None))

Which is the correct answer for the given prompt.

The prompt i used:
prompt.txt

The server parameters:
--isq Q4K plain -m meta-llama/Meta-Llama-3-8B-Instruct -a llama

Latest commit
Release 0.1.9

@LLukas22 LLukas22 added the bug Something isn't working label May 21, 2024
@EricLBuehler
Copy link
Owner

EricLBuehler commented May 22, 2024

Thank you for reporting this.

I think that this may be a problem with needing llama cache shifting.

Edit: I opened #341, can you please try to reproduce it there?

@EricLBuehler
Copy link
Owner

EricLBuehler commented May 25, 2024

@LLukas22, looks like this issue stretches as far back as 4ffe68d (v0.1.2). I can reproduce it with Mistral, and also Llama, but not Phi3 128k. It appears that the sliding window is at fault for Mistral, but for Llama, it is especially strange because the context length is 8k.

The v0.1.2 code used the same masking strategy as the current Candle method, while we currently use one similar to the implantation here.

Code from v0.1.2, using the Candle method:

fn prepare_decoder_attention_mask(
&self,
b_size: usize,
tgt_len: usize,
seqlen_offset: usize,
) -> Result<Tensor> {
// Sliding window mask
let sliding_window = self.sliding_window.unwrap_or(tgt_len + 1);
let mask: Vec<_> = (0..tgt_len)
.flat_map(|i| {
(0..tgt_len).map(move |j| {
if i < j || j + sliding_window < i {
f32::NEG_INFINITY
} else {
0.
}
})
})
.collect();
let mask = Tensor::from_slice(&mask, (tgt_len, tgt_len), &self.device)?;
let mask = if seqlen_offset > 0 {
let mask0 = Tensor::zeros((tgt_len, seqlen_offset), DType::F32, &self.device)?;
Tensor::cat(&[&mask0, &mask], D::Minus1)?
} else {
mask
};
mask.expand((b_size, 1, tgt_len, tgt_len + seqlen_offset))?
.to_dtype(self.dtype)
}

Core of current method:

let mask = self.make_mask(tgt_len, past_kv_len, input_ids.device())?;
let diagonal = past_kv_len as isize - sliding_window as isize - 1;
let context_mask = apply_tril(&mask.ones_like()?, diagonal)?;
let mask = masked_fill(&mask.to_dtype(DType::F32)?, &context_mask, f32::MIN)?;
let mask = mask
.expand((b_sz, 1, tgt_len, tgt_len + past_kv_len))?
.to_dtype(DType::U8)?;
Some(mask)

Do you see anything wrong with the way the sliding window is done, even with the v0.1.2 code? I will try to get this fixed soon, perhaps we can try to reproduce it with the Candle version.

@EricLBuehler
Copy link
Owner

@LLukas22, I think I figured it out. It looks like it works when not using ISQ, but if ISQ is used then it breaks. This is probably because we quantize slightly differently than the GGUF implementation?

@EricLBuehler
Copy link
Owner

EricLBuehler commented Jun 4, 2024

Perhaps #377 will help this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants