Garbled output on very long prompts #339

LLukas22 · 2024-05-21T09:17:04Z

Describe the bug
Models seam to produce garbled output on very long prompts.

If i use the following script:

import openai
from transformers import AutoTokenizer

if __name__ == "__main__":
    client = openai.Client(api_key="foobar", base_url=MISTRAL)
    tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
    with open("prompt.txt", "r", encoding="UTF-8") as f:
        content = f.read()
        
    print(len(tok.encode(content)))
    
    response = client.chat.completions.create(
        model="llama3",
        messages=[
            {
                "role": "user",
                "content": content,
            }
        ],
        max_tokens=256,
        temperature=0.0,
    )
    print(response.choices[0])

To send a 7368 token long prompt to a mistralrs server i recieve the following output:

Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!', role='assistant', function_call=None, tool_calls=None))

Meaning , that the server just filled the rest of the context length with !.

If i send the same prompt to an ollama server i get the following result:

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='One mile is approximately equal to 1.6093 kilometers.', role='assistant', function_call=None, tool_calls=None))

Which is the correct answer for the given prompt.

The prompt i used:
prompt.txt

The server parameters:
--isq Q4K plain -m meta-llama/Meta-Llama-3-8B-Instruct -a llama

Latest commit
Release 0.1.9

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2024-05-22T00:33:08Z

Thank you for reporting this.

I think that this may be a problem with needing llama cache shifting.

Edit: I opened #341, can you please try to reproduce it there?

EricLBuehler · 2024-05-25T22:11:28Z

@LLukas22, looks like this issue stretches as far back as 4ffe68d (v0.1.2). I can reproduce it with Mistral, and also Llama, but not Phi3 128k. It appears that the sliding window is at fault for Mistral, but for Llama, it is especially strange because the context length is 8k.

The v0.1.2 code used the same masking strategy as the current Candle method, while we currently use one similar to the implantation here.

Code from v0.1.2, using the Candle method:

mistral.rs/mistralrs-core/src/models/mistral.rs

Lines 376 to 404 in 4ffe68d

fn prepare_decoder_attention_mask(

&self,

b_size: usize,

tgt_len: usize,

seqlen_offset: usize,

) -> Result<Tensor> {

// Sliding window mask

let sliding_window = self.sliding_window.unwrap_or(tgt_len + 1);

let mask: Vec<_> = (0..tgt_len)

.flat_map(|i| {

(0..tgt_len).map(move |j| {

if i < j || j + sliding_window < i {

f32::NEG_INFINITY

} else {

0.

}

})

})

.collect();

let mask = Tensor::from_slice(&mask, (tgt_len, tgt_len), &self.device)?;

let mask = if seqlen_offset > 0 {

let mask0 = Tensor::zeros((tgt_len, seqlen_offset), DType::F32, &self.device)?;

Tensor::cat(&[&mask0, &mask], D::Minus1)?

} else {

mask

};

mask.expand((b_size, 1, tgt_len, tgt_len + seqlen_offset))?

.to_dtype(self.dtype)

}

Core of current method:

mistral.rs/mistralrs-core/src/layers_masker.rs

Lines 138 to 146 in 59b3157

let mask = self.make_mask(tgt_len, past_kv_len, input_ids.device())?;

let diagonal = past_kv_len as isize - sliding_window as isize - 1;

let context_mask = apply_tril(&mask.ones_like()?, diagonal)?;

let mask = masked_fill(&mask.to_dtype(DType::F32)?, &context_mask, f32::MIN)?;

let mask = mask

.expand((b_sz, 1, tgt_len, tgt_len + past_kv_len))?

.to_dtype(DType::U8)?;

Some(mask)

Do you see anything wrong with the way the sliding window is done, even with the v0.1.2 code? I will try to get this fixed soon, perhaps we can try to reproduce it with the Candle version.

EricLBuehler · 2024-05-26T11:12:32Z

@LLukas22, I think I figured it out. It looks like it works when not using ISQ, but if ISQ is used then it breaks. This is probably because we quantize slightly differently than the GGUF implementation?

EricLBuehler · 2024-06-04T23:31:42Z

Perhaps #377 will help this?

LLukas22 added the bug Something isn't working label May 21, 2024

EricLBuehler added the urgent label May 21, 2024

EricLBuehler mentioned this issue May 22, 2024

Implement cache shifting for Llama models #341

Closed

EricLBuehler removed the urgent label Jun 4, 2024

This was referenced Jun 22, 2024

Fix LongRope models position ids calculation #459

Merged

What are the differences between the llama block in this repo and the implementation in candle-transformer? #465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbled output on very long prompts #339

Garbled output on very long prompts #339

LLukas22 commented May 21, 2024

EricLBuehler commented May 22, 2024 •

edited

Loading

EricLBuehler commented May 25, 2024 •

edited

Loading

EricLBuehler commented May 26, 2024

EricLBuehler commented Jun 4, 2024 •

edited

Loading

Garbled output on very long prompts #339

Garbled output on very long prompts #339

Comments

LLukas22 commented May 21, 2024

EricLBuehler commented May 22, 2024 • edited Loading

EricLBuehler commented May 25, 2024 • edited Loading

EricLBuehler commented May 26, 2024

EricLBuehler commented Jun 4, 2024 • edited Loading

EricLBuehler commented May 22, 2024 •

edited

Loading

EricLBuehler commented May 25, 2024 •

edited

Loading

EricLBuehler commented Jun 4, 2024 •

edited

Loading