[Bug]: Continuous batching (OpenAI Server) with greedy search return different results

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
- Tested with vllm v0.6.1, v0.6.4.post1 and v0.6.6.post1
- Torch 2.5.1
- Transformers 4.47.0
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

 I am using greedy decoding (temperature==0.0) for the same gpu and every time we run inference on the same data, the results are a whole lot different

To reproduce, first run the api server
```
vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype bfloat16 --enforce-eager --host 0.0.0.0 --port 8011 --gpu-memory-utilization 0.95
```

Then run (batching with multithread)
```
from openai import OpenAI
from tqdm.auto import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed

def multithread(func_to_call,list_data,max_workers=8):
    res = []
    with tqdm(total=len(list_data)) as pbar:
        with ThreadPoolExecutor(max_workers=min(len(list_data),max_workers)) as ex:
            futures = [ex.submit(func_to_call, k) for k in list_data]
            for future in as_completed(futures):
                result = future.result()
                res.append(result)
                pbar.update(1)
    return res

def multithread_wrapper(func_to_call,list_data,max_workers=128):
    from tqdm import trange
    mthread_inputs = []
    for rnd_idx in range(len(list_data)): mthread_inputs.append((rnd_idx,list_data[rnd_idx]))
    wrapper_func = lambda x:[x[0],func_to_call(x[1])]
    wrapped_outputs = multithread(wrapper_func,mthread_inputs,max_workers=max_workers)
    wrapped_outputs = sorted(wrapped_outputs,key=lambda x:x[0])
    wrapped_outputs = [x[1].choices[0].message.content.strip() for x in wrapped_outputs]
    return wrapped_outputs

from openai import OpenAI
openai_api_key ='WHATEVER' 
openai_api_base =  'http://localhost:8011/v1'
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)


models = client.models.list()
model = models.data[0].id

func_to_call = lambda x: client.chat.completions.create(
            messages=x,
            model=model,
            max_tokens=1024,
            temperature=0.0,
            seed=9,
        )

single_prompt = [
{
    "role":"system",
    "content":"You are a superhelpful assistant"
},
 {
    "role":"user",
    "content":"Hi how are you feeling ? Respond in a random non-English language"
 },
 {
     "role":"assistant",
     "content":"元気です！あなたはどうですか？"
 },
 {
    "role":"user",
    "content":"Great, now teach me which language you used, and its grammar and basic phrases"
 }
]

list_prompts = [single_prompt]*100

# single infer - for loop
# outs_1 = [func_to_call(single_prompt).choices[0].message.content.strip() for single_prompt in list_prompts]
# outs_2 = [func_to_call(single_prompt).choices[0].message.content.strip() for single_prompt in list_prompts]


# batch infer - multithread
outs_1 = multithread_wrapper(func_to_call,list_prompts)
outs_2 = multithread_wrapper(func_to_call,list_prompts)

cc=0
for x,y in zip(outs_1,outs_2): 
    if not x==y: # inconsistent output
        cc+=1
        # print(x)
        # print('-'*5)
        # print(y)
        # print('='*10)
print("Num mismatch:",cc) 

#Output: Num mismatch: 98
#Tested on a A100 PCIE GPU
```

Results are similar if (i) I use for loop (no batching, one example at a time) OR (ii) use offline inference i.e. model.chat(...)
I believe there's a critical bug with continuous batching at the moment (since (ii) works).

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Continuous batching (OpenAI Server) with greedy search return different results #11658

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Continuous batching (OpenAI Server) with greedy search return different results #11658

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions