Endless response using the Phi-4-mini-instruct model #1450

f2bo · 2025-05-06T01:37:42Z

I'm experiencing problems using the Phi-4-mini-instruct model where it will generate responses that begin to repeat text until max_length is reached.

To Reproduce
I see this problem with my application, currently using the OnnxRuntimeGenAIChatClient, but also with the sample applications in this repo, for example, the HelloPhi app.

To reproduce it, execute this app using the cpu execution provider and the Phi-4-mini-instruct model using the prompt "Explain how lasers work".

Note: You may need to increase the max_length value (e.g. to 750) to see the problem as the response to this prompt from some models sometimes exceeds the hard-coded value.

onnxruntime-genai/examples/csharp/HelloPhi/Program.cs

Line 93 in c6ee481

int maxLength = 500;

Run:

HelloPhi -m E:\models\Phi-4-mini-instruct-onnx\cpu_and_mobile\cpu-int4-rtn-block-32-acc-level-4 -e cpu

This results in the following output (only showing bottom lines for the screenshot).

Expected behavior
Response should end normally.

Desktop:

OS: Windows 11
Browser: NA
Version: 24H2

Additional context

The problem does not exist running the HelloPhi app with a different model, for example, Phi-3-mini-128K-instruct

I also noticed that the problem does not seem to exist in the Python sample chat app. After some digging, I narrowed the difference to the prompt template being used.

The HelloPhi app uses the following template:

onnxruntime-genai/examples/csharp/HelloPhi/Program.cs

Line 119 in c6ee481

var sequences = tokenizer.Encode($"<|user|>{prompt}<|end|><|assistant|>");

whereas the Python app uses:

onnxruntime-genai/examples/python/model-chat.py

Line 53 in c6ee481

args.chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

Notice that besides the newline characters, which do not seem to matter, there's a single space character following the {input} placeholder. Adding this single space character to the prompt in the HelloPhi app seems to fix the problem. Alternatively, the problem is also reproducible in the Python code if the space character is removed.

Perhaps this is a known problem and was it was purposely added to the Python code, though it seems quite unexpected.

The text was updated successfully, but these errors were encountered:

f2bo · 2025-05-12T12:35:37Z

After further testing, it now appears that the space character in the prompt appears to be unrelated as I continue to see the problem with or without it, which I suppose is not all that surprising.

jiafatom · 2025-05-12T17:04:32Z

This is a known issue. You need use latest onnxruntime-genai model builder to regenerate the model (select k_quant_mixed as quantization algo), and use latest onnxruntime-1.22 (we just released) to run it.

f2bo · 2025-05-13T23:03:03Z

@jiafatom

This is a known issue. You need use latest onnxruntime-genai model builder to regenerate the model (select k_quant_mixed as quantization algo), and use latest onnxruntime-1.22 (we just released) to run it.

I don't know if I followed your instructions correctly but I still see the problem with the regenerated model. This is the command line I used to run the model builder. Can you confirm if it is correct?

python -m onnxruntime_genai.models.builder -m microsoft/Phi-4-mini-instruct -o ./models/microsoft/cpu_and_mobile/phi4-mini-regenerated -p int4 -e cpu --extra_options int4_algo_config=k_quant_mixed

Note that I had to use onnxruntime-1.21.1 to run model builder as I was getting the following error with onnxruntime-1.22.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "E:\ModelBuilder\.venv\Lib\site-packages\onnxruntime_genai\models\builder.py", line 12, in <module>
    from onnxruntime.quantization.matmul_4bits_quantizer import MatMul4BitsQuantizer, QuantFormat
ModuleNotFoundError: No module named 'onnxruntime.quantization.matmul_4bits_quantizer'

However, I did run the HelloPhi app with the latest ONNX runtime and the regenerated model. To do this, I added a direct reference to the 1.22.0 package.

<ItemGroup>
  <PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.22.0" />
  <PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI" Version="0.7.1" Condition=" '$(Configuration)' == 'Debug' OR '$(Configuration)' == 'Release' " />
  <PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.Cuda" Version="0.7.1" Condition=" '$(Configuration)' == 'Debug_Cuda' OR '$(Configuration)' == 'Release_Cuda' " />
  <PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.DirectML" Version="0.7.1" Condition=" '$(Configuration)' == 'Debug_DirectML' OR '$(Configuration)' == 'Release_DirectML' " />
</ItemGroup>

jiafatom · 2025-05-13T23:09:05Z

@f2bo As of today, if you want to regenerate this phi-4 onnx model, you need to build both onnxruntime and onnxruntime-genai from source, because part of k_quant work is not included in onnxruntime 1.22.

After that, this command is good to me: python -m onnxruntime_genai.models.builder -m microsoft/Phi-4-mini-instruct -o ./models/microsoft/cpu_and_mobile/phi4-mini-regenerated -p int4 -e cpu --extra_options int4_algo_config=k_quant_mixed

And then when you run the model, you can use onnxruntime 1.22

f2bo · 2025-05-13T23:22:35Z

@jiafatom

As of today, if you want to regenerate this phi-4 onnx model, you need to build both onnxruntime and onnxruntime-genai from source, because part of k_quant work is not included in onnxruntime 1.22.

I've never built from source either one. I suspect that it's probably going to take more time than I had originally anticipated. Let me see if I can find the time to give it a try and see how it goes.

Thanks!

f2bo · 2025-05-15T14:02:53Z

I built the onnxruntime and onnxruntime-genai from source and regenerated the model again and this time there's a difference. I only tested very briefly but prompts that previously resulted in repeating text no longer do so.

I do notice slower performance (about 25% drop in tokens/sec). I don't know if that's expected.

Thanks again!

jiafatom · 2025-05-15T14:13:45Z

@f2bo I think this is expected. Previously it was a pure int4 model (MatMul), now what you generated should be a mixed precision model with some int4 and some int8 (MatMul). Then you should observe slower performance. I haven't checked the number about how much percent drop is expected.

f2bo · 2025-05-15T14:22:03Z

I think this is expected. Previously it was a pure int4 model (MatMul), now what you generated should be a mixed precision model with some int4 and some int8 (MatMul). Then you should observe slower performance.

@jiafatom Got it.

I'll test a bit more to make sure and then I'll close the issue.

Thank you

jiafatom · 2025-05-15T14:26:08Z

@f2bo Sure, we are working on some performance improvement, it may help your case. (You may need install onnxruntime from source later to include the new feature)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Endless response using the Phi-4-mini-instruct model #1450

Endless response using the Phi-4-mini-instruct model #1450

f2bo commented May 6, 2025 •

edited

Loading

f2bo commented May 12, 2025

Uh oh!

jiafatom commented May 12, 2025

Uh oh!

f2bo commented May 13, 2025

Uh oh!

jiafatom commented May 13, 2025

Uh oh!

f2bo commented May 13, 2025

Uh oh!

f2bo commented May 15, 2025

Uh oh!

jiafatom commented May 15, 2025 •

edited

Loading

Uh oh!

f2bo commented May 15, 2025

Uh oh!

jiafatom commented May 15, 2025

Uh oh!

Endless response using the Phi-4-mini-instruct model #1450

Endless response using the Phi-4-mini-instruct model #1450

Comments

f2bo commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

f2bo commented May 12, 2025

Uh oh!

jiafatom commented May 12, 2025

Uh oh!

f2bo commented May 13, 2025

Uh oh!

jiafatom commented May 13, 2025

Uh oh!

f2bo commented May 13, 2025

Uh oh!

f2bo commented May 15, 2025

Uh oh!

jiafatom commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

f2bo commented May 15, 2025

Uh oh!

jiafatom commented May 15, 2025

Uh oh!

f2bo commented May 6, 2025 •

edited

Loading

jiafatom commented May 15, 2025 •

edited

Loading