Skip to content

Endless response using the Phi-4-mini-instruct model #1450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
f2bo opened this issue May 6, 2025 · 9 comments
Open

Endless response using the Phi-4-mini-instruct model #1450

f2bo opened this issue May 6, 2025 · 9 comments

Comments

@f2bo
Copy link

f2bo commented May 6, 2025

I'm experiencing problems using the Phi-4-mini-instruct model where it will generate responses that begin to repeat text until max_length is reached.

To Reproduce
I see this problem with my application, currently using the OnnxRuntimeGenAIChatClient, but also with the sample applications in this repo, for example, the HelloPhi app.

To reproduce it, execute this app using the cpu execution provider and the Phi-4-mini-instruct model using the prompt "Explain how lasers work".

Note: You may need to increase the max_length value (e.g. to 750) to see the problem as the response to this prompt from some models sometimes exceeds the hard-coded value.

Run:

HelloPhi -m E:\models\Phi-4-mini-instruct-onnx\cpu_and_mobile\cpu-int4-rtn-block-32-acc-level-4 -e cpu

This results in the following output (only showing bottom lines for the screenshot).

Image

Expected behavior
Response should end normally.

Desktop:

  • OS: Windows 11
  • Browser: NA
  • Version: 24H2

Additional context

The problem does not exist running the HelloPhi app with a different model, for example, Phi-3-mini-128K-instruct

I also noticed that the problem does not seem to exist in the Python sample chat app. After some digging, I narrowed the difference to the prompt template being used.

The HelloPhi app uses the following template:

var sequences = tokenizer.Encode($"<|user|>{prompt}<|end|><|assistant|>");

whereas the Python app uses:

args.chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

Notice that besides the newline characters, which do not seem to matter, there's a single space character following the {input} placeholder. Adding this single space character to the prompt in the HelloPhi app seems to fix the problem. Alternatively, the problem is also reproducible in the Python code if the space character is removed.

Perhaps this is a known problem and was it was purposely added to the Python code, though it seems quite unexpected.

@f2bo
Copy link
Author

f2bo commented May 12, 2025

After further testing, it now appears that the space character in the prompt appears to be unrelated as I continue to see the problem with or without it, which I suppose is not all that surprising.

@jiafatom
Copy link
Contributor

This is a known issue. You need use latest onnxruntime-genai model builder to regenerate the model (select k_quant_mixed as quantization algo), and use latest onnxruntime-1.22 (we just released) to run it.

@f2bo
Copy link
Author

f2bo commented May 13, 2025

@jiafatom

This is a known issue. You need use latest onnxruntime-genai model builder to regenerate the model (select k_quant_mixed as quantization algo), and use latest onnxruntime-1.22 (we just released) to run it.

I don't know if I followed your instructions correctly but I still see the problem with the regenerated model. This is the command line I used to run the model builder. Can you confirm if it is correct?

python -m onnxruntime_genai.models.builder -m microsoft/Phi-4-mini-instruct -o ./models/microsoft/cpu_and_mobile/phi4-mini-regenerated -p int4 -e cpu --extra_options int4_algo_config=k_quant_mixed

Note that I had to use onnxruntime-1.21.1 to run model builder as I was getting the following error with onnxruntime-1.22.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "E:\ModelBuilder\.venv\Lib\site-packages\onnxruntime_genai\models\builder.py", line 12, in <module>
    from onnxruntime.quantization.matmul_4bits_quantizer import MatMul4BitsQuantizer, QuantFormat
ModuleNotFoundError: No module named 'onnxruntime.quantization.matmul_4bits_quantizer'

However, I did run the HelloPhi app with the latest ONNX runtime and the regenerated model. To do this, I added a direct reference to the 1.22.0 package.

<ItemGroup>
  <PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.22.0" />
  <PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI" Version="0.7.1" Condition=" '$(Configuration)' == 'Debug' OR '$(Configuration)' == 'Release' " />
  <PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.Cuda" Version="0.7.1" Condition=" '$(Configuration)' == 'Debug_Cuda' OR '$(Configuration)' == 'Release_Cuda' " />
  <PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.DirectML" Version="0.7.1" Condition=" '$(Configuration)' == 'Debug_DirectML' OR '$(Configuration)' == 'Release_DirectML' " />
</ItemGroup>

Image

@jiafatom
Copy link
Contributor

@f2bo As of today, if you want to regenerate this phi-4 onnx model, you need to build both onnxruntime and onnxruntime-genai from source, because part of k_quant work is not included in onnxruntime 1.22.

After that, this command is good to me: python -m onnxruntime_genai.models.builder -m microsoft/Phi-4-mini-instruct -o ./models/microsoft/cpu_and_mobile/phi4-mini-regenerated -p int4 -e cpu --extra_options int4_algo_config=k_quant_mixed

And then when you run the model, you can use onnxruntime 1.22

@f2bo
Copy link
Author

f2bo commented May 13, 2025

@jiafatom

As of today, if you want to regenerate this phi-4 onnx model, you need to build both onnxruntime and onnxruntime-genai from source, because part of k_quant work is not included in onnxruntime 1.22.

I've never built from source either one. I suspect that it's probably going to take more time than I had originally anticipated. Let me see if I can find the time to give it a try and see how it goes.

Thanks!

@f2bo
Copy link
Author

f2bo commented May 15, 2025

I built the onnxruntime and onnxruntime-genai from source and regenerated the model again and this time there's a difference. I only tested very briefly but prompts that previously resulted in repeating text no longer do so.

I do notice slower performance (about 25% drop in tokens/sec). I don't know if that's expected.

Thanks again!

@jiafatom
Copy link
Contributor

jiafatom commented May 15, 2025

@f2bo I think this is expected. Previously it was a pure int4 model (MatMul), now what you generated should be a mixed precision model with some int4 and some int8 (MatMul). Then you should observe slower performance. I haven't checked the number about how much percent drop is expected.

@f2bo
Copy link
Author

f2bo commented May 15, 2025

I think this is expected. Previously it was a pure int4 model (MatMul), now what you generated should be a mixed precision model with some int4 and some int8 (MatMul). Then you should observe slower performance.

@jiafatom Got it.

I'll test a bit more to make sure and then I'll close the issue.

Thank you

@jiafatom
Copy link
Contributor

@f2bo Sure, we are working on some performance improvement, it may help your case. (You may need install onnxruntime from source later to include the new feature)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants