You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Using the new phi4-mm.py example, attempting to transcribe the 1272-141231-0002.mp3 provided in this repository produces wildly inaccurate results.
# Download the model directly using the Hugging Face CLI
huggingface-cli download microsoft/Phi-4-multimodal-instruct-onnx --include gpu/* --local-dir .
# Install the CUDA package of ONNX Runtime GenAI
pip install --pre onnxruntime-genai-cuda
# Please adjust the model directory (-m) accordingly
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi4-mm.py -o phi4-mm.py
python phi4-mm.py -m gpu/gpu-int4-rtn-block-32 -e cuda
> python phi4-mm.py -m gpu/gpu-int4-rtn-block-32 -e cuda
Loading model...
Setting model to cuda...
Model loaded
Image Path (comma separated; leave empty if no image):
Audio Path (comma separated; leave empty if no audio): 1272-141231-0002.mp3
No image provided
Using audio: 1272-141231-0002.mp3
Prompt: Transcribe the audio clip into text.
Processing inputs...
Processor complete.
Generating response...
The thought of anything else seemed trivial to him. All of his worries and concerns seemed inconsequential compared to the things that actually mattered to him. He realized that the small irritations and trivialities of everyday life often cloud one's judgment and perspective. By focusing on the truly important matters, he sought to bring clarity and purpose to his thoughts and actions. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life.
Traceback (most recent call last):
File "C:\Source\Phi\p\phi4-mm.py", line 173, in <module>
run(args)
File "C:\Source\Phi\p\phi4-mm.py", line 133, in run
generator.generate_next_token()
KeyboardInterrupt
Expected behavior
I'd expect the produced text to be something at least closer to the following, which is what both Whisper and I hear when listening to the audio:
"the cut on his chest still dripping blood the ache of his over-strained eyes even the soaring arena around him with the thousands of spectators were trivialities not worth thinking about"
It seems to have got some vague connection with 'triviality', so it isn't completely ignoring the audio, but nothing I'd call ASR.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
OS: Windows 11 24H2
GPU: RTX2080 SUPER, 8GB
Driver: 571.96
CUDA: 12.8.3
CUDNN: 9.7.1
The text was updated successfully, but these errors were encountered:
Describe the bug
Using the new phi4-mm.py example, attempting to transcribe the 1272-141231-0002.mp3 provided in this repository produces wildly inaccurate results.
To Reproduce
Steps to reproduce the behavior:
Using the steps from the readme at https://huggingface.co/microsoft/Phi-4-multimodal-instruct-onnx
Copying the sample audio from https://github.com/microsoft/onnxruntime-genai/blob/main/test/test_models/audios/1272-141231-0002.mp3 into the local directory and then answering the prompts as follows:
Expected behavior
I'd expect the produced text to be something at least closer to the following, which is what both Whisper and I hear when listening to the audio:
"the cut on his chest still dripping blood the ache of his over-strained eyes even the soaring arena around him with the thousands of spectators were trivialities not worth thinking about"
It seems to have got some vague connection with 'triviality', so it isn't completely ignoring the audio, but nothing I'd call ASR.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: