Wildly inaccurate audio transcription. #1290

bcdev-com · 2025-02-28T04:57:27Z

Describe the bug
Using the new phi4-mm.py example, attempting to transcribe the 1272-141231-0002.mp3 provided in this repository produces wildly inaccurate results.

To Reproduce
Steps to reproduce the behavior:
Using the steps from the readme at https://huggingface.co/microsoft/Phi-4-multimodal-instruct-onnx

# Download the model directly using the Hugging Face CLI
huggingface-cli download microsoft/Phi-4-multimodal-instruct-onnx --include gpu/* --local-dir .

# Install the CUDA package of ONNX Runtime GenAI
pip install --pre onnxruntime-genai-cuda

# Please adjust the model directory (-m) accordingly 
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi4-mm.py -o phi4-mm.py
python phi4-mm.py -m gpu/gpu-int4-rtn-block-32 -e cuda

Copying the sample audio from https://github.com/microsoft/onnxruntime-genai/blob/main/test/test_models/audios/1272-141231-0002.mp3 into the local directory and then answering the prompts as follows:

> python phi4-mm.py -m gpu/gpu-int4-rtn-block-32 -e cuda
Loading model...
Setting model to cuda...
Model loaded
Image Path (comma separated; leave empty if no image):
Audio Path (comma separated; leave empty if no audio): 1272-141231-0002.mp3
No image provided
Using audio: 1272-141231-0002.mp3
Prompt: Transcribe the audio clip into text.
Processing inputs...
Processor complete.
Generating response...
The thought of anything else seemed trivial to him. All of his worries and concerns seemed inconsequential compared to the things that actually mattered to him. He realized that the small irritations and trivialities of everyday life often cloud one's judgment and perspective. By focusing on the truly important matters, he sought to bring clarity and purpose to his thoughts and actions. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life.
Traceback (most recent call last):
  File "C:\Source\Phi\p\phi4-mm.py", line 173, in <module>
    run(args)
  File "C:\Source\Phi\p\phi4-mm.py", line 133, in run
    generator.generate_next_token()
KeyboardInterrupt

Expected behavior
I'd expect the produced text to be something at least closer to the following, which is what both Whisper and I hear when listening to the audio:

"the cut on his chest still dripping blood the ache of his over-strained eyes even the soaring arena around him with the thousands of spectators were trivialities not worth thinking about"

It seems to have got some vague connection with 'triviality', so it isn't completely ignoring the audio, but nothing I'd call ASR.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Windows 11 24H2
GPU: RTX2080 SUPER, 8GB
Driver: 571.96
CUDA: 12.8.3
CUDNN: 9.7.1

The text was updated successfully, but these errors were encountered:

jarroddavis68 · 2025-03-09T22:24:12Z

I'm having the same issue too.

jarroddavis68 · 2025-03-09T23:05:56Z

Hmm, after converting the .mp3 to .wav, it seems to work much better for me.

hanbitmyths assigned kunal-vaishnavi Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wildly inaccurate audio transcription. #1290

Wildly inaccurate audio transcription. #1290

bcdev-com commented Feb 28, 2025

jarroddavis68 commented Mar 9, 2025

jarroddavis68 commented Mar 9, 2025

Wildly inaccurate audio transcription. #1290

Wildly inaccurate audio transcription. #1290

Comments

bcdev-com commented Feb 28, 2025

jarroddavis68 commented Mar 9, 2025

jarroddavis68 commented Mar 9, 2025