Skip to content

Wildly inaccurate audio transcription. #1290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bcdev-com opened this issue Feb 28, 2025 · 2 comments
Open

Wildly inaccurate audio transcription. #1290

bcdev-com opened this issue Feb 28, 2025 · 2 comments
Assignees

Comments

@bcdev-com
Copy link

Describe the bug
Using the new phi4-mm.py example, attempting to transcribe the 1272-141231-0002.mp3 provided in this repository produces wildly inaccurate results.

To Reproduce
Steps to reproduce the behavior:
Using the steps from the readme at https://huggingface.co/microsoft/Phi-4-multimodal-instruct-onnx

# Download the model directly using the Hugging Face CLI
huggingface-cli download microsoft/Phi-4-multimodal-instruct-onnx --include gpu/* --local-dir .

# Install the CUDA package of ONNX Runtime GenAI
pip install --pre onnxruntime-genai-cuda

# Please adjust the model directory (-m) accordingly 
curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi4-mm.py -o phi4-mm.py
python phi4-mm.py -m gpu/gpu-int4-rtn-block-32 -e cuda

Copying the sample audio from https://github.com/microsoft/onnxruntime-genai/blob/main/test/test_models/audios/1272-141231-0002.mp3 into the local directory and then answering the prompts as follows:

> python phi4-mm.py -m gpu/gpu-int4-rtn-block-32 -e cuda
Loading model...
Setting model to cuda...
Model loaded
Image Path (comma separated; leave empty if no image):
Audio Path (comma separated; leave empty if no audio): 1272-141231-0002.mp3
No image provided
Using audio: 1272-141231-0002.mp3
Prompt: Transcribe the audio clip into text.
Processing inputs...
Processor complete.
Generating response...
The thought of anything else seemed trivial to him. All of his worries and concerns seemed inconsequential compared to the things that actually mattered to him. He realized that the small irritations and trivialities of everyday life often cloud one's judgment and perspective. By focusing on the truly important matters, he sought to bring clarity and purpose to his thoughts and actions. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life. He sought to rise above the trivial and trivialize the trivial in his pursuit of a meaningful and purposeful life.
Traceback (most recent call last):
  File "C:\Source\Phi\p\phi4-mm.py", line 173, in <module>
    run(args)
  File "C:\Source\Phi\p\phi4-mm.py", line 133, in run
    generator.generate_next_token()
KeyboardInterrupt

Expected behavior
I'd expect the produced text to be something at least closer to the following, which is what both Whisper and I hear when listening to the audio:

"the cut on his chest still dripping blood the ache of his over-strained eyes even the soaring arena around him with the thousands of spectators were trivialities not worth thinking about"

It seems to have got some vague connection with 'triviality', so it isn't completely ignoring the audio, but nothing I'd call ASR.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Windows 11 24H2
  • GPU: RTX2080 SUPER, 8GB
  • Driver: 571.96
  • CUDA: 12.8.3
  • CUDNN: 9.7.1
@jarroddavis68
Copy link

I'm having the same issue too.

@jarroddavis68
Copy link

Hmm, after converting the .mp3 to .wav, it seems to work much better for me.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants