Skip to content

Keeping normalizer up-to-date with Whisper-normalizer for ASR#27

Merged
akshaykalkunte merged 1 commit intomainfrom
bug/wer-normalizer
Jan 13, 2026
Merged

Keeping normalizer up-to-date with Whisper-normalizer for ASR#27
akshaykalkunte merged 1 commit intomainfrom
bug/wer-normalizer

Conversation

@nhhoang96
Copy link
Copy Markdown
Collaborator

📌 Description

Fix the issue of post-processing before WER computation. There exists a minor deviation of post-processing (within EnglishNormalizer) from the standardized whisper-normalizer. This PR ensures the compatibility with current standard normalizer.

🔗 Related Issue(s)

🛠️ Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality including new tasks)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactor / Code cleanup
  • Maintenance / Chore / Task
  • Other (please describe):

✅ How Has This Been Tested?

  • Specific unit testing on cases where bugs were reported.

  • Integration testing with re-run Librispeech-test-clean

  • Unit tests

  • Integration tests

  • Manual testing

Test Results / Screenshots (if applicable):

📸 Screenshots / Demos

📋 Checklist

  • Code follows project style guidelines
  • Tests have been added/updated (if applicable)
  • Documentation has been updated (if applicable)
  • Linked relevant issue(s)
  • Self-reviewed my code

🙌 Additional Notes

Copy link
Copy Markdown
Collaborator

@akshaykalkunte akshaykalkunte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that

r"(\w+)'m\b": "\1 am" is incorrect. r"(\w+)'m\b": "\\1 am" was the correct usage.

Using the normalizers from Whisper without any changes is a good idea. So the changes look good to me.

@akshaykalkunte akshaykalkunte merged commit daa0616 into main Jan 13, 2026
@akshaykalkunte akshaykalkunte deleted the bug/wer-normalizer branch January 13, 2026 21:34
nhhoang96 added a commit that referenced this pull request Apr 18, 2026
* add gpqa diamond

* Update constants.py (#18)

* updating turn handling for multi-turn evals

* feat: Add Gemini support (#15)

* add spokenwoz speech and text (#24)

* add vllm configs and readme (#21)

* added phonetics, speech_disorder, and speech_enhancement tasks - stil… (#22)

* added phonetics, speech_disorder, and speech_enhancement tasks - still in need of full model scoring. Fixed small inconsistency bug in config by changing judge_properties to judge_settings.

* Update the correct HF path for noise_detection task

* updated scores

---------

Co-authored-by: hoang <huuhoang.nguyen@servicenow.com>

* voxtral and phi4 guidance (#25)

* Keeping normalizer up-to-date with Whisper-normalizer for ASR (#27)

* add gpqa diamond

---------

Co-authored-by: oluwanifemibamgbose <oluwanifemi.bamgbose@servicenow.com>
Co-authored-by: khyatimahajan <khyati.mahajan@servicenow.com>
Co-authored-by: Khyati Mahajan <mahajan.khyati@gmail.com>
Co-authored-by: Akshay Kalkunte <akshay.kalkunte@servicenow.com>
Co-authored-by: Jash Mehta <jash.mehta@servicenow.com>
Co-authored-by: Sidharth Surapaneni <40740959+pcsid@users.noreply.github.com>
Co-authored-by: hoang <huuhoang.nguyen@servicenow.com>
Co-authored-by: hoang <hnguy7@uic.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants