Normalizer and NormalizerWithAudio fail with dashed tokens eg "550-pound bomb" #2266

sbuser · 2021-05-26T16:05:50Z

Describe the bug

Example sentence fragments for text normalizer:
"It had 12-inch guns and a 21-knot cruising speed."
"Each 550-pound bomb exploded."

Steps/Code to reproduce bug

from nemo_text_processing.text_normalization.normalize import Normalizer
from nemo_text_processing.text_normalization.normalize_with_audio import NormalizerWithAudio

normalizer = Normalizer(input_case='cased')
audio_normalizer = NormalizerWithAudio(input_case='cased')

texts = [
    "It had 12-inch guns and a 21-knot cruising speed.",
    "Each 550-pound bomb exploded.",
]

for text in texts:
    normalized_text = normalizer.normalize(text, verbose=False)
    print(normalized_text)

    normalized_texts = audio_normalizer.normalize_with_audio(text, verbose=False)
    print(normalized_texts)

The results from normalized_texts will be something like (notice the spaces between each letter in the word for the units):
"five hundred fifty p o u n d bomb"
and
"twenty one k n o t cruising speed"

It seems like the sparrow/pynini rules don't allow for dashed text input followed by a unit and interpret the unit as an abbreviation. The CER calculation in NormalizerWithAudio doesn't repair this because the results from normalize_with_audio() don't ever contain the expected outputs.

Expected behavior

"five hundred fifty pound bomb"
"twenty one knot cruising speed"

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of NeMo install: pip install from source, main branch

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version Ubuntu 20.10
PyTorch version 1.7.1
Python version 3.8.5

I'm trying to track down where in pynini or the nemo rules this is happening and if I find it I will follow-up.

sbuser · 2021-05-26T16:41:48Z

The taggers pick these examples up as Serials and not as Measures. The options with the spaces between the letters make sense if it's a Serial.

I think a modification to the tagger for Measure that allows the dash between Cardinal and Unit may do the trick. If I can figure out the syntax...

ekmb · 2021-05-26T17:19:44Z

Yes, this is a known issue, the fix is a part of the upcoming update.
I'll update this bug when the PR is ready.

sbuser · 2021-05-26T17:36:17Z

I'll be following the progress closely and am happy to help if I can get more familiar with pynini -- is this repo the best place to file / discuss similar issues?

sbuser added the bug Something isn't working label May 26, 2021

ekmb self-assigned this May 26, 2021

ekmb mentioned this issue May 28, 2021

Audio Norm #2285

Merged

ekmb linked a pull request May 28, 2021 that will close this issue

Audio Norm #2285

Merged

ekmb closed this as completed in #2285 Jun 8, 2021

ekmb added the TN/ITN label Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalizer and NormalizerWithAudio fail with dashed tokens eg "550-pound bomb" #2266

Normalizer and NormalizerWithAudio fail with dashed tokens eg "550-pound bomb" #2266

sbuser commented May 26, 2021 •

edited

Loading

sbuser commented May 26, 2021

ekmb commented May 26, 2021

sbuser commented May 26, 2021 •

edited

Loading

Normalizer and NormalizerWithAudio fail with dashed tokens eg "550-pound bomb" #2266

Normalizer and NormalizerWithAudio fail with dashed tokens eg "550-pound bomb" #2266

Comments

sbuser commented May 26, 2021 • edited Loading

sbuser commented May 26, 2021

ekmb commented May 26, 2021

sbuser commented May 26, 2021 • edited Loading

sbuser commented May 26, 2021 •

edited

Loading

sbuser commented May 26, 2021 •

edited

Loading