Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizer and NormalizerWithAudio fail with dashed tokens eg "550-pound bomb" #2266

Closed
sbuser opened this issue May 26, 2021 · 3 comments · Fixed by #2285
Closed

Normalizer and NormalizerWithAudio fail with dashed tokens eg "550-pound bomb" #2266

sbuser opened this issue May 26, 2021 · 3 comments · Fixed by #2285
Assignees
Labels
bug Something isn't working TN/ITN

Comments

@sbuser
Copy link

sbuser commented May 26, 2021

Describe the bug

Example sentence fragments for text normalizer:
"It had 12-inch guns and a 21-knot cruising speed."
"Each 550-pound bomb exploded."

Steps/Code to reproduce bug

from nemo_text_processing.text_normalization.normalize import Normalizer
from nemo_text_processing.text_normalization.normalize_with_audio import NormalizerWithAudio

normalizer = Normalizer(input_case='cased')
audio_normalizer = NormalizerWithAudio(input_case='cased')

texts = [
    "It had 12-inch guns and a 21-knot cruising speed.",
    "Each 550-pound bomb exploded.",
]

for text in texts:
    normalized_text = normalizer.normalize(text, verbose=False)
    print(normalized_text)

    normalized_texts = audio_normalizer.normalize_with_audio(text, verbose=False)
    print(normalized_texts)

The results from normalized_texts will be something like (notice the spaces between each letter in the word for the units):
"five hundred fifty p o u n d bomb"
and
"twenty one k n o t cruising speed"

It seems like the sparrow/pynini rules don't allow for dashed text input followed by a unit and interpret the unit as an abbreviation. The CER calculation in NormalizerWithAudio doesn't repair this because the results from normalize_with_audio() don't ever contain the expected outputs.

Expected behavior

"five hundred fifty pound bomb"
"twenty one knot cruising speed"

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of NeMo install: pip install from source, main branch

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version Ubuntu 20.10
  • PyTorch version 1.7.1
  • Python version 3.8.5

I'm trying to track down where in pynini or the nemo rules this is happening and if I find it I will follow-up.

@sbuser sbuser added the bug Something isn't working label May 26, 2021
@sbuser
Copy link
Author

sbuser commented May 26, 2021

The taggers pick these examples up as Serials and not as Measures. The options with the spaces between the letters make sense if it's a Serial.

I think a modification to the tagger for Measure that allows the dash between Cardinal and Unit may do the trick. If I can figure out the syntax...

@ekmb
Copy link
Collaborator

ekmb commented May 26, 2021

Yes, this is a known issue, the fix is a part of the upcoming update.
I'll update this bug when the PR is ready.

@ekmb ekmb self-assigned this May 26, 2021
@sbuser
Copy link
Author

sbuser commented May 26, 2021

I'll be following the progress closely and am happy to help if I can get more familiar with pynini -- is this repo the best place to file / discuss similar issues?

@ekmb ekmb mentioned this issue May 28, 2021
@ekmb ekmb linked a pull request May 28, 2021 that will close this issue
@ekmb ekmb closed this as completed in #2285 Jun 8, 2021
@ekmb ekmb added the TN/ITN label Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working TN/ITN
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants