You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Example sentence fragments for text normalizer:
"It had 12-inch guns and a 21-knot cruising speed."
"Each 550-pound bomb exploded."
Steps/Code to reproduce bug
fromnemo_text_processing.text_normalization.normalizeimportNormalizerfromnemo_text_processing.text_normalization.normalize_with_audioimportNormalizerWithAudionormalizer=Normalizer(input_case='cased')
audio_normalizer=NormalizerWithAudio(input_case='cased')
texts= [
"It had 12-inch guns and a 21-knot cruising speed.",
"Each 550-pound bomb exploded.",
]
fortextintexts:
normalized_text=normalizer.normalize(text, verbose=False)
print(normalized_text)
normalized_texts=audio_normalizer.normalize_with_audio(text, verbose=False)
print(normalized_texts)
The results from normalized_texts will be something like (notice the spaces between each letter in the word for the units):
"five hundred fifty p o u n d bomb"
and
"twenty one k n o t cruising speed"
It seems like the sparrow/pynini rules don't allow for dashed text input followed by a unit and interpret the unit as an abbreviation. The CER calculation in NormalizerWithAudio doesn't repair this because the results from normalize_with_audio() don't ever contain the expected outputs.
Expected behavior
"five hundred fifty pound bomb"
"twenty one knot cruising speed"
Environment overview (please complete the following information)
Environment location: Bare-metal
Method of NeMo install: pip install from source, main branch
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
OS version Ubuntu 20.10
PyTorch version 1.7.1
Python version 3.8.5
I'm trying to track down where in pynini or the nemo rules this is happening and if I find it I will follow-up.
The text was updated successfully, but these errors were encountered:
I'll be following the progress closely and am happy to help if I can get more familiar with pynini -- is this repo the best place to file / discuss similar issues?
Describe the bug
Example sentence fragments for text normalizer:
"It had 12-inch guns and a 21-knot cruising speed."
"Each 550-pound bomb exploded."
Steps/Code to reproduce bug
The results from normalized_texts will be something like (notice the spaces between each letter in the word for the units):
"five hundred fifty p o u n d bomb"
and
"twenty one k n o t cruising speed"
It seems like the sparrow/pynini rules don't allow for dashed text input followed by a unit and interpret the unit as an abbreviation. The CER calculation in NormalizerWithAudio doesn't repair this because the results from normalize_with_audio() don't ever contain the expected outputs.
Expected behavior
"five hundred fifty pound bomb"
"twenty one knot cruising speed"
Environment overview (please complete the following information)
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
I'm trying to track down where in pynini or the nemo rules this is happening and if I find it I will follow-up.
The text was updated successfully, but these errors were encountered: