Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Digits Remain Unnormalized in European Languages Output #171

Open
dmylzenova opened this issue May 8, 2024 · 1 comment
Open

Digits Remain Unnormalized in European Languages Output #171

dmylzenova opened this issue May 8, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@dmylzenova
Copy link

dmylzenova commented May 8, 2024

Hello,

I have observed an issue where digits remain unnormalized in the output text when using the Nemo text normalization library, specifically with European languages such as German (de), Italian (it), and French (fr). This behavior occurs even though the expected output should not contain any digits.

Here is an example:

from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case="cased", lang="it")
text = "il 48% ha risposto che avrebbe dovuto provenire dal proprio budget."
norm_text = normalizer.normalize(text, punct_post_process=True)
print(norm_text)

Expected output: No digits in the normalized text.
Actual output: 'il 48% ha risposto che avrebbe dovuto provenire dal proprio budget.'

Additional Examples:

Other examples with similart behavior in format (text, normalized_text):

[('Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.',
  'Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.'),
 ('Aber die Tatsache, dass andere Leute bieten nur 800.000 zu diesem Zeitpunkt der Marktpreis ist auch 800.000.',
  'Aber die Tatsache, dass andere Leute bieten nur 800.000 zu diesem Zeitpunkt der Marktpreis ist auch 800.000.'),
 ('Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.',
  'Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.'),
 ('Ich gebe Ihnen ein anderes Beispiel: Wenn Sie einmal unseren OPP sprechen und ich gebe Ihnen auf der Stelle 1.000 Dollar.',
  'Ich gebe Ihnen ein anderes Beispiel: Wenn Sie einmal unseren OPP sprechen und ich gebe Ihnen auf der Stelle 1.000 Dollar.'),
 ('Il y a 1,08 milliard de vaches dans le monde qui émettent 18% des émissions de carbone.',
  'Il y a un virgule zéro huit milliard de vaches dans le monde qui émettent 18% des émissions de carbone'),
 ('Ci sono 1,08 miliardi di mucche nel mondo che emettono il 18% delle emissioni di carbonio.',
  'Il y a un virgule zéro huit milliard de vaches dans le monde qui émettent 18% des émissions de carbone.')]

Expected Behavior:
The normalized text should not contain any digits.

Actual Behavior:
Digits are retained in the normalized output, which contradicts the expected behavior of a text normalization tool. This
issue does not occur consistently but appears sometimes which is particularly problematic for tasks that require clean, digit-free text—such as grapheme-to-phoneme (g2p) conversion.

Environment:

Nemo version: I use nemo_text_processing with version==0.3.0rc0.
Python version: Python 3.11.8

@dmylzenova dmylzenova added the bug Something isn't working label May 8, 2024
@zoobereq zoobereq mentioned this issue May 24, 2024
14 tasks
@zoobereq
Copy link
Collaborator

zoobereq commented May 24, 2024

  • We are addressing the issue with the % not normalizing in Italian and French. This fix will be available shortly and will also cause the numbers in these languages to normalize correctly.
  • We are aware of h and some other units not normalizing in French and are working to address that.
  • We are aware of period-separated numbers not normalizing in German (numbers without period-separators normalize correctly). We are working to address that as well.

This was referenced May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants