Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some bugs in English, German, Spanish, Italian normalizers #166

Closed
Oktai15 opened this issue May 1, 2024 · 2 comments
Closed

Some bugs in English, German, Spanish, Italian normalizers #166

Oktai15 opened this issue May 1, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Oktai15
Copy link

Oktai15 commented May 1, 2024

Hi!

I found a bug in English normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="en",
  deterministic=True,
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=Here is mail.nasa.gov.
norm_text=Here is mail dot nasa dot gov dot
expected output=Here is mail dot nasa dot gov.

Similar bug can be reached in German normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="de",
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=Here is brettspielversand.de.
norm_text=Here is b r e t t s p i e l v e r s a n d punkt de punkt
expected output=Here is brettspielversand punkt de.

Similar problem with text=KIM.com-Specials..
I got same problem with website in text on Spanish and Italian.

I also found a specific bug in Spanish normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="es",
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico.
norm_text=El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico.
Not sure what is expected output, but current norm_text looks not okay.

@Oktai15 Oktai15 added the bug Something isn't working label May 1, 2024
@dmylzenova
Copy link

I aslo met similar behavior:

text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen."
normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."

@zoobereq
Copy link
Collaborator

I aslo met similar behavior:

text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen." normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."

The above is expected behavior. The normalizer assumes that consecutive sentences are separated by a period and at least one whitespace. The string quoted above comprises two clauses separated by a period without whitespaces. Adding a whitespace after the period induces correct normalization.

zoobereq pushed a commit that referenced this issue Jun 3, 2024
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
@ekmb ekmb closed this as completed in e2fbc45 Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants