Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German TN fixes #173

Closed
wants to merge 13 commits into from
Closed

German TN fixes #173

wants to merge 13 commits into from

Conversation

zoobereq
Copy link
Collaborator

@zoobereq zoobereq commented May 20, 2024

What does this PR do ?

This PR implements DE TN fixes for the following issues:

  • Adds support for normalizing social media tags (e.g. @zoobereq and @zoobereq.net)
  • Adds support for normalizing comma-separated digit strings
  • Adds support for period-separated time formats (e.g. 2.30 and 02.30)
  • The % sign is now accepted by Italian TN (see this issue)
  • The % sign is now accepted by French TN (same issue as above). This is generally implemented as part of the Measures class, which French currently lacks. As a stop-gap measure, the % was whitelisted. It should be properly implemented as part of Measures.

This PR does not address the following:

  • The issue with optionally period-separated cardinals (e.g. 1 Mil = 100000 | 1.000.000) , which the currently implemented DE TN/ITN doesn't support. The TN and ITN taggers have been rebuilt to accommodate for that hiccup, but since cardinals plug into almost all other DE classes, this issue will be addressed gradually as other classes are fixed/(re)developed.
  • The issue with 2h not normalizing to deux heures in FR. Since the abbreviation h is subject to declension, this problem should be addressed with the implementation of the Measures class.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@zoobereq zoobereq requested review from ekmb and tbartley94 May 20, 2024 16:17
@zoobereq zoobereq self-assigned this May 20, 2024
@zoobereq zoobereq closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant