Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Armenian TN #137

Merged
merged 13 commits into from Mar 13, 2024
Merged

Armenian TN #137

merged 13 commits into from Mar 13, 2024

Conversation

davidks13
Copy link
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@davidks13 davidks13 mentioned this pull request Jan 31, 2024
13 tasks
Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as ITN. Looks good at core, just need some edits for readability. Two main things though:

  1. For TN, we typically introduce flags for deterministic and non-deterministic behavior. This means creating two logics depending on if this flag is raised. Since this is an outside contribution there's no major requirement to do this, just is an additional option if you're interested in making production ready.

  2. Try and be a bit more limited with optimize calls. They're expensive and the graph doesn't really benefit that much from additional calls while constructing. (Unless you're being veerry keen with epsilon insertions.) I recommend only using them when creating subgraphs as properties or when the graph is fully instantiated.

(Fyi: your use of closures is within style with other modules, but letting you know that pynini allows Kleene shorthand with .star .plus and .ques method calls from fst graphs.)

tbartley94
tbartley94 previously approved these changes Feb 14, 2024
Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor things, beyond that LGTM

David Sargsyan and others added 6 commits February 20, 2024 15:13
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
for more information, see https://pre-commit.ci

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
@davidks13
Copy link
Contributor Author

davidks13 commented Feb 20, 2024

Hi @ekmb . What should I do to pass this check ''continuous-integration/jenkins/pr-head''? I could not open details of that check. Thank you!

Signed-off-by: Ara Yeroyan <60027241+Ara-Yeroyan@users.noreply.github.com>
@tbartley94
Copy link
Collaborator

Hi @ekmb . What should I do to pass this check ''continuous-integration/jenkins/pr-head''? I could not open details of that check. Thank you!

That's CI on our end. Can't really access if you're not a maintainer. We have to check and provide feedback. Sorry, can be a bit annoying.

£s սիրիական ֆունտ
₺ թուրքական լիրա
₴ ուկրաինական գրիվնա
$ ամերիկյան դոլար
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ran normalize.py on just --text="$123" you will get "23" instead of "հարյուր քսաներեք ամերիկյան դոլար"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine on my side, the thing is in Armenian most of the time people write currency after cardinal or decimal not before. So in this case I get output $ հարյուր քսաներեք. But if you try 123$ it works perfectly fine and outputs հարյուր քսաներեք դոլար, the other names for the same currencies are added for future pull requests where will be added non-deterministic behavior.

self.one_to_all_tens = one_to_all_tens.optimize()

hundreds_parts = (pynutil.delete("0") + insert_space + digits) | (insert_space + double_digits)
one_hundreds = pynini.cross("1", "հարյուր") + (pynutil.delete("00") | hundreds_parts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ran

  1. normalize.py on just --text="123" you will get the same "123" instead of normalized "հարյուր քսաներեք"
  2. normalize.py on --text="123թվ․ Տրդատը կառուցեց 3 ամրոց" will again output the same text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first case you mentioned, everything works fine on my side. As for the second case there is no 'թվ.' only 'թ.' and 'թթ.' and don't use Armenian dot it does not work (use English). By the way thank you for this comment as measurements and measurement dates only worked with space after cardinal/decimal, now space is optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your response!

  1. I cloned your fork and installed the project with the ./reinstall.sh . With CLI python normalize.py --text="123" -- language=hy I get output "123". With that being said, maybe you can guide on how to reproduce your project state ? as at the moment I get the afore-mentioned results for all the comments I mentioned.

  2. How likely is to get an Armenian text with English dots ? Wouldn't it be useful to handle the punctuations of the target language?

Copy link
Contributor Author

@davidks13 davidks13 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The easiest way to reproduce the project will be running python shell in NeMo-text-processing directory and then running the code below:
    from nemo_text_processing.text_normalization.normalize import Normalizer
    normalizer_hy = Normalizer(input_case='lower_cased', lang='hy')
    normalizer_hy.normalize("<INPUT_TEXT>")

  2. I faced problems with Armenian dot in the past and made tagger classes with English dot only, however I've managed to add Armenian dots too, thanks to your comment! So now you can use both, as I believe there can be cases where English dot is used.

David Sargsyan and others added 3 commits February 21, 2024 15:33
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
@tbartley94
Copy link
Collaborator

@Ara-Yeroyan are there additional issues with this or can I close out reviewing?

Fix: add "hy" language option for armenian
@Ara-Yeroyan
Copy link
Contributor

Everything is okay now! The pynini behaviour was different on Windows (docker) and in linux. We have checked with @davidks13.

@davidks13
Copy link
Contributor Author

davidks13 commented Feb 27, 2024

Hi @tbartley94 . Are there any additional issues with the code I need to check?

@Ara-Yeroyan
Copy link
Contributor

Actually there are issues (no handling) with Roman Numbers and the range like numbers - e.g. 26-27

@davidks13
Copy link
Contributor Author

Actually there are issues (no handling) with

This is a base for Armenian TN. Those features can be added in the future.

@tbartley94
Copy link
Collaborator

@davidks13 you're good on my technical review. There's a CI issue that requires me to test on local, so the delay is me doing some san testing. I'll be merging later in the week.

@Ara-Yeroyan Roman and ranges are more complex features that are implemented after base TN. Those can be disregarded.

Signed-off-by: tbartley94 <90423858+tbartley94@users.noreply.github.com>
@tbartley94
Copy link
Collaborator

jenkins

Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tbartley94 tbartley94 merged commit b49afa0 into NVIDIA:main Mar 13, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants