Armenian TN #137

davidks13 · 2024-01-31T14:48:11Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

PR Type:

New Feature
Bugfix
Documentation
Test

If you haven't finished some of the above items you can still open "Draft" PR.

tbartley94

Same as ITN. Looks good at core, just need some edits for readability. Two main things though:

For TN, we typically introduce flags for deterministic and non-deterministic behavior. This means creating two logics depending on if this flag is raised. Since this is an outside contribution there's no major requirement to do this, just is an additional option if you're interested in making production ready.
Try and be a bit more limited with optimize calls. They're expensive and the graph doesn't really benefit that much from additional calls while constructing. (Unless you're being veerry keen with epsilon insertions.) I recommend only using them when creating subgraphs as properties or when the graph is fully instantiated.

(Fyi: your use of closures is within style with other modules, but letting you know that pynini allows Kleene shorthand with .star .plus and .ques method calls from fst graphs.)

nemo_text_processing/text_normalization/hy/taggers/cardinal.py

nemo_text_processing/text_normalization/hy/taggers/decimal.py

nemo_text_processing/text_normalization/hy/taggers/fraction.py

nemo_text_processing/text_normalization/hy/taggers/money.py

nemo_text_processing/text_normalization/hy/taggers/ordinal.py

nemo_text_processing/text_normalization/hy/verbalizers/fraction.py

tbartley94

Two minor things, beyond that LGTM

nemo_text_processing/text_normalization/hy/taggers/decimal.py

nemo_text_processing/text_normalization/hy/taggers/money.py

nemo_text_processing/text_normalization/hy/verbalizers/fraction.py

nemo_text_processing/text_normalization/hy/verbalizers/measure.py

nemo_text_processing/text_normalization/hy/verbalizers/money.py

nemo_text_processing/text_normalization/hy/verbalizers/ordinal.py

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

for more information, see https://pre-commit.ci Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

for more information, see https://pre-commit.ci

davidks13 · 2024-02-20T15:40:00Z

Hi @ekmb . What should I do to pass this check ''continuous-integration/jenkins/pr-head''? I could not open details of that check. Thank you!

Signed-off-by: Ara Yeroyan <60027241+Ara-Yeroyan@users.noreply.github.com>

tbartley94 · 2024-02-20T18:19:05Z

Hi @ekmb . What should I do to pass this check ''continuous-integration/jenkins/pr-head''? I could not open details of that check. Thank you!

That's CI on our end. Can't really access if you're not a maintainer. We have to check and provide feedback. Sorry, can be a bit annoying.

Ara-Yeroyan · 2024-02-20T18:41:25Z

nemo_text_processing/text_normalization/hy/data/currency.tsv

+£s	սիրիական ֆունտ
+₺	թուրքական լիրա
+₴	ուկրաինական գրիվնա
+$	ամերիկյան դոլար


If you ran normalize.py on just --text="$123" you will get "23" instead of "հարյուր քսաներեք ամերիկյան դոլար"

Works fine on my side, the thing is in Armenian most of the time people write currency after cardinal or decimal not before. So in this case I get output $ հարյուր քսաներեք. But if you try 123$ it works perfectly fine and outputs հարյուր քսաներեք դոլար, the other names for the same currencies are added for future pull requests where will be added non-deterministic behavior.

Ara-Yeroyan · 2024-02-20T18:46:11Z

nemo_text_processing/text_normalization/hy/taggers/cardinal.py

+        self.one_to_all_tens = one_to_all_tens.optimize()
+
+        hundreds_parts = (pynutil.delete("0") + insert_space + digits) | (insert_space + double_digits)
+        one_hundreds = pynini.cross("1", "հարյուր") + (pynutil.delete("00") | hundreds_parts)


If you ran

normalize.py on just --text="123" you will get the same "123" instead of normalized "հարյուր քսաներեք"

normalize.py on --text="123թվ․ Տրդատը կառուցեց 3 ամրոց" will again output the same text

In the first case you mentioned, everything works fine on my side. As for the second case there is no 'թվ.' only 'թ.' and 'թթ.' and don't use Armenian dot it does not work (use English). By the way thank you for this comment as measurements and measurement dates only worked with space after cardinal/decimal, now space is optional.

Thanks for your response!

I cloned your fork and installed the project with the ./reinstall.sh . With CLI python normalize.py --text="123" -- language=hy I get output "123". With that being said, maybe you can guide on how to reproduce your project state ? as at the moment I get the afore-mentioned results for all the comments I mentioned.

How likely is to get an Armenian text with English dots ? Wouldn't it be useful to handle the punctuations of the target language?

The easiest way to reproduce the project will be running python shell in NeMo-text-processing directory and then running the code below:
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer_hy = Normalizer(input_case='lower_cased', lang='hy')
normalizer_hy.normalize("<INPUT_TEXT>")

I faced problems with Armenian dot in the past and made tagger classes with English dot only, however I've managed to add Armenian dots too, thanks to your comment! So now you can use both, as I believe there can be cases where English dot is used.

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

for more information, see https://pre-commit.ci

tbartley94 · 2024-02-23T01:06:24Z

@Ara-Yeroyan are there additional issues with this or can I close out reviewing?

Fix: add "hy" language option for armenian

Ara-Yeroyan · 2024-02-23T13:39:33Z

Everything is okay now! The pynini behaviour was different on Windows (docker) and in linux. We have checked with @davidks13.

davidks13 · 2024-02-27T10:49:50Z

Hi @tbartley94 . Are there any additional issues with the code I need to check?

Ara-Yeroyan · 2024-02-27T11:18:21Z

Actually there are issues (no handling) with Roman Numbers and the range like numbers - e.g. 26-27

davidks13 · 2024-02-27T12:30:37Z

Actually there are issues (no handling) with

This is a base for Armenian TN. Those features can be added in the future.

tbartley94 · 2024-02-27T16:24:05Z

@davidks13 you're good on my technical review. There's a CI issue that requires me to test on local, so the delay is me doing some san testing. I'll be merging later in the week.

@Ara-Yeroyan Roman and ranges are more complex features that are implemented after base TN. Those can be disregarded.

Signed-off-by: tbartley94 <90423858+tbartley94@users.noreply.github.com>

tbartley94 · 2024-03-12T20:35:59Z

jenkins

tbartley94

lgtm

davidks13 mentioned this pull request Jan 31, 2024

Armenian itn #136

Merged

13 tasks

tbartley94 requested changes Feb 5, 2024

View reviewed changes

tbartley94 previously approved these changes Feb 14, 2024

View reviewed changes

nemo_text_processing/text_normalization/hy/taggers/decimal.py Outdated Show resolved Hide resolved

nemo_text_processing/text_normalization/hy/taggers/money.py Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Feb 14, 2024

View reviewed changes

davidks13 dismissed tbartley94’s stale review via 0db2c22 February 15, 2024 12:40

davidks13 force-pushed the armenian_tn branch from 43215b4 to 1714fa4 Compare February 19, 2024 14:39

David Sargsyan and others added 6 commits February 20, 2024 15:13

merged with main branch and fixed conflicts

ae1e75a

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

fixing conflicts

442f176

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

fixing some more conflicts

6603f4b

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d25c427

for more information, see https://pre-commit.ci Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

fixed a minor issue

9f3483d

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

deleted unused imports

52f6a64

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

davidks13 force-pushed the armenian_tn branch from 1714fa4 to 52f6a64 Compare February 20, 2024 11:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

c082aad

for more information, see https://pre-commit.ci

Fix: add "hy" language option for armenian

fd56c31

Signed-off-by: Ara Yeroyan <60027241+Ara-Yeroyan@users.noreply.github.com>

Ara-Yeroyan reviewed Feb 20, 2024

View reviewed changes

David Sargsyan and others added 3 commits February 21, 2024 15:33

added optional space for measurements after cardinals/decimals

4e20313

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

added Armenian dot

6eb4ae8

Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e7ceaee

for more information, see https://pre-commit.ci

Merge pull request #1 from Ara-Yeroyan/patch-1

e5c3222

Fix: add "hy" language option for armenian

Merge branch 'main' into armenian_tn

d84cf5c

Signed-off-by: tbartley94 <90423858+tbartley94@users.noreply.github.com>

tbartley94 approved these changes Mar 12, 2024

View reviewed changes

ekmb approved these changes Mar 12, 2024

View reviewed changes

tbartley94 merged commit b49afa0 into NVIDIA:main Mar 13, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Armenian TN #137

Armenian TN #137

davidks13 commented Jan 31, 2024

tbartley94 left a comment

tbartley94 left a comment

davidks13 commented Feb 20, 2024 •

edited

tbartley94 commented Feb 20, 2024

Ara-Yeroyan Feb 20, 2024 •

edited

davidks13 Feb 21, 2024

Ara-Yeroyan Feb 20, 2024 •

edited

davidks13 Feb 21, 2024

Ara-Yeroyan Feb 21, 2024

davidks13 Feb 22, 2024 •

edited

tbartley94 commented Feb 23, 2024

Ara-Yeroyan commented Feb 23, 2024

davidks13 commented Feb 27, 2024 •

edited

Ara-Yeroyan commented Feb 27, 2024

davidks13 commented Feb 27, 2024

tbartley94 commented Feb 27, 2024

tbartley94 commented Mar 12, 2024

tbartley94 left a comment

Armenian TN #137

Armenian TN #137

Conversation

davidks13 commented Jan 31, 2024

What does this PR do ?

Before your PR is "Ready for review"

tbartley94 left a comment

Choose a reason for hiding this comment

tbartley94 left a comment

Choose a reason for hiding this comment

davidks13 commented Feb 20, 2024 • edited

tbartley94 commented Feb 20, 2024

Ara-Yeroyan Feb 20, 2024 • edited

Choose a reason for hiding this comment

davidks13 Feb 21, 2024

Choose a reason for hiding this comment

Ara-Yeroyan Feb 20, 2024 • edited

Choose a reason for hiding this comment

davidks13 Feb 21, 2024

Choose a reason for hiding this comment

Ara-Yeroyan Feb 21, 2024

Choose a reason for hiding this comment

davidks13 Feb 22, 2024 • edited

Choose a reason for hiding this comment

tbartley94 commented Feb 23, 2024

Ara-Yeroyan commented Feb 23, 2024

davidks13 commented Feb 27, 2024 • edited

Ara-Yeroyan commented Feb 27, 2024

davidks13 commented Feb 27, 2024

tbartley94 commented Feb 27, 2024

tbartley94 commented Mar 12, 2024

tbartley94 left a comment

Choose a reason for hiding this comment

davidks13 commented Feb 20, 2024 •

edited

Ara-Yeroyan Feb 20, 2024 •

edited

Ara-Yeroyan Feb 20, 2024 •

edited

davidks13 Feb 22, 2024 •

edited

davidks13 commented Feb 27, 2024 •

edited