-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Armenian TN #137
Armenian TN #137
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as ITN. Looks good at core, just need some edits for readability. Two main things though:
-
For TN, we typically introduce flags for deterministic and non-deterministic behavior. This means creating two logics depending on if this flag is raised. Since this is an outside contribution there's no major requirement to do this, just is an additional option if you're interested in making production ready.
-
Try and be a bit more limited with optimize calls. They're expensive and the graph doesn't really benefit that much from additional calls while constructing. (Unless you're being veerry keen with epsilon insertions.) I recommend only using them when creating subgraphs as properties or when the graph is fully instantiated.
(Fyi: your use of closures is within style with other modules, but letting you know that pynini allows Kleene shorthand with .star
.plus
and .ques
method calls from fst graphs.)
nemo_text_processing/text_normalization/hy/verbalizers/fraction.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor things, beyond that LGTM
43215b4
to
1714fa4
Compare
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
for more information, see https://pre-commit.ci Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
1714fa4
to
52f6a64
Compare
for more information, see https://pre-commit.ci
Hi @ekmb . What should I do to pass this check ''continuous-integration/jenkins/pr-head''? I could not open details of that check. Thank you! |
Signed-off-by: Ara Yeroyan <60027241+Ara-Yeroyan@users.noreply.github.com>
That's CI on our end. Can't really access if you're not a maintainer. We have to check and provide feedback. Sorry, can be a bit annoying. |
£s սիրիական ֆունտ | ||
₺ թուրքական լիրա | ||
₴ ուկրաինական գրիվնա | ||
$ ամերիկյան դոլար |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ran normalize.py
on just --text="$123" you will get "23" instead of "հարյուր քսաներեք ամերիկյան դոլար"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works fine on my side, the thing is in Armenian most of the time people write currency after cardinal or decimal not before. So in this case I get output $ հարյուր քսաներեք. But if you try 123$ it works perfectly fine and outputs հարյուր քսաներեք դոլար, the other names for the same currencies are added for future pull requests where will be added non-deterministic behavior.
self.one_to_all_tens = one_to_all_tens.optimize() | ||
|
||
hundreds_parts = (pynutil.delete("0") + insert_space + digits) | (insert_space + double_digits) | ||
one_hundreds = pynini.cross("1", "հարյուր") + (pynutil.delete("00") | hundreds_parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ran
normalize.py
on just --text="123" you will get the same "123" instead of normalized "հարյուր քսաներեք"normalize.py
on --text="123թվ․ Տրդատը կառուցեց 3 ամրոց" will again output the same text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first case you mentioned, everything works fine on my side. As for the second case there is no 'թվ.' only 'թ.' and 'թթ.' and don't use Armenian dot it does not work (use English). By the way thank you for this comment as measurements and measurement dates only worked with space after cardinal/decimal, now space is optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your response!
-
I cloned your fork and installed the project with the
./reinstall.sh
. With CLIpython normalize.py --text="123" -- language=hy
I get output "123". With that being said, maybe you can guide on how to reproduce your project state ? as at the moment I get the afore-mentioned results for all the comments I mentioned. -
How likely is to get an Armenian text with English dots ? Wouldn't it be useful to handle the punctuations of the target language?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
The easiest way to reproduce the project will be running python shell in NeMo-text-processing directory and then running the code below:
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer_hy = Normalizer(input_case='lower_cased', lang='hy')
normalizer_hy.normalize("<INPUT_TEXT>") -
I faced problems with Armenian dot in the past and made tagger classes with English dot only, however I've managed to add Armenian dots too, thanks to your comment! So now you can use both, as I believe there can be cases where English dot is used.
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
Signed-off-by: David Sargsyan <d.sargsyan@ispras.ru>
for more information, see https://pre-commit.ci
@Ara-Yeroyan are there additional issues with this or can I close out reviewing? |
Fix: add "hy" language option for armenian
Everything is okay now! The pynini behaviour was different on Windows (docker) and in linux. We have checked with @davidks13. |
Hi @tbartley94 . Are there any additional issues with the code I need to check? |
Actually there are issues (no handling) with Roman Numbers and the range like numbers - e.g. 26-27 |
This is a base for Armenian TN. Those features can be added in the future. |
@davidks13 you're good on my technical review. There's a CI issue that requires me to test on local, so the delay is me doing some san testing. I'll be merging later in the week. @Ara-Yeroyan Roman and ranges are more complex features that are implemented after base TN. Those can be disregarded. |
Signed-off-by: tbartley94 <90423858+tbartley94@users.noreply.github.com>
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.