-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending the neural TN/ITN models for other languages #2497
Conversation
This pull request introduces 1 alert when merging 4562e7c into e5b8570 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 1be6c7e into c527e95 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 32207f0 into c527e95 - view on LGTM.com new alerts:
|
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
This pull request introduces 1 alert when merging 914ff12 into c527e95 - view on LGTM.com new alerts:
|
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
Tested on Google Russian dataset and @yzhang123 's German dataset:
|
from omegaconf import DictConfig, OmegaConf | ||
|
||
import nemo.collections.nlp.data.text_normalization.constants as constants | ||
from nemo.collections.nlp.data.text_normalization import TextNormalizationTestDataset | ||
from nemo.collections.nlp.data.text_normalization.utils import basic_tokenize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats difference between basic_tokenize and word_tokenize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basic_tokenize
takes as inputs the input text and the language. If the language is English, then basic_tokenize
will just call word_tokenize
. For other languages (e.g., ru and de), basic_tokenize
will just split words between spaces.
@@ -15,7 +15,7 @@ | |||
""" | |||
This script can be used to process the raw data files of the Google Text Normalization dataset | |||
to obtain data files of the format mentioned in the `text_normalization doc <https://github.com/NVIDIA/NeMo/blob/main/docs/source/nlp/text_normalization.rst>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add this script supports EN, and RU Google Text norm dataset processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script google_data_preprocessing.py
supports EN and RU processing at this current form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you had to do some german processing? if not im ok to merge PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I directly used your German's dataset. The original Google dataset also does not have German data.
Thanks for reviewing the PR. I think the CI has passed. Could you merge it when you have some time? Thanks. |
* extending the neural TN/ITN model to handle RU Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Support German Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Catch AttributeError instead of BaseException Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Style fix Signed-off-by: Tuan Lai <tuanl@nvidia.com> Signed-off-by: Jason <jasoli@nvidia.com>
* extending the neural TN/ITN model to handle RU Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Support German Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Catch AttributeError instead of BaseException Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Style fix Signed-off-by: Tuan Lai <tuanl@nvidia.com> Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
#2415 was mainly targetting English. This PR extends it to support other languages.
Results on Google Russian dataset and internal German dataset: