Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending the neural TN/ITN models for other languages #2497

Merged
merged 9 commits into from
Jul 21, 2021
Merged

Extending the neural TN/ITN models for other languages #2497

merged 9 commits into from
Jul 21, 2021

Conversation

laituan245
Copy link
Contributor

@laituan245 laituan245 commented Jul 16, 2021

#2415 was mainly targetting English. This PR extends it to support other languages.

  • Testing the new code on Russian
  • Testing the new code on German

Results on Google Russian dataset and internal German dataset:

  • Russian Duplex System. TN accuracy: 96.20%. ITN accuracy: 85.57%
  • German Duplex System. TN accuracy: 94.38%. ITN accuracy: 79.89%

@laituan245 laituan245 marked this pull request as draft July 16, 2021 18:07
@lgtm-com
Copy link

lgtm-com bot commented Jul 16, 2021

This pull request introduces 1 alert when merging 4562e7c into e5b8570 - view on LGTM.com

new alerts:

  • 1 for Except block handles 'BaseException'

@lgtm-com
Copy link

lgtm-com bot commented Jul 19, 2021

This pull request introduces 1 alert when merging 1be6c7e into c527e95 - view on LGTM.com

new alerts:

  • 1 for Except block handles 'BaseException'

@lgtm-com
Copy link

lgtm-com bot commented Jul 19, 2021

This pull request introduces 1 alert when merging 32207f0 into c527e95 - view on LGTM.com

new alerts:

  • 1 for Except block handles 'BaseException'

Tuan Lai added 2 commits July 19, 2021 10:27
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented Jul 19, 2021

This pull request introduces 1 alert when merging 914ff12 into c527e95 - view on LGTM.com

new alerts:

  • 1 for Except block handles 'BaseException'

Tuan Lai added 3 commits July 19, 2021 10:50
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
@laituan245 laituan245 marked this pull request as ready for review July 19, 2021 17:50
@laituan245
Copy link
Contributor Author

Tested on Google Russian dataset and @yzhang123 's German dataset:

  • Russian Duplex System. TN accuracy: 96.20%. ITN accuracy: 85.57%
  • German Duplex System. TN accuracy: 94.38%. ITN accuracy: 79.89%

from omegaconf import DictConfig, OmegaConf

import nemo.collections.nlp.data.text_normalization.constants as constants
from nemo.collections.nlp.data.text_normalization import TextNormalizationTestDataset
from nemo.collections.nlp.data.text_normalization.utils import basic_tokenize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats difference between basic_tokenize and word_tokenize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic_tokenize takes as inputs the input text and the language. If the language is English, then basic_tokenize will just call word_tokenize. For other languages (e.g., ru and de), basic_tokenize will just split words between spaces.

@@ -15,7 +15,7 @@
"""
This script can be used to process the raw data files of the Google Text Normalization dataset
to obtain data files of the format mentioned in the `text_normalization doc <https://github.com/NVIDIA/NeMo/blob/main/docs/source/nlp/text_normalization.rst>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add this script supports EN, and RU Google Text norm dataset processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script google_data_preprocessing.py supports EN and RU processing at this current form.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you had to do some german processing? if not im ok to merge PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I directly used your German's dataset. The original Google dataset also does not have German data.

@laituan245
Copy link
Contributor Author

Thanks for reviewing the PR. I think the CI has passed. Could you merge it when you have some time? Thanks.

@yzhang123 yzhang123 merged commit b472670 into NVIDIA:main Jul 21, 2021
@laituan245 laituan245 deleted the neural_tn_ru branch July 21, 2021 17:51
blisc pushed a commit to blisc/NeMo that referenced this pull request Aug 12, 2021
* extending the neural TN/ITN model to handle RU

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Support German
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Catch AttributeError instead of BaseException
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Style fix
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
paarthneekhara pushed a commit to paarthneekhara/NeMo that referenced this pull request Sep 17, 2021
* extending the neural TN/ITN model to handle RU

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Support German
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Catch AttributeError instead of BaseException
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Style fix
Signed-off-by: Tuan Lai <tuanl@nvidia.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants