Extending the neural TN/ITN models for other languages #2497

laituan245 · 2021-07-16T18:06:40Z

#2415 was mainly targetting English. This PR extends it to support other languages.

Testing the new code on Russian
Testing the new code on German

Results on Google Russian dataset and internal German dataset:

Russian Duplex System. TN accuracy: 96.20%. ITN accuracy: 85.57%
German Duplex System. TN accuracy: 94.38%. ITN accuracy: 79.89%

lgtm-com · 2021-07-16T18:26:19Z

This pull request introduces 1 alert when merging 4562e7c into e5b8570 - view on LGTM.com

new alerts:

1 for Except block handles 'BaseException'

lgtm-com · 2021-07-19T02:56:14Z

This pull request introduces 1 alert when merging 1be6c7e into c527e95 - view on LGTM.com

new alerts:

1 for Except block handles 'BaseException'

lgtm-com · 2021-07-19T03:09:24Z

This pull request introduces 1 alert when merging 32207f0 into c527e95 - view on LGTM.com

new alerts:

1 for Except block handles 'BaseException'

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

lgtm-com · 2021-07-19T15:48:14Z

This pull request introduces 1 alert when merging 914ff12 into c527e95 - view on LGTM.com

new alerts:

1 for Except block handles 'BaseException'

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

laituan245 · 2021-07-19T17:57:54Z

Tested on Google Russian dataset and @yzhang123 's German dataset:

Russian Duplex System. TN accuracy: 96.20%. ITN accuracy: 85.57%
German Duplex System. TN accuracy: 94.38%. ITN accuracy: 79.89%

yzhang123 · 2021-07-20T18:49:09Z

examples/nlp/duplex_text_normalization/duplex_text_normalization_test.py

 from omegaconf import DictConfig, OmegaConf

 import nemo.collections.nlp.data.text_normalization.constants as constants
 from nemo.collections.nlp.data.text_normalization import TextNormalizationTestDataset
+from nemo.collections.nlp.data.text_normalization.utils import basic_tokenize


whats difference between basic_tokenize and word_tokenize?

basic_tokenize takes as inputs the input text and the language. If the language is English, then basic_tokenize will just call word_tokenize. For other languages (e.g., ru and de), basic_tokenize will just split words between spaces.

yzhang123 · 2021-07-20T18:51:24Z

examples/nlp/duplex_text_normalization/google_data_preprocessing.py

@@ -15,7 +15,7 @@
 """
 This script can be used to process the raw data files of the Google Text Normalization dataset
 to obtain data files of the format mentioned in the `text_normalization doc <https://github.com/NVIDIA/NeMo/blob/main/docs/source/nlp/text_normalization.rst>`.


could you add this script supports EN, and RU Google Text norm dataset processing?

This script google_data_preprocessing.py supports EN and RU processing at this current form.

did you had to do some german processing? if not im ok to merge PR

No. I directly used your German's dataset. The original Google dataset also does not have German data.

laituan245 · 2021-07-20T21:55:40Z

Thanks for reviewing the PR. I think the CI has passed. Could you merge it when you have some time? Thanks.

* extending the neural TN/ITN model to handle RU Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Support German Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Catch AttributeError instead of BaseException Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Style fix Signed-off-by: Tuan Lai <tuanl@nvidia.com> Signed-off-by: Jason <jasoli@nvidia.com>

* extending the neural TN/ITN model to handle RU Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Support German Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Catch AttributeError instead of BaseException Signed-off-by: Tuan Lai <tuanl@nvidia.com> * Style fix Signed-off-by: Tuan Lai <tuanl@nvidia.com> Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

laituan245 marked this pull request as draft July 16, 2021 18:07

Tuan Lai added 2 commits July 19, 2021 10:27

extending the neural TN/ITN model to handle RU

71b205e

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

Support German

914ff12

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

Tuan Lai added 3 commits July 19, 2021 10:50

Catch AttributeError instead of BaseException

7685b24

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

Style fix

9b8d9d0

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

Merge branch 'main' into neural_tn_ru

839a746

laituan245 marked this pull request as ready for review July 19, 2021 17:50

Merge branch 'main' into neural_tn_ru

700291d

yzhang123 reviewed Jul 20, 2021

View reviewed changes

yzhang123 approved these changes Jul 20, 2021

View reviewed changes

Merge branch 'main' into neural_tn_ru

54ff015

Tuan Manh Lai added 2 commits July 20, 2021 16:59

Merge branch 'main' into neural_tn_ru

127f043

Merge branch 'main' into neural_tn_ru

922a006

yzhang123 merged commit b472670 into NVIDIA:main Jul 21, 2021

laituan245 deleted the neural_tn_ru branch July 21, 2021 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending the neural TN/ITN models for other languages #2497

Extending the neural TN/ITN models for other languages #2497

laituan245 commented Jul 16, 2021 •

edited

Loading

lgtm-com bot commented Jul 16, 2021

lgtm-com bot commented Jul 19, 2021

lgtm-com bot commented Jul 19, 2021

lgtm-com bot commented Jul 19, 2021

laituan245 commented Jul 19, 2021

yzhang123 Jul 20, 2021

laituan245 Jul 20, 2021

yzhang123 Jul 20, 2021

laituan245 Jul 20, 2021

yzhang123 Jul 20, 2021

laituan245 Jul 20, 2021

laituan245 commented Jul 20, 2021

Extending the neural TN/ITN models for other languages #2497

Extending the neural TN/ITN models for other languages #2497

Conversation

laituan245 commented Jul 16, 2021 • edited Loading

lgtm-com bot commented Jul 16, 2021

lgtm-com bot commented Jul 19, 2021

lgtm-com bot commented Jul 19, 2021

lgtm-com bot commented Jul 19, 2021

laituan245 commented Jul 19, 2021

yzhang123 Jul 20, 2021

Choose a reason for hiding this comment

laituan245 Jul 20, 2021

Choose a reason for hiding this comment

yzhang123 Jul 20, 2021

Choose a reason for hiding this comment

laituan245 Jul 20, 2021

Choose a reason for hiding this comment

yzhang123 Jul 20, 2021

Choose a reason for hiding this comment

laituan245 Jul 20, 2021

Choose a reason for hiding this comment

laituan245 commented Jul 20, 2021

laituan245 commented Jul 16, 2021 •

edited

Loading