Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS] Add VietnameseCharsTokenizer #9665

Merged
merged 17 commits into from
Jul 26, 2024
Merged

[TTS] Add VietnameseCharsTokenizer #9665

merged 17 commits into from
Jul 26, 2024

Conversation

huutuongtu
Copy link
Contributor

@huutuongtu huutuongtu commented Jul 10, 2024

Signed-off-by: Tu huutu12312vn@gmail.com

What does this PR do ?

Add a Vietnamese language tokenizer for TTS training

Collection: [TTS]

Changelog

  • Add VietnameseCharsTokenizer
  • Add unit tests for Vietnamese

Usage

from nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers import VietnameseCharsTokenizer

text = "Xin chào các bạn."

tokenizer = VietnameseCharsTokenizer(
    pad_with_space=True,
)

tokens = tokenizer(text)
graphemes = tokenizer.decode(tokens)
graphemes = graphemes.replace('|', '')

print(tokens)
#  xin chào các bạn.
print(graphemes)

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

huutuongtu and others added 3 commits July 10, 2024 04:40
Signed-off-by: huutuongtu <huutuongtu@users.noreply.github.com>
Signed-off-by: huutuongtu <huutuongtu@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Copy link
Collaborator

@XuesongYang XuesongYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refactor the code accordingly.

@@ -184,6 +208,32 @@ def get_ipa_punctuation_list(locale):
'—', # em dash, U+2014, decimal 8212
]
)
if locale == "vi-VN":
punct_set.update(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you pls add the source of punctuations of Vietnamese?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems that there isn't any 'official' source that talks about Vietnamese punctuation marks. I can find some information about punctuation marks here: https://languagedrops.com/word/en/english/vietnamese/topics/punctuation/.
Maybe we just need to use DEFAULT_PUNCTUATION.

@XuesongYang XuesongYang self-requested a review July 24, 2024 06:58
XuesongYang
XuesongYang previously approved these changes Jul 24, 2024
Copy link
Collaborator

@XuesongYang XuesongYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. LGTM. pls add the source of Vietnamese punctuations if any.

XuesongYang
XuesongYang previously approved these changes Jul 24, 2024
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
@XuesongYang XuesongYang self-requested a review July 25, 2024 22:39
@XuesongYang XuesongYang merged commit 74c2caf into NVIDIA:main Jul 26, 2024
206 of 207 checks passed
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jul 30, 2024
* Update tts_tokenizers.py
* Update tokenizer_utils.py
* Update test_tts_tokenizers.py
* Apply isort and black reformatting

Signed-off-by: huutuongtu <huutuongtu@users.noreply.github.com>

* Signed-off-by: Tu [huutu12312vn@gmail.com](mailto:huutu12312vn@gmail.com)

* Update ipa_lexicon.py - Signed-off-by: Tu [huutu12312vn@gmail.com](mailto:huutu12312vn@gmail.com)

Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>

---------

Signed-off-by: huutuongtu <huutuongtu@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: huutuongtu <huutuongtu@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
xuanzic pushed a commit to xuanzic/NeMo that referenced this pull request Aug 1, 2024
* Update tts_tokenizers.py
* Update tokenizer_utils.py
* Update test_tts_tokenizers.py
* Apply isort and black reformatting

Signed-off-by: huutuongtu <huutuongtu@users.noreply.github.com>

* Signed-off-by: Tu [huutu12312vn@gmail.com](mailto:huutu12312vn@gmail.com)

* Update ipa_lexicon.py - Signed-off-by: Tu [huutu12312vn@gmail.com](mailto:huutu12312vn@gmail.com)

Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>

---------

Signed-off-by: huutuongtu <huutuongtu@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: huutuongtu <huutuongtu@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
Signed-off-by: Vivian Chen <xuanzic@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants