Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TN] WFST to normalize punctuation #4108

Merged
merged 30 commits into from
May 26, 2022
Merged

[TN] WFST to normalize punctuation #4108

merged 30 commits into from
May 26, 2022

Conversation

ekmb
Copy link
Collaborator

@ekmb ekmb commented May 4, 2022

What does this PR do ?

Added WFST graph to handle punctuation after verbalization step, parallel normalization, bug fixes

Collection: [TN]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [ X] New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

ekmb added 4 commits May 3, 2022 13:25
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
…_threshold

Signed-off-by: ekmb <ebakhturina@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented May 4, 2022

This pull request introduces 2 alerts when merging 5f0ef3e into dc13c5d - view on LGTM.com

new alerts:

  • 1 for Duplicate key in dict literal
  • 1 for Wrong name for an argument in a class instantiation

ekmb added 4 commits May 5, 2022 09:35
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented May 6, 2022

This pull request introduces 2 alerts when merging 64bbc44 into ddd8719 - view on LGTM.com

new alerts:

  • 1 for Duplicate key in dict literal
  • 1 for Wrong name for an argument in a class instantiation

ekmb added 3 commits May 9, 2022 07:11
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented May 10, 2022

This pull request introduces 2 alerts when merging b5814f6 into 650718f - view on LGTM.com

new alerts:

  • 1 for Duplicate key in dict literal
  • 1 for Wrong name for an argument in a class instantiation

ekmb added 3 commits May 10, 2022 14:34
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
@ekmb ekmb changed the title WFST to normalize punctuation [TN] WFST to normalize punctuation May 12, 2022
ekmb added 3 commits May 13, 2022 15:41
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
@lgtm-com
Copy link

lgtm-com bot commented May 13, 2022

This pull request introduces 2 alerts when merging d40929b into 08df199 - view on LGTM.com

new alerts:

  • 1 for Unreachable code
  • 1 for Wrong name for an argument in a class instantiation

ekmb added 3 commits May 15, 2022 15:58
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
@ekmb ekmb requested a review from yzhang123 May 16, 2022 02:46
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Copy link
Contributor

@yzhang123 yzhang123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Evelina for the great effort!

chapter
Class
CLASS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removing this?

$ and 5% or %~dollar and five percent or percent sign
1~one
1~one
1, ~one ,
1!!!!~one ! ! ! !
(1)Hello~( one ) Hello
!1~! one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not be included? only punctuation following semiotic and words?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to test_cases_punctuation.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

abbreviation USA,~abbreviation USA,
23rd july, 1998~the twenty third of july, nineteen ninety eight
April 29th’s meeting~april twenty ninth’s meeting
?,~?,
?,no~?,no
I've 20' and 14/ they're I'm 16c.~I've twenty' and fourteen slash they're I'm sixteen c.
I've 20' and 14/ they're I'm 16c.~I've twenty ' and fourteen slash they're I'm sixteen c.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why space between 20 and ' in normalization?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apostrophe is not handled if attached to a semiotic token

pynini.difference(
self.graph, pynini.union("$", "€", "₩", "£", "¥", "#", "$", "%") + pynini.closure(NEMO_DIGIT, 1)
graph, pynini.union("$", "€", "₩", "£", "¥", "#", "$", "%") + pynini.closure(NEMO_DIGIT, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we treat this separately?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduced the complexity of this

# non_digit is needed to allow non-ascii chars, like in "Müller's"
non_digit = pynini.difference(NEMO_NOT_SPACE, NEMO_DIGIT).optimize()
at_least_one_alpha = (
pynini.closure(non_digit) + pynini.closure(NEMO_ALPHA, 1) + pynini.closure(non_digit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just having closure(non_digit), why including NeMo_Alpha?


# punct followed by word and another punct mark: { "And, }
alpha_with_punct_graph = (
pynini.closure(punct_symbols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why includding punct_symbols here?

pynini.closure(punct_symbols)
+ at_least_one_alpha
+ pynini.closure(punct_symbols, 1)
+ pynini.closure(non_digit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why requiring non_digit, NeMo_Alpha here?

yzhang123
yzhang123 previously approved these changes May 26, 2022
$ and 5% or %~dollar and five percent or percent sign
1~one
1~one
1, ~one ,
1!!!!~one ! ! ! !
(1)Hello~( one ) Hello
!1~! one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

Signed-off-by: ekmb <ebakhturina@nvidia.com>
yzhang123
yzhang123 previously approved these changes May 26, 2022
Signed-off-by: ekmb <ebakhturina@nvidia.com>
@ekmb ekmb requested a review from yzhang123 May 26, 2022 19:28
@yzhang123 yzhang123 merged commit 49dff00 into main May 26, 2022
@ekmb ekmb deleted the tn_tts_e branch August 3, 2022 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants