-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TN] WFST to normalize punctuation #4108
Conversation
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
…_threshold Signed-off-by: ekmb <ebakhturina@nvidia.com>
This pull request introduces 2 alerts when merging 5f0ef3e into dc13c5d - view on LGTM.com new alerts:
|
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
This pull request introduces 2 alerts when merging 64bbc44 into ddd8719 - view on LGTM.com new alerts:
|
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
This pull request introduces 2 alerts when merging b5814f6 into 650718f - view on LGTM.com new alerts:
|
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
This pull request introduces 2 alerts when merging d40929b into 08df199 - view on LGTM.com new alerts:
|
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Evelina for the great effort!
chapter | ||
Class | ||
CLASS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removing this?
tests/nemo_text_processing/en/data_text_normalization/test_cases_cardinal.txt
Show resolved
Hide resolved
$ and 5% or %~dollar and five percent or percent sign | ||
1~one | ||
1~one | ||
1, ~one , | ||
1!!!!~one ! ! ! ! | ||
(1)Hello~( one ) Hello | ||
!1~! one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not be included? only punctuation following semiotic and words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to test_cases_punctuation.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx!
abbreviation USA,~abbreviation USA, | ||
23rd july, 1998~the twenty third of july, nineteen ninety eight | ||
April 29th’s meeting~april twenty ninth’s meeting | ||
?,~?, | ||
?,no~?,no | ||
I've 20' and 14/ they're I'm 16c.~I've twenty' and fourteen slash they're I'm sixteen c. | ||
I've 20' and 14/ they're I'm 16c.~I've twenty ' and fourteen slash they're I'm sixteen c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why space between 20 and ' in normalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apostrophe is not handled if attached to a semiotic token
pynini.difference( | ||
self.graph, pynini.union("$", "€", "₩", "£", "¥", "#", "$", "%") + pynini.closure(NEMO_DIGIT, 1) | ||
graph, pynini.union("$", "€", "₩", "£", "¥", "#", "$", "%") + pynini.closure(NEMO_DIGIT, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we treat this separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduced the complexity of this
# non_digit is needed to allow non-ascii chars, like in "Müller's" | ||
non_digit = pynini.difference(NEMO_NOT_SPACE, NEMO_DIGIT).optimize() | ||
at_least_one_alpha = ( | ||
pynini.closure(non_digit) + pynini.closure(NEMO_ALPHA, 1) + pynini.closure(non_digit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just having closure(non_digit), why including NeMo_Alpha?
|
||
# punct followed by word and another punct mark: { "And, } | ||
alpha_with_punct_graph = ( | ||
pynini.closure(punct_symbols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why includding punct_symbols here?
pynini.closure(punct_symbols) | ||
+ at_least_one_alpha | ||
+ pynini.closure(punct_symbols, 1) | ||
+ pynini.closure(non_digit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why requiring non_digit, NeMo_Alpha here?
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
$ and 5% or %~dollar and five percent or percent sign | ||
1~one | ||
1~one | ||
1, ~one , | ||
1!!!!~one ! ! ! ! | ||
(1)Hello~( one ) Hello | ||
!1~! one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx!
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
What does this PR do ?
Added WFST graph to handle punctuation after verbalization step, parallel normalization, bug fixes
Collection: [TN]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information