-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add extra segmentation style for paragraph_tokenize
function
#844
Conversation
add segmentation style
add segmentation style
Paragraph segmentation style
fix segmentation style
Hello @pavaris-pm! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
Kudos, SonarCloud Quality Gate passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
According to issue #843, about
wtpsplit
engine used inparagraph_tokenize
function. wtpsplit itself can adapt to the Universal Dependencies, OPUS100, or Ersatz corpus segmentation style in many languages as well. As for 2023, it supported Thai language inOPUS100
corpus style.Since we both agreed on adding a segmentation style as an option, I've added
style
as a new argument ofparagraph_tokenize
function.Here is a usage:
paragraph_tokenize
with defaultparagraph_threshold=0.5
(the current version in PyThaiNLP):Here is
paragraph_tokenize
function after addedstyle
argumentnewline
andopus100
style (as supported in wtpsplit)paragraph_threshold
will be set to 0.5 in order to show how different in each segmentation styleparagraph_tokenize
withstyle='newline'
that is the default style in the current version of PyThaiNLP. In other word, this is the same as 1.) case:paragraph_tokenize
withstyle="opus100"
that is newly added style as mentioned in wtpsplit paper that this style is supported in Thai language. This will let the tokenizer adapt toOPUS100
style for segmentation.Apart from the usage of
style
argument. I also write a condition to handle the case when the given segmentation style input is not our available style. The ValueError will be raised.This is an error that will be raised if that case occurs