Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extra segmentation style for paragraph_tokenize function #844

Merged
merged 4 commits into from
Oct 6, 2023

Conversation

pavaris-pm
Copy link
Contributor

According to issue #843, about wtpsplit engine used in paragraph_tokenize function. wtpsplit itself can adapt to the Universal Dependencies, OPUS100, or Ersatz corpus segmentation style in many languages as well. As for 2023, it supported Thai language in OPUS100 corpus style.

Since we both agreed on adding a segmentation style as an option, I've added style as a new argument of paragraph_tokenize function.

Here is a usage:

from pythainlp.tokenize import paragraph_tokenize

sent = (
    "(1) บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต"
    +"  มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด"
    +" จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ณ ที่นี้"
)
  1. paragraph_tokenize with default paragraph_threshold=0.5 (the current version in PyThaiNLP):
# same as paragraph_tokenize(sent, paragraph_threshold=0.5)
paragraph_tokenize(sent)

# output
# [['(1) '],
# ['บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
#  'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด ',
#  'จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ',
#  'ณ ที่นี้']]

Here is paragraph_tokenize function after added style argument

  • 2 segmentation styles available to choose that is newline and opus100 style (as supported in wtpsplit)
  • note that the default value of paragraph_threshold will be set to 0.5 in order to show how different in each segmentation style
  1. paragraph_tokenize with style='newline' that is the default style in the current version of PyThaiNLP. In other word, this is the same as 1.) case:
# this is the same as paragraph_tokenize(sent)
paragraph_tokenize(text, paragraph_threshold=0.5, style='newline')

# output
# [['(1) '],
# ['บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
#  'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด ',
#  'จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ',
#  'ณ ที่นี้']]
  1. paragraph_tokenize with style="opus100" that is newly added style as mentioned in wtpsplit paper that this style is supported in Thai language. This will let the tokenizer adapt to OPUS100 style for segmentation.
# this will change the segmentation style by adapt it to OPUS100 corpus style
paragraph_tokenize(text, paragraph_threshold=0.5, style='opus100')

# output
# [['(1) '],
# ['บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
# 'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ณ ที่นี้']]

Apart from the usage of style argument. I also write a condition to handle the case when the given segmentation style input is not our available style. The ValueError will be raised.

# this is the case that specified style input is not our available style
paragraph_tokenize(text, paragraph_threshold=0.5, style='newjeans')

This is an error that will be raised if that case occurs

ValueError: Segmentation style "newjeans" not found. It might be a typo; if not, please consult our document.

add segmentation style
add segmentation style
fix segmentation style
@pep8speaks
Copy link

Hello @pavaris-pm! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 450:7: E126 continuation line over-indented for hanging indent
Line 450:17: W291 trailing whitespace
Line 451:32: W291 trailing whitespace
Line 452:26: E231 missing whitespace after ':'
Line 452:32: E252 missing whitespace around parameter equals
Line 452:33: E252 missing whitespace around parameter equals
Line 453:12: E231 missing whitespace after ':'
Line 453:16: E252 missing whitespace around parameter equals
Line 453:17: E252 missing whitespace around parameter equals
Line 454:5: E121 continuation line under-indented for hanging indent
Line 454:5: E125 continuation line with same indent as next logical line
Line 501:23: E126 continuation line over-indented for hanging indent
Line 506:21: E126 continuation line over-indented for hanging indent

Line 33:14: E231 missing whitespace after ':'
Line 33:18: E252 missing whitespace around parameter equals
Line 33:19: E252 missing whitespace around parameter equals
Line 42:17: E225 missing whitespace around operator
Line 43:11: E111 indentation is not a multiple of four
Line 49:19: E225 missing whitespace around operator
Line 50:11: E111 indentation is not a multiple of four
Line 58:11: E111 indentation is not a multiple of four
Line 59:13: E121 continuation line under-indented for hanging indent
Line 63:1: E302 expected 2 blank lines, found 1
Line 64:13: E231 missing whitespace after ':'
Line 64:18: W291 trailing whitespace
Line 65:13: E231 missing whitespace after ':'
Line 65:17: E252 missing whitespace around parameter equals
Line 65:18: E252 missing whitespace around parameter equals
Line 65:25: W291 trailing whitespace
Line 66:17: E231 missing whitespace after ':'
Line 66:21: E252 missing whitespace around parameter equals
Line 66:22: E252 missing whitespace around parameter equals
Line 66:33: W291 trailing whitespace
Line 67:28: E231 missing whitespace after ':'
Line 67:34: E252 missing whitespace around parameter equals
Line 67:35: E252 missing whitespace around parameter equals
Line 68:14: E231 missing whitespace after ':'
Line 68:18: E252 missing whitespace around parameter equals
Line 68:19: E252 missing whitespace around parameter equals
Line 69:5: E121 continuation line under-indented for hanging indent
Line 69:5: E125 continuation line with same indent as next logical line
Line 69:6: E225 missing whitespace around operator
Line 80:15: E126 continuation line over-indented for hanging indent
Line 80:20: W291 trailing whitespace
Line 85:13: E126 continuation line over-indented for hanging indent

@sonarcloud
Copy link

sonarcloud bot commented Oct 6, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@coveralls
Copy link

Coverage Status

coverage: 0.0% (-77.3%) from 77.263% when pulling a0053c1 on pavaris-pm:dev into fcf567c on PyThaiNLP:dev.

@wannaphong wannaphong added the hacktoberfest-accepted hacktoberfest accepted pull requests. label Oct 6, 2023
Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@wannaphong wannaphong added the Hacktoberfest for Hacktoberfest event label Oct 6, 2023
@wannaphong wannaphong linked an issue Oct 6, 2023 that may be closed by this pull request
@wannaphong wannaphong merged commit 73b17e3 into PyThaiNLP:dev Oct 6, 2023
7 of 14 checks passed
@wannaphong wannaphong added this to the 4.1 milestone Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest for Hacktoberfest event hacktoberfest-accepted hacktoberfest accepted pull requests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding segmentation style for PyThaiNLP paragraph_tokenize function
4 participants