Add extra segmentation style for `paragraph_tokenize` function #844

pavaris-pm · 2023-10-06T13:45:45Z

According to issue #843, about wtpsplit engine used in paragraph_tokenize function. wtpsplit itself can adapt to the Universal Dependencies, OPUS100, or Ersatz corpus segmentation style in many languages as well. As for 2023, it supported Thai language in OPUS100 corpus style.

Since we both agreed on adding a segmentation style as an option, I've added style as a new argument of paragraph_tokenize function.

Here is a usage:

from pythainlp.tokenize import paragraph_tokenize

sent = (
    "(1) บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต"
    +"  มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด"
    +" จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ณ ที่นี้"
)

paragraph_tokenize with default paragraph_threshold=0.5 (the current version in PyThaiNLP):

# same as paragraph_tokenize(sent, paragraph_threshold=0.5)
paragraph_tokenize(sent)

# output
# [['(1) '],
# ['บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
#  'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด ',
#  'จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ',
#  'ณ ที่นี้']]

Here is `paragraph_tokenize` function after added `style` argument

2 segmentation styles available to choose that is newline and opus100 style (as supported in wtpsplit)
note that the default value of paragraph_threshold will be set to 0.5 in order to show how different in each segmentation style

paragraph_tokenize with style='newline' that is the default style in the current version of PyThaiNLP. In other word, this is the same as 1.) case:

# this is the same as paragraph_tokenize(sent)
paragraph_tokenize(text, paragraph_threshold=0.5, style='newline')

# output
# [['(1) '],
# ['บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
#  'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด ',
#  'จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ',
#  'ณ ที่นี้']]

paragraph_tokenize with style="opus100" that is newly added style as mentioned in wtpsplit paper that this style is supported in Thai language. This will let the tokenizer adapt to OPUS100 style for segmentation.

# this will change the segmentation style by adapt it to OPUS100 corpus style
paragraph_tokenize(text, paragraph_threshold=0.5, style='opus100')

# output
# [['(1) '],
# ['บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต  ',
# 'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ณ ที่นี้']]

Apart from the usage of style argument. I also write a condition to handle the case when the given segmentation style input is not our available style. The ValueError will be raised.

# this is the case that specified style input is not our available style
paragraph_tokenize(text, paragraph_threshold=0.5, style='newjeans')

This is an error that will be raised if that case occurs

ValueError: Segmentation style "newjeans" not found. It might be a typo; if not, please consult our document.

add segmentation style

Paragraph segmentation style

fix segmentation style

pep8speaks · 2023-10-06T13:45:51Z

Hello @pavaris-pm! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pythainlp/tokenize/core.py:

Line 450:7: E126 continuation line over-indented for hanging indent
Line 450:17: W291 trailing whitespace
Line 451:32: W291 trailing whitespace
Line 452:26: E231 missing whitespace after ':'
Line 452:32: E252 missing whitespace around parameter equals
Line 452:33: E252 missing whitespace around parameter equals
Line 453:12: E231 missing whitespace after ':'
Line 453:16: E252 missing whitespace around parameter equals
Line 453:17: E252 missing whitespace around parameter equals
Line 454:5: E121 continuation line under-indented for hanging indent
Line 454:5: E125 continuation line with same indent as next logical line
Line 501:23: E126 continuation line over-indented for hanging indent
Line 506:21: E126 continuation line over-indented for hanging indent

In the file pythainlp/tokenize/wtsplit.py:

Line 33:14: E231 missing whitespace after ':'
Line 33:18: E252 missing whitespace around parameter equals
Line 33:19: E252 missing whitespace around parameter equals
Line 42:17: E225 missing whitespace around operator
Line 43:11: E111 indentation is not a multiple of four
Line 49:19: E225 missing whitespace around operator
Line 50:11: E111 indentation is not a multiple of four
Line 58:11: E111 indentation is not a multiple of four
Line 59:13: E121 continuation line under-indented for hanging indent
Line 63:1: E302 expected 2 blank lines, found 1
Line 64:13: E231 missing whitespace after ':'
Line 64:18: W291 trailing whitespace
Line 65:13: E231 missing whitespace after ':'
Line 65:17: E252 missing whitespace around parameter equals
Line 65:18: E252 missing whitespace around parameter equals
Line 65:25: W291 trailing whitespace
Line 66:17: E231 missing whitespace after ':'
Line 66:21: E252 missing whitespace around parameter equals
Line 66:22: E252 missing whitespace around parameter equals
Line 66:33: W291 trailing whitespace
Line 67:28: E231 missing whitespace after ':'
Line 67:34: E252 missing whitespace around parameter equals
Line 67:35: E252 missing whitespace around parameter equals
Line 68:14: E231 missing whitespace after ':'
Line 68:18: E252 missing whitespace around parameter equals
Line 68:19: E252 missing whitespace around parameter equals
Line 69:5: E121 continuation line under-indented for hanging indent
Line 69:5: E125 continuation line with same indent as next logical line
Line 69:6: E225 missing whitespace around operator
Line 80:15: E126 continuation line over-indented for hanging indent
Line 80:20: W291 trailing whitespace
Line 85:13: E126 continuation line over-indented for hanging indent

sonarcloud · 2023-10-06T13:47:45Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

coveralls · 2023-10-06T14:05:24Z

coverage: 0.0% (-77.3%) from 77.263% when pulling a0053c1 on pavaris-pm:dev into fcf567c on PyThaiNLP:dev.

wannaphong

Thank you!

pavaris-pm added 4 commits October 6, 2023 19:03

Update wtsplit.py

db6272f

add segmentation style

Update core.py

9bfd754

add segmentation style

Merge pull request #1 from pavaris-pm/paragraph-segmentation-style

4898960

Paragraph segmentation style

Update wtsplit.py

a0053c1

fix segmentation style

pavaris-pm mentioned this pull request Oct 6, 2023

Adding segmentation style for PyThaiNLP paragraph_tokenize function #843

Closed

wannaphong added the hacktoberfest-accepted hacktoberfest accepted pull requests. label Oct 6, 2023

wannaphong approved these changes Oct 6, 2023

View reviewed changes

wannaphong added the Hacktoberfest for Hacktoberfest event label Oct 6, 2023

wannaphong linked an issue Oct 6, 2023 that may be closed by this pull request

Adding segmentation style for PyThaiNLP paragraph_tokenize function #843

Closed

wannaphong merged commit 73b17e3 into PyThaiNLP:dev Oct 6, 2023
7 of 14 checks passed

wannaphong added this to the 4.1 milestone Oct 6, 2023

wannaphong mentioned this pull request Feb 5, 2024

PyThaiNLP 5.0 Change Log #788

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extra segmentation style for `paragraph_tokenize` function #844

Add extra segmentation style for `paragraph_tokenize` function #844

pavaris-pm commented Oct 6, 2023

pep8speaks commented Oct 6, 2023

sonarcloud bot commented Oct 6, 2023

coveralls commented Oct 6, 2023

wannaphong left a comment

Add extra segmentation style for paragraph_tokenize function #844

Add extra segmentation style for paragraph_tokenize function #844

Conversation

pavaris-pm commented Oct 6, 2023

Here is paragraph_tokenize function after added style argument

pep8speaks commented Oct 6, 2023

sonarcloud bot commented Oct 6, 2023

coveralls commented Oct 6, 2023

wannaphong left a comment

Choose a reason for hiding this comment

Add extra segmentation style for `paragraph_tokenize` function #844

Add extra segmentation style for `paragraph_tokenize` function #844

Here is `paragraph_tokenize` function after added `style` argument