Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated crfcut.py #905

Merged
merged 5 commits into from Apr 3, 2024
Merged

Conversation

varunkatiyar819
Copy link
Contributor

@varunkatiyar819 varunkatiyar819 commented Apr 1, 2024

Crfcut creating issues for split using terminal punctuation commonly '.' (full stop) which should be treated as end of the sentence, Modified the function such that it should split using terminal punctuations and avoid any kind of empty strings.

What does this changes

Brief summary of the changes

What was wrong

Description of what was the root cause of the issue.

How this fixes it

Description of how the changes fix the issue.

Fixes #904

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

Crfcut creating issues for split using terminal punctuation commonly '.' (full stop) which should be treated as end of the sentence, Modified the function such that it should split using terminal punctuations and avoid any kind of empty strings.
@pep8speaks
Copy link

pep8speaks commented Apr 1, 2024

Hello @varunkatiyar819! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 207:35: E226 missing whitespace around arithmetic operator

Comment last updated at 2024-04-03 07:51:42 UTC

Formatting checks handled
@wannaphong wannaphong linked an issue Apr 1, 2024 that may be closed by this pull request
@coveralls
Copy link

coveralls commented Apr 1, 2024

Coverage Status

coverage: 79.063% (-0.04%) from 79.103%
when pulling d1b64a7 on varunkatiyar819:varunkatiyar819-patch-1
into fa0a2ca on PyThaiNLP:dev.

Removed redundant brackets.
@bact bact added the bug bugs in the library label Apr 1, 2024
@bact bact added this to the 5.0 milestone Apr 1, 2024
@bact bact added this to In progress in PyThaiNLP Apr 1, 2024
Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It look great!

This pull request change crfcut.

Before

>>> sent_tokenize("สวัสดีชาวโลก. \n โลกนี้เต็มไปด้วยความลึกลับ")
['สวัสดีชาวโลก. \n โลกนี้เต็มไปด้วยความลึกลับ']
>>> sent_tokenize("สวัสดีชาวโลก! \n โลกนี้เต็มไปด้วยความลึกลับ")
['สวัสดีชาวโลก! \n โลกนี้เต็มไปด้วยความลึกลับ']

After

>>> sent_tokenize("สวัสดีชาวโลก. \n โลกนี้เต็มไปด้วยความลึกลับ")
['สวัสดีชาวโลก.', ' \n โลกนี้เต็มไปด้วยความลึกลับ']
>>> sent_tokenize("สวัสดีชาวโลก! \n โลกนี้เต็มไปด้วยความลึกลับ")
['สวัสดีชาวโลก!', ' \n โลกนี้เต็มไปด้วยความลึกลับ']

Thank you!

Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It look the test is break. Can you fix? @varunkatiyar819

['ฉันไปโรงเรียน เธอไปโรงพยาบาล'] != ['ฉันไปโรงเรียน ', 'เธอไปโรงพยาบาล']

Modified the logic for splitting of sentences due to empty strings or spaces.
Copy link

sonarcloud bot commented Apr 3, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@varunkatiyar819
Copy link
Contributor Author

It look the test is break. Can you fix? @varunkatiyar819

['ฉันไปโรงเรียน เธอไปโรงพยาบาล'] != ['ฉันไปโรงเรียน ', 'เธอไปโรงพยาบาล']

Logic for the Code is been modified and tested as well. Please do have a check. Thanks

Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@wannaphong wannaphong merged commit 71ef3e3 into PyThaiNLP:dev Apr 3, 2024
8 of 12 checks passed
PyThaiNLP automation moved this from In progress to Done Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs in the library
Projects
PyThaiNLP
  
Done
Development

Successfully merging this pull request may close these issues.

Sentence Tokenizer Split sentence on Terminal Puntuation.
5 participants