pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241

espressofx · 2019-07-26T09:04:09Z

Describe the bug
มีปัญหา execute ใช้เวลานานมาก กับประโยคที่ยาวต่อเนื่องโดยไม่มี space แบ่ง
เช่น ประโยค

ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า

แต่ถ้ามีการแบ่งประโยคโดยมี space จะไม่มีปัญหา เช่น

ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า ด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้าด้านหน้า

Desktop (please complete the following information):

OS: ubuntu 18.04
Python 3.6
Pythainlp 2.0.5

espressofx · 2019-07-26T10:20:52Z

จาก notebook ต้อง interrupt kernel

bact · 2019-09-11T03:55:51Z

May need some rule to stop generating word graphs when reaching a threshold.

คิดว่าในโค้ดที่พยายามสร้างทางที่เป็นไปได้ที่จะตัดคำ น่าจะต้องมีกำหนดไว้ ว่าจะทำถึงแค่ไหน เกินนี้จะบังคับหยุด ไม่งั้นมันจะสร้างไปเรื่อยๆ จนช้ามากๆ

wannaphong · 2019-09-29T05:59:34Z

Colab : https://colab.research.google.com/drive/1MOZ7GZ9x_P75wUQtz0C2Fl52GPRhJO4g

p16i · 2019-10-06T08:11:09Z

I guess this result comes from newmm? Should we label the issue with newmm?

bact · 2019-10-12T07:45:09Z

Proposal for the fix:

break the input text into smaller chunks (100±25)
- chunking strategy should take into account the invalid token separation point (this can be either word or syllable)
tokenizes each chunks and combine tokens together

See code in pythainlp/tokenize/newmm.py here #302

Warn that this may make the tokenization slower.

Current chunk size is 100
(Have tried 400, too slow).

Current window to scan possible break points between chunks is 25+25 = 50
(the longest word in dictionary is 70 chars)

Please comment

bact · 2019-11-12T17:30:59Z

A new segmentation "engine" has been proposed for this, one can now use newmm-safe engine to avoid the problem of waiting too long for text like one in the example above. Now available in fix-newmm-longtext branch and will soon in dev branch.

ต่อไปจะมี engine ใหม่ชื่อ newmm-safe เพื่อแก้ปัญหานี้ครับ

pythainlp.tokenzie.word_tokenize("ด้านหน้าด้านหน้าด้านหน้า", engine="newmm-safe")

bact · 2019-11-15T09:03:43Z

แก้ไขแล้วนะครับ จะสามารถใช้ได้ในรุ่น 2.1dev8 ครับ

Fixed with #302 - will be available in 2.1dev8 release.

wannaphong added the bug bugs in the library label Jul 26, 2019

bact added this to the 2.1 milestone Oct 5, 2019

bact self-assigned this Oct 8, 2019

bact mentioned this issue Oct 12, 2019

"newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points #302

Merged

bact closed this as completed Nov 15, 2019

bact mentioned this issue Dec 6, 2019

บางคำประโยคติด Loop ครับ ที่ _bfs_paths_graph #326

Closed

bact mentioned this issue Dec 13, 2019

Add graph size limit in _onecut() to avoid long wait for ambiguous text #333

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241

pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241

espressofx commented Jul 26, 2019

espressofx commented Jul 26, 2019

bact commented Sep 11, 2019 •

edited

Loading

wannaphong commented Sep 29, 2019

p16i commented Oct 6, 2019 •

edited

Loading

bact commented Oct 12, 2019 •

edited

Loading

bact commented Nov 12, 2019 •

edited

Loading

bact commented Nov 15, 2019

pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241

pythainlp.word_tokenize ปัญหาตัดคำประโยคที่ยาวต่อเนื่องโดยไม่มี space [newmm] #241

Comments

espressofx commented Jul 26, 2019

espressofx commented Jul 26, 2019

bact commented Sep 11, 2019 • edited Loading

wannaphong commented Sep 29, 2019

p16i commented Oct 6, 2019 • edited Loading

bact commented Oct 12, 2019 • edited Loading

bact commented Nov 12, 2019 • edited Loading

bact commented Nov 15, 2019

bact commented Sep 11, 2019 •

edited

Loading

p16i commented Oct 6, 2019 •

edited

Loading

bact commented Oct 12, 2019 •

edited

Loading

bact commented Nov 12, 2019 •

edited

Loading