Improved custom dict tokenization #77

smeeklai · 2018-03-01T08:52:51Z

สวัสดีครับน้องต้นตาล

ผมปรับจุดที่แก้ไขได้ให้นิดหน่อยและช่วย improve ตรงส่วนของการตัดคำโดยใช้ custom dict ให้นิดหน่อยน่ะครับ

สิ่งที่แก้ไข:

แก้ไขเลข version 1.5 -> 1.6
ลบ f.close() ออกไป เพราะหากใช้ open แล้ว f จะปิดเองอัตโนมัต หลังออกจาก with statement
พอดี code เก่ามันช้าตรงที่ ทุกครั้งที่เรียก dict_word_tokenize() แล้วมันเอา read file หรือ/และ เอา list of words ไปสร้าง trie ทุกครั้ง ซึ่งจริงๆแล้วมันไม่ได้จำเป็นและทำให้ช้ามากกว่าเรียกใช้ default dict หลายสิบเท่าเลย
ผมก็เลยแก้ไขไฟล์ init.py ใน folder tokenize โดยเพิ่ม class Tokenizer ขึ้นมา ซึ่ง user จะ initiate object นี้แค่ครั้งเดียว พร้อมส่ง custom dict หรือใหม่ก็ได้ จากนั้นก็เรียกใช้ word_tokenize() ไปตัดคำได้เลย ตอนนี้ผมทำแค่ newmm ก่อน เพราะผมไม่กล้าแก้ code เก่ามากเกินไป เพราะมันค่อนข้างเปลี่ยน behevior ของตัว tokenize ไปเลย จากเมื่อก่อนแค่ import function แล้วก็สามารถเรียกใช้ได้เลย อยากจะฟังความเห็นน้องต้นตาลก่อนว่ายังไงดีแล้วค่อยช่วยแก้ไขให้ต่อไปครับ

ยังไงรบกวนน้องต้นตาล review อีกที ก็ได้ครับ ขอบคุณครับ ไม่เคย pull request มาก่อน ถ้าทำไรผิดยังไงก็ขอโทษด้วยนะครับ ยังไม่ได้ test กับ python2 ด้วยนะครับ

coveralls · 2018-03-01T09:07:51Z

Coverage decreased (-1.7%) to 60.913% when pulling 028cb85 on smeeklai:improved_custom_dict_tokenization into ee4b1a1 on PyThaiNLP:dev.

coveralls · 2018-03-01T09:07:51Z

Coverage decreased (-0.3%) to 62.31% when pulling 6c8fa67 on smeeklai:improved_custom_dict_tokenization into 762fd0f on PyThaiNLP:dev.

coveralls · 2018-03-01T09:07:51Z

Coverage decreased (-0.3%) to 62.31% when pulling 6c8fa67 on smeeklai:improved_custom_dict_tokenization into 762fd0f on PyThaiNLP:dev.

wannaphong · 2018-03-01T11:43:18Z

ความคิดเห็นผมนะครับ ตรงข้อ 3 เป็นไปได้ไหมครับ ที่จะไม่ต้องสร้าง class Tokenizer ใหม่ขึ้นมาครับ

…_tokenize()

smeeklai · 2018-03-03T08:05:04Z

@wannaphongcom ผมได้ลองหา workaround มาแล้ว และก็ได้วิธีที่สามารถให้ไม่ต้องสร้าง class ใหม่แล้วก็ได้ อาจจะให้เป็น temporary solution ไปก่อนก็ได้นะครับ แต่ในระยะยาว ผมสนับสนุนให้สร้าง class ดีกว่าครับ เพราะหากดูตาม design แล้ว ตัว engine แทบทุกตัวต้องใช้ Trie เป็น source ในการตัดคำ ฉะนั้นการสร้าง class ที่มี property trie อยู่น่าจะเป็นทางออกในระยะยาวที่ดีกว่านะครับ แล้วยังสามารถลดให้เหลือ function word_tokenize แค่อันเดียวได้ด้วยครับ

ถ้าน้องต้นตาล ok กับ workaround นี้แล้ว เดี๋ยวผมจะช่วยแก้ engines ที่เหลือ ให้ compatible กับทางออกใหม่นี้ให้ครับ

ขอบคุณครับ

wannaphong · 2018-03-03T12:36:37Z

ขอบคุณครับ ผมลองแล้ว ok กับ workaround นี้ 👍

…roved_custom_dict_tokenization

…ew method of using custom dict to tokenize words

smeeklai · 2018-03-07T16:54:02Z

@wannaphongcom น้องต้นตาลครับพอดีจากแนวทางที่แล้วที่ได้นำเสนอไป มันแอบมีประเด็นเรื่องของที่ว่า ในกรณีที่สมมุติไปเรียกใช้ pythainlp ใน jupyter notebook แล้วถ้าผมทำการแก้ไขคำศัพท์หรือเพิ่มเข้าไปในไฟล์ __filename__.txt แบบนี้แล้ว มันจะไม่ทำการ reload มาให้เพราะ custom_dict มันโดน initialized ค่าตั้งแต่ import แล้ว ทำให้ต้อง reset notebook ใหม่อย่างเดียวเลย

สุดท้ายผมก็เลยคิดว่าแนวทางล่าสุดนี้น่าจะดีที่สุด หากไม่ต้องการสร้าง class ก็คือผมได้สร้าง function create_custom_dict_trie ขึ้นมาเพื่อเอาไว้สร้าง trie ของตัวเองเพื่อที่จะส่งเข้าไปใน word_tokenize() เพราะใน word_tokenize() เพิ่ม args ตัวใหม่เข้ามาชื่อ custom_dict_trie เพื่อมารับตัว custom trie ที่ถูกสร้างขึ้นมา

ยังไงลองดู code ที่แก้ไขแล้วคิดเห็นยังไงบอกนะครับ

wannaphong · 2018-03-08T16:11:32Z

@smeeklai ถ้าดูที่

dict_word_tokenize(text,file='',engine="newmm",data=[''],data_type="file")

จะเห็นว่า มีส่วนพารามิเตอร์ data กับ data_type ครับ ผมว่าจะเขียนให้สามารถตัดคำที่ต้องการได้จากข้อมูล list ที่ใส่เข้าไปใน api ตัวนี้ด้วยครับ แต่ผมยังไม่มีเวลาทำต่อครับ น่าจะแก้ไขประเด็นข้างบนได้ด้วยการใช้ list แทนไฟล์ครับ

wannaphong

👍

smeeklai added 2 commits March 1, 2018 15:38

Improved custom dict tokenization

87c3b2c

made dict_word_tokenize() to work with new edited newmm.py

6c8fa67

declared global custom_dict_trie and vocabs to be used with dict_word…

504a0f3

…_tokenize()

wannaphong self-requested a review March 3, 2018 12:36

smeeklai added 13 commits March 6, 2018 22:16

Merge branch 'dev' of https://github.com/PyThaiNLP/pythainlp into imp…

c93a436

…roved_custom_dict_tokenization

temp

ebd26a9

added create_custom_dict_trie() and modified word_tokenize() to use n…

38ca4bc

…ew method of using custom dict to tokenize words

removed duplicated and nested return statement

9113cc4

chaged word sengment function in syllable_tokenize

cff5a1a

fixed mistakes in syllable_tokenize() and wordcutpy tokenizer

26246ec

took nested return out from engine == 'wordcutpy'

65aacef

took nested return out from engine == 'wordcutpy'

f557544

fixed deeply nested control flow statements

f70deff

fixed deeply nested control flow statements

8533827

improved Coverage

ddbefa9

improved Coverage

1ec02b6

improved Coverage

816061c

smeeklai added 2 commits March 11, 2018 17:25

moved new tokenizing method to dict_work_tokenize instead

f4f66b8

removed unused variables

028cb85

wannaphong approved these changes Mar 12, 2018

View reviewed changes

wannaphong merged commit 26ec714 into PyThaiNLP:dev Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved custom dict tokenization #77

Improved custom dict tokenization #77

Uh oh!

smeeklai commented Mar 1, 2018

Uh oh!

coveralls commented Mar 1, 2018 •

edited

Loading

Uh oh!

coveralls commented Mar 1, 2018

Uh oh!

coveralls commented Mar 1, 2018

Uh oh!

wannaphong commented Mar 1, 2018

Uh oh!

smeeklai commented Mar 3, 2018 •

edited

Loading

Uh oh!

wannaphong commented Mar 3, 2018

Uh oh!

smeeklai commented Mar 7, 2018

Uh oh!

wannaphong commented Mar 8, 2018 •

edited

Loading

Uh oh!

wannaphong left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improved custom dict tokenization #77

Improved custom dict tokenization #77

Uh oh!

Conversation

smeeklai commented Mar 1, 2018

Uh oh!

coveralls commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Mar 1, 2018

Uh oh!

coveralls commented Mar 1, 2018

Uh oh!

wannaphong commented Mar 1, 2018

Uh oh!

smeeklai commented Mar 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wannaphong commented Mar 3, 2018

Uh oh!

smeeklai commented Mar 7, 2018

Uh oh!

wannaphong commented Mar 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wannaphong left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Mar 1, 2018 •

edited

Loading

smeeklai commented Mar 3, 2018 •

edited

Loading

wannaphong commented Mar 8, 2018 •

edited

Loading