Skip to content

Moonshile/ChineseWordSegmentation

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

ChineseWordSegmentation

Chinese word segmentation algorithm without corpus

Usage

from wordseg import WordSegment
doc = u'十四是十四四十是四十,十四不是四十,四十不是十四'
ws = WordSegment(doc, max_word_len=2, min_aggregation=1, min_entropy=0.5)
ws.segSentence(doc)

This will generate words

十四 是 十四 四十 是 四十 , 十四 不是 四十 , 四十 不是 十四

In fact, doc should be a long enough document string for better results. In that condition, the min_aggregation should be set far greater than 1, such as 50, and min_entropy should also be set greater than 0.5, such as 1.5.

Besides, both input and output of this function should be decoded as unicode.

WordSegment.segSentence has an optional argument method, with values WordSegment.L, WordSegment.S and WordSegment.ALL, means

  • WordSegment.L: if a long word that is combinations of several shorter words found, given only the long word.
  • WordSegment.S: given the several shorter words.
  • WordSegment.ALL: given both the long and the shorters.

Reference

Thanks Matrix67's article

About

Chinese word segmentation algorithm without corpus(无需语料库的中文分词)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages