Though pkuseg claims to be outperform jieba and thulac on some golden test datasets. However, it is highly **NOT** recommended to use the specified news model instead of the default, mixed-field model. For the news model performs extremely bad on English words like peoples' names (though they are more often transliterated in our People's Daily news corpus), brand names (Apple, Microsoft), abbreviation of disease and virus (H1N1, SARS, MERS), cutting them all up. We would however prefer to preserve such foreign names (CD, DVD, BP机) rather than simply removing everything that is beyond Chinese characters.

In [1]:
import jieba
import pkuseg
import thulac
import re

pku = pkuseg.pkuseg()
pku_news = pkuseg.pkuseg(model_name='news')
thu = thulac.thulac()

Model loaded succeed


In [9]:
sentences = ["Alex James比起CD更喜欢听MP3。",
    "Microsoft和Apple的股价最近分别上涨了0.1和0.01个百分点。",
    "SARS和甲型H1N1流感都一度对中国社会造成很大冲击。",
    "Lily，发生什么事了？Hello？你还好吗？",
    "她要去看Justin Bieber和Taylor Swift的演唱会。",
    "QQ和微信哪一个在学生当中更流行？",
    "大哥大和BP机都早就过时了！现在我们都用iPhone。",
    "这部电影是VCD还是DVD播放？",
    "这次的MERS病毒大流行听起来比H2N2，H5N6和H7N9都还要吓人！",
    "自从2022年俄乌冲突以来，俄乌关系引发了全世界的广泛关注……"
]

In [10]:
for sentence in sentences:
    print('jieba   ', jieba.lcut(sentence))
    print('pku-seg ', pku.cut(sentence))
    print('pku-news', pku_news.cut(sentence))
    print('thu-lac', [pair[0] for pair in thu.cut(sentence)])
    print()

jieba    ['Alex', ' ', 'James', '比起', 'CD', '更', '喜欢', '听', 'MP3', '。']
pku-seg  ['Alex', 'James', '比起', 'CD', '更', '喜欢', '听', 'MP3', '。']
pku-news ['Alex', 'Ja', 'm', 'e', 's', '比', '起', 'CD', '更', '喜欢', '听', 'MP3', '。']
thu-lac ['Alex', ' ', 'James', '比', '起', 'CD', '更', '喜欢', '听', 'MP3', '。']

jieba    ['Microsoft', '和', 'Apple', '的', '股价', '最近', '分别', '上涨', '了', '0.1', '和', '0.01', '个', '百分点', '。']
pku-seg  ['Microsoft', '和', 'Apple', '的', '股价', '最近', '分别', '上涨', '了', '0.1', '和', '0.01', '个', '百分点', '。']
pku-news ['Mi', 'cr', 'o', 'so', 'ft', '和', 'Apple', '的', '股价', '最近', '分别', '上涨', '了', '0.1', '和', '0', '.', '01个百分点', '。']
thu-lac ['Microsoft', '和', 'Apple', '的', '股', '价', '最近', '分别', '上涨', '了', '0', '.', '1', '和', '0', '.', '01', '个', '百分点', '。']

jieba    ['SARS', '和', '甲型', 'H1N1', '流感', '都', '一度', '对', '中国', '社会', '造成', '很大', '冲击', '。']
pku-seg  ['SARS', '和', '甲型', 'H1N1', '流感', '都', '一度', '对', '中国', '社会', '造成', '很', '大', '冲击', '。']
pku-news ['SARS', '和', '甲型', 'H1', 'N', '1

We look at "words" that are made of Chinese characters, the English alphabet (a-z, A-Z), '-' and '.', digits are also allowed if they are part of a name (MP3), but removed if they are pure numbers (10, 0.1) or phone numbers (010-22-19391).

In [34]:
words = []

for word in pku.cut("Microsoft和MP3的股价最近分别上涨了10%和0.01个百分点。"):
    if not (re.search('[^\.\-0-9a-zA-Z\u4e00-\u9fa5]+', word) or re.match('[\.\-\d]+', word)):
        words.append(word)

print(words)

['Microsoft', '和', 'MP3', '的', '股价', '最近', '分别', '上涨', '了', '和', '个', '百分点']


In [35]:
words = []

for word in pku.cut("请在工作时间8:00-17:00内拨打电话号码010-22-19391垂询。"):
    if not (re.search('[^\.\-0-9a-zA-Z\u4e00-\u9fa5]+', word) or re.match('[\.\-\d]+', word)):
        words.append(word)

print(words)

['请', '在', '工作', '时间', '内', '拨打', '电话', '号码', '垂询']
