A simple short-text classification tool based on LibLinear
C++ Python C Makefile
Switch branches/tags
Nothing to show
Clone or download
Latest commit 301ea91 Aug 29, 2017
Permalink
Failed to load latest commit information.
sample add show result api Jan 26, 2015
tgrocery fixes #6 Jun 17, 2015
.gitignore fix bug of setup script Jan 11, 2015
.travis.yml transfer unicode type Jan 12, 2015
MANIFEST.in fix bug of setup script Jan 11, 2015
Makefile add setup script Jan 11, 2015
README.md readme edit Aug 29, 2017
README.rst edit readme Jan 28, 2015
README_CN.md edit readme Jan 28, 2015
runtests.py add setup script Jan 11, 2015
setup.py edit readme.rst Jan 12, 2015

README.md

TextGrocery

Build Status

A simple, efficient short-text classification tool based on LibLinear

Embed with jieba as default tokenizer to support Chinese tokenize

Other languages: 更详细的中文文档

Performance

  • Train set: 48k news titles with 32 labels
  • Test set: 16k news titles with 32 labels
  • Compare with svm and naive-bayes of scikit-learn
Classifier Accuracy Time cost(s)
scikit-learn(nb) 76.8% 134
scikit-learn(svm) 76.9% 121
TextGrocery 79.6% 49

Sample Code

>>> from tgrocery import Grocery
# Create a grocery(don't forget to set a name)
>>> grocery = Grocery('sample')
# Train from list
>>> train_src = [
    ('education', 'Student debt to cost Britain billions within decades'),
    ('education', 'Chinese education for TV experiment'),
    ('sports', 'Middle East and Asia boost investment in top level sports'),
    ('sports', 'Summit Series look launches HBO Canada sports doc series: Mudhar')
]
>>> grocery.train(train_src)
# Or train from file
# Format: Label\tText
>>> grocery.train('train_ch.txt')
# Save model
>>> grocery.save()
# Load model(the same name as previous)
>>> new_grocery = Grocery('sample')
>>> new_grocery.load()
# Predict
>>> new_grocery.predict('Abbott government spends $8 million on higher education media blitz')
education
# Test from list
>>> test_src = [
    ('education', 'Abbott government spends $8 million on higher education media blitz'),
    ('sports', 'Middle East and Asia boost investment in top level sports'),
]
>>> new_grocery.test(test_src)
# Return Accuracy
1.0
# Or test from file
>>> new_grocery.test('test_ch.txt')
# Custom tokenize
>>> custom_grocery = Grocery('custom', custom_tokenize=list)

More examples: sample/

Install

$ pip install tgrocery

Only test under Unix-based System