Skip to content
Ruby port of TinySegmenter.js for tokenizing Japanese text
Ruby
Branch: master
Clone or download
Latest commit 15a5b82 Oct 26, 2015
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib
spec Fix rspec deprecated syntax Oct 26, 2015
.gitignore gitignore rvm Feb 25, 2013
.rspec Add some RSpec Aug 27, 2012
.travis.yml Ensure it works on latest Ruby version Oct 26, 2015
Gemfile https rubygems Feb 25, 2013
README.md Document ignore_punctuation option Mar 31, 2013
Rakefile Try out travis-ci Aug 27, 2012
tiny_segmenter.gemspec tweak description Oct 26, 2015

README.md

Ruby port of TinySegmenter.js for tokenizing Japanese text. Ruby 1.9 or higher required.

Build Status

Install

gem install tiny_segmenter or add tiny_segmenter to your Gemfile

Usage

ts = TinySegmenter.new
ts.segment("今晩は!良い天気ですね。")
# => ["今晩", "は", "!", "良い", "天気", "です", "ね", "。"]
ts.segment("今晩は!良い天気ですね。", ignore_punctuation: true)
# => ["今晩", "は", "良い", "天気", "です", "ね"]

Input text should be UTF-8 encoded.

How it works

The Naive Bayes model was trained using the RWCP corpus and optimized using L1-norm regularization (e.g. this). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.

License

BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt

You can’t perform that action at this time.