Skip to content
Ruby port of TinySegmenter.js for tokenizing Japanese text
Branch: master
Clone or download
Latest commit 15a5b82 Oct 26, 2015
Type Name Latest commit message Commit time
Failed to load latest commit information.
spec Fix rspec deprecated syntax Oct 26, 2015
.gitignore gitignore rvm Feb 25, 2013
.rspec Add some RSpec Aug 27, 2012
.travis.yml Ensure it works on latest Ruby version Oct 26, 2015
Gemfile https rubygems Feb 25, 2013 Document ignore_punctuation option Mar 31, 2013
Rakefile Try out travis-ci Aug 27, 2012
tiny_segmenter.gemspec tweak description Oct 26, 2015

Ruby port of TinySegmenter.js for tokenizing Japanese text. Ruby 1.9 or higher required.

Build Status


gem install tiny_segmenter or add tiny_segmenter to your Gemfile


ts =
# => ["今晩", "は", "!", "良い", "天気", "です", "ね", "。"]
ts.segment("今晩は!良い天気ですね。", ignore_punctuation: true)
# => ["今晩", "は", "良い", "天気", "です", "ね"]

Input text should be UTF-8 encoded.

How it works

The Naive Bayes model was trained using the RWCP corpus and optimized using L1-norm regularization (e.g. this). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.


BSD - see

You can’t perform that action at this time.