Ruby port of TinySegmenter.js for tokenizing Japanese text
Ruby port of TinySegmenter.js for tokenizing Japanese text. Ruby 1.9 or higher required.

gem install tiny_segmenter or add tiny_segmenter to your Gemfile


ts =
# => ["今晩", "は", "!", "良い", "天気", "です", "ね", "。"]
ts.segment("今晩は!良い天気ですね。", ignore_punctuation: true)
# => ["今晩", "は", "良い", "天気", "です", "ね"]

Input text should be UTF-8 encoded.

How it works

The Naive Bayes model was trained using the RWCP corpus and optimized using L1-norm regularization (e.g. this). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.


BSD - see

