Ruby port of TinySegmenter.js for tokenizing Japanese text
Ruby
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
spec
.gitignore
.rspec
.travis.yml
Gemfile
README.md
Rakefile
tiny_segmenter.gemspec

README.md

Ruby port of TinySegmenter.js for tokenizing Japanese text. Ruby 1.9 or higher required.

Build Status

Install

gem install tiny_segmenter or add tiny_segmenter to your Gemfile

Usage

ts = TinySegmenter.new
ts.segment("今晩は!良い天気ですね。")
# => ["今晩", "は", "!", "良い", "天気", "です", "ね", "。"]
ts.segment("今晩は!良い天気ですね。", ignore_punctuation: true)
# => ["今晩", "は", "良い", "天気", "です", "ね"]

Input text should be UTF-8 encoded.

How it works

The Naive Bayes model was trained using the RWCP corpus and optimized using L1-norm regularization (e.g. this). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.

License

BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt