GitHub - 6/tiny_segmenter: Ruby port of TinySegmenter.js for tokenizing Japanese text

Ruby port of TinySegmenter.js for tokenizing Japanese text. Ruby 1.9 or higher required.

Install

gem install tiny_segmenter or add tiny_segmenter to your Gemfile

Usage

ts = TinySegmenter.new
ts.segment("今晩は！良い天気ですね。")
# => ["今晩", "は", "！", "良い", "天気", "です", "ね", "。"]
ts.segment("今晩は！良い天気ですね。", ignore_punctuation: true)
# => ["今晩", "は", "良い", "天気", "です", "ね"]

Input text should be UTF-8 encoded.

How it works

The Naive Bayes model was trained using the RWCP corpus and optimized using L1-norm regularization (e.g. this). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.

License

BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.travis.yml		.travis.yml
Gemfile		Gemfile
README.md		README.md
Rakefile		Rakefile
tiny_segmenter.gemspec		tiny_segmenter.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Usage

How it works

License

About

Releases

Packages

Contributors 2

Languages

6/tiny_segmenter

Folders and files

Latest commit

History

Repository files navigation

Install

Usage

How it works

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages