Skip to content

zencephalon/Tactful_Tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TactfulTokenizer

<img src=“https://badge.fury.io/rb/tactful_tokenizer.png” alt=“Gem Version” /> <img src=“https://travis-ci.org/zencephalon/Tactful_Tokenizer.png?branch=release” alt=“Build Status” /> <img src=“https://codeclimate.com/github/zencephalon/Tactful_Tokenizer.png” /> <img src=“https://coveralls.io/repos/zencephalon/Tactful_Tokenizer/badge.png?branch=release” alt=“Coverage Status” />

TactfulTokenizer is a Ruby library for high quality sentence tokenization. It uses a Naive Bayesian statistical model, and is based on Splitta, but has support for ‘?’ and ‘!’ as well as primitive handling of XHTML markup. Better support for XHTML parsing is coming shortly.

Additionally supports unicode text tokenization.

Usage

require "tactful_tokenizer"
m = TactfulTokenizer::Model.new
m.tokenize_text("Here in the U.S. Senate we prefer to eat our friends. Is it easier that way? <em>Yes.</em> <em>Maybe</em>!")
#=> ["Here in the U.S. Senate we prefer to eat our friends.", "Is it easier that way?", "<em>Yes.</em>", "<em>Maybe</em>!"]

The input text is expected to consist of paragraphs delimited by line breaks.

Installation

gem install tactful_tokenizer

Author

Copyright © 2010 Matthew Bunday. All rights reserved. Released under the GNU GPL v3.

About

Accurate Bayesian sentence tokenizer in Ruby.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages