Greeb

Greeb [grʲip] is a simple Unicode-aware text segmentator based on regular expressions. The API documentation is available on RubyDoc.info.

Installation

Add this line to your application's Gemfile:

gem 'greeb'

And then execute:

$ bundle

Or install it yourself as:

$ gem install greeb

Usage

Greeb can approach such essential text processing problems as tokenization and segmentation. There are two ways to use it: (1) as a command-line application, (2) as a Ruby library.

Command-Line Interface

The greeb application reads the input text from STDIN and writes one token per line to STDOUT.

% echo 'Hello http://nlpub.ru guys, how are you?' | greeb
Hello
http://nlpub.ru
guys
,
how
are
you
?

Tokenization API

Greeb has a very convenient API that makes you happy.

pp Greeb::Tokenizer.tokenize('Hello!')
=begin
[#<struct Greeb::Span from=0, to=5, type=:letter>,
 #<struct Greeb::Span from=5, to=6, type=:punct>]
=end

It should be noted that it is also possible to process much complex texts than the present one.

text =<<-EOF
Hello! I am 18! My favourite number is 133.7...

What about you?
EOF

pp Greeb::Tokenizer.tokenize(text)
=begin
[#<struct Greeb::Span from=0, to=5, type=:letter>,
 #<struct Greeb::Span from=5, to=6, type=:punct>,
 #<struct Greeb::Span from=6, to=7, type=:space>,
 #<struct Greeb::Span from=7, to=8, type=:letter>,
 #<struct Greeb::Span from=8, to=9, type=:space>,
 #<struct Greeb::Span from=9, to=11, type=:letter>,
 #<struct Greeb::Span from=11, to=12, type=:space>,
 #<struct Greeb::Span from=12, to=14, type=:integer>,
 #<struct Greeb::Span from=14, to=15, type=:punct>,
 #<struct Greeb::Span from=15, to=16, type=:space>,
 #<struct Greeb::Span from=16, to=18, type=:letter>,
 #<struct Greeb::Span from=18, to=19, type=:space>,
 #<struct Greeb::Span from=19, to=28, type=:letter>,
 #<struct Greeb::Span from=28, to=29, type=:space>,
 #<struct Greeb::Span from=29, to=35, type=:letter>,
 #<struct Greeb::Span from=35, to=36, type=:space>,
 #<struct Greeb::Span from=36, to=38, type=:letter>,
 #<struct Greeb::Span from=38, to=39, type=:space>,
 #<struct Greeb::Span from=39, to=44, type=:float>,
 #<struct Greeb::Span from=44, to=47, type=:punct>,
 #<struct Greeb::Span from=47, to=49, type=:break>,
 #<struct Greeb::Span from=49, to=53, type=:letter>,
 #<struct Greeb::Span from=53, to=54, type=:space>,
 #<struct Greeb::Span from=54, to=59, type=:letter>,
 #<struct Greeb::Span from=59, to=60, type=:space>,
 #<struct Greeb::Span from=60, to=63, type=:letter>,
 #<struct Greeb::Span from=63, to=64, type=:punct>,
 #<struct Greeb::Span from=64, to=65, type=:break>]
=end

Segmentation API

The analyzer can also perform sentence detection.

text = 'Hello! How are you?'
tokens = Greeb::Tokenizer.tokenize(text)
pp Greeb::Segmentator.new(tokens).sentences
=begin
[#<struct Greeb::Span from=0, to=6, type=:sentence>,
 #<struct Greeb::Span from=7, to=19, type=:sentence>]
=end

Having obtained the sentence boundaries, it is possible to extract tokens covered by these sentences.

text = 'Hello! How are you?'
tokens = Greeb::Tokenizer.tokenize(text)
segmentator = Greeb::Segmentator.new(tokens)
pp segmentator.extract(segmentator.sentences)
=begin
{#<struct Greeb::Span from=0, to=6, type=:sentence>=>
  [#<struct Greeb::Span from=0, to=5, type=:letter>,
   #<struct Greeb::Span from=5, to=6, type=:punct>],
 #<struct Greeb::Span from=7, to=19, type=:sentence>=>
  [#<struct Greeb::Span from=7, to=10, type=:letter>,
   #<struct Greeb::Span from=10, to=11, type=:space>,
   #<struct Greeb::Span from=11, to=14, type=:letter>,
   #<struct Greeb::Span from=14, to=15, type=:space>,
   #<struct Greeb::Span from=15, to=18, type=:letter>,
   #<struct Greeb::Span from=18, to=19, type=:punct>]}
=end

Parsing API

It is often that a text includes such special entries as URLs and e-mail addresses. Greeb can assist you in extracting them.

Extraction of URLs and e-mails

text = 'My website is http://nlpub.ru and e-mail is example@example.com.'

pp Greeb::Parser.urls(text).map { |e| [e, e.slice(text)] }
=begin
[[#<struct Greeb::Span from=14, to=29, type=:url>, "http://nlpub.ru"]]
=end

pp Greeb::Parser.emails(text).map { |e| [e, e.slice(text)] }
=begin
[[#<struct Greeb::Span from=44, to=63, type=:email>, "example@example.com"]]
=end

Please do not use Greeb for the development of spam lists. Spam sucks.

Extraction of abbreviations

text = 'Hello, G.L.H.F. everyone!'

pp Greeb::Parser.abbrevs(text).map { |e| [e, e.slice(text)] }
=begin
[[#<struct Greeb::Span from=7, to=15, type=:abbrev>, "G.L.H.F."]]
=end

The algorithm is not so accurate, but still useful in many practical situations.

Extraction of time stamps

text = 'Our time is running out: 13:37 or 14:89.'

pp Greeb::Parser.time(text).map { |e| [e, e.slice(text)] }
=begin
[[#<struct Greeb::Span from=25, to=30, type=:time>, "13:37"]]
=end

Spans

Greeb operates with spans, which are tuples of (from, to, kind), where from is a beginning of the span, to is an ending of the span, and kind is a type of the span.

There are several span types at the tokenization stage: :letter, :float, :integer, :separ, :punct (for punctuation), :spunct (for in-sentence punctuation), :space, and :break.

Contributing

Fork it;
Create your feature branch (git checkout -b my-new-feature);
Commit your changes (git commit -am 'Added some feature');
Push to the branch (git push origin my-new-feature);
Create new Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
bin		bin
lib		lib
spec		spec
.gitignore		.gitignore
.travis.yml		.travis.yml
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
greeb.gemspec		greeb.gemspec

License

dustalov/greeb

Folders and files

Latest commit

History

Repository files navigation

Greeb

Installation

Usage

Command-Line Interface

Tokenization API

Segmentation API

Parsing API

Extraction of URLs and e-mails

Extraction of abbreviations

Extraction of time stamps

Spans

Contributing

Copyright

About

Topics

Resources

License

Stars

Watchers

Forks

Languages