github
Advanced Search
  • Home
  • Pricing and Signup
  • Explore GitHub
  • Blog
  • Login

ealdent / uea-stemmer

  • Admin
  • Watch Unwatch
  • Fork
  • Your Fork
  • Pull Request
  • Download Source
    • 8
    • 1
  • Source
  • Commits
  • Network (1)
  • Issues (0)
  • Downloads (0)
  • Wiki (1)
  • Graphs
  • Branch: master

click here to add a description

click here to add a homepage

  • Branches (1)
    • master ✓
  • Tags (0)
Sending Request…
Click here to lend your support to: uea-stemmer and make a donation at www.pledgie.com ! Edit Pledgie Setup

Pledgie Donations

Once activated, we'll place the following badge in your repository's detail box:
Pledgie_example
This service is courtesy of Pledgie.

Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing — Read more

  cancel

http://www.uea.ac.uk/cmp/research/graphicsvisionspeech/speech/WordStemming

  cancel
  • Private
  • Read-Only
  • HTTP Read-Only

This URL has Read+Write access

- add test case for setting options 
ealdent (author)
Wed Oct 14 12:50:53 -0700 2009
commit  f67896feb61950ef7f3cc49679cf87cdb253bb2e
tree    9c47c2e32eb03a317c3cbddee6d0a6ca1b268892
parent  c6dc46c855434cd13933109bf3fe8d67d4f6f287
uea-stemmer /
name age
history
message
file .document Loading commit data...
file .gitignore Wed Jul 15 18:34:53 -0700 2009 ignore textmate project files [ealdent]
file LICENSE Tue Jul 14 22:18:51 -0700 2009 Port of UEA Stemmer to Ruby using the Java port... [ealdent]
file README.rdoc Wed Jul 15 11:39:55 -0700 2009 update README [ealdent]
file Rakefile Wed Sep 30 14:40:01 -0700 2009 - Version bump to 0.9.10 - update rakefile so t... [ealdent]
file VERSION
directory lib/
directory test/
file uea-stemmer.gemspec Tue Oct 13 11:24:48 -0700 2009 Version bump to 0.10.0 [ealdent]
README.rdoc

uea-stemmer

Similar to other stemmers, UEA-Lite operates on a set of rules which are used as steps. There are two groups of rules: the first to clean the tokens, and the second to alter suffixes.

The first group of rules first avoids a small list of six frequent problem words. An improvement to the stemmer would be to expand this list by adding other problem words which the second rule set cannot deal with. Second, possessive apostrophes are removed and contractions are expanded. All hyphens are removed and tokens containing digits are left untouched. Strings which are all upper case and digits are left untouched unless there is a lower case terminal ’s’ (i.e. transforming plural forms of acronyms to singular forms).

Proper nouns should not usually be stemmed, except to remove possessives; our implementation will respect PoS tags if they are present. If the text is untagged the stemmer uses the simple heuristic that any capitalized token not preceded by sentence breaking punctuation is a proper noun.

Many texts, particularly scientific papers, contain sequences of digits, single letters, and other non-word tokens. Our implementation ignores tokens containing digits, single-letter tokens, and tokens with embedded punctuation.

The second group of rules contains 139 suffix rules, each testing for a specific type of suffix. The rules are set in a particular order so that the longest suffix applicable is used rather a shorter one which could lead to nonsense words and more words not stemmed entirely to their root form.

This is a port to Ruby from the port to Java from the original Perl script by Marie-Claire Jenkins and Dr. Dan J. Smith at the University of East Anglia.

Installation

Install the gem via Github:

  gem sources -a http://gems.github.com   # only need to do this once
  sudo gem install ealdent-uea-stemmer

Install the gem from source:

  git clone git://github.com/ealdent/uea-stemmer.git
  cd uea-stemmer
  rake install

Example Usage

Typical usage:

  require 'uea-stemmer'
  stemmer = UEAStemmer.new

  stemmer.stem('helpers')   # helper
  stemmer.stem('dying')     # die
  stemmer.stem('scarred')   # scar

  'buries'.stem             # bury
  'bodies'.stem             # body
  'ordained'.stem           # ordain

You can also extract the stemmed word along with the rule by using the `stem_with_rule` method.

  stem = stemmer.stem_with_rule('invited')   # Word('invite', Rule #22.3)
  puts stem.rule  # rule #22.3 (remove -d when the word ends in -ited)

TODO

  • test handling of POS tags
  • add tests to mimic methodology used in the paper

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so I don’t break it in a future version unintentionally.
  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
  • Send me a pull request. Bonus points for topic branches.

Relevant Web Pages

  • www.uea.ac.uk/cmp/research/graphicsvisionspeech/speech/WordStemming
  • Stemming

Copyright

Copyright © 2005 by the University of East Anglia and authored by Marie-Claire Jenkins and Dr. Dan J Smith. This port to Ruby was done by Jason Adams using the port to Java by Richard Churchill.

This project is distributed under the Apache 2.0 License. See LICENSE for details.

Blog | Support | Training | Contact | API | Status | Twitter | Help | Security
© 2010 GitHub Inc. All rights reserved. | Terms of Service | Privacy Policy
Powered by the Dedicated Servers and
Cloud Computing of Rackspace Hosting®
Dedicated Server