Narnach / groupie

Groupie is a simple way to group texts and classify new texts as being a likely member of one of the defined groups. Think of bayesian spam filters.

This URL has Read+Write access

Wes Oldenbeuving (author)
Tue Jun 02 23:34:04 -0700 2009
commit  b85a3cb1cd75e24046f0259423c5cb4f40d6a2f2
tree    d957c4410bccafae9ff00758baf78066bd7dcfc4
parent  6db2b64980e70796c1b296180bc49baeff89e9fe
name age message
file MIT-LICENSE Loading commit data...
directory lib/
file readme.rdoc
directory test/
readme.rdoc

Groupie

Groupie is a simple way to group texts and classify new texts as being a likely member of one of the defined groups. Think of bayesian spam filters.

The eventual goal is to have Groupie work as a sort of bayesian spam filter, where you feed it spam and ham (non-spam) and ask it to classify new texts as spam or ham. Applications for this are e-mail spam filtering and blog spam filtering. Other sorts of categorizing might be interesting as well, such as finding suitable tags for a blog post or bookmark.

Goals

Groupie is a ‘fun’ project that has the following goals, in descending order of importance:

  • Have fun playing with code
  • Play with Bayesian-like (spam) filtering
  • Check out the Testy BDD framework. It’s pretty good for 60 lines of code!

Current functionality

Current funcionality includes:

  • Tokenize an input text to prepare it for grouping.
    • Strip XML and HTML tag.
    • Keep certain infix characters, such as period and comma.
  • Add texts (as an Array of Strings) to any number of groups.
  • Classify a single word to check the likelihood it belongs to each group.
  • Do classification for complete (tokenized) texts.

License

As always, the code is licensed under the MIT license.

Wes Oldenbeuving