courtenay / splam

Simple, pluggable, easily customizable score-based spam filter plugin for Ruby-based applications

commit  412d8e178c1daa2d09247e2c5c38b49ec0e0280b
tree    46316e9241a0d5ea6b73bd95e3d63f4ed3242963
parent  bbe48ee06e01fbe926fe34e76ad2019a4be9d79c
splam / README
100644 54 lines (39 sloc) 2.021 kb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Splam
=====
 
Splam is a simple spam scoring plugin. It contains a set of rules that are run on a field
to help you determine the likelihood of that field being spam. It doesn't do anything
other than give a field a score. It's up to you to act on that score.
 
Check out the tests for instructions on how to use: you'll want to integrate this into
your application's workflow.
 
It's heavily biased towards the spam I've been seeing in the past two or three hours.
This includes lots of crap with
- bbcode [url=
- lots of links (http://)
- russian text
- links to russian or chinese websites
 
You can write your own plugins to Splam: simply subclass Splam::Rule. Splam is clever enough
to iterate over all Rule's subclasses and run the 'run' method on the field to be checked.
The other way to do this would be to define Rule.add_rule do ... end but I think the class
form is easier for rubyists to understand and modify.
 
Splam aggregates the scores from all the rules. From the brief testing I've done, anything over
about 40 is likely to be spam. Real spam will blow out of the scoring stratosphere with over 1,000.
 
Recommended serving directions:
 
    class Comment
      include Splam
      
      splammable :body
    end
    
    comment = Comment.new :body => "This is spam!!!1"
    comment.splam? # => false
    comment.splam_score # => 2
    comment.splam_reasons # => []
 
Add this to a model, check the score, and determine (based on other factors such as logged-in
user, time spent on the page, validity of request headers, length of user's membership on the
site) whether to ban the post or not.
 
We recommend showing the post to the user (spambox them in) but hide it from everyone else.
 
TODO
 
- Integrate bayesian or other clever algorithm, so that scores aren't hardcoded.
- Switch to using a percentage (0.994) rather than a score (250)
- Write more plugins!
- Test against a larger Ham corpus
- Fix that nasty autoloading code in splam.rb
 
Copyright (c) 2008 ENTP, released under the MIT license