Skip to content

dchest/pybayesantispam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyBayesAntispam

Simple Bayesian spam rating in Python that is easy to use, small, contained in a single file, and doesn't require any external modules.

Author: Dmitry Chestnykh, Coding Robots

License: MIT (see bayes.py)

Example usage

Creating and destroying Storage and Bayes objects

With data output to file "bayes.dat":

import bayes

storage = bayes.Storage("bayes.dat", 10)
try:
    storage.load()
except IOError:
    pass  # don't fail if bayes.dat doesn't exist, it will be created

bayes = bayes.Bayes(storage)

Train and/or check for spam (described below). After you're done (e.g. before exiting from script) call:

storage.finish()

to save changes (they'll be saved only if needed).

Explanation

First, you create a Storage object with the following arguments:

filename

Path to data file where probabilities of tokens are stored.

min_save_count (default 5)

How many trainings should be performed before saving data to file. This is needed purely for performance reasons. Since probability data can contain hundreds of thousands tokens (saving half a million tokens takes a few seconds on my MacBook Air), it would be unwise to save this data after every train operation. If you set min_save_count to 10, data will be written out to file every 10 train operations.

Next, you create a Bayes object with your Storage object as argument. The separation of storage is here to improve performance of web applications (and to enable using of other storages). For example, if you create your storage before interacting with FastCGI server, and share it with all Bayes instances, your web app will only load data on startup, not on every visit of a page. Note, however, that currently Storage is not thread-safe, so don't share a single Storage object between instances of Bayes in multiple threads.

Then you perform your training or checking for spam operations.

Finally, you should call storage.finish() to make sure your data are saved to file.

Training

For a Bayesian antispam filtering system to work, you should train it on bad (spam) and good (ham, or not spam) data. After you created objects as described above, interact with an instance of Bayes object:

spam_message = "Viagra, cialis for $2.59!!! Call 555-54-53"
bayes.train(spam_message, True)

ham_message = "Paul Graham doesn't need Viagra. He is NP-hard."
bayes.train(ham_message, False)

Explanation

train() takes two arguments:

message

Text for training.

is_spam

Indicates if given message is spam.

Checking for spam

After you trained your system with enough data, you can rate or check messages using one of the two methods: is_spam() or spam_rating(). Let's use both with data trained above.

m1 = "Cheap viagra for 2.59"
bayes.is_spam(m1)       # => True
bayes.spam_rating(m1)   # => 0.97

m2 = "I don't use viagra (yet)"
bayes.is_spam(m2)       # => False
bayes.spam_rating(m2)   # => 0.16  

Explanation

Both methods take text message you want to check for spam as an argument.

spam_rating() calculates and returns probability of a message to be spam (from 0 to 1).

is_spam() simply calls spam_rating() and checks if probability is more than 0.9.

Using PyBayesAntispam from shell

You can use PyBayesAntispam as a stand-alone executable.

Since training data is loaded on every execution, it's currently inefficient to use PyBayesAntispam from shell with large training databases and a lot of messages.

Usage

Usage: bayes.py [option] datafile < infile

Options:
		-s, --train-spam - train datafile with infile as spam
		-h, --train-ham - train datafile with infile as ham
		-c, --check - check for spam rating of infile

Examples

To train bayes.dat with spam.txt as spam:

bayes.py -s bayes.dat < spam.txt

To train bayes.dat with a given message as ham (not spam):

echo -n "Not a spammy message" | bayes.py -h bayes.dat

To check message.txt against bayes.dat for spam:

bayes -c bayes.dat < message.txt

This will output spam rating of message.txt.

Another example of checking:

$ echo "I don't use viagra (yet)" | ./bayes.py -c test.dat
0.16

Running tests

PyBayesAntispam includes a few unit tests. To run them:

$ ./tests.py

Implementation details

Storage

Storage stores database of tokens using cPickle (since it seems like an efficient way to store binary data [ref. 1]). Data values stored in file are totals and tokens.

Totals is a simple dictionary:

totals = {'spam':integer, 'ham':integer}

Tokens is a dictionary of lists with the following schema:

tokens[hash] = [ham_count, spam_count]

As you can see, it doesn't store actual word, only its hash (calculated using Python's hash() function).

Bayes

When you train your filter with a message, totals['spam'] (or totals['ham'] if the message is ham) in storage is incremented by 1, and tokens for each word (token) are incremented by 1 accordingly.

When you check a message, probabilities are being calculated [ref. 2] from totals and tokens:

	    			p1 * p2 * ... * pN
p = -----------------------------------------------------
      p1 * p2 * ... * pN + (1 - p1) * (1 - p2) * (1 - p3)

where:

  • p is the probability (rating) that the given message is spam.
  • pN is the probability that it is a spam knowing it contains a Nth token.

pN is calculated using the following formula:

				spam_count / totals_spam
pN = ------------------------------------------------------
        (ham_count / total_ham) + (spam_count / total_spam)

where spam_count and ham_count are counts for a given token.

Only 20 most interesting probabilities (pN) [ref. 3] are used to calculate final probability (p): 10 highest and 10 lowest. Also, lowest p(N) probability is 0.01 to avoid float underflows.

Default probabilities [ref. 3]

Token has 0 hams, 1 or more spams: 0.99

Token has 0 spams, 1 or more hams: 0.01

Token is not in database: 0.04

Message is spam (as used by is_spam() function): > 0.9

Token separation

  • Everything between whitespace or separator characters is a token (e.g. "Hello, there!!! Nice weather?" is split to ["hello", "there", "nice", "weather"])...

  • ...except that dots between digits are not separators (e.g. "Address 192.168.1.3" is ["address", "192.168.1.3"], or "$10.80 payment" is ["$10.80", "payment"]) [ref. 4]

  • Tokens are case insensitive (e.g. "spam" and "SPAM" have the same probability), i.e. they are converted to uppercase. It should work better for small training databases.

  • Dashes between words are preserved (e.g. "know-how" is a single token, but "hey - you" are two tokens, "hey" and "you").

  • Tokens with 2 or less characters are not considered (e.g "this is not me" is two tokens "this" and "not".)

(Finally, as said before, tokens are not words, but hash(token).)

References

  1. Didip Kerabat. Python: cPickle vs ConfigParser vs Shelve Performance, 2009
  2. -. Bayesian Spam Filtering. Wikipedia.
  3. Paul Graham. A Plan for Spam, 2002
  4. Paul Graham. Better Bayesian Filtering, 2003

About

Simple Bayesian spam rating in Python that is easy to use, small, contained in a single file, and doesn't require any external modules.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages