There was a discussion on comp.lang.python about a spam filter based on probabilities. This theory is described by Paul Graham in his article A Plan for Spam and partially includes the improvement described by Gary Robinson in Spam Detection.
If you read this article, you will see that this method is very attractive and the results given by the author are very interesting. So I decided to write such a filter. This filter should propose the following features:
Probabilistic analysis
Each word or word group is associated to a probability which is computed from real emails received by the user. From the individual probability of each word, we compute the probability of a message being a spam. Messages which probability is high are tagged (a tag is added to the subject). The mail reader can sort the incoming messages given this mark.Separate analysis
For sake of performance, the creation of the database containing each individual probability is independent of the real-time filtering.White list
To reduce the risk of false positive (which is already nul in my case), people you sent messages to are put in a white list. Then their messages will be accepted without being filtered. It also speeds up the process.POP3 Proxy
To filter incoming messages, transparently for the user, PopF is a POP3 proxy that link your software to your POP3 server. This system is very simple to use and can be adapted to any software (conforming to the POP3 protocol).Decoding
Headers, text and other attachments - text, base64 or quoted-printable encoding - are decoded before being filtered. Other formats are ignored (pictures, executable files, …).Antivirus
PopF can be connected to an antivirus (not released with PopF).Training to exhaustion
Iterative learning algorithm using only misclassified messages (smaller database and more selective filter).PopF is written in Python and should work on any platform accepting Python. I have tested it on Linux only and I am very interested in any try on other operating systems.
They speak about PopF on the Internet:
Paul Graham
http://www.paulgraham.com/filters.html
Gary Robinson
http://www.transpose.com/grobinson.html
http://www.transpose.com/technology.html
And in the newspaper industry:
Linux Loader
The synopsis of number 17 is
there. The
article describes the installation of PopF.
PopF Python script is contained in a single file: popf.py
This table is the result of the popf.py -check
command. It shows how
efficient PopF is on known spams by testing all messages against the
current database. The efficiency should be close to 100%.
This table is the result of the popf -efficiency
command. It shows
real results of PopF by checking the X-PopF-Spam header. It better show
the efficiency of PopF at the time a new (and maybe unknown) spam. Be
aware that the efficiency may be very low at the beginning (with few
known spams).
popf.py -proxy
starts the POP3 proxy.
popf.py -kill
kills the proxy.
popf.py -gen
builds the database.
popf.py -test files ...
tests files with the current database.
popf.py -setup
makes a default configuration file. To create a predefined configuration file:
popf.py -setup Graham [exhaustion]
popf.py -setup Robinson [exhaustion]
popf.py -setup Robinson-Fisher [exhaustion]
popf.py -clean
cleans POP3 accounts. Spams are kept for the generation of the database.
Wanted messages can be forwarded to other emails.
popf.py -purge
purges the most ancient spams.
popf.py -version
prints PopF version.
popf.py -check
computes the efficiency of the filter on the messages of the user.
popf.py -efficiency
computes the actual efficiency of the filter on the messages of the
user.
The installation described here is for Linux. If you use it with other operating systems (especially Window$), do not hesitate to share your experience ;-)
Python should be installed. I have tested PopF with version 2.3.4 but should works with version 2.3 or greater.
PopF can also benefit from Psyco when it is installed.
Then you need popf.py. Put it anywhere, in an accessible path
(/usr/bin for example). The script should be executable
(chmod +x popf.py
).
Warning
To download PopF, you have to use the “Download this link” function (or a similar function in your browser). If you copy and paste the source directly from the browser, you may get an erroneous popf file.
To configure PopF, run popf -setup
. It is also possible to use
predefined configurations:
popf.py -setup Graham
Method described by Paul Graham
popf.py -setup Robinson
Method described by Gary Robinson
popf.py -setup Robinson-Fisher
Method described by Gary Robinson, based on Fisher’s calculation
This creates ~/.popf/popfrc containing the following parameters:
HOME
PopF can be executed before the HOME environment variable is defined. To
do so, just copy the popfrc configuration file to /etc/popf.conf
(Linux/Unix) or C:\popf.conf
(Window$) and define the HOME variable
in this file. Then the $HOME/.popf/popfrc
file will be read to replace
or complete the parameters defined in popf.conf. This variable has no
effect in the popfrc file.
On Windows, the USERPROFILE variable is used if HOME is not defined.
HOST, PORT, TIMEOUT
Host name and port number of the proxy. HOST should be ‘localhost’ since
PopF may run on your machine. PORT default value is 50110. It can be 110
(the default value for POP3) if you run PopF as root. Default values are
recommanded.
The TIMEOUT parameter is the longuest delay in seconds. After such a period of inactivity, the connection is aborted. If TIMEOUT is None, there is no limit. This feature only works with Python 2.3. Anyway PopF can work without timeout with Python 2.2.
LOG
Saving POP3 commands in ~/.popf/popf.log
(LOG = True or False)
LOCALE
Definition of the characters in a word. The default value (None) doesn’t
accept accent for example. To know the list of known names, run
locale -a
. With a German configuration, we may use
LOCALE = 'German'
.
WARNING:
this option works well under Linux/Unix. I don’t think so about Window$.
TOKEN, NONTOKEN
TOKEN is a regular expression defining a word. NONTOKEN is a regular
expression used to ignore some words recognized by TOKEN (for example
words with only digits or shorter than 3 characters). Default values are
recommended.
HEADER_FILTER, BODY_FILTER
If HEADER_FILTER is True, the filter uses headers. If BODY_FILTER is
True, the filter uses the body of the message. By default both
parameters are active.
GOOD_CORPUS, BAD_CORPUS
GOOD_CORPUS is a (set of) file or directory containing non spam emails.
BAD_CORPUS is a (set of) file or directory containing spam emails.
These files must be RFC822 complient (Unix format with many messages per file or MH format with one file per message). The filter may work with other formats but it hasn’t been tested.
You absolutely need to change these values. For example:
GOOD_CORPUS = '/home/foo/Mail/Archives', '/home/foo/Mail/outbox'
BAD_CORPUS = '/home/foo/Mail/SPAM'
GOOD_CORPUS must not be a subdirectory of BAD_CORPUS and vice-versa.
IGNORED_EXTENSIONS
IGNORED_EXTENSIONS is the list of the extensions of the files to be
ignored while learning. These files are those that don’t contain
messages. The default value can be used with some popular softwares.
WHITELIST
WHITELIST is the list of addresses of the user. The white list is the
set of addresses the user has sent emails. It is then useless to build
it from scratch. For example:
WHITELIST = 'my.first.email@free.fr', 'my.second.email@free.fr'
TRAINING_TO_EXHAUSTION
Training to exhaustion learning method. By default this method is
disabled because it can consume a huge amount of memory. When this
parameter is True, the following parameters must be defined:
TRAINING_TO_EXHAUSTION_GOOD_LIMIT
Maximal probability that non spams should not be above of
TRAINING_TO_EXHAUSTION_BAD_LIMIT
Minimal probability that spams should not be below
TRAINING_TO_EXHAUSTION_MAX_ITERATION
Maximal number of iterations
METHOD
Probability computation for messages (Graham, Robinson or
Robinson-Fisher).
FREQUENCY_THRESHOLD
Number of occurrences of words needed to be stored in the data base.
Rare words are not stored. Default values are recommended.
GOOD_BIAS, BAD_BIAS, GOOD_PROB, BAD_PROB, UNKNOWN_PROB
Bias and probabilities of spam, nonspam and unknown words. Default
values are recommended.
RARE_WORD_STRENGTH
Strength given to “rare” words”. Default values are recommended.
SIGNIFICANT
Number of words to take in account in a message to be filtered. Default
values are recommended.
BAD_THRESHOLD
Threshold from which the message is considered as spam. Default values
are recommended (0.9 if METHOD == “Graham”, 0.5 if METHOD ==
“Robinson”).
UNCERTAIN
Width of the uncertainty band around BAD_THRESHOLD. Default values are
recommended.
TAG
Tag to insert in the subject of spams.
To avoid tagging the subject, just use an empty TAG (TAG = ""
). When
the tag is empty it is still possible to filter messages using the
X-PopF-Spam header that is always added to spams. The 4.1.0 version of
PopF also adds a “X-Spam-Flag: YES” tag to be used with
gnubiff.
Warning
it’s better to filter messages using the “X-PopF-Spam” because some spams have more than one “Subject” header and PopF only tags one (will be fixed in a future verion).
AUTORELOAD
AUTORELOAD tells PopF to reload the probabilities when they are
generated.
ANTIVIRUS, VIRUS_TAG, FAST_ANTIVIRUS
ANTIVIRUS is the list of antivirus to use with the filter. This list
contains the names (and options) of antiviruses and regular expressions
that match the names of the detected viruses. For instance to use f-prot
and clamav:
ANTIVIRUS = 'f-prot', 'Infection: (.*)', 'clamscan -r --disable-summary', ': (.*) FOUND'
FAST_ANTIVIRUS only checks spam messages for viruses to fasten the process (FAST_ANTIVIRUS = True). The default value is FAST_ANTIVIRUS = False.
VIRUS_TAG is the tag to insert in the subject of infected messages.
When a virus is found, the X-PopF-Virus header is added to the message. This header holds the name of the virus.
To avoid tagging the subject, just use an empty VIRUS_TAG (VIRUS_TAG = ““).
BYPASS
BYPASS is the list of regular expressions that define the messages not
to be filtered.
CLEANER_ACCOUNTS, CLEANER_DIRECTORY, CLEANER_PERIOD, CLEANER_FORWARDS, CLEANER_SMTP
The -cleaner option downloads spams and stores them in a local directory
(referenced in BAD_CORPUS). This is useful to clean a mailbox and leave
wanted messages on the server. This option can also forward wanted
messages to other emails.
CLEANER_ACCOUNTS is the list of accounts to be cleaned. Each item of the
list looks like user:password@host:port
:port
is optionnal.
CLEANER_DIRECTORY is the directory where spams will be stored. This directory should be a sub directory of BAD_CORPUS or be referenced by BAD_CORPUS.
CLEANER_PERIOD is the period in hours between two cleanings. If CLEANER_PERIOD is None, only one cleaning will be done.
CLEANER_FORWARDS is the list of emails to which messages are forwarded.
CLEANER_SMTP is the SMTP server used to forward messages.
PURGE, PURGE_DIRECTORY
The -purge option moves or removes the oldest spams so as not to
overload the data base and to be more representative of recent spams
instead of older spams. This also seems to avoid false positives that
appear when the data base contains old spams (maybe because such a data
base is too heterogeneous).
PURGE can have several values:
PURGE = integer value
PURGE is the number of monthes after which spams must be removed
PURGE = floating point value
PURGE is the ham/spam ratio (e.g. if PURGE=1.0, PopF will keep spam
repository as big as the ham repository)
PURGE = None
the option is disabled
PURGE_DIR is the directory to which oldest spams are moved. If PURGE_DIR is None then the spams are deleted.
To build the database: popf.py -gen
A little patience…
You need to rebuild the database sometimes to maintain its efficiency.
To use PopF, you need to configure your software as follows:
Protocol
POP3
Server
localhost
User name
your.user.name@your.pop.server
Password
your password on your.pop.server
POP3 Port
50110
For example, my user name is christophe.delord
on a POP3 server
(pop.example.com
), my user name for PopF is then
christophe.delord@pop.example.com
(though PopF knows it will connect
to pop.example.com
and we can have different POP3 servers for
different accounts).
To start PopF: popf.py -proxy
You can start PopF automatically with your mail reader using a shell script for instance.