Pyzor initially started out to be merely a Python implementation of Razor, but due to the protocol and the fact that Razor's server is not Open Source or software libre, Frank Tobin decided to implement Pyzor with a new protocol and release the entire system as Open Source and software libre.
The central premise of Pyzor is that it converts an email message to a short digest that uniquely identifies the message. Simply hashing the entire message is an ineffective method of generating a digest, because message headers will differ when the content does not, and because spammers will often try to make a message unique by injecting random/unrelated text into their messages.
To generate a digest, the 2.0 version of the Pyzor protocol:
- Discards all message headers.
- If the message is greater than 4 lines in length:
- Discards the first 20% of the message.
- Uses the next 3 lines.
- Discards the next 40% of the message.
- Uses the next 3 lines.
- Discards the remainder of the message.
- Removes any 'words' (sequences of characters separated by whitespace) that are 10 or more characters long.
- Removes anything that looks like an email address (X@Y).
- Removes anything that looks like a URL.
- Removes anything that looks like HTML tags.
- Removes any whitespace.
- Discards any lines that are fewer than 8 characters in length.
This is intended as an easy-to-understand explanation, rather than a technical one.