# Bayes Exercise (Solutions)
The goal of the exercise is to implement a Bayesian Classifier which filters spam emails.

In [1]:
from collections import Counter
from data import ham, spam
from math import log, exp

In [2]:
# Load data
ham_words = []
spam_words = []
for email in ham:
    ham_words += email.words
for email in spam:
    spam_words += email.words

ham_counts = Counter(ham_words)
spam_counts = Counter(spam_words)

print("{} spam and {} ham emails loaded.".format(len(spam), len(ham)))

1380 spam and 3001 ham emails loaded.


## Determine word probabilities
Given the data in our emails, we need to find the probability that a given word appears in a certain message type.

In [3]:
# Conditional probability that a given word appears in a spam message
def prob_word_given_spam(word):
    return spam_counts[word] / len(spam_words)
    
# Conditional probability that a given word appears in a ham message
def prob_word_given_ham(word):
    return ham_counts[word] / len(ham_words)

## Classify a given email
To classify the emails, we use Bayes rule for computing conditional probabilities. We also assume the words occurrences are independent.

Implement Bayes rule using the logarithmic expression

$$
  \ln\left( \frac{1}{P(S|W)} - 1 \right) = \sum_i \ln\left( P(W_i|\neg S) \right) - \ln\left( P(W_i|S) \right) + \ln(P(\neg S)) - \ln(P(S)),
$$

to determine if a given message is spam or ham.

In [4]:
# Probability a given email is spam
# Note: email has a field `words` which contains a list with all words in the message
def is_spam(email, prior=0.5):
    rhs_ln = 0
    for word in email.words:
        if prob_word_given_spam(word) == 0 or prob_word_given_ham(word) == 0:
            # Discard words with too little data to make informed decision
            continue
        rhs_ln -= log(prob_word_given_spam(word))
        rhs_ln += log(prob_word_given_ham(word))
    rhs_ln -= log(prior)
    rhs_ln += log(1 - prior)

    return 1 / (exp(rhs_ln) + 1)

## Run our classifier over the dataset

The code below runs the classifier using the `is_spam` function defined above. Try adjusting the
threshold and observe how it represents a trade-off between false negatives, false positives, and
total error rate.

Which of these is most important in the case of email processing?

In [5]:
# What spam probibility rejects the message as spam
threshold = 0.9

fail_spam = 0
fail_ham = 0
for email in spam + ham:
    prob = is_spam(email)
    if prob >= threshold and not email.spam:
        fail_ham += 1
    if prob < threshold and email.spam:
        fail_spam += 1

print("Result:")
print("error rate: {}%".format(100 * (fail_ham + fail_spam) / (len(ham) + len(spam))))
print("false positives: {}%".format(100 * fail_ham / (len(ham) + len(spam))))
print("false negatives: {}%".format(100 * fail_spam / (len(ham) + len(spam))))

Result:
error rate: 8.057521113900936%
false positives: 0.7076010043369094%
false negatives: 7.349920109564026%
