## Chap 13 Naive Bayes

#### A Really Dumb Spam Filter

Let `S` be the event "the message is spam" and `V` be the event "the message contains the word *viagra*.". 
`P(S/V) = P(message is spam and contains viagra/ probability that message contains viagra)`

If we have a large collection of messages we know are spam, and a large collection we know are not spam, we can estimate `P(V|S)` and `P(V|¬S)`. If we further assume that any message is equally likely to be spam or not-spam (so that `P(S) = P(¬S) = 0.5`), then:

`P(S|V) = P(V|S) / P(V|S) + P(V|¬S)`

E.g if 50% of spam messages have the word *viagra*, but only 1% of nonspam messages do, then the probability that any given *viagra*-containing email is spam is:


In [3]:
(0.5/(0.5 +0.01)) # 98%

0.9803921568627451

## Implimentation

Building the classifier. First, we will make a function to tokenize messages into distinct words. We'll first convert each message to lowercase; use `re.findall()` to extract "words" consisting of letters, numbers and apostrophes; finally use `set()` to get just the distinct words:

In [4]:
def tokenize(message):
    message = message.lower() # convert to lowercase
    all_words = re.findall("[a-z0-9]+", message) # extract the words
    return set(all_words) # remove duplicates   

Next function will count the words in a labeled training set of messages. We'll have to return a dictionary whose keys are words, and whose values are two-element lists[spam_count, non_spam_count] corresponding to how many times we saw that word in both spam and nonspam messages:

In [5]:
def count_words(training_set):
    """training set consists of pairs (message, is_spam)"""
    counts = defaultdict(lambda:[0, 0])
    for message, is_spam in training_set:
        counts[word][0 if is_spam else 1] += 1
    return counts