## Chap 13 Naive Bayes

#### A Really Dumb Spam Filter

Let `S` be the event "the message is spam" and `V` be the event "the message contains the word *viagra*.". 
`P(S/V) = P(message is spam and contains viagra/ probability that message contains viagra)`

If we have a large collection of messages we know are spam, and a large collection we know are not spam, we can estimate `P(V|S)` and `P(V|¬S)`. If we further assume that any message is equally likely to be spam or not-spam (so that `P(S) = P(¬S) = 0.5`), then:

`P(S|V) = P(V|S) / P(V|S) + P(V|¬S)`

E.g if 50% of spam messages have the word *viagra*, but only 1% of nonspam messages do, then the probability that any given *viagra*-containing email is spam is:


In [3]:
(0.5/(0.5 +0.01)) # 98%

0.9803921568627451

## Implimentation

Building the classifier. First, we will make a function to tokenize messages into distinct words. We'll first convert each message to lowercase; use `re.findall()` to extract "words" consisting of letters, numbers and apostrophes; finally use `set()` to get just the distinct words:

In [4]:
def tokenize(message):
    message = message.lower() # convert to lowercase
    all_words = re.findall("[a-z0-9]+", message) # extract the words
    return set(all_words) # remove duplicates   

Next function will count the words in a labeled training set of messages. We'll have to return a dictionary whose keys are words, and whose values are two-element lists[spam_count, non_spam_count] corresponding to how many times we saw that word in both spam and nonspam messages:

In [29]:
def count_words(training_set):
    """training set consists of pairs (message, is_spam)"""
    counts = defaultdict(lambda:[0, 0])
    for message, is_spam in training_set:
        for word in tokenize(message):
            counts[word][0 if is_spam else 1] += 1 # does dict[][] reference?
    return counts

In [35]:
def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
    """turn the word_counts into a list of triplets
    w,(p(w | spam) and p(w | ~spam))"""
    return[(w,(spam + k) / (total_spams + 2 * k)) 
           for w, (spam, non_spam) in counts.items()]

The next method uses our `word_probs` classification, determined with the training set, and applies it to a new message to be tested: `message`

In [36]:
def spam_probability(word_probs, message):
    message_words = tokenize(message) # split the message up into each word
    log_prob_if_spam = log_prob_if_not_spam = 0.0
    
    # iterate through each word in our vocabulary dict
    for word, prob_if_spam, prob_if_not_spam in word_probs:
        
        # if *word* appears in the message, add the log probability of seeing it
        if word in message_words:
            log_prob_if_spam += math.log(prob_if_spam)
            log_prob_if_not_spam += math.log(prob_if_spam)
            
        # if *word* doesnt sppear in the message add the log probability of _not_
        # seeing it which is log(1 - probability of seeing it)
        else:
            log_prob_if_spam += math.log(1.0 - prob_if_spam)
            log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)
        
        prob_if_spam = math.exp(log_prob_if_spam)
        prob_if_not_spam = math.exp(log_prob_if_not_spam)
        return prob_if_spam / (prob_if_spam + prob_if_not_spam)        

Put all of this together into a Naive Bayes Classifier:

In [54]:
class NaiveBayesClassifier:
    
    def __init__(self, k=0.5): # k is the psuedo count for smoothing
        self.k = k
        self.words_probs = []
        
    def train(self, training_set):
        
        # count spam and non-spam messages
        num_spams = len([is_spam for message, is_spam in training_set if is_spam])
        num_non_spams = len(training_set) - num_spams
        
        # run training data through our "pipeline"
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts, num_spams, num_non_spams, self.k)
        print(list(self.word_probs)[0:10])
        
    def classify(self, message):
        return spam_probability(self.word_probs, message)

## Testing Our Model

How do we identify the subject line? Looking through the files, they all seem to start with "Subject" - so we'll look for that.

In [55]:
from collections import Counter, defaultdict
from machine_learning import split_data
import math, random, re, glob

# path for the files
path = r"/home/sophie/projects/DS_fromScratch/*/*"

data = []

# regex for stripping out the leading "Subject:" and any spaces after it
subject_regex = re.compile(r"^Subject:\s+")

# glob.glob returns every filename that matches the wildcarded path
for fn in glob.glob(path):
    is_spam = "ham" not in fn

    with open(fn,'r',encoding='ISO-8859-1') as file:
        for line in file:
            if line.startswith("Subject:"):
                subject = subject_regex.sub("", line).strip()
                data.append((subject, is_spam))
    

Now we can split the data into a training data and test data, and then we're ready to build a classifier

In [56]:
random.seed(0)  # to check answers with the example I am following
train_data, test_data = split_data(data, 0.75)

classifier = NaiveBayesClassifier()
classifier.train(train_data)

[('kissinger', 0.0013513513513513514), ('truck', 0.0013513513513513514), ('orgns', 0.0013513513513513514), ('delivers', 0.0013513513513513514), ('angry', 0.0013513513513513514), ('newby', 0.0013513513513513514), ('zone', 0.0013513513513513514), ('tiling', 0.0013513513513513514), ('university', 0.004054054054054054), ('interviews', 0.0013513513513513514)]


And then we can check how our model does:

In [41]:
# triplets (subject, actual is_spam, predicted spam probability)
classified = [(subject, is_spam, classifier.classify(subject)) for subject, is_spam in test_data]

# assume that spam_probability > 0.5 corresponds to spam prediction and count the 
# combinations of (actual is_spam, predicted is_spam)
counts = Counter((is_spam, spam_probability > 0.5) for _,  is_spam, spam_probability in 
                 classified)

ValueError: not enough values to unpack (expected 3, got 2)

Let's also look at the most misclassified:

In [None]:
# sort by spam_probability from smallest to largest
classified.sort(key=lambda row: row[2])

# the highest predicted spam probabilities among the non-spams
spammiest_hams = filter(lambda row: not row[1], classified)[-5:]