# Chapter 13: Naive Bayes

### Some Theory

Here we will build a Naive Bayes Classifier for the Spam/Ham classification problem. Naive Baye is built on Bayes theorem, as given below:

$P(A|B) = \displaystyle \frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\neg A)P(\neg A)} $

In the email classification domain we would treat the event $A$ as the "this email is spam" and then the event $B$ as some feature of the given email.

A common $B$ feature to use would be the existance of a certain word. Thus if we set some event $V$ as "the message contains the word viagra", then $P(A|V)$ would be the probability that an email is spam given that it contains the word "Viagra". 

Key passage:
> The key to Naive Bayes is making the (big) assumption that the precences (or absences) of each word are independent of one another, conditional on the message being spam or not.

Given this independence assumption we can make take the following mathematical step:

$\displaystyle P(X_1 = x_1, \ldots, X_n = x_n | S) = P(X_1=x_1|S) \times \cdots \times P(X_n=x_n|S) = \prod_{i=1}^n P(X_i=x_i|S)$

Thus applying Bayes Theorem we can calculate the probability of a message being spam or not by simply coming up with an estimate for $P(X_i|S)$ and $P(X_i|\neg S)$. Estimating this is then as simple as counting words from labeled spam/ham messages, given the following "smoothing" measure to account for missing data (called _pseudocounting_):

$P(X_i|S) = \displaystyle \frac{k + \text{ number of spams containing } w_i}{2k + \text{number of spams}}$

### An Implementation

In [2]:
class NaiveBayesClassifier:
    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = {}
        
    def train(self, training_set):
        
        # count spam and non-spam messages
        num_spams = len([is_spam
                        for message, is_soan in training_set
                        if is_spam])
        num_non_spams = len(training_set) - num_spams
        
        # run training data through our "pipeline"
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts, 
                                             num_spams, 
                                             num_non_spams,
                                             self.k)
        
    def classify(self, message):
        return spam_probability(self.word_probs, message)
        
    
    # Static methods
    def tokenize(message):
        message = message.lower()                      # convert to lowercase
        all_words = re.findall("[a-z0-9']+", message)  # extract the words
        return set(all_words)                          # remove duplicates
    
    def count_words(training_set):
        """training set consists of pairs (message, is_spam)"""
        counts = defaultdict(lambda: [0, 0])
        for message, is_spam in training_set:
            for word in tokenize(message):
                counts[word][0 if is_spam else 1] += 1
        return counts
    
    def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
        """
            turn the word_counts into a list of triplets:
            (w, p(w | spam), p(w | -spam))
        """
        return [(w,
                (spam + k) / (total_spams + 2*k),
                (non_spam + k) / (total_non_spams + 2*k))
               for w, (spam, not_spam) in counts.iteritems()]
    
    def spam_probability(word_probs, message):
        message_words = tokenize(message)
        log_prob_if_spam = log_prob_if_not_spam = 0.0
        
        # iterate through each word in our vocab
        for word, prob_if_spam, prob_if_not_spam in word_probs:
            
            # if *word* appears in the message, add the log prob of seeing it
            if word in message_words:
                log_prob_if_spam += math.log(prob_if_spam)
                log_prob_if_not_spam += math.log(prob_if_not_spam)
                
            # if *word* doesn't appear in the message, add the prob of not seeing it
            else:
                log_prob_if_spam += math.log(1.0 - prob_if_spam)
                log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)
                
        prob_if_spam = math.exp(log_prob_if_spam)
        prob_if_not_spam = math.exp(log_prob_if_not_spam)
        return prob_if_spam / (prob_if_spam + prob_if_not_spam)              

### TODOS
- Run on some training data
- Plot the ROC curve
    - Vary the spam probability cutoff
    - Use a color map for the probability value correponding to each point in the curve