# Naive Bayes Spam Filter
In this notebook we will train and validate a Naive Bayes spam filter (classifier). The plan is to use the [UCI spambase dataset](https://archive.ics.uci.edu/ml/datasets/Spambase) to determine what messages are spam or not spam. 

## Bayes Theorem

Given the event that the message is spam by S and the message contains a word by W, Bayes theorem allows us to calculate the probability that the message is spam given the word(s) from the trained probabilities of the message is spam if it contains a word. 

$$ 
P(S|W) = \frac{P(W|S)P(S)}{P(W)} = \frac{P(W|S)P(S)}{P(W|S)P(S) + P(W|!S)P(!S)} 
$$

In [176]:
import pandas as pd
import os
import numpy as np
import collections

import sklearn.metrics

import get_words

np.random.seed(123) # So you can exactly reproduce my numbers.

In [155]:
word_catagories = np.array(get_words.get_words())

def load_data():
    """ 
    Load the spambase dataset
    """
    column_names = np.concatenate((word_catagories, ['target']))
    column_numbers = np.concatenate((np.arange(len(word_catagories)), [57]))

    data = pd.read_csv(os.path.join('.', 'data', 'spambase.data'), 
                       names=column_names, usecols=column_numbers)
    #data[word_catagories] *= 10000 # Scaling factor to go from percentage to number of words 
    return data

def split_data(data, p):
    """
    Split the data according to the probability p of picking a training sample.
    Returns data with the same number of columns, except the first DataFrame 
    (training set) has p*data.shape[0] number of rows and the second DataFrame 
    (validation set) has (1-p)*data.shape[0] number of rows.
    """ 
    n_samples = data.shape[0]
    n_train = int(n_samples * p)
    idt = np.random.choice(np.arange(n_samples), size=n_train, replace=False)
    idv = np.array([i for i in range(n_samples) if i not in idt])
    return data.loc[idt], data.loc[idv]

Load the spambase dataset, and split it into a train and validate subsets.

In [127]:
data = load_data()
train, validate = split_data(data, 0.5)

In [128]:
train.head()

Unnamed: 0,make,address,all,our,over,remove,internet,order,mail,receive,...,direct,cs,meeting,original,project,re,edu,table,conference,target
3938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.45,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3533,0.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.56,0
892,0.0,0.0,0.0,0.0,0.73,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.28,...,0.0,0.0,0.56,0.0,0.56,0.0,0.0,0.0,0.0,0


For each word, we now tally up the number of spam and non-spam messages that word was in.

In [107]:
def count_spams_hams(train):
    """
    Counts the number of messages that contain at least one
    instence of a particular word (non-zero probability). 
    """
    spam_counts = collections.defaultdict(int)
    ham_counts = collections.defaultdict(int)

    # loop over the messages in the training set
    for _, message in train.iterrows():
        spam_status = message['target']

        # Loop over all word_categories and check if any of them were 
        # in a spam or non-spam message (had a non-zero probability)
        for i, word in enumerate(word_catagories):
            # if spam and word is in the message
            if (message[i] > 0) and (spam_status): 
                # Add the number of times word occured in that message
                spam_counts[word] += 1
            # if non-spam message contains word
            elif (message[i] > 0) and (not spam_status): 
                ham_counts[word] += 1
    return spam_counts, ham_counts

spam_counts, ham_counts = count_spams_hams(train)

Calculate the likelihoods that a message is spam or not spam given a word (train the Naive Bayes classifier). 

In [192]:
def word_probabilities(train, spam_counts, ham_counts, k=1):
    """
    Calculates the likelihood of word given spam and word given not spam.
    Returns a tuple of (word, p(word|spam), p(word|!spam)). k is the 
    smoothing factor
    """
    n_spams = sum(train.target) # total number of spam messages 
    n_hams = train.shape[0] - n_spams
    
    t = [(word, 
         (k + spam_counts[word])/(2*k + n_spams),
         (k + ham_counts[word])/(2*k + n_hams)
         ) for word in spam_counts.keys()]
    return t

word_likelihood = word_probabilities(train, spam_counts, ham_counts)

This function takes in the trained likelihoods and calculates the probability that the message is spam.

In [157]:
def spam_probability(word_likelihoods, message):
    """
    Calculate the probability that the message is spam, given the
    word likelihoods.    
    """
    log_prob_spam = log_prob_ham = 0.0
    
    for word, spam_like, ham_like in word_likelihoods:
        # Find the index of the probability in message that
        # corresponds to the likelihood we are currently comparing
        idx = np.where(word == word_catagories)[0]
        assert len(idx) == 1, f"One or too many matches! idx={idx}"
        
        # If word was found in the message.
        if message[idx[0]] > 0:
            log_prob_spam += np.log(spam_like)
            log_prob_ham += np.log(ham_like)
        # If the word was not found, add the probability of not seeing it.
        else:
            log_prob_spam += np.log(1-spam_like)
            log_prob_ham += np.log(1-ham_like)
            
    prob_spam = np.exp(log_prob_spam)
    prob_ham = np.exp(log_prob_ham)
    return prob_spam / (prob_spam + prob_ham)        

Validate the classifier with the validation dataset.

In [185]:
filter_targets = np.zeros(validate.shape[0])

for i, (_, row) in enumerate(validate.iterrows()):
    p_i = spam_probability(word_likelihood, row)
    if p_i > 0.5:
        filter_targets[i] = 1
    else:
        filter_targets[i] = 0

What is the confusion matrix?

In [191]:
confusion = sklearn.metrics.confusion_matrix(validate.target, filter_targets)
print(confusion, '=\n\n', np.array([['TP', 'FP'], ['FN', 'TN']]), 
      '\n FP = type 1 error | FN = type 2 error',
      '\n(rows: spedicted spam and not spam. columns: actual spam and actual ham)')

[[1270  125]
 [ 162  744]] =

 [['TP' 'FP']
 ['FN' 'TN']] 
 FP = type 1 error | FN = type 2 error 
(rows: spedicted spam and not spam. columns: actual spam and actual ham)


And accuracy?

In [198]:
print(round(100*(confusion[0, 0] + confusion[1, 1])/np.sum(confusion)), '% of messages were correctly classified as spam or not spam')

88.0 % of messages were correctly classified as spam or not spam
