<h1>[ADVSTAT] Naive Bayesian Spam Filtering Tutorial</h1> <br>
<b>Author:</b> Jan Kristoffer Cheng and Shayane Tan
<hr>
<h3>Description</h3>
<p>In this notebook, we implemented the </p>
<h4>References:</h4>
<ol>
    <li>Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of naive bayesian anti-spam filtering. arXiv preprint cs/0006013.</li>
    <li>Schütze, H. (2008). 13: Text Classification and Naive Bayes. In Introduction to Information Retrieval (pp. 253-286). Cambridge University Press. Retrieved December 8, 2016, from http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf</li>
    <li>Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006, July). Spam filtering with naive bayes-which naive bayes?. In CEAS (pp. 27-28).</li>
</ol>

In [1]:
import os
import math

<p>To start off, let's import the necessary packages.</p>

In [2]:
class Word:
    def __init__(self, content):
        self.content = content
        self.mutualInfo = 0
        self.notPresentSpamCount = 0
        self.notPresentLegitCount = 0
        self.presentSpamCount = 0
        self.presentLegitCount = 0
        self.spamDocumentCount = 0
        self.legitDocumentCount = 0

<p>The Word class represents each distinct word in the dataset. It contains the document frequencies for both spam and legitimate categories.</p>

In [3]:
class PartFolder:
    def __init__(self):
        self.spamEmail = []
        self.legitEmail = []

    def addSpamEmail(self, email):
        self.spamEmail.append(email)

    def addLegitEmail(self, email):
        self.legitEmail.append(email)

<p>The PartFolder class represents the different part folders in the dataset. It categorizes the emails to its corresponding class.</p>

In [4]:
#user input variables
threshold_lambda = 1
threshold_t = threshold_lambda/(1+threshold_lambda)
file_path = 'spam emails\\bare\\part'

#training variables
trainingDistinctWords = {} #dictionary of Word(s)
trainingSpamEmails = [] #list of spam emails in the training set
trainingLegitEmails = [] #list of legitimate emails in the training set
folderCollection = [] #list of PartFolder(s)

nWordsSpam = 0
nWordsLegit = 0 

<p>Here, we initialized the necessary variables to contain the words' document frequencies (trainingDistinctWords), training set of spam emails (trainingSpamEmails), training set of legit emails (trainingLegitEmails), and the collection of all preloaded emails or dataset (folderCollection). This is also where the threshold is defined for the spam classification based on the Naive Bayes result. The threshold will be discussed further later.</p>

<b>Preload email dataset:</b>
<p>In order to lessen the running time, let us first pre-load the email dataset before training the system and evaluating all results.</p>

In [7]:
def loadEmails(path):
    print("Loading emails...")
    for i in range(1,11):
        partPath = path + str(i)
        partFolder = PartFolder()
        for filename in os.listdir(partPath):
            content = open(partPath + '\\' + filename).read()
            if filename.startswith('sp'):
                partFolder.addSpamEmail(content)
            else:
                partFolder.addLegitEmail(content)

        folderCollection.append(partFolder)
    
    print("Finish loading emails...")    

<b>Prepare training dataset:</b>
<p>Then, let us prepare our training set by observing the frequencies of the different distinct words.</p>

In [None]:
def preparingTrainingSet(testingIndex):

    print("Preparing training set...")
    #re-initialized the necessary variables for each iteration when implementing the 10-fold cross validation
    trainingSpamEmails = []
    trainingLegitEmails = []
    trainingDistinctWords = {}

    for i in range(len(self.folderCollection)):
        if i != testingIndex:
            self.trainingSpamEmails += self.folderCollection[i].spamEmail
            self.trainingLegitEmails += self.folderCollection[i].legitEmail

    for email in self.trainingLegitEmails:
        email = email.split()
        tokenizedEmail = set(email)

        #count term frequencies
        for token in tokenizedEmail:
            if token in self.trainingDistinctWords:
                word = self.trainingDistinctWords.get(token)
                word.presentLegitCount += 1
                word.notPresentLegitCount -= 1
            else:
                word = Word(token)
                word.presentLegitCount = 1
                word.notPresentLegitCount = len(self.trainingLegitEmails) - 1
                word.presentSpamCount = 0
                word.notPresentSpamCount = len(self.trainingSpamEmails)
                self.trainingDistinctWords[token] = word



    for email in self.trainingSpamEmails:
        email = email.split()
        tokenizedEmail = set(email)
        for token in tokenizedEmail:
            if token in self.trainingDistinctWords:
                word = self.trainingDistinctWords.get(token)
                word.presentSpamCount += 1
                word.notPresentSpamCount -= 1
            else:
                word = Word(token)
                word.presentSpamCount = 1
                word.notPresentSpamCount = len(self.trainingSpamEmails) - 1
                word.presentLegitCount = 0
                word.notPresentLegitCount = len(self.trainingLegitEmails)
                self.trainingDistinctWords[token] = word



    print("Training distinct words: ", len(self.trainingDistinctWords))