<h1>[ADVSTAT] Naive Bayesian Spam Filtering Tutorial</h1> <br>
<b>Author:</b> Jan Kristoffer Cheng and Shayane Tan
<hr>
<h3>Description</h3>
<p>In this notebook, we implemented the based on what paper? sino nag sulat ng paper? purpose of the paper? what is the paper about? how was it implemented</p>
<h4>References:</h4>
<ol>
    <li>Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of naive bayesian anti-spam filtering. arXiv preprint cs/0006013.</li>
    <li>Sch√ºtze, H. (2008). 13: Text Classification and Naive Bayes. In Introduction to Information Retrieval (pp. 253-286). Cambridge University Press. Retrieved December 8, 2016, from http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf</li>
    <li>Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006, July). Spam filtering with naive bayes-which naive bayes?. In CEAS (pp. 27-28).</li>
</ol>

In [11]:
import os
import math
import pandas as pd
from model.Result import Result
from model.PartFolder import PartFolder
from model.Word import Word
from mutual_information.FeatureSelector import FeatureSelector

<p>To start off, let's import the necessary packages.</p>

In [12]:
class Word:
    def __init__(self, content):
        self.content = content
        self.mutualInfo = 0
        self.notPresentSpamCount = 0
        self.notPresentLegitCount = 0
        self.presentSpamCount = 0
        self.presentLegitCount = 0
        self.spamDocumentCount = 0
        self.legitDocumentCount = 0

<p>The Word class represents each distinct word in the dataset. It contains the information of a particular token  document frequencies for both spam and legitimate categories.</p>

In [13]:
class PartFolder:
    def __init__(self):
        self.spamEmail = []
        self.legitEmail = []
        self.trainingSpamEmail = []
        self.trainingLegitEmail = []
        self.relevantWords = [] #relevant words when ith na folder for testing

<p>The PartFolder class represents the different part folders in the dataset. It categorizes the emails to its corresponding class.</p>

In [14]:
#Constant variables
FILTERS = ['bare', 'stop', 'lemm', 'lemm_stop']

#contains the list of average results per experiment
tableResults = []

#contains the dataset for the different filter configuration:bare, stop, lemm, lemm_stop
filterCollection = {} 


<p>Here, we initialized the necessary variables to contain the words' document frequencies (trainingDistinctWords), training set of spam emails (trainingSpamEmails), training set of legit emails (trainingLegitEmails), and the collection of all preloaded emails or dataset (folderCollection). This is also where the threshold is defined for the spam classification based on the Naive Bayes result. The threshold will be discussed further later.</p>

<b>Preload email dataset:</b>
<p>In order to lessen the running time, let us first pre-load the email dataset before training the system and evaluating all results.</p>

In [15]:
def loadEmails(path):
    print("Loading emails...:",path)

    folderCollection = []

    #pre-load the emails of each folder per filter configuration
    for i in range(1, 11):
        partPath = path + str(i)
        partFolder = PartFolder()

        for filename in os.listdir(partPath):
            content = open(partPath + '\\' + filename).read()
            if filename.startswith('sp'):
                partFolder.spamEmail.append(content)
            else:
                partFolder.legitEmail.append(content)

        folderCollection.append(partFolder)

    #pre-load the training emails classified as spam and legit
    for i in range(10):  # testingIndex
        for j in range(10):
            if j != i:
                folderCollection[i].trainingSpamEmail += folderCollection[j].spamEmail
                folderCollection[i].trainingLegitEmail += folderCollection[j].legitEmail

    #pre-load and find the relevant words per testing index
    for i in range(10):
        folderCollection[i].relevantWords = praparingTrainingSet(folderCollection[i])

    return folderCollection  

<b>Prepare training dataset:</b>
<p>Then, let us prepare our training set by observing the frequencies of the different distinct words.</p>

In [16]:
def praparingTrainingSet(testingFolder):
    
    trainingDistinctWords = {}

    for email in testingFolder.trainingLegitEmail:
        email = email.split()
        tokenizedEmail = set(email)

        # count term frequencies
        for token in tokenizedEmail:
            if token in trainingDistinctWords:
                word = trainingDistinctWords.get(token)
                word.presentLegitCount += 1
                word.notPresentLegitCount -= 1
            else:
                word = Word(token)
                word.presentLegitCount = 1
                word.notPresentLegitCount = len(testingFolder.trainingLegitEmail) - 1
                word.presentSpamCount = 0
                word.notPresentSpamCount = len(testingFolder.trainingSpamEmail)
                trainingDistinctWords[token] = word

    for email in testingFolder.trainingSpamEmail:
        email = email.split()
        tokenizedEmail = set(email)
        for token in tokenizedEmail:
            if token in trainingDistinctWords:
                word = trainingDistinctWords.get(token)
                word.presentSpamCount += 1
                word.notPresentSpamCount -= 1
            else:
                word = Word(token)
                word.presentSpamCount = 1
                word.notPresentSpamCount = len(testingFolder.trainingSpamEmail) - 1
                word.presentLegitCount = 0
                word.notPresentLegitCount = len(testingFolder.trainingLegitEmail)
                trainingDistinctWords[token] = word

    fs = FeatureSelector(trainingDistinctWords)
    return fs.getRelevantWords()

In [17]:
def selectNFeatures(nFeatures, testingFolder):
    relevantWords = {x[0]: x[1] for x in testingFolder.relevantWords[:nFeatures]}

    for key in relevantWords:
        word = relevantWords[key]

        for email in testingFolder.trainingSpamEmail:
            if word.content in email.split():
                word.spamDocumentCount += 1  # count document frequencies

        for email in testingFolder.trainingLegitEmail:
            if word.content in email.split():
                word.legitDocumentCount += 1  # count document frequencies

    return relevantWords

In [18]:
def computeNaiveBayes(testingFolder, emailContent, relevantWords):
    # Naive Bayes: Multinomial NB, TF attributes
    emailContent = emailContent.split()
    dict_testingData = {}  # dictionary of distinct words in testing data

    total_trainingEmails = len(testingFolder.trainingSpamEmail) + len(testingFolder.trainingLegitEmail)

    probIsSpam = len(testingFolder.trainingSpamEmail) / total_trainingEmails
    probIsLegit = len(testingFolder.trainingLegitEmail) / total_trainingEmails

    probWord_isPresentSpam = 1.0
    probWord_isPresentLegit = 1.0

    # determine whther term appeared in document
    for key in relevantWords:
        if key in emailContent:
            dict_testingData[key] = 1
        else:
            dict_testingData[key] = 0

    for key in relevantWords:
        word = relevantWords[key]
        power = dict_testingData[key]

        prob_t_s = (1 + word.spamDocumentCount) / (2 + len(testingFolder.trainingSpamEmail))
        prob_t_l = (1 + word.legitDocumentCount) / (2 + len(testingFolder.trainingLegitEmail))

        probWord_isPresentSpam *= (math.pow(prob_t_s, power) * math.pow(1 - prob_t_s, 1 - power))
        probWord_isPresentLegit *= (math.pow(prob_t_l, power) * math.pow(1 - prob_t_l, 1 - power))

    return (probIsSpam * probWord_isPresentSpam) / ( probIsSpam * probWord_isPresentSpam + probIsLegit * probWord_isPresentLegit)

In [19]:
class Result:
    def __init__(self):
        self.filter_config = ''
        self.threshold = 0
        self.nFeatures = 0
        self.avg_recall = 0.0
        self.avg_precision = 0.0
        self.avg_w_acc = 0.0
        self.avg_bw_acc = 0.0
        self.avg_tcr = 0.0

In [20]:
#function for constructing the average results table per filter, nAttributes, and threshold configuration
def runTestTable(filter, threshold, nFeatures):
    
    print("Testing filter:", filter," threshold:", threshold, " No. of Attributes:", nFeatures)
    folderCollection = filterCollection[filter]
    
    threshold_lambda = threshold
    threshold = threshold_lambda /(1+threshold_lambda)


    sPrecision = 0
    sRecall = 0
    wAcc_b = 0
    wErr_b = 0
    wAcc = 0
    wErr = 0
    tcr = 0

    for testingIndex in range(10):
        print("Folder Collection Spam Email: ", len(folderCollection[testingIndex].spamEmail))
        print("Folder Collection Legit Email: ", len(folderCollection[testingIndex].legitEmail))
        print("Folder Collection Relevant Words: ", len(folderCollection[testingIndex].relevantWords))
        relevantWords = selectNFeatures(nFeatures, folderCollection[testingIndex])

        s_s = 0 #spam email categorized as spam
        s_l = 0 #spam email categorized as legit
        l_s = 0 #legit email categorized as spam
        l_l = 0 #legit email categorized as legit

        spamSize = len(folderCollection[testingIndex].spamEmail)
        legitSize = len(folderCollection[testingIndex].legitEmail)

        for email in folderCollection[testingIndex].spamEmail:
            result = computeNaiveBayes(folderCollection[testingIndex], email, relevantWords)
            if result > threshold: #isSpam
                s_s += 1
            else: #isLegit
                s_l += 1


        for email in folderCollection[testingIndex].legitEmail:
            result = computeNaiveBayes(folderCollection[testingIndex], email, relevantWords)
            if result > threshold: #isSpam
                l_s += 1
            else:
                l_l += 1

        sPrecision += (s_s / (s_s + l_s))
        sRecall += (s_s / (s_s +s_l))
        wAcc += (threshold_lambda * l_l + s_s)/ (threshold_lambda * legitSize + spamSize)
        wErr += (threshold_lambda * l_s + s_l)/ (threshold_lambda * legitSize + spamSize)
        wAcc_b += (threshold_lambda * legitSize)/(threshold_lambda * legitSize + spamSize)
        wErr_b += spamSize / (threshold_lambda * legitSize + spamSize)
        tcr += spamSize / (threshold_lambda*l_s + s_l)


    table_row = Result()
    table_row.filter_config = filter
    table_row.threshold = threshold_lambda
    table_row.nFeatures = nFeatures
    table_row.avg_recall =  (sRecall/10)*100
    table_row.avg_precision = (sPrecision/10)*100
    table_row.avg_w_acc = (wAcc/10)*100
    table_row.avg_bw_acc = (wAcc_b/10)*100
    table_row.avg_tcr = tcr/10
    
    print("S_Precision:", table_row.avg_precision)
    print("S_Recall:", table_row.avg_recall)
    print("w_acc:", table_row.avg_w_acc)
    print("TCR:", table_row.avg_tcr)
    
    tableResults.append(table_row)

In [21]:
for i in range(len(FILTERS)):
    filterCollection[FILTERS[i]] = loadEmails('spam emails\\'+FILTERS[i]+'\\part')

Loading emails...: spam emails\bare\part
Loading emails...: spam emails\stop\part
Loading emails...: spam emails\lemm\part
Loading emails...: spam emails\lemm_stop\part


In [22]:
def table_row_generator(pd, filter, threshold, nFeatures, avg_recall, avg_precision, avg_accuracy, avg_accuracy_base,
                        avg_tcr):
    raw_data = {
        'Filter Configuration': [filter],
        'Lambda': [threshold],
        'No. of attrib.': [nFeatures],
        'Spam Recall': [avg_recall],
        'Spam Precision': [avg_precision],
        'Weighted Accuracy': [avg_accuracy],
        'Baseline W. Acc': [avg_accuracy_base],
        'TCR': [avg_tcr]
    }
    return pd.DataFrame(raw_data, columns=['Filter Configuration', 'Lambda', 'No. of attrib.', 'Spam Recall',
                                           'Spam Precision', 'Weighted Accuracy', 'Baseline W. Acc', 'TCR'])

In [24]:
tableResults = []
runTestTable(FILTERS[0], 1, 50)
runTestTable(FILTERS[1], 1, 50)

table_row = []
for i in range(len(tableResults)):
    row = tableResults[i]
    df = table_row_generator(pd, row.filter_config, row.threshold, row.nFeatures, row.avg_recall, row.avg_precision, 
                            row.avg_w_acc, row.avg_bw_acc, row.avg_tcr)
    table_row.append(df)
    
table = pd.concat(table_row)
table

Testing filter: bare  threshold: 1  No. of Attributes: 50
Folder Collection Spam Email:  48
Folder Collection Legit Email:  241
Folder Collection Relevant Words:  62333
Folder Collection Spam Email:  48
Folder Collection Legit Email:  241
Folder Collection Relevant Words:  61652
Folder Collection Spam Email:  48
Folder Collection Legit Email:  241
Folder Collection Relevant Words:  61723
Folder Collection Spam Email:  48
Folder Collection Legit Email:  241
Folder Collection Relevant Words:  60812
Folder Collection Spam Email:  48
Folder Collection Legit Email:  242
Folder Collection Relevant Words:  61646
Folder Collection Spam Email:  48
Folder Collection Legit Email:  241
Folder Collection Relevant Words:  62177
Folder Collection Spam Email:  48
Folder Collection Legit Email:  241
Folder Collection Relevant Words:  62731
Folder Collection Spam Email:  48
Folder Collection Legit Email:  241
Folder Collection Relevant Words:  62084
Folder Collection Spam Email:  48
Folder Collection Le

Unnamed: 0,Filter Configuration,Lambda,No. of attrib.,Spam Recall,Spam Precision,Weighted Accuracy,Baseline W. Acc,TCR
0,bare,1,50,60.110544,91.806504,89.040666,83.373782,2.234894
0,stop,1,50,67.393707,92.260037,90.285865,83.373782,2.729349
