## A Tutorial based from the paper : "An Evaluation of Naive Bayesian Anti-Spam Filtering

This is a tutorial to replicate the process of the paper entitled: "An Evaluation of Naive Bayesian Anti-Spam Filtering. The objective of the paper is to filter spam emails using a Naive Bayes Techinique

## Preparing our Data

The dataset provided to us has 4 directories [bare, lemm, lemm_stop, stop] with subdirectories of 10 parts. Where 9 were used as training set and 1 reserved for testing for every repetition. To reduce random variation, a ten cross-validation was done yielding the ten subdirectories.


To form a list of features to finally use in predicting classification using the Naive Bayes theorem, a common feature selection method is done by computing the Mutual Information (MI) of term t and class c 

where, 
- <i>t</i> is defined to be a word attribute and;
- classified either as <i>c</i> = spam or not spam. 

We are given a description <i>d ∈ X</i> of a document, where <i>X</i> is the document space; and a fixed set of classes <i>C = {spam, legitimate}</i>

Before training our data to filter out spam, we need to find the features that will be used, the paper used the words found in the corpus as the features of the classifier. We created a whole tutorial on feature extraction alone, refer to another notebook named: <b><i>Jupyter Feature Extraction</i></b>. 

After reading the feature extraction tutorial, you already have the list of all the terms in each corpus with their corresponding Mutual Information score, the following function will extract n-terms with the highest <i>MI</i>. The you/system decide how many features will be used. 

In [None]:
import pandas as pd

numMI = 10
corpus = ['bare','lemm', 'lemm_stop', 'stop']

for corp in corpus:
    termMIList = pd.read_csv("Features/"+corp+"/"+corp+"termMI.csv", index_col = 0)
    terms = pd.DataFrame(termMIList['Term'].head(n=numMI).tolist()).to_csv("MI/"+corp+"MI.txt", 
                                                                    header = None, index = None)
print ("Done, check MI folder")

We then create the train and test set split by passing the extracted <i>.txt</i> file to the function below that automatically walks through the directory given to it and transfers the contents of the text file into a list called <i>dir_dataset</i>, where the function <i>parse_subdirectories</i> returns it.

In [None]:
import os

# eg. passed directory = 'Emails/bare'
# reads its subdirectories and files into a list of lists
def parse_subdirectories(directory):
    for path, subdirs, files in os.walk(directory):
        dir_dataset = []
        for filename in files:
            f = os.path.join(path,filename)
            subdir_content = []
            with open(f,'r') as file_content:
                content = file_content.read()
                subdir_content.append(content)
        dir_dataset.append(subdir_content)
    return dir_dataset

We then split this list data into 90% training set and 10% data for accuracy testing of Naive Bayes prediction. Let's call these list variables: <i>train_set</i> and <i>test_set</i>.

In [None]:
import random

def split_dataset(dataset, n_split):
    train_set = []
    data_copy = list(dataset)
    while len(train_set) < int(len(dataset)*n_split):
        pointer = random.randrange(len(data_copy))
        train_set.append(data_copy.pop(pointer))
    return [train_set,data_copy]

## Classification of Email/Documents

To start classifying if the instance of document <i>X</i> is a legitimate message or spam, we first create a <b>Term Matrix</b>. The term matrix contains the final features cross-checked in every document if it exists or not denoted by: <i>0 or 1</i>. 

In [None]:
import os
import csv

features= []

#MIfilepath: 'MI/[filename].txt
def readFeatures(MIfilepath):
    with open(MIfilepath) as f:
        for line in f.readlines():
            line = line.rstrip()
            features.append(line)
    f.close()
    
#corpusdirectory: 'Emails/bare'
#corpusname : bare
def build_term_matrix(corpusdirectory, corpusname):
    directory = os.path.dirname("Term Matrix/")
    subdir = os.path.join(directory,corpusname)
    if not os.path.exists(subdir):
        os.makedirs(subdir)
    csvpath = os.path.join(subdir,corpusname)
    csvfile = open(csvpath + '.csv', 'w')
    csv_writer = csv.writer(csvfile,delimiter=',')
    
    for dirs,subdirs,files in os.walk(corpusdirectory):
        for messages in sorted(files):
            f = os.path.join(dirs,messages)
            with open(f,'r') as email:
                content = email.read().split(' ')
                csv_writer.writerow([messages])
                output = []
                for feature in features:
                    r = csv.reader(open(csvpath+'.csv'))
                    if feature in content:
                        if os.path.exists(csvpath+'.csv'):
                            for row in r:
                                row.append('1')
                                output.append(row)
                    else:
                        if os.path.exists(csvpath+'.csv'):
                            for row in r:
                                row.append('0')
                                output.append(row)
                csv_writer.writerows(output)
                print(output)
            email.close()
    csvfile.close()
        
readFeatures('MI/bareMI.txt')
build_term_matrix('Emails/bare', 'bare')
readFeatures('MI/lemmMI.txt')
build_term_matrix('Emails/lemm', 'lemm')
readFeatures('MI/lemm_stopMI.txt')
build_term_matrix('Emails/lemm_stop', 'lemm_stop')
readFeatures('MI/stopMI.txt')
build_term_matrix('Emails/stop','stop')

## Training

The classifier we will be building will use "supervised learning". Now that we have the a list of features, its association with all messages using the term matrix, this will form the basis of the classifier.

Once we have associated the various words with our two classifications (spam and legit), we can calculate the probability that a given word belongs to either spam or legit category. For instance, the probability that the word "vintage" appears in a spam message is much higher than the probability it appears in a legitimate email.

For example, once we have trained our classifier using 200 documents, 100 are spam and 100 are legit. If word "vintage" appears in 25 spam documents, but only 5 legit documents. The probability, then, that the word "vintage" classifies as a spam document is calculated:

$$P("vintage" | spam) = (.25 * .5) / ((.25 * .5) + (.05 * .5)) = .83, or 83%.$$

The ".25" and ".05" are the percentage of documents containing the word money that are spam and ham respectively. The ".5" is the interesting number and is the percentage of documents that are spam or legit. Since we have classified 100 of each, the total number of documents is 200, and it is overall 50% likely that a document is spam.

By combining the probabilities for all the words in a document, it is possible to get an overall view of the likelihood a document is either spam or legit.

Let us break this down for a while, we need the following count:
- Count of documents
- Actual count of legit and spam documents
- Feature count of words on all documents segregated between the categories (legit/spam)

Let's create function/s for these.

In [13]:
import os

#subdirectory : 'Emails/bare
def countDocuments(directory):
    for path, subdirs, files in os.walk(directory):
        for subdir in subdirs:
            dir_path = os.path.join(path,subdir)
            docCount = len([name for name in os.listdir(dir_path)])
            spamCount = 0
            legitCount = 0
            for name in os.listdir(dir_path):
                if name.startswith("spm"):
                    spamCount += 1
                else:
                    legitCount += 1
    return docCount,spamCount,legitCount

print(countDocuments('Emails/bare'))
print(countDocuments('Emails/lemm'))
print(countDocuments('Emails/lemm_stop'))
print(countDocuments('Emails/stop'))

(289, 48, 241)
(279, 48, 231)
(289, 48, 241)
(289, 48, 241)


Upon constructing the <b>Term Matrix</b> we can now proceed in computing for the probability of classifiying it as a spam or not with the equation:

$$\frac{P(C=spam|\vec{X} = \vec{x})}{P(C=legitimate | \vec{X} = \vec{x})} > \lambda$$

The equation tells us that we need the vectors from the term matrix which indicates association between the features and documents in order to assess if it is a spam or not. Therefore we can do it like so:

In [None]:
def probability_feature():
def weighted_probability():
def probability_document():
def final_probability():


In [None]:
#pass each to train, classify and test : dataset
bare_dataset = parse_subdirectories('Emails/bare')
lemm_dataset = parse_subdirectories('Emails/lemm')
lemm_stop_dataset = parse_subdirectories('Emails/lemm_stop')
stop_dataset = parse_subdirectories('Emails/stop')

def train(dataset, threshold):
    train_set, test_set = split_dataset(dataset)
    #ready csv file to write
    #in term matrix directory : do file walk
        #read term matrix : for every row get vectors
        #compute predicted result
        #create row : store filename , actual  , predicted
def classify():
    
def test(dataset):
    train_set, test_set = split_dataset(dataset)


In [None]:
def main():
    