## A Tutorial based from the paper : "An Evaluation of Naive Bayesian Anti-Spam Filtering

This is a tutorial to replicate the process of the paper entitled: "An Evaluation of Naive Bayesian Anti-Spam Filtering. The objective of the paper is to filter spam emails using a Naive Bayes Techinique

## Preparing our Data

The dataset provided to us has 4 directories [bare, lemm, lemm_stop, stop] with subdirectories of 10 parts. Where 9 were used as training set and 1 reserved for testing for every repetition. To reduce random variation, a ten cross-validation was done yielding the ten subdirectories.


To form a list of features to finally use in predicting classification using the Naive Bayes theorem, a common feature selection method is done by computing the Mutual Information (MI) of term t and class c 

where, 
- <i>t</i> is defined to be a word attribute and;
- classified either as <i>c</i> = spam or not spam. 

We are given a description <i>d ∈ X</i> of a document, where <i>X</i> is the document space; and a fixed set of classes <i>C = {spam, legitimate}</i>

Before training our data to filter out spam, we need to find the features that will be used, the paper used the words found in the corpus as the features of the classifier. We created a whole tutorial on feature extraction alone, refer to another notebook named: <b><i>Jupyter Feature Extraction</i></b>. 

After reading the feature extraction tutorial, you already have the list of all the terms in each corpus with their corresponding Mutual Information score, the following function will extract n-terms with the highest <i>MI</i>. The you/system decide how many features will be used. 

In [39]:
import pandas as pd

numMI = 10
corpus = ['bare','lemm', 'lemm_stop', 'stop']

for corp in corpus:
    termMIList = pd.read_csv("Features/"+corp+"/"+corp+"termMI.csv", index_col = 0)
    terms = pd.DataFrame(termMIList['Term'].head(n=numMI).tolist()).to_csv("MI/"+corp+"MI.txt", 
                                                                    header = None, index = None)
print ("Done, check MI folder")

Done, check MI folder


We then create the train and test set split by passing the extracted <i>.txt</i> file to the function below that automatically walks through the directory given to it and transfers the contents of the text file into a list called <i>dir_dataset</i>, where the function <i>parse_subdirectories</i> returns it.

In [None]:
import os

# eg. passed directory = 'Emails/bare'
# reads its subdirectories and files into a list of lists
def parse_subdirectories(directory):
    for path, subdirs, files in os.walk(directory):
        dir_dataset = []
        for filename in files:
            f = os.path.join(path,filename)
            subdir_content = []
            with open(f,'r') as file_content:
                content = file_content.read()
                subdir_content.append(content)
        dir_dataset.append(subdir_content)
    return dir_dataset

We then split this list data into 90% training set and 10% data for accuracy testing of Naive Bayes prediction. Let's call these list variables: <i>train_set</i> and <i>test_set</i>.

In [None]:
import random

def split_dataset(dataset, n_split):
    train_set = []
    data_copy = list(dataset)
    while len(train_set) < int(len(dataset)*n_split):
        pointer = random.randrange(len(data_copy))
        train_set.append(data_copy.pop(pointer))
    return [train_set,data_copy]

## Classification of Email/Documents

To start classifying if the instance of document <i>X</i> is a legitimate message or spam, we first create a <b>Term Matrix</b>. The term matrix contains the final features cross-checked in every document if it exists or not denoted by: <i>0 or 1</i>. 

In [None]:
import os
import csv

features= []

#MIfilepath: 'MI/[filename].txt
def readFeatures(MIfilepath):
    with open(MIfilepath) as f:
        for line in f.readlines():
            line = line.rstrip()
            features.append(line)
    f.close()
    
#corpusdirectory: 'Emails/bare'
#corpusname : bare
def build_term_matrix(corpusdirectory, corpusname):
    directory = os.path.dirname("Term Matrix/")
    subdir = os.path.join(directory,corpusname)
    if not os.path.exists(subdir):
        os.makedirs(subdir)
    csvpath = os.path.join(subdir,corpusname)
    csvfile = open(csvpath + '.csv', 'w')
    csv_writer = csv.writer(csvfile,delimiter=',')
    
    for dirs,subdirs,files in os.walk(path):
        for messages in sorted(files):
            f = os.path.join(dirs,messages)
            wih open(f,'r') as email:
                content = email.read().split(' ')
                csv_writer.writerow(messages)
                for feature in features:
                    r = csv.reader(open(csvpath+'.csv'))
                    if feature in content:
                        if os.path.exists(csvpath+'.csv'):
                            for row in r:
                                row.append('1')
                    else:
                        if os.path.exists(csvpath+'.csv'):
                            for row in r:
                                row.append('0')
            email.close()
        csvfile.close()

In [None]:
def main():
    