# Part 1 - Most Common Words Extraction

The dataset selected is classified into 6 subdirectories, each with a ham and a spam directories. Each on contains ham and spam **.txt** files respectively. By using the **os** library, the dataset was filtered to a single dictionary containing the words with their respective number of occurance in all files. 

This was done by spliting each file's content to individual strings and adding them to a list. Next, a counter function was used to make count each word occurance in that list.

It was found that there are special characters in the spam **.txt** files that the UTF-8 encoding does not classify as strings. To ignore this error, latin encoding was used to ensure that every character is treated as a string.

In [133]:
#Importing needed libraries
import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix 

def countFiles(rootDir): #Function returning the number of files present in the directories and its sub directories
    i = 0
    for directories, subdirectories, files in os.walk(rootDir):
        for filename in files:    
            with open(os.path.join(directories, filename),encoding="latin-1") as f:
                i += 1
    return i

def mostCommonWords(N, wordList, rootDir): #Function returning N most commonly used word
    for directories, subdirectories, files in os.walk(rootDir):
        for filename in files:    #For every file
            with open(os.path.join(directories, filename),encoding="latin-1") as f:
                wordsInFile = f.read().split() #Split the file's content into word strings 
                wordList+= wordsInFile #Add every word to the global word list

    dictionary = Counter(wordList) #Defining the common word dictionary
    for word in list(dictionary): #Looping to delete non-alphabetic and single characters
        if word.isalpha() == False: 
            del dictionary[word]
        elif len(word) == 1:
            del dictionary[word]

    dictionary = dictionary.most_common(N) #Outputting most common words with their frequencies based on user preference
    return (dictionary)

# Part 2  - Feature Extraction

The process of feature extraction is one which a given set of data is reduced from a higher dimension to the lower dimension to reduce the processing power and save resources when it comes to doing data analysis.
Here in this dataset, the emails were filtered to be in a two-dimension format (i.e dictionary) with the most commonly used **N** words. One can visualize such data into a huge, one-dimension vector or list of integers, each representing the number of occurances of words relevant to the dictionary, in a given email. 

For example, the dictionary output had the word "**is**" among the most occurring words in the email. Ultimately, if one email contained the message "Hi, my name is Hossam Elghamry", then the N-sized vector will be outputting [0,0,0...1,0,0,0]. The ***1*** integer value is the number of occurances of the "**is**" word in that given email, with the integer's position in the list being respective to that of the "**is**" in the dictionary.

That said, we can represent all the feature extraction vectors in one ***MxN*** matrix. ***M*** being the number of emails processed and ***N*** being the count of the most occurring words (i.e the length of the output dictionary in **Part 1**)

In [134]:
def featureExtraction(commonWordsDictionary, rootDir): 
    featuresMatrix = np.zeros((fileCount,mostCommonWordsCount)) #Making a (number of emails X number 
                                                                #of most occuring words) matrix
    fileNum = 0; #Row incrementation value
    for directories, subdirectories, files in os.walk(rootDir):
        for filename in files:      
            with open(os.path.join(directories, filename),encoding="latin-1") as f:
                words = f.read().split()
                for word in words: #for every word in every file
                    wordNum = 0 
                    for i,d in enumerate(commonWordsDictionary): #search for it in the dictionary of repeated words
                        if d[0] == word:
                        #if found, put the number of occurances of that current word by
                        #using Count() function, in the feature matrix's word column, respective to the document's row
                            wordNum = i
                            featuresMatrix[fileNum,wordNum] = words.count(word) 
        fileNum = fileNum + 1 #Next document index (i.e next row in the matrix)
    return featuresMatrix

# Part 3 - Extracting Labeled Feature Vector per Training Email to One Single Two-Dimensional Matrix






In [127]:
trainingDir = "Train" #Insert training directory here respective to the notebook file
wordList = [] #List containing all words used, with duplicates
mostCommonWordsCount = 10 #Desired number of the most occurring words in the output dictionary

fileCount = countFiles(trainingDir) 
commonWordsDictionary = mostCommonWords(mostCommonWordsCount, wordList, trainingDir)
#print (commonWordsDictionary)

trainingLabels = np.zeros(4080) #Initiallizing labels
trainingLabels[2880:4080] = 1 #Labeling spam training emails by "1"
trainingMatrix = featureExtraction(commonWordsDictionary, trainingDir)
print(trainingMatrix)

[[0. 0. 0. ... 0. 0. 0.]
 [3. 5. 5. ... 2. 1. 1.]
 [1. 5. 1. ... 1. 1. 2.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


# Part 4 - Defining and Training Naive Bayes Classifier
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

The reason for using such classifier is that it directly links the test outputs with the given labels and comparison is fairly easily at that point using **Confussion Matrixes**

In [128]:
NB_Classifier = MultinomialNB()
NB_Classifier.fit(trainingMatrix, trainingLabels) #Training the model using the training emails

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# Part 5 - Testing the Trained Model using the Test Set Defined

In [129]:
testingDir = "Test" #Insert test directory here respective to the notebook file
fileCount = countFiles(testingDir)  
testingMatrix = featureExtraction(commonWordsDictionary, testingDir)
testLabels = np.zeros(1120) #Initiallizing labels
testLabels[720:1119]= 1 #Labeling spam training emails by "1"

testResult = NB_Classifier.predict(testingMatrix) #Testing the model 
print (confusion_matrix(testLabels,testResult)) #Comparing the results with the actual labels using confusion matrix

1120
[[720   1]
 [399   0]]
