# Part 1 - Most Common Words Extraction

The dataset selected is classified into 6 subdirectories, each with a ham and a spam directories. Each on contains ham and spam **.txt** files respectively. By using the **os** library, the dataset was filtered to a single dictionary containing the words with their respective number of occurance in all files. 

This was done by spliting each file's content to individual strings and adding them to a list. Next, a counter function was used to make count each word occurance in that list.

It was found that there are special characters in the spam **.txt** files that the UTF-8 encoding does not classify as strings. To ignore this error, latin encoding was used to ensure that every character is treated as a string.

In [91]:
import os
import glob, os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn.metrics import confusion_matrix 

rootdir = "Dataset\enron1" #Insert directory here respective to the notebook file

def MostCommonWords(N, wordList, filesContent):
    for directories, subdirectories, files in os.walk(rootdir):
        for filename in files:      
            with open(os.path.join(directories, filename),encoding="latin-1") as f:
                content = f.read()
                filesContent += content
                wordsInFile = content.split()
                wordList+= wordsInFile

    dictionary = Counter(wordList)
    for word in list(dictionary):
        if word.isalpha() == False: 
            del dictionary[word]
        elif len(word) == 1:
            del dictionary[word]

    dictionary = dictionary.most_common(N)
    return (dictionary)

filesContent =[] #List containing each file's content
wordList = [] #List containing all words used, with duplicates
mostCommonWordsCount = 5 #Desired number of the most occurring words in the output dictionary

commonWordsDictionary = (MostCommonWords(mostCommonWordsCount, wordList, filesContent))
print(commonWordsDictionary)

[('the', 25656), ('to', 20345), ('ect', 13900), ('and', 12829), ('for', 10508)]


# Part 2  - Feature Extraction

The process of feature extraction is one which a given set of data is reduced from a higher dimension to the lower dimension to reduce the processing power and save resources when it comes to doing data analysis.
Here in this dataset, the emails were filtered to be in a two-dimension format (i.e dictionary) with the most commonly used **N** words. One can visualize such data into a huge, one-dimension vector or list of integers, each representing the number of occurances of words relevant to the dictionary, in a given email. 

For example, the dictionary output had the word "**is**" among the most occurring words in the email. Ultimately, if one email contained the message "Hi, my name is Hossam Elghamry", then the N-sized vector will be outputting [0,0,0...1,0,0,0]. The ***1*** integer value is the number of occurances of the "**is**" word in that given email, with the integer's position in the list being respective to that of the "**is**" in the dictionary.

That said, we can represent all the feature extraction vectors in one ***MxN*** matrix. ***M*** being the number of emails processed and ***N*** being the count of the most occurring words (i.e the length of the output dictionary in **Part 1**)

In [88]:
def feature_extraction(): 
    features_matrix = np.zeros((len(filesContent),5)) #Making a (number of emails X number of most occuring words) matrix
    fileID = 0; #Row incrementation value
    for directories, subdirectories, files in os.walk(rootdir):
        for filename in files:      
            with open(os.path.join(directories, filename),encoding="latin-1") as f:
                words = f.read().split()
                for word in words: #for every word in every file
                    wordID = 0 
                    for i,d in enumerate(commonWordsDictionary): #search for it in the dictionary of repeated words
                        if d[0] == word:
                        #if found, put the number of occurances of that current word by
                        #using Count() function, in the feature matrix's word column, respective to the document's row
                            wordID = i
                            features_matrix[fileID,wordID] = words.count(word) 
        fileID = fileID + 1 #Next document index (i.e next row in the matrix)
    return features_matrix

print(feature_extraction())

[[ 0.  0.  0.  0.  0.]
 [19. 21.  1.  1.  5.]
 [ 2.  7.  1.  1.  1.]
 ...
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]


# Part 3 - Training Classifiers




