# Email Spam Detector:
The steps that lead to the classification are:
- Extraction of messages from emails in the training set,
- Converting the messages into a vectors,
- Train a model based on these vectors,
- Extract a message from test email, convert the message into vector
- With the help of the trained model, predict the email to be either 'ham' or 'spam'

Let us go through every step and analyze how can we get more than 98% accuracy by training a set of 6600 emails. The data that we are using Enron email data which can be downloaded from http://csmining.org/index.php/enron-spam-datasets.html. I have downloaded the first two folders of preprocessed data and divided the files into 60% train and 40% test sets. You can download the Raw data and use that as the input. The results wont change, but the running time is the cost we must pay.<br>
<br>
So the whole process boils down to three important steps:
- **Extraction of messages;**
- **Convertion to vectors;**
- **Choosing a model for classification.**

## Extracting messages:
Every text file once opened, extraction of the message and subject of email can be done using **email** library. These messages can be converted into a vector and directly fed into the training model. But think of those spam emails which contain hypertexts and weblinks. All these links are converted into sparse vectors. Sparse because weblinks that appear in one email mostly wont appear in others. But the presence of a weblink and the number of them present in an email can be a good parameter to classify it as a spam or not. So we can extract the body of email without weblinks and for every weblink present in the body, we can append the word *'link'* at the end of body. As we can see later, the frequency of a word appearing in the email is also considered while classifying.

### extract_file_path :
- give *dir_path* as input,
- use os.walk, and for each file in the directory append the file_path into the array *files*

In [1]:
import numpy as np
import os

def extract_file_path(files, dir_path):
    for root, dir_name, file_name in os.walk(dir_path): # gives the root, directory name and file name for each file in the directory mentioned.
        for name in file_name:
            file_path = [os.path.join(root, name)] # joins the root name and file name to give the file path
            files = np.hstack((files, file_path)) # append the file path to the array files.
    return files

### msg_sub_and_tag_of_email :
- use *codecs.open to open the *email_file*, with 'ISO-8859-1' encoding and read the opened file }= *raw_email*
- use *email.message_from_string* to get the message and subject }= *mail*
- *mail.get_payload()* gives message }= *msg*
- *mail['subject']* gives subject of the email.}= *sub*
- extract the weblinks using *lxml.html* and use *BeautifulSoup* to scoop out weblinks and give the remaining message.
- as the data is already segregated into two folders with names 'ham' and 'spam', we check for each file the folder it is in and tag 0 = spam; 1 = ham.

-- encoding = 'ISO-8859-1', as 'utf-8' can not read symbol in some of the spam messages.<br>
-- We can use *BeautifulSoup* to extract weblinks, but for that the file format must be .html or .xml. *BeautifulSoup* gives error if it is a .txt file.

In [2]:
import codecs, email, lxml.html
from bs4 import BeautifulSoup

def msg_sub_and_tag_of_email(email_file):
    
    data = codecs.open(email_file, encoding='ISO-8859-1', errors='ignore')
    raw_email = data.read()# open the file and read the text
    mail = email.message_from_string(raw_email) # extract the contents of the file.
    
    sub = mail['subject'] # from the contents, extract the subject.
    if sub == None:
        sub = ' '         # if there is no subject, take the subject as an empy string
    
    msg = mail.get_payload()  # get_payload is a module in the library 'email' which gives bosy of the email.
    #while isinstance(msg, list): 
     #   msg = msg[0].get_payload()
    
    dom =  lxml.html.fromstring(raw_email) # from the string 'raw_email' 'dom' is created to calculate the number of weblinks
    msg = msg + (' link'*len(dom.xpath('//a/@href'))) # for every weblink, append the word ' link' at the end of message.
    soup = BeautifulSoup(msg,'html.parser' )
    msg = soup.get_text() # extract the rest of message after deleting the weblinks 
    
    number_of_folders_to_be_opened = 7
    tag = 0
    if ((email_file.split('/')[number_of_folders_to_be_opened]).lower() == 'ham'): # look for the filder in which the 'email_file' is stored 
        tag = 1 # if the 'email_file' is in ham folder, tag = 1, if not, the file is in spam folder, so tag = 0.
    
    data.close()
    return [tag, sub + ' \n ' + msg ]

In [3]:
file_list = []
file_list = extract_file_path(file_list, '__dir_path__')

In [4]:
permutations = np.random.permutation(len(file_list)) # before splitting into train and test sets, we must jumble the 'file_list'. 
                                                     # This takes away the possibility of train set having majority ham or spam messages.
train_size = int(0.6*len(file_list)) # 60% train and 40% test sets.
file_list_train = file_list[permutations[:train_size]]
file_list_test = file_list[permutations[train_size:len(file_list)]]

In [5]:
train_msg_and_sub = []
y_train = []

for i in range(len(file_list_train)):
    m_s_t_of_email = msg_sub_and_tag_of_email(file_list_train[i])
    train_msg_and_sub = np.hstack((train_msg_and_sub, m_s_t_of_email[1]))
    y_train = np.hstack((y_train, m_s_t_of_email[0]))

In [6]:
test_msg_and_sub = []
y_test = []

for j in range(len(file_list_test)):
    m_s_t_of_email = msg_sub_and_tag_of_email(file_list_test[j])
    test_msg_and_sub = np.hstack((test_msg_and_sub, m_s_t_of_email[1]))
    y_test = np.hstack((y_test, m_s_t_of_email[0]))

## Converting messages into vector:
Once we have the messages and subject extracted from emails, the next step would be converting these messages into vecots. One way to do that would be:
- splitting each message and subject into words,
- appending all the words into an array.
- and dropping repeated words.

For example 'This is a test message' after splitting becomes an array ['This', 'is', 'a', 'test', 'message']. The next email will also be split into words and appended to this array. This procees gives the total vocabulary, which has repetition of words. *np.unique* takes care of this issue. 
Then for each email, we consider a vector of length V (size of vocabulary) all zeros, and 1 for every word present in the email and vocabulary.<br>
The downside of this process of vectorizing the messages is it takes a ot of time. Instead we can use **CountVectorizer** which converts the whole training set into a sparse matrix. While using *CountVectorizer*, define *min_df* feature to be equal to 1/train_size. This improves the accuracy because if there is a word or number that is used only in one email to denote the cost of something, or time of rendezvous, or name of the sender, or any such thing, it adds less information about whether the email is spam or ham. Adding this feature improved the classification from 98.4% to 98.6%.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(min_df=1/len(y_train))

In [8]:
x_train = cv.fit_transform(np.array(train_msg_and_sub))
x_test = cv.transform(np.array(test_msg_and_sub))

## Training a model:
 Initially I have used the former meathod of appending each word to the vocabulary array and using that vocab to convert messages into vectors. When i did that I have used Gaussian Naive Bayes (**GaussianNB**) classifier. GaussianNB does not care about the repetition of a word in a message. It is only bothered about the presence of it. When I switched to *CountVectorizer*, however, does consider repetitions. So instead of using a *GuassianNB* I used **MultinomialNB** classifier. This improved the classification accuracy from 96.5% for *GaussianNB* to 98.6% for *MultinomialNB*.

In [9]:
from sklearn.naive_bayes import MultinomialNB
model_multinb = MultinomialNB()
model_multinb.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
y_pred = model_multinb.predict(x_test)

0.98300090661831374

In [11]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[1178,   48],
       [  27, 3159]])

In [12]:
true_positive = confusion_matrix(y_test, y_pred)[1,1]
true_negative = confusion_matrix(y_test, y_pred)[0,0]
predicted_positive = (y_pred == 1).sum()
positive_class = (y_test == 1).sum()
accuracy = (true_positive + true_negative)/len(y_test)
recall = true_positive/positive_class
precision = true_positive/predicted_positive
f1_score = 2*recall*precision/(recall + precision)

In [13]:
accuracy

0.98300090661831374

In [14]:
recall

0.99152542372881358

In [15]:
precision

0.98503274087932646

In [16]:
f1_score

0.98826841858282499