# Email Spam Detector:
The steps that lead to the classification are:
- Extraction of messages from emails in the training set,
- Converting the messages into vectors,
- Train a model based on these vectors,
- Extract a message from test email, convert the message into vector
- With the help of the trained model, predict the email to be either 'ham' or 'spam.'

Let us go through every step and analyze how can we get more than 98% accuracy by training a set of 6600 emails. The data that we are using Enron email data which can be downloaded from http://csmining.org/index.php/enron-spam-datasets.html. I have downloaded the first two folders of preprocessed data and divided the files into 60% train and 40% test sets. You can download the Raw data and use that as the input. The results won't change, but the running time is the cost we must pay.<br>
<br>
So the whole process of classification boils down to three important steps:
- **Preprocessing of Data**
- **Transformation of Data**
- **Training a Model.**

In [1]:
#MODULES
import numpy as np
import os
import codecs, email, lxml.html
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

## Preprocessing of Data:
- Every text file once opened, extraction of the message and subject of email can be done using **email** library. 
- If body of the email contains any web links, remove the links and add a word 'link' at the end of the message.
- Do this for every email, and append message and subject of each email to form the input data.

In [2]:
def extract_file_path(files, dir_path):
    '''Extracts the path of each email text file and appends all these into and array 'files'.
       Later, the classifier can be run on this array any number of times'''
    for root, dir_name, file_name in os.walk(dir_path): 
        for name in file_name:
            file_path = [os.path.join(root, name)]
            files = np.hstack((files, file_path)) 
    return files

In [3]:
dir_path = '/home/srinandyash/Downloads/data/enron/'#give the path to the folder where you have stored the email files.
file_list = []
file_list = extract_file_path(file_list, dir_path)

In [4]:
def read_email(email_file):
    data = codecs.open(email_file, encoding='ISO-8859-1', errors='ignore')
    raw_email = data.read()
    mail = email.message_from_string(raw_email)
    data.close()
    return mail

In [5]:
def scoop_out_weblinks(msg_with_links):
    dom =  lxml.html.fromstring(msg_with_links) 
    msg_with_links = msg_with_links + (' link'*len(dom.xpath('//a/@href'))) 
    soup = BeautifulSoup(msg_with_links,'html.parser' )
    msg_without_links = soup.get_text() 
    return msg_without_links

def get_msg(mail):
    msg = mail.get_payload()
    while isinstance(msg,list):
        msg = msg[0].get_payload()
    if (len(msg) != 0):
        msg = scoop_out_weblinks(msg)
    return msg

In [6]:
def get_subject(mail):
    sub = mail['subject'] 
    if sub == None:
        sub = ' '         
    return sub

In [7]:
def extract_msg_and_sub(email_file):
    mail = read_email(email_file)
    msg = get_msg(mail)
    sub = get_subject(mail)
    msg_and_sub = msg + ' \n ' + sub
    return msg_and_sub

In [8]:
def get_tag_of_email(email_file):
    tag = 0
    if 'ham' in (email_file.lower().split('/')):
        tag = 1
    return tag

In [9]:
msg_and_sub = np.array([])
y = np.array([])

for i in range(len(file_list)):
    msg_and_sub  = np.hstack((msg_and_sub, extract_msg_and_sub(file_list[i])))
    y = np.hstack((y, get_tag_of_email(file_list[i])))

In [10]:
train_msg_and_sub, test_msg_and_sub, y_train, y_test = train_test_split(msg_and_sub, y, train_size = 0.6, random_state = 1)

## Transformation of Data:
Once we have the messages and subject extracted from emails, the next step would be transforming these messages into vectors. One way to do that would be:
- Splitting the whole message into words,
- Appending all the words into an array.
- And dropping repeated words to give the vocabulary.
- Converting the email into vector can be done by taking a vector of size V (length of vocabulary), and:
 - for words in vocabulary if they are in the email, we put 1
 - for words in vocabulary, if they are not in the email, we put 0.

The other (and convinient) way to transform the data to vectors is by using:
- CountVectorizer 

The former process takes a lot of time to run. But if we want to use spell-check, it can be easily done if we use the first way of vectorizing. 

Note: While using *CountVectorizer*, define *min_df* feature to be equal to 1/train_size. It improved the accuracy from . This might be because if we have emails which has a time or date of rendezvous, or the name of the sender, or any such words or letter which once in the whole data set, then these will add little information about whether the email is spam or ham. 

In [11]:
cv = CountVectorizer(min_df=1/len(y_train))

In [12]:
x_train = cv.fit_transform(np.array(train_msg_and_sub))
x_test = cv.transform(np.array(test_msg_and_sub))

## Training a model:
 Initially, I have used the first way of vectorizing the data. This way does not care for the repetition of words. It is only concerned about the presence of a word. So, I have used Gaussian Naive Bayes (**GaussianNB**) classifier.<br>
 *CountVectorizer*, however, does consider repetitions. *GuassianNB* fails to consider the frequency of words into account. So, I switched to **MultinomialNB** classifier, which exactly does that. This improved the classification accuracy from 96.5% for *GaussianNB* to 98.6% for *MultinomialNB*.

In [13]:
model_multinb = MultinomialNB()
model_multinb.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [14]:
y_pred = model_multinb.predict(x_test)
y_pred_train = model_multinb.predict(x_train)

In [15]:
def get_metrics(prediction, positive_class):
    con_mat = confusion_matrix(prediction, positive_class)
    
    true_positive = con_mat[1,1]
    true_negative = con_mat[0,0]
    predicted_positive = (prediction == 1).sum()
    positive_labels = (positive_class == 1).sum()
    
    accuracy = (true_positive + true_negative)/len(positive_class)
    recall = true_positive/positive_labels
    precision = true_positive/predicted_positive
    f1_score = 2*recall*precision/(recall + precision)
    
    return accuracy, recall, precision, f1_score

In [16]:
accuracy, recall, precision, f1_score = get_metrics(y_pred, y_test)
acc_train, recall_train, prec_train, f1_train = get_metrics(y_pred_train, y_train)

In [17]:
print( ' Metrics on test:','\n','\n', 'Accuracy = ', accuracy,'\n','Recall = ', recall, '\n', 'Precision = ', precision, '\n', 'F1_score = ', f1_score, '\n','\n','Metrics on train:','\n','\n', 'Accuracy = ', acc_train,'\n','Recall = ', recall_train, '\n', 'Precision = ', prec_train, '\n', 'F1_score = ', f1_train)

 Metrics on test: 
 
 Accuracy =  0.984587488667 
 Recall =  0.990040460629 
 Precision =  0.988809449798 
 F1_score =  0.989424572317 
 
 Metrics on train: 
 
 Accuracy =  0.991839202055 
 Recall =  0.99377593361 
 Precision =  0.995014540922 
 F1_score =  0.994394851567


# Conclusion:
- Extract and process the data. i.e. extract the messages, replace the web links with the word 'link'.
- Trasform the data into vectors using CountVectorizer.
- On these vectors, use MultinomialNB classifier.

And that is it. You have a spam classifier of your own. Cheers!!