# AAS lab  -  Spam Classification with SVM

Objectives: Apply Support Vector Machine (SVM) to build a spam classifier. 

You will train a SVM classifier to classify whether a given email is spam (y = 1) or non-spam (y = 0). 

First, each email needs to be converted into a feature vector x. The dataset used in this exercise is based on a subset of SpamAssassin Public Corpus (https://spamassassin.apache.org/old/publiccorpus/ ) You will only use the body of the email (excluding the email headers).

In [None]:
#Import relevant libraries
import numpy as np
# pandas - Python Data Analysis Library
import pandas as pd
import matplotlib.pyplot as plt
#to load matlab mat files
from scipy.io import loadmat
#Re package is used for text processing. 
import re
from nltk.stem import PorterStemmer

In [None]:
#one sample
file_contents = open("emailSample1.txt","r").read()

print(file_contents )


### Preprocessing emails. Construct a Vocabulary List
<img src="images/f8.png" style="width:500px;height:150px;">
<caption><center> **Fig.1 ** : **Sample email** </center></caption>

Fig.1 shows a sample email that contains a URL, an email address (at the end), numbers, and dollar amounts. While many emails contain similar types of entities (e.g., numbers, URLs, email addresses), the specific entities (e.g., the specific URL or specific dollar amount) will be different in almost every email. Therefore, one method often employed in processing emails is to “normalize" these values, so that all URLs are treated the same, all numbers are treated the same, etc. For example, we will replace each URL in the email with the unique string “httpaddr" to indicate that a URL was present. This has the effect of letting the spam classifier make a classification decision based on whether any URL was present, rather than whether a specific URL was present. This typically improves the performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small. 

The words are also transformated into lower case letters and are reduced to their stemmed form. For example, “discount", “discounts", “discounted" and “discounting" are all replaced with “discount". Sometimes, the Stemmer actually strips off additional characters from the end, so “include", “includes", ”included", and “including" are all replaced with “includ".

After the preprocessing we get a list of words for each email, as shown in Fig.2 for this example. 

<img src="images/f7.png" style="width:450px;height:150px;">
<caption><center> **Fig.2 ** : **Preprocessed sample email** </center></caption>

The next step is to choose which words to use in the classiﬁer and which to leave out. The vocabulary list (file *vocab.txt*) was selected by choosing all words which occur at least a 100 times in the training set of emails (corpus), resulting in a list of 1899 words (Fig.3). Words that occur rarely in the training set were excluded, because they may cause the model to overﬁt the training set. In practice, a vocabulary list with about 10,000 to 50,000 words is often used. 

<img src="images/f5.png" style="width:100px;height:150px;">
<caption><center> **Fig.3 ** : **Vocabulary list** </center></caption>

Given the vocabulary list, we can now map each word in the preprocessed emails (Fig.2) into a list of word indices that contains the index of the word in the vocabulary list. Fig.4 shows the mapping for the sample email. Speciﬁcally, in the sample email, the word “anyone” was ﬁrst normalized to “anyon” and then mapped onto the index 86 in the vocabulary list.  

<img src="images/f6.png" style="width:150px;height:200px;">
<caption><center> **Fig.4 ** : **Word indices for sample email** </center></caption>


In [None]:
#read and transform vacablist into a dictionary 
vocabList = open("vocab.txt","r").read()

#print(vocabList)
#print(len(vocabList)) #=20240, counts all letters, spaces

vocabList=vocabList.split("\n")[:-1]

vocabList_d={}
for ea in vocabList:
    #key is the word; value is the index
    value,key = ea.split("\t")[:]
    vocabList_d[key] = value
 

Function *processEmail* performs mapping between words and indices. It gets an email and transforms it into a list of words. For each word it looks up if the word exist in the vocabulary list. If the word exists, it adds the index of the word into the word indices variable. If the word does not exist, and is therefore not in the vocabulary, it skips the word.

In [None]:
def processEmail(email_contents,vocabList_d):
    """
    Preprocesses the body of an email and returns a list of indices of the words contained 
    in the email. 
    """
    # Lower case
    email_contents = email_contents.lower()
    
    # Handle numbers
    email_contents = re.sub("[0-9]+","number",email_contents)
    
    # Handle URLS
    email_contents = re.sub("[http|https]://[^\s]*","httpaddr",email_contents)
    
    # Handle Email Addresses
    email_contents = re.sub("[^\s]+@[^\s]+","emailaddr",email_contents)
    
    # Handle $ sign
    email_contents = re.sub("[$]+","dollar",email_contents)
    
    # Strip all special characters
    specialChar = ["<","[","^",">","+","?","!","'",".",",",":"]
    for char in specialChar:
        email_contents = email_contents.replace(str(char),"")
    email_contents = email_contents.replace("\n"," ")    
    
    # Stem the word
    ps = PorterStemmer()
    email_contents = [ps.stem(token) for token in email_contents.split(" ")]
    email_contents= " ".join(email_contents)
    
    # Process the email and return word_indices
    
    word_indices=[]
    
    for char in email_contents.split():
        if len(char) >1 and char in vocabList_d:
            word_indices.append(int(vocabList_d[char]))
    
    return word_indices

In [None]:
word_indices= processEmail(file_contents,vocabList_d)
print(word_indices)


### Extracting binary features from emails

You will now implement the feature extraction that converts each email into a vector. The binary feature $x_i$ for an email corresponds to whether the i-th word in the dictionary occurs in the email. That is, $x_i$ = 1 if the i-th word is in the email and $x_i$ = 0 if the i-th word is not present in the email. Thus, for a typical email, this feature would look like, n is the number of words in the vocabulary list: 

<img src="images/f9.png" style="width:100px;height:180px;">
<caption><center> **Fig.5 ** : **Binary feature vector** </center></caption>

You should now complete function *emailFeatures* to generate a feature vector for an email, given the word indices. Once you run *emailFeatures* on the email sample, you should see that the feature vector has length 1899 and 43 non-zero entries.

In [None]:
def emailFeatures(word_indices, vocabList_d):
    """
    Takes in a word_indices vector and  produces a feature vector from the word indices. 
    """
    n = len(vocabList_d)
    
    features = np.zeros((n,1))
    
        
    return features

In [None]:
features = ?
print("Length of feature vector: ",?)  #ANSWER: 1899
print("Number of non-zero entries: ", ?  #ANSWER: 43

### Training SVM for spam classification

The next step will load a preprocessed training dataset that will be used to train a SVM classiﬁer. 

*spamTrain.mat* contains 4000 training examples and their labels of spam/non-spam emails.

*spamTest.mat* contains 1000 test examples. 

Each original email was processed using the *processEmail* and *emailFeatures* functions and converted into a vector $x^{(i)}$ ∈ $R^{1899}$. After loading the train dataset, the main script will proceed to train a SVM to classify between spam (y = 1) and non-spam (y = 0) emails. Once the training completes, you should see that the classiﬁer gets a training accuracy of about 99.8% and a test accuracy of about 98.9%.

In [None]:
#Use loadmat to load the file spamTrain.mat as a dictionary with keys "X"  and "y" 

spam_mat = loadmat("spamTrain.mat")
print(spam_mat)

In [None]:
#extract the training data from the keys "X"  and "y" 

X_train = ?
y_train = ?

In [None]:
#Apply Support Vector Classifier (SVC) to train (fit) binary classifier and compute the training accuracy. 
#Suggestion: Call SVC with linear kernel and C=0.1
from sklearn.svm import SVC 
SVC?
print("Training Accuracy: ?")  # Answer:~ 99.8 %

In [None]:
#Use loadmat to load the file spamTest.mat as a dictionary with keys "Xtest"  and "ytest" 
#and extract the testing data

X_test = ?
y_test = ?


In [None]:
#Apply the trained SVC classifier to predict the test data and compute Test accuracy

print("Test Accuracy:?"  # ANSWER: 98.9 %

### Top predictors for spam

To better understand how the spam classiﬁer works, we can inspect the parameters to see which words the classiﬁer thinks are the most predictive of spam. The next step ﬁnds the parameters with the largest positive values in the classiﬁer and displays the corresponding words. Thus, if an email contains words such as "click", “guarantee”, “remove”, “dollar”, “price”, etc., it is likely to be classiﬁed as spam.

In [None]:
weights = spam_svc.coef_[0]  # print(weights.shape)  (1899,)

#first column indices (1,2,3,... 1899), second column all weights 
weights_col = np.hstack((np.arange(1,1900).reshape(1899,1),weights.reshape(1899,1)))

#transform it into data frame 
df = pd.DataFrame(weights_col)

df.sort_values(by=[1],ascending = False,inplace=True)

predictors = []
idx=[]
for i in df[0][:15]:
    for keys, values in vocabList_d.items():
        if str(int(i)) == values:
            predictors.append(keys)
            idx.append(int(values))

In [None]:
print("Top predictors of spam:")

for _ in range(15):
    print(predictors[_],"\t\t",round(df[1][idx[_]-1],6))

We have included two email examples (*emailSample1.txt*, *emailSample2.txt*) and two spam examples (*spamSample1.txt*, *spamSample2.txt*) as test emails. Apply the learned SVM spam classifier to see if the classiﬁer gets them right. 

In [None]:
file_contents = open("spamSample1.txt","r").read()

?


### Try your own emails


Try your own emails by replacing the examples (plain text ﬁles) with your own emails.
Now that you have trained a spam classiﬁer, you can try it out on your own emails.


### Build your own dataset

In this project, we provided a preprocessed training set and test set. These datasets were created using the same functions (processEmail and emailFeatures) that you have completed. Download the original ﬁles from the public corpus, run the *processEmail* and *emailFeatures* functions on each email to extract a feature vector from each email. This will allow you to build a dataset X, y of examples. You should then randomly divide the dataset into training, cross validation and test sets.

While you are building your own dataset, you may also build your own vocabulary list (by selecting the high frequency words that occur in the dataset) and adding any additional features that you find useful.
