# Project 4: SVM Spam Classifier

### 1- Basic notions about machine learning
- Give the steps of the algorithm of best subset selection.
- What is the difference between supervised and unsupervised learning?
- Describe a real-life situation in which linear regression might be useful (specify thefeatures, design the predictor) and transforming the predictor logistic regression might be also used. Describe that transformation of the predictor

### 2- Programming Section
In this project, you are invited to work with spam email dataset from Apache SpamAssas- sin Project. First, you should train an SVM classifier using ”spam train.txt” file in github. This file contains 1899 features per email (line). Each features represent the number of oc- currence of a given word from ”vocab list.txt” file. In fact, raw emails from SpamAssassin project are preprocessed to substitute or remove some expression and punctuation. Then, we select the most occurring words in the resulting emails to build a vocabulary list. In our case, we select all words which occur at least a 100 times in all emails. This results in a list of 1899 words. In practice, a vocabulary list with about 10 000 to 50 000 words is often used.
After training the SVM Classifier, you should test it on raw emails. You have in github folder some file containing email samples. You could also use some email from your mail- box. You should preprocess these raw emails to extract features. Thus, you could feed the features vector to your trained classifier that predict if it is spam email or not. Finally, you should evaluate the generalized performance of your spam classifier on the given test set.

### Preprocessing Emails
In general emails contain different types of entities (e.g. numbers, dollar amount, URLs, or other email addresses). These entities will be different in almost every email. Therefore, one method often employed in processing emails is to normalize these values, so that all URLs are treated the same, all numbers are treated the same, etc. For example, we could replace each URL in the email with the unique string httpaddr to indicate that a URL was present. This has the effect of letting the spam classifier make a classification decision based on whether any URL was present, rather than whether a specific URL was present. This typ- ically improves the performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small.
Below needed preprocessing and normalization steps are enumerated. Besides, some regular expressions and function usefull for these processing are given:

- Lower-casing: The entire email is converted into lower case, so that capitalization is ignored (e.g., IndIcaTE is treated the same as Indicate).
- Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting. You should remove all the HTML tags by replacing this regular expression ’<[^<>]+>’ with white space. You could use ”sub()” function from regular expression library in python.
- Normalizing URLs: All URLs are normalized by replacing this regular expression ’(http|https)://[^\s]*’ with the text ’httpaddr’.
- Normalizing Email Addresses: All email addresses are normalized by replacing this regular expression ’[^\s]+@[^\s]+’ with the text ’emailaddr’.
- Normalizing Numbers: All numbers are normalized by replacing this regular expression ’[0-9]+’ with the text ’number’.
- Normalizing Dollars: All dollar signs (**$**) are replaced with the text dollar (find the appropriate regular expression to use).

- Removal of non-words: Non-words and punctuation should be removed and all white spaces (tabs, newlines, spaces) have to be trimmed to a single space character by replacing the following regular expression with a single white space.
      ’[@/#.\-:\[\]&*+=?!(){},\’\’">_<;% \t\n\r]+’
Concerning the leading and trailing space you could removed them using ”strip()” function.
- Word Stemming: Words are reduced to their stemmed form. For example, discount, discounts, discounted and discounting are all replaced with discount. Sometimes, the Stemmer actually strips off additional characters from the end, so include, includes, included, and including are all replaced with includ. You may use PorterStemmer class from nltk.stem module. For that you need to download ”punkt” package using this command in python: nltk.download(’punkt’)

<font color="blue">**Question 1: **</font> Load the dataset from ”spam_train.txt” file in github and explore it. Try to solve the problem of missing value if any.

In [49]:
%matplotlib notebook
import numpy as np
import pandas as pd

# load and extract data
data = np.loadtxt("spam_train.txt",dtype="object")
print("The size of data is:", data.shape)

# Solving problem of missing value, considering that the missing values have type NAN
df = pd.DataFrame(data)
df.dropna()

m = df.shape[0]   # number of emails
y = df[df.columns[-1:]].values.flatten().astype(np.float) 
X = df[df.columns[0:-1]].values.astype(np.float) 

The size of data is: (4000, 1900)


<font color="blue">**Question 2: **</font> Train an SVM classifier using loaded training data.

In [50]:
from sklearn import svm
import matplotlib.pyplot as plt

# create and train SVM classifier
lambda_ = 1
C = 1/lambda_
lin_svm = svm.SVC(kernel="linear")
lin_svm.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

<font color="blue">**Question 3: **</font> Introduce the regularization on your model if it wasnt used and tune its regularization parameter.

In [83]:
from math import floor

# load and extract data test
test = np.loadtxt("spam_test.txt")
print("The size of test data is:", test.shape)

# shuffle data
np.random.seed(5190969)
test=np.random.permutation(test)

# Solving problem of missing value, considering that the missing values have type NAN
test_df = pd.DataFrame(test)
test_df.dropna()

#extract data
m = test_df.shape[0] # number of samples
y_test = test_df[test_df.columns[-1:]].values.flatten().astype(np.float)  
X_test = test_df[test_df.columns[0:-1]].values.astype(np.float) 


C = 1  # regularization parameter
K = 5  # number of cross validation folds
fold_size = int(floor(m/K))  #size of fold
lambda_list = [0.01, 0.03, 0.1,0.3, 1,100]
print(X_test.shape)

# loop over different sigma values
for lambda_ in lambda_list:
    C = 1/lambda_
    val_accuracy=np.zeros((K,))
    # loop over different folds and calculate the mean accuracy
    for k in range(K):
        X_val = X_test[k*fold_size:(k+1)*fold_size]  
        y_val = y_test[k*fold_size:(k+1)*fold_size]
        gauss_svm = svm.SVC(C=C,kernel='linear')
        gauss_svm.fit(X, y)
        y_val_pred = lin_svm.predict(X_val).astype(float)
        val_accuracy[k]=(y_val==y_val_pred).sum()/fold_size*100
    
    mean_val_accuracy = val_accuracy.mean()
    print("The {}-fold cross validation mean accuracy for '\u039B=".format(K), lambda_," is: ",mean_val_accuracy,"%")



The size of test data is: (1000, 1900)
(1000, 1899)
The 5-fold cross validation mean accuracy for 'Λ= 0.01  is:  97.8 %
The 5-fold cross validation mean accuracy for 'Λ= 0.03  is:  97.8 %
The 5-fold cross validation mean accuracy for 'Λ= 0.1  is:  97.8 %
The 5-fold cross validation mean accuracy for 'Λ= 0.3  is:  97.8 %
The 5-fold cross validation mean accuracy for 'Λ= 1  is:  97.8 %
The 5-fold cross validation mean accuracy for 'Λ= 100  is:  97.8 %


<font color="blue">**Question 4: **</font> Implement the preprocessing steps described above. Try to use given regular expressions and functions that may help. Then, you should split the email with white space character in order to get a list of words. Thus, you could apply stemming word by word. However, you need to remove any non-alphanumeric character from each word by replacing this regular expression ’[^a-zA-Z0-9]’ with empty string.

In [60]:
import re
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

nltk.download('punkt')


def preprocessing(email):
    email=email.lower()  # Lower-casing
    email= re.sub(r'<[^<>]+>',r' ',email) # Stripping HTML
    email= re.sub(r'(http|https)://[^\s]*',r'httpaddr',email) # Normalizing URLs
    email= re.sub(r'[^\s]+@[^\s]+',r'emailaddr',email) # Normalizing Email Addresses
    email= re.sub(r'[0-9]+',r'number',email) # Normalizing Numbers
    email= re.sub(r'\$',r'dollar',email) # Normalizing Dollars
    email= re.sub(r'[@/#.\-:\[\]&*+=?!(){},\’\’">_<;% \t\n\r]+',r' ',email) # Removal of non-words
    email = email.strip()  
    words = email.split()
    email_final = []

    for w in words:
        w= re.sub(r'[^a-zA-Z0-9]',r'',w)
        email_final.append(ps.stem(w)) # Word Stemming
    
    return email_final

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lucascoiado/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<font color="blue">**Question 5: **</font> Load the vocabulary list (”vocab list.txt” file in github) and extract features from the processed email. For that you should count the occurrence in the email of each word in the vocabulary list. Then, put this count in a vector of 1899 elements according to the order of word in the vocabulary list. For example, if the 3rd element of the resulting features vector is 5 this means that the 3rd word in the vocabulary list was encountered 5 times in the processed email.

In [61]:

def occurrences(email):
    email = preprocessing(email)
    # load and extract vocabulary 
    vocabul = np.loadtxt("vocab_list.txt",dtype='str')[:,1,np.newaxis]
    occurrences = np.zeros((1,1899))

    # Calculating the occurrences
    for index,w in enumerate(vocabul):
        occurrences[:,index] = email.count(w)
    return occurrences

with open('emailSample1.txt', 'r') as myfile:
    email1=myfile.read().replace('\n', '')
    occu1 = occurrences(email1)

<font color="blue">**Question 6: **</font> Estimate the generalized performance of this model using adequate metrics..

In [80]:
from  sklearn import metrics
# SVM classifier
gauss_svm = svm.SVC(C=C,kernel='linear')
gauss_svm.fit(X, y)
y_pred =  lin_svm.predict(X_test)


print("The sklearn confusion matrix:\n",metrics.confusion_matrix(y_test, y_pred))
print("The sklearn precision:",metrics.precision_score(y_test, y_pred))
print("The sklearn recall:",metrics.recall_score(y_test, y_pred))
print("The sklearn F1_score:",metrics.f1_score(y_test, y_pred))
print("The sklearn classification report:\n",metrics.classification_report(y_test, y_pred))

# Load the given emails
with open('emailSample1.txt', 'r') as myfile:
    email1=myfile.read().replace('\n', '')
    occu1 = occurrences(email1)
    
with open('emailSample2.txt', 'r') as myfile:
    email2=myfile.read().replace('\n', '')
    occu2 = occurrences(email2)

with open('spamSample1.txt', 'r') as myfile:
    email3=myfile.read().replace('\n', '')
    occu3 = occurrences(email3)
    
with open('spamSample2.txt', 'r') as myfile:
    email4=myfile.read().replace('\n', '')
    occu4 = occurrences(email4)

# Test with given emails
occu =np.concatenate((occu1,occu2,occu3,occu4),axis=0)
y_pred =  lin_svm.predict(occu) 

print("Prediction of Samples emails:", y_pred)

# Testing with own emails
with open('Spam.txt', 'r') as myfile:
    spam=myfile.read().replace('\n', '')
    occu5 = occurrences(spam)
    
    y_pred =  lin_svm.predict(occu5) 
    
with open('Email.txt', 'r') as myfile:
    email=myfile.read().replace('\n', '')
    occu6 = occurrences(email)

occu =np.concatenate((occu5,occu6),axis=0)
y_pred =  lin_svm.predict(occu)
print("Prediction of our own emails:", y_pred)

The sklearn confusion matrix:
 [[679  13]
 [  9 299]]
The sklearn precision: 0.958333333333
The sklearn recall: 0.970779220779
The sklearn F1_score: 0.964516129032
The sklearn classification report:
              precision    recall  f1-score   support

        0.0       0.99      0.98      0.98       692
        1.0       0.96      0.97      0.96       308

avg / total       0.98      0.98      0.98      1000

Prediction of Samples emails: [ 0.  0.  1.  1.]
Prediction of our own emails: [ 1.  0.]
