#### <h1>Machine Learning</h1>

<h2>Lab assignment 4: Spam filter</h2>

Current email services provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. In the following you experiment known classifiers to build your own spam filter.

The goal is to discriminate whether a given email, $x$, is spam ($y$=1) or non-spam ($y$=0). For this you need to convert each email into a feature vector $\vec{x} \in \{0, 1\}^n$. The following will walk you through how such a feature vector can be constructed from an email.

Throughout the rest of this lab, you will be using the datasets included which are a subset of the Spam Assassin Public Corpus[1]. For the purpose of this lab, you will only be using the body of the email (excluding the email headers).

![sampleEmail](sampleEmail.png "sampleEmail") \
Fig 1. Sample email

\
Before starting a machine learning task, it is usually useful to take a look at examples from the dataset. For this we need to import some packages and a write a little readFile function.

In [1]:
import numpy as np
import scipy.io
import pandas as pd
import re
import nltk
import operator
import csv
from sklearn.metrics import f1_score

Now we read the text file with the email.

In [2]:
def readFile(filename = None):
    #READFILE reads a file and returns its entire contents
    #   file_contents = READFILE(filename) reads a file and returns its entire
    #   contents in file_contents
    with open(filename, 'r') as file:
        file_contents = file.read().replace('\n', '')
    return file_contents


file_contents = readFile('emailSample1.txt')

print('Original email')
print(file_contents)


Original email
> Anyone knows how much it costs to host a web portal ?>Well, it depends on how many visitors you're expecting.This can be anywhere from less than 10 bucks a month to a couple of $100. You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if youre running something big..To unsubscribe yourself from this mailing list, send an email to:groupname-unsubscribe@egroups.com


**Preprocessing Emails**

The above is a sample email that contains a URL, an email address (at the end), numbers, and dollar amounts. While many emails would contain similar types of entities (e.g., numbers, other URLs, or other email addresses), the specific entities (e.g., the specific URL or specific dollar amount) will be different in almost every email. Therefore, one method often employed in processing emails is to “normalize” these values, so that all URLs are treated the same, all numbers are treated the same, etc. For example, we could replace each URL in the email with the unique string “httpaddr" to indicate that a URL was present. This has the effect of letting the spam classifier make a classification decision based on whether any URL was present, rather than whether a specific URL was present. This typically improves the performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small.

In the function processEmail, we have implemented the following email preprocessing and normalization steps:

* Lower-casing: The entire email is converted into lower case, so that capitalization is ignored (e.g., IndIcaTE is treated the same as Indicate).
* Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.
* Normalizing URLs: All URLs are replaced with the text “httpaddr". 
* Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr".
* Normalizing Numbers: All numbers are replaced with the text “number".
* Normalizing Dollars: All dollar signs ($) are replaced with the text “dollar".
* Word Stemming: Words are reduced to their stemmed form. For example, “discount", “discounts", “discounted" and “discounting" are all replaced with “discount". Sometimes, the Stemmer actually strips off additional characters from the end, so “include", “includes", “included", and “including" are all replaced with “includ".
* Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.


**Vocabulary List**

After preprocessing the emails, we have a list of word for each email. The next step is to choose which words we would like to use in our classifier and which we would want to leave out.

For this lab assignment, we have chosen only the most frequently occurring words as our set of words considered (the vocabulary list). Since words that occur rarely in the training set are only in a few emails, they might cause the model to overfit our training set. The complete vocabulary list is in the file vocab.txt.

Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used.

In [7]:
# get Vocabulary

def getVocabList():
    vocabList = [' ' for i in range(1899)]
    with open('vocab.txt') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter='\t')
        line_count = 0
        for row in csv_reader:
            vocabList[line_count] = row[1]
            line_count += 1
    return vocabList

vocabList = getVocabList()


---
# Quiz
\
From this point on, you will be asked to complete the code of some functions.

The instructions to do so, are defined as comments, in the place where the code should be inserted, started by the word:

**# Instructions:**

The code must be inserted below the instructions, after the comment line:

&#35; ====================== YOUR CODE HERE ====================== 

\
Take a look at the example below, function `processEmail()`. This function converts the email text into stemmed words, and then into a vector of vocables indexes, named **word_indices**.


In [4]:
def processEmail(email_contents = None):
    #   word_indices = PROCESSEMAIL(email_contents) preprocesses
    #   the body of an email and returns a list of indices of the
    #   words contained in the email.

    # ========================== Preprocess Email ===========================
    # Headers
    # Handle them bellow  if you are working with raw emails with the
    # full headers

    # Lower case
    email_contents = email_contents.lower()

    # Strip all HTML
    # Looks for any expression that starts with < and ends with > and replace
    # it with a space
    pattern = '<[^<]+?>'
    email_contents = re.sub(pattern, ' ', email_contents)

    # Look for one or more characters between 0-9
    pattern = r'[0-9]+'
    # Match all digits in the string and replace them with 'number'
    email_contents = re.sub(pattern, ' number', email_contents)

    # Handle URLS
    # Look for strings starting with http:// or https://
    pattern=r'(http|https)\S+'
    email_contents = re.sub(pattern, 'httpaddr', email_contents)

    # Handle Email Addresses
    # Look for strings with @ in the middle
    pattern = r'[\w.+-]+@[\w-]+\.[\w.-]+'
    email_contents = re.sub(pattern, 'emailaddr', email_contents)

    pattern = r'\$'
    email_contents = re.sub(pattern, 'dollar', email_contents)

    # ========================== Tokenize Email ===========================

    # Output the email to screen as well
    print('\n==== Processed Email ====\n\n' % ())
    # Process file
    l = 0
    
    # Init return value
    word_indices = np.array([])
    for s in re.split("[ .:;\\-,']",email_contents):
        # Tokenize and also get rid of any punctuation
        s = re.sub(r'[^\w\s]','', s)
        # Remove any non alphanumeric characters
        s = re.sub('[^0-9a-zA-Z]+', ' ', s)

        # Stem the word
        ps = nltk.stem.PorterStemmer()
        s=ps.stem(s)

        # Skip the word if it is too short
        if len(s) < 1:
            continue
        # Look up the word in the dictionary and add to word_indices if
        # found
        
        # ====================== YOUR CODE HERE ======================
        # Instructions: Fill in this function to add the index of str to
        #               word_indices if it is in the vocabulary. At this point
        #               of the code, you have a stemmed word from the email in
        #               the variable str. You should look up str in the
        #               vocabulary list (vocabList). If a match exists, you
        #               should add the index of the word to the word_indices
        #               vector. Concretely, if str = 'action', then you should
        #               look up the vocabulary list to find where in vocabList
        #               'action' appears. For example, if vocabList{18} =
        #               'action', then, you should add 18 to the word_indices
        #               vector.
       
        for i in np.arange(0, len(vocabList)) :
            if (vocabList[i] == s):
                word_indices = np.append(word_indices, i+1)

        # =============================================================
        # Print to screen, ensuring that the output lines are not too long
        if (l + len(s) + 1) > 78:
            print()
            l = 0
        print('%s ' % (s),end='')
        l = l + len(s) + 1
        
    # Print footer
    print('\n\n=========================\n' % ())
    return word_indices

word_indices = processEmail(file_contents)
# Print Stats
print('Word Indices: ', word_indices)



==== Processed Email ====


anyon know how much it cost to host a web portal well it depend on how mani 
visitor you re expect thi can be anywher from less than number buck a month 
to a coupl of dollar number you should checkout httpaddr or perhap amazon ec 
number if your run someth big to unsubscrib yourself from thi mail list send 
an email to emailaddr 


Word Indices:  [  86.  916.  794. 1077.  883.  370. 1699.  790. 1822. 1831.  883.  431.
 1171.  794. 1002. 1893. 1364.  592. 1676.  238.  162.   89.  688.  945.
 1663. 1120. 1062. 1699.  375. 1162.  477. 1120. 1893. 1510.  799. 1182.
 1237.  512. 1120.  810. 1895. 1440. 1547.  181. 1699. 1758. 1896.  688.
 1676.  992.  961. 1477.   71.  530. 1699.  531.]


The above is the processed sample email. While preprocessing has left word fragments and non-words, this form turns out to be much easier to work with for performing feature extraction.

Given the vocabulary list, we can now map each word in the preprocessed emails into a list of word indices that contains the index of the word in the vocabulary list. For example, in the sample email, the word “anyone" was first normalized to “anyon" and then mapped onto the index 86 in the vocabulary list.

**Your first task**

Your first task is to complete the code in emailFeatures(word_indices = None) below that takes in a word_indices array and produces a feature vector from the word indices. In other words, you will now implement the feature extraction that converts each email into a vector in $\{0, 1\}^n$. For this exercise, you will be using $n$ = number of words in vocabulary list. Specifically, the feature $x_i \in \{0,1 \}$ for an email corresponds to whether the $i$-th word in the dictionary occurs in the email. That is, $x_i=1$ if the $i$-th word is in the email and $x_i=0$ if the $i$-th word is not present in the email. 

In [10]:
def emailFeatures(word_indices = None):
    #   x = EMAILFEATURES(word_indices) takes in a word_indices vector and
    #   produces a feature vector from the word indices.

    # Total number of words in the dictionary
    n = len(vocabList)

    # You need to return the following variables correctly.
    x = np.zeros(n)
  

    # Instructions: Fill in this function to return a feature vector for the
    #               given email (word_indices). To help make it easier to
    #               process the emails, we have have already pre-processed each
    #               email and converted each word in the email into an index in
    #               a fixed dictionary (of 1899 words). The variable
    #               word_indices contains the list of indices of the words
    #               which occur in one email.

    #               Concretely, if an email has the text:
    #                  The quick brown fox jumped over the lazy dog.
    #               Then, the word_indices vector for this text might look
    #               like:
    #                   60  100   33   44   10     53  60  58   5
    #               where, we have mapped each word onto a number, for example:
    #                   the   -- 60
    #                   quick -- 100
    #                   ...
    #              (note: the above numbers are just an example and are not the
    #               actual mappings).

    #              Your task is take one such word_indices vector and construct
    #              a binary feature vector that indicates whether a particular
    #              word occurs in the email. That is, x(i) = 1 when word i
    #              is present in the email. Concretely, if the word 'the' (say,
    #              index 60) appears in the email, then x(60) = 1. The feature
    #              vector should look like:

    #              x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..];

    # ====================== YOUR CODE HERE ======================
    for word_index in word_indices:
        x[int(word_index)] = 1.0
    
    return x

In [11]:
print('\nExtracting features from sample email (emailSample1.txt)\n')
features = emailFeatures(word_indices)
# Print Stats
print('Length of feature vector: %d' % len(features))
print('Number of non-zero entries: %d \n' % sum(features > 0))


Extracting features from sample email (emailSample1.txt)



IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

**Second task**

After you have completed the feature extraction functions, the next step will load a preprocessed training dataset that will be used to train a classifier. **spamTrain.mat** contains 4000 training examples of spam and non-spam email, while **spamTest.mat** contains 1000 test examples. Each original email was processed using the processEmail and emailFeatures functions and converted into a vector $x \in \{0, 1\}^{1899}$. After loading the dataset, train a classifier of your choice for discriminating between spam ($y=1$) and non-spam ($y=0$) emails. 

Your current task is to train this classifier and record the achieved training accuracies in both the training and the test sets. It is recommended to regularize your classifier. You can use either your own developed classifier code or sklearn classifiers.


In [None]:
data_training = scipy.io.loadmat('spamTrain.mat')
print('\nTraining \n')
C = 0 # regularization coefficient

X=data_training['X']
y=data_training['y'].ravel()

# ====================== YOUR CODE HERE ======================

...
print('Training Accuracy: %f\n' % model.score(X, y))

After training the classifier, we can evaluate it on a test set. We have included a test set in spamTest.mat

In [None]:
print('\nTesting \n')

#accuracy
data_test=scipy.io.loadmat('spamNotebookBase/spamTest.mat')
print('Test Accuracy: %f\n' % model.score(data_test['Xtest'], data_test['ytest']))

#F1-score
yp = model.predict(data_test['Xtest'])
print('Test F1-score: %f\n' % f1_score(data_test['ytest'], yp))

In [None]:
#show samples:
def email_words(X):
    m = len(X)
    words = []
    for i in np.arange(0, m):
        if X[i] == 1:
            words.append(vocabList[i])
    return words    

print('Targets: ', end="")
print(data_training['y'].T)
print('\nSpam email sample words:')
print(email_words(data_training['X'][0,:]))
print('\nLegit email sample words:')
print(email_words(data_training['X'][2,:]))
print()

**Top Predictors of Spam**

We can inspect the weights learned by the model to understand better how it is determining
whether an email is spam or not. The following code finds the words with
the highest weights in the classifier. Informally, the classifier
assign high credit to these words as the most likely indicators of spam.


In [None]:

# Sort the weights and obtain the corresponding entries in the vocabulary list
weights = model.coef_

# ====================== YOUR CODE HERE ======================
....

print('\nTop 10 words indicators of spam: %s\n' % words)


**Try Your Own Emails**

Now that you've trained the spam classifier, you can use it on your own
emails! In the starter code, we have included spamSample1.txt,
spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.

The following code reads in one of these emails and then uses your
learned classifier to determine whether the email is Spam or Not Spam

Set the file to be read in (change this to spamSample2.txt,
mailSample1.txt or emailSample2.txt to see different predictions on
different emails types). Try your own emails as well!


In [None]:
def predict(model, X):

    # use model to predict if X email is SPAM or NOT
    # if it is SPAM, return 1
    # else return 0
    # ====================== YOUR CODE HERE ======================
    ....


In [None]:
#Test the predictor on sample emails

filename = 'spamNotebookBase/emailSample1.txt'     #Should return 0: No spam
#filename = 'spamNotebookBase/emailSample2.txt'    #Should return 0: No spam    
#filename = 'spamNotebookBase/spamSample1.txt'     #Should return 1: SPAM
#filename = 'spamNotebookBase/spamSample2.txt'     #Should return 1: SPAM

# Read and predict
file_contents = readFile(filename)
word_indices = processEmail(file_contents)
x = emailFeatures(word_indices)
p = predict(model, x)
print('File: ', filename);
print('Spam Classification:', p)
print('(1 indicates spam, 0 indicates not spam)\n\n')

**Credits** 

This lab assignment is based on an Octave programming project of the course Machine Learning from Coursera[2]. 

References
* [1] http://spamassassin.apache.org/publiccorpus/
* [2] http://www.coursera.org/
