# Support Vector Machines

## 2 Spam Classification

"You will be training a classifier to classify whether a given email, x, is spam (y = 1) or non-spam (y = 0)."

In [1]:
import scipy.io
import numpy as np
from sklearn import svm
import re
import string

from nltk.stem import PorterStemmer
ps = PorterStemmer()

### 2.1 Preprocessing Emails

First we need to process the emails, using regexp normalisation, then map the contents of the email into a word indices vector.

In [2]:
file = open('data/emailSample1.txt')
file_contents = file.read()
file.close()

In [3]:
file_contents

"> Anyone knows how much it costs to host a web portal ?\n>\nWell, it depends on how many visitors you're expecting.\nThis can be anywhere from less than 10 bucks a month to a couple of $100. \nYou should checkout http://www.rackspace.com/ or perhaps Amazon EC2 \nif youre running something big..\n\nTo unsubscribe yourself from this mailing list, send an email to:\ngroupname-unsubscribe@egroups.com\n\n"

This is an example of the kind of data we are working with.

The exercise provides a vocabulary list that contains the 1899 most common words in the email spam corpus. Each word is associated with a number, which we will use for our word indices. The list is a .txt file formatted like so:


```
1	aa
2	ab
3	abil
...
1897	zdnet
1898	zero
1899	zip
```

i.e. newlines separating entries, tabs separating indices from words.

I have written `get_vocab_dict()` to load the file and convert it into a python dictionary.

In [4]:
def get_vocab_dict():
    '''
    Loads the provided vocab.txt, converts it into a python
    dictionary, and returns the dictionary.
    '''
    
    vocab_dict = {}
    
    file = open('data/vocab.txt')
    file_contents = file.readlines()
    file.close()

    for line in file_contents:
        index = line.split()[0]
        word = line.split()[1]
        
        # We're going to use this to look up indices based on words,
        # so make words the keys and indices the values
        vocab_dict[word] = index
    
    return vocab_dict

In [5]:
def process_email(email_contents):
    
    '''
    Preprocesses the body of an email (email_contents) and returns
    a list of indices of the words contained in the email.
    '''

    word_indices = []
    vocab_dict = get_vocab_dict()

    
    ## Preprocess email    
    # Make lower case
    email_contents = email_contents.lower()
    
    # Strip HTML
    # Find expressions that start with <, end with >, and do not
    # contain either < or > in the middle. Replace them with
    # a blank space.
    # Regex cheatsheet: https://www.debuggex.com/cheatsheet/regex/python
    email_contents = re.sub(r'<[^<>]+>', ' ', email_contents)
    
    # Numbers
    # Replace numbers with the word 'number'
    email_contents = re.sub(r'[0-9]+', 'number', email_contents)

    # URLs
    # Look for strings starting with http:// or https://
    # replace with 'httpaddr'
    email_contents = re.sub(r'(http|https)://[^\s]*', 'httpaddr', email_contents)
    
    # Email addresses
    # Look for strings with @ in the middle, replace with 'emailaddr'
    email_contents = re.sub(r'[^\s]+@[^\s]+', 'emailaddr', email_contents)
    
    # $ sign
    # Replace with 'dollar'
    email_contents = re.sub(r'[$]+', 'dollar', email_contents)


    ## Create list of words in email
    words = email_contents.split()
    
    for word in words:
        # Remove punctuation
        word = word.translate(str.maketrans('','',string.punctuation))
    
        # Remove non-alphanumeric characters
        word = re.sub(r'[^a-zA-Z0-9]', '', word)
    
        # Stem word
        word = ps.stem(word)
        
        # Skip spaces, blank lines
        if len(word) < 1:
            continue
            
        # Look up the word in the dictionary and
        # add to word_indices if found
        if word in vocab_dict:
            word_indices.append(int(vocab_dict[word]))
        
    return word_indices

In [6]:
word_indices = process_email(file_contents)

# Check our result using Fig. 11 in ex6.pdf
print('Expected output:')
print('86 916 794 1077 883 370 1699 790 1822 1831 ...\n')
print('word_indices:')
print(word_indices[:10])

Expected output:
86 916 794 1077 883 370 1699 790 1822 1831 ...

word_indices:
[86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831]


### 2.2 Extracting Features from Emails

Convert each email into a vector of features x in R^n, where n = 1899 is the number of words in our vocabulary list.

"x_i ∈ {0, 1} for an email corresponds to whether the i-th word in the dictionary occurs in the email. That is, x_i = 1 if the i-th word is in the email and x_i = 0 if the i-th word is not present in the email."

NB - unlike Matlab/Octave (which this course was designed for), in Python our feature vector will run from 0-1898 rather than 1-1899, so we'll be off by one which we'll have to compensate for.

In [7]:
def email_features(word_indices):
    
    '''
    Takes in a word_indices vector and 
    produces a feature vector from the word indices. 
    '''
    
    # Total number of words in the vocabulary list
    n = 1899
    
    # Feature vector
    x = np.zeros([n, 1])
    
    for index in word_indices:
        x[index - 1] = 1 # -1 because python lists start at 0
    
    return x

In [8]:
features = email_features(word_indices)

In [9]:
print('Length of feature vector:', len(features))
print('Number of non-zero entries:', sum(features > 0))

Length of feature vector: 1899
Number of non-zero entries: [44]


"You should see that the feature vector had length 1899 and 45 non-zero entries."

We have 44 non-zero entries. I'm guessing this is due to me using a slightly different stemmer which has stemmed one word differently than expected by the exercise, and that stemmed word doesn't match the provided vocabulary list. Shouldn't be a major problem.

### 2.3 Training SVM for Spam Classification

In [10]:
# spamTrain.mat contains 4000 training examples of spam and non-spam emails that
# have already been converted into feature vectors like I did above
emails_train = scipy.io.loadmat('data/spamTrain.mat')

In [11]:
emails_train.keys()

dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])

In [12]:
emails_train['X'].shape

(4000, 1899)

4000 feature vectors, each of length 1899

In [13]:
emails_train['y'].shape

(4000, 1)

In [14]:
# Train the SVM
X = emails_train['X']
y = emails_train['y']
y = y.flatten()

model = svm.SVC(kernel='linear', C=0.1)
model.fit(X, y)

SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
pred = model.predict(X)
print('Training accuracy:', np.mean(pred == y) * 100)

Training accuracy: 99.825


"...you should see that the classifier gets a training accuracy of about 99.8%"

In [16]:
# Evaluate model on test set

# spamTest.mat contains 1000 test examples of spam and non-spam emails that
# have already been converted into feature vectors like I did above

emails_test = scipy.io.loadmat('data/spamTest.mat')
emails_test.keys()

dict_keys(['__header__', '__version__', '__globals__', 'Xtest', 'ytest'])

In [17]:
Xtest = emails_test['Xtest']
ytest = emails_test['ytest']
ytest = ytest.flatten()

In [18]:
pred = model.predict(Xtest)
print('Test accuracy:', np.mean(pred == ytest) * 100)

Test accuracy: 98.9


"you should see ... a test accuracy of about 98.5%"

### 2.4 Top Predictors for Spam

Find the words with the largest positive values in the classifier - these are the top predictors of spam.

In [19]:
# This gives us the weights for each of the 1899 features
model.coef_.shape

(1, 1899)

The position of each element in `model.coef_` corresponds to the feature index, and we need to preserve this information since feature indices correspond to words in our vocaculary list. Easy way to do this is usung `np.argsort()`, which returns the indices that would sort an array.

In [20]:
indices = np.argsort(-model.coef_) # - sign to sort by descending
indices = indices.flatten()

In [21]:
# Our vocab_dict has words as keys and indices as values
# which is the opposite of what we now need. So create an
# inverse dictionary

vocab_dict = get_vocab_dict()
index_dict = {v: k for k, v in vocab_dict.items()}

In [22]:
# Print top 15 words
for i in range(15):
    print(index_dict[str(indices[i])])

otherwis
clearli
remot
gt
visa
base
doesn
wife
previous
player
mortgag
natur
ll
futur
hot


These are very different to the "most spammy" words in ex6.pdf Fig. 12. I did a few tests and I'm confident in my code, so I presume this is due to the fact that I'm using a different SVM.