# Spam classification of emails

- Preprocessing stages carried out in standard Python to illustrate required stages

- Scikit-Learn Linear Support Vector Machine used for classifying spam / non-spam emails
---

In [1]:
import numpy as np
import matplotlib.pyplot as pl
import pandas as pd
from scipy.io import loadmat
import re
import nltk

---
### Preview an example email

In [2]:
f = open('machine-learning-ex6/ex6/emailSample1.txt','r')
email_contents = f.read()
f.close()

print(email_contents)

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com




---
## Email Preprocessing Steps

We want to perform a wide variety of preprocessing steps on our email data before it is suitable for training through a classifier such as a Support Vector Machine.

The following will be performed on our data prior to training:
- Remove any html formatting present in the email
- Convert all numbers given into the string 'number' (in most cases the actual number is irrelevant to our model)
- Convert all email addresses into the string 'emailaddr' (the address does not help our predictions in this case)
- Convert dollar signs ($) into 'dollar', and pound signs (£) into 'pounds'
- Convert all URLs into 'httpaddr'
- Remove special symbols or non-alphanumeric characters
- Perform word stemming on all words 

Following this, we will form a binary word vector for a specified vocabulary of 1899 words, which will be used to create features for training our model.

In [3]:
def remove_html(contents):
    # remove any html present
    parser = re.compile(r'<.*?>')
    return re.sub(parser, '', contents)

def remove_special_chars(contents):
    # remove all but alphanumeric chars and spaces
    return re.sub('[^A-Za-z0-9]+', ' ', contents)
    
def handle_numbers(contents):
    numbers = re.compile('[0-9]+')
    return re.sub(numbers, 'number', contents)

def handle_emails(contents):
    emails = re.compile(r'[\w\.-]+@[\w\.-]+')
    return re.sub(emails, 'emailaddr', contents)

def handle_currency(contents):
    formatted = contents.replace('$', 'dollar')
    formatted = formatted.replace('£', 'pounds')
    return formatted

def handle_urls(contents):
    urls = re.compile('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+')
    return re.sub(urls, 'httpaddr', contents)

def stem_words(contents):
    """ Stem words using nltk porter stemmer """
    words = contents.split()
    stemmer = nltk.PorterStemmer()
    words = [str(stemmer.stem(word)) for word in words]
    return words
    
def preprocess_email(email):
    """ Call all our preprocessing steps on the provided email """
    email_contents = remove_html(email)
    email_contents = handle_emails(email_contents)
    email_contents = handle_urls(email_contents)
    email_contents = handle_numbers(email_contents)
    email_contents = handle_currency(email_contents)
    email_contents = remove_special_chars(email_contents)
    email_contents = email_contents.strip().lower()
    stemmed_words = stem_words(email_contents)
    return stemmed_words

#### Validate our preprocessing works on our sample email

In [4]:
word_list = preprocess_email(email_contents)
print("The words we now have after preprocessing are: \n\n{}".format(word_list))

The words we now have after preprocessing are: 

['anyon', 'know', 'how', 'much', 'it', 'cost', 'to', 'host', 'a', 'web', 'portal', 'well', 'it', 'depend', 'on', 'how', 'mani', 'visitor', 'you', 're', 'expect', 'thi', 'can', 'be', 'anywher', 'from', 'less', 'than', 'number', 'buck', 'a', 'month', 'to', 'a', 'coupl', 'of', 'dollarnumb', 'you', 'should', 'checkout', 'httpaddr', 'or', 'perhap', 'amazon', 'ecnumb', 'if', 'your', 'run', 'someth', 'big', 'to', 'unsubscrib', 'yourself', 'from', 'thi', 'mail', 'list', 'send', 'an', 'email', 'to', 'emailaddr']


Looks good! Only basic stemmed words remain from our email.

#### Create binary word feature vector

Now we need to use the provided vocabulary list to keep only those words that occur frequently (since rare words can disrupt our model with such a small amount of data in this case).

In practice, our vocabulary list would be much larger (10,000 - 50,000), however for this exercise only 1899 words are used.

In [5]:
# read vocabulary list with pandas and convert into a python dict for lookup
vocabulary_file = pd.read_csv('machine-learning-ex6/ex6/vocab.txt', sep="\t", header=None)
vocabulary_file.columns = ["index", "word"]
vocabulary_file.set_index('index')

# create a lookup dict of words we want and index values
vocab_dict = dict(zip(vocabulary_file["word"], vocabulary_file["index"]))

In [6]:
# map each word to its index from the vocab dict - if word is not in vocab list: ignore
word_vector = []

for word in word_list:
    word_index = vocab_dict.get(word)
    if word_index:
        word_vector.append(word_index)

In [7]:
print(word_vector)

[86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831, 883, 431, 1171, 794, 1002, 1893, 1364, 592, 1676, 238, 162, 89, 688, 945, 1663, 1120, 1062, 1699, 375, 1162, 479, 1893, 1510, 799, 1182, 1237, 810, 1895, 1440, 1547, 181, 1699, 1758, 1896, 688, 1676, 992, 961, 1477, 71, 530, 1699, 531]


In [8]:
# also create a binary feature vector: 1 if word is in email, 0 if not
feature_vector = []

for index in vocabulary_file["index"]:
    if index in word_vector:
        feature_vector.append(1)
    else:
        feature_vector.append(0)

In [9]:
print("Length of the feature vector (should be 1899): {}".format(len(feature_vector)))

print("Number of non-zero entries in example email (should be 45): {}".format(
    len([x for x in feature_vector if x == 1])))

Length of the feature vector (should be 1899): 1899
Number of non-zero entries in example email (should be 45): 45


Since we know our code successfully creates our word vectors, we can define a function that makes performing these steps easier for training our data.

In [10]:
def vectorise_words(word_list):
    word_vector = []
    
    # map each word to index from vocab dict - if word not in vocab dict: ignore
    for word in word_list:
        word_index = vocab_dict.get(word)
        if word_index:
            word_vector.append(word_index)
        
    feature_vector = []
    
    # create a binary feature vector: 1 if word is in email, 0 if not
    for index in vocabulary_file["index"]:
        if index in word_vector:
            feature_vector.append(1)
        else:
            feature_vector.append(0)
    
    return word_vector, feature_vector

Now that our feature vectors can be obtained using our defined preprocessing steps, we're ready to train a SVM linear classifier to predict whether an email is spam (output 1) or non-spam (output 0).

---
## SVM Classifier

#### With our preprocessing steps complete, we can now train an SVM with linear kernel on our preprocessed data

We'll use 4000 preprocessed emails (containing spam and non-spam examples), and 1000 preprocessed emails for testing purposes (containing both spam and non-spam).

Each email has been processed using the email preprocessing steps defined above, and structured as a feature vector:

$$ x^{(i)} \in \Re^{(1899)} $$

In [11]:
train_datafile = loadmat('machine-learning-ex6/ex6/spamTrain.mat', struct_as_record=False)
X = train_datafile['X']
y = train_datafile['y'].flatten()

test_datafile = loadmat('machine-learning-ex6/ex6/spamTest.mat', struct_as_record=False)
X_test = test_datafile['Xtest']
y_test = test_datafile['ytest'].flatten()

print("The shape of X is: {}".format(X.shape))
print("The shape of X test is: {}".format(X_test.shape))
print("The shape of y is: {}".format(y.shape))
print("The shape of y test is: {}".format(y_test.shape))

The shape of X is: (4000, 1899)
The shape of X test is: (1000, 1899)
The shape of y is: (4000,)
The shape of y test is: (1000,)


In [12]:
from sklearn.svm import SVC

clf = SVC(C=0.1, kernel='linear').fit(X, y)

training_score = clf.score(X, y)
test_score = clf.score(X_test, y_test)
print("The Linear SVM accuracies: Training set: {0}, Test set: {1}".format(training_score, test_score))

The Linear SVM accuracies: Training set: 0.99825, Test set: 0.989


### Inspection of the top predictors for an email being spam

We can analyse the largest coefficients in our trained classifier to see what words have the most impact on an email being classified as spam.

In [13]:
# get the index of the top 20 classifier coefficients
predictors = np.argsort(clf.coef_)
top_predicts = predictors[0][-20:].tolist()

# create a dict for word index look-up for convenience
word_index_dict = dict(zip(vocabulary_file["index"], vocabulary_file["word"]))

# print the corresponding words that best predict an email as being spam
for word_index in top_predicts:
    print(word_index_dict.get(word_index))

dollarac
wall
script
cv
air
hot
futur
ll
natur
mortgag
player
previous
wife
doesn
base
visa
gt
remot
clearli
otherwis


### Predictions on our own email data

Using our defined preprocessing functions above, and the vocabulary dictionary, we'll now read in some example spam and non-spam emails and make predictions using the SVM linear classifier.

#### Prediction 1 - non spam email example

In [14]:
f = open('machine-learning-ex6/ex6/emailSample1.txt','r')
non_spam_example = f.read()
f.close()
nonspam_words = preprocess_email(non_spam_example)
nonspam_word_vector, nonspam_feature_vector = vectorise_words(nonspam_words)

print(non_spam_example)
print(nonspam_words)

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com


['anyon', 'know', 'how', 'much', 'it', 'cost', 'to', 'host', 'a', 'web', 'portal', 'well', 'it', 'depend', 'on', 'how', 'mani', 'visitor', 'you', 're', 'expect', 'thi', 'can', 'be', 'anywher', 'from', 'less', 'than', 'number', 'buck', 'a', 'month', 'to', 'a', 'coupl', 'of', 'dollarnumb', 'you', 'should', 'checkout', 'httpaddr', 'or', 'perhap', 'amazon', 'ecnumb', 'if', 'your', 'run', 'someth', 'big', 'to', 'unsubscrib', 'yourself', 'from', 'thi', 'mail', 'list', 'send', 'an', 'email', 'to', 'emailaddr']


In [15]:
print("Length of the non-spam example feature vector: {}".format(len(nonspam_feature_vector)))

print("Number of non-zero entries: {}".format(
    len([x for x in nonspam_feature_vector if x == 1])))

Length of the non-spam example feature vector: 1899
Number of non-zero entries: 45


In [16]:
nonspam_feature_vector = np.array(nonspam_feature_vector)
X_pred = nonspam_feature_vector.reshape(1, 1899)
prediction = clf.predict(X_pred)

In [17]:
print(prediction)

[0]


Output label was predicted as '0', which corresponds to non-spam. This is correct!

#### Prediction 2 - non spam email example

In [18]:
f = open('machine-learning-ex6/ex6/emailSample2.txt','r')
non_spam_example = f.read()
f.close()
nonspam_words = preprocess_email(non_spam_example)
nonspam_word_vector, nonspam_feature_vector = vectorise_words(nonspam_words)

print(non_spam_example)
print(nonspam_words)

print("\n\nLength of the non-spam example feature vector: {}".format(len(nonspam_feature_vector)))

print("Number of non-zero entries: {}".format(
    len([x for x in nonspam_feature_vector if x == 1])))

Folks,
 
my first time posting - have a bit of Unix experience, but am new to Linux.

 
Just got a new PC at home - Dell box with Windows XP. Added a second hard disk
for Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went
fine except it didn't pick up my monitor.
 
I have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4
Ti4200 video card, both of which are probably too new to feature in Suse's default
set. I downloaded a driver from the nVidia website and installed it using RPM.
Then I ran Sax2 (as was recommended in some postings I found on the net), but
it still doesn't feature my video card in the available list. What next?
 
Another problem. I have a Dell branded keyboard and if I hit Caps-Lock twice,
the whole machine crashes (in Linux, not Windows) - even the on/off switch is
inactive, leaving me to reach for the power cable instead.
 
If anyone can help me in any way with these probs., I'd be really grateful -
I've searched the 'ne

In [19]:
nonspam_feature_vector = np.array(nonspam_feature_vector)
X_pred = nonspam_feature_vector.reshape(1, 1899)
prediction = clf.predict(X_pred)
print(prediction)

[0]


The output label is '0' again, which corresponds to non-spam. Correct!

#### Prediction 3 - spam email example

In [20]:
f = open('machine-learning-ex6/ex6/spamSample1.txt','r')
spam_example = f.read()
f.close()
spam_words = preprocess_email(spam_example)
spam_word_vector, spam_feature_vector = vectorise_words(spam_words)

print(spam_example)
print(spam_words)

print("\n\nLength of the non-spam example feature vector: {}".format(len(spam_feature_vector)))

print("Number of non-zero entries: {}".format(
    len([x for x in spam_feature_vector if x == 1])))

Do You Want To Make $1000 Or More Per Week?

 

If you are a motivated and qualified individual - I 
will personally demonstrate to you a system that will 
make you $1,000 per week or more! This is NOT mlm.

 

Call our 24 hour pre-recorded number to get the 
details.  

 

000-456-789

 

I need people who want to make serious money.  Make 
the call and get the facts. 

Invest 2 minutes in yourself now!

 

000-456-789

 

Looking forward to your call and I will introduce you 
to people like yourself who
are currently making $10,000 plus per week!

 

000-456-789



3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72


['do', 'you', 'want', 'to', 'make', 'dollarnumb', 'or', 'more', 'per', 'week', 'if', 'you', 'are', 'a', 'motiv', 'and', 'qualifi', 'individu', 'i', 'will', 'person', 'demonstr', 'to', 'you', 'a', 'system', 'that', 'will', 'make', 'you', 'dollarnumb', 'number', 'per', 'week', 'or', 'more', 'thi', 'is', 'not', 'mlm', 'call', 'our', 'number', 'h

In [21]:
spam_feature_vector = np.array(spam_feature_vector)
X_pred = spam_feature_vector.reshape(1, 1899)
prediction = clf.predict(X_pred)
print(prediction)

[1]


Output label is '1', which is spam. Our model was correct!

#### Prediction 4 - spam email example

In [22]:
f = open('machine-learning-ex6/ex6/spamSample2.txt','r')
spam_example = f.read()
f.close()
spam_words = preprocess_email(spam_example)
spam_word_vector, spam_feature_vector = vectorise_words(spam_words)

print(spam_example)
print(spam_words)

print("\n\nLength of the non-spam example feature vector: {}".format(len(spam_feature_vector)))

print("Number of non-zero entries: {}".format(
    len([x for x in spam_feature_vector if x == 1])))

Best Buy Viagra Generic Online

Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!

We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers!
http://medphysitcstech.ru



['best', 'buy', 'viagra', 'gener', 'onlin', 'viagra', 'numbermg', 'x', 'number', 'pill', 'dollarnumb', 'free', 'pill', 'reorder', 'discount', 'top', 'sell', 'number', 'qualiti', 'satisfact', 'guarante', 'we', 'accept', 'visa', 'master', 'e', 'check', 'payment', 'number', 'satisfi', 'custom', 'httpaddr']


Length of the non-spam example feature vector: 1899
Number of non-zero entries: 19


In [23]:
spam_feature_vector = np.array(spam_feature_vector)
X_pred = spam_feature_vector.reshape(1, 1899)
prediction = clf.predict(X_pred)
print(prediction)

[1]


Again, our model was successful in predicting '1', which corresponds to spam!