# Spam Classification

In order to classify an email to be spam or not, we have to convert the email into a feature vector firstly.

## 1. Preprocessing Emails
Browsing sample emails gives us a good feeling about how spam/non-spam emails may look like. 
Usually, the numbers, links and email addresses are different in almost every email. Therefore, "normalizing" these values may be a good idea, so that all the values would be treated in the same way. For example, we could replace each URL in the email with the string "httpaddr" to indicate a URL was present. This has the effect of letting the spam classifier make a classification decision based on whether any URL was present, rather than whether a specific URL was present. This typically improves the performance of a spam classier, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
from sklearn import svm
import re
import nltk, nltk.stem.porter

In [2]:
print ("Sample spam:")
!cat data/emailSample1.txt

Sample spam:
> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com



In [3]:
def preProcess(email):
    email = email.lower() # make the email lower-case
    email = re.sub('<[^<>]+>', ' ', email); # remove all the html tags
    email = re.sub('[0-9]+', 'number', email); # replace all the specific numbers with general "number"
    email = re.sub('(http|https)://[^\s]*', 'httpaddr', email) # replace all the specific links with general "httpaddr"
    email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email) #replace all the specific email address with general "emailaddr"
    email = re.sub('[$]+', 'dollar', email)
    return email

In [4]:
def email2wordlist(raw_email):
    email = preProcess(raw_email)
    lists = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)
    stemmer = nltk.stem.porter.PorterStemmer()
    wordList = []
    for word in lists:
        word = re.sub('[^a-zA-Z0-9]', '', word)
        stemmed = stemmer.stem(word)
        if not len(word): continue
        wordList.append(stemmed)
    return wordList

## 2. Vocabulary List

In [5]:
def vocabDict():
    vocab_dict = {}
    with open('data/vocab.txt') as f:
        for line in f:
            (val, key) = line.split()
            vocab_dict[key] = int(val)
    return vocab_dict

In [6]:
def wordlist2Indexlist(wordList, vocab_dict):
    indexlist = []
    indexlist = [vocab_dict[token] for token in wordList if token in vocab_dict]
    return indexlist

## 3. Feature Vector Extraction

In [7]:
def indexlist2FeatureVector(raw_email, vocab_dict):
    feature_vector = np.zeros((len(vocab_dict),1))
    wordList = email2wordlist(raw_email)
    indexlist = wordlist2Indexlist(wordList, vocab_dict)
    for idx in indexlist:
        feature_vector[idx] = 1
    return feature_vector, wordList

In [8]:
vocab_dict = vocabDict()
raw_email = open('data/emailSample1.txt', 'r').read()
feature_vector, wordList = indexlist2FeatureVector(raw_email, vocab_dict)

print("The length of the feature vector is %d." %len(feature_vector))
print("The number of non-zero elements is %d." %sum(feature_vector==1))

The length of the feature vector is 1899.
The number of non-zero elements is 45.


In [9]:
wordList

['anyon',
 'know',
 'how',
 'much',
 'it',
 'cost',
 'to',
 'host',
 'a',
 'web',
 'portal',
 'well',
 'it',
 'depend',
 'on',
 'how',
 'mani',
 'visitor',
 'you',
 're',
 'expect',
 'thi',
 'can',
 'be',
 'anywher',
 'from',
 'less',
 'than',
 'number',
 'buck',
 'a',
 'month',
 'to',
 'a',
 'coupl',
 'of',
 'dollarnumb',
 'you',
 'should',
 'checkout',
 'httpaddr',
 'or',
 'perhap',
 'amazon',
 'ecnumb',
 'if',
 'your',
 'run',
 'someth',
 'big',
 'to',
 'unsubscrib',
 'yourself',
 'from',
 'thi',
 'mail',
 'list',
 'send',
 'an',
 'email',
 'to',
 'emailaddr']

## 4. Training SVM for Spam Classification

Next, we will load a preprocessed training dataset to train a SVM classifier.

In [10]:
datafile = 'data/spamTrain.mat'
data = scipy.io.loadmat(datafile)

In [11]:
data.keys()

dict_keys(['__header__', 'y', '__version__', '__globals__', 'X'])

In [12]:
X = data['X']
y = data['y']

In [13]:
datafile = 'data/spamTest.mat'
data = scipy.io.loadmat(datafile)

In [14]:
data.keys()

dict_keys(['__header__', 'ytest', '__version__', '__globals__', 'Xtest'])

In [15]:
Xtest = data['Xtest']
ytest = data['ytest']

In [16]:
linear_svm = svm.SVC(C=0.1, kernel='linear')
linear_svm.fit(X, y.flatten())

SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [17]:
train_predict = linear_svm.predict(X).reshape((y.shape[0],1))
train_accuracy = sum(train_predict==y)/y.shape[0]
print ('Training set accuracy = %f'%train_accuracy)

Training set accuracy = 0.998250


In [18]:
test_predict = linear_svm.predict(Xtest).reshape((ytest.shape[0],1))
test_accuracy = sum(test_predict==ytest)/ytest.shape[0]
print ('Test set accuracy = %f'%test_accuracy)

Test set accuracy = 0.989000


## 5. Top Predictors of Spam

In [19]:
def vocabList():
    vocab_list = {}
    with open('data/vocab.txt') as f:
        for line in f:
            (val, key) = line.split()
            vocab_list[int(val)] = key
    return vocab_list

In [20]:
vocab_list = vocabList()
sorted_index = np.argsort(-linear_svm.coef_)

In [21]:
print("The 15 most important words for spam:")
print([vocab_list[x] for x in sorted_index[0][0:15]])

The 15 most important words for spam:
['otherwis', 'clearli', 'remot', 'gt', 'visa', 'base', 'doesn', 'wife', 'previous', 'player', 'mortgag', 'natur', 'll', 'futur', 'hot']


In [22]:
print("The 15 least important words for spam:")
print([vocab_list[x] for x in sorted_index[0][-15:]])

The 15 least important words for spam:
['http', 'toll', 'xp', 'ratio', 'august', 'unsubscrib', 'useless', 'numberth', 'round', 'linux', 'datapow', 'wrong', 'urgent', 'that', 'spam']


Here, we use preprocessed vocabulary dictionary. In practice, we have to build the dictionary by ourselves using, for example, the library "collections" (Counter(), most_common()) to select out the most popular 1000 or 5000 words.
Then, we split the data into training dataset (60%), CV dataset (20%), testing dataset (20%). The C value is trained with training dataset and applied on the CV dataset. The C value with best performance on the CV dataset will be selected and fit on the testing dataset.