<h2>Email Spam Classification</h2>

Many email services today provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. In this part of the exercise, you will use SVMs to build my own spam filter. I will be training a classifier to classify whether a given email, x, is spam (y = 1) or non-spam (y = 0). In particular, I'll convert each email into a feature vector x ∈ R^n.

The dataset included is based on a a subset of the SpamAssassin Public Corpus. Here, I will only be using the body of the email (excluding the email headers).

In [69]:
# importing necessary libraries

import os
import pandas as pd
import numpy as np
import re
#!pip install nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
#nltk.download('stopwords')     # download resource if not present
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler

print('libraries imported!')

libraries imported!


<h3>Preprocessing Emails</h3>

In [2]:
# helper functions

def normalize_emailaddress(text):
    """Substitue email address to 'emailaddr' in a string"""
    clean1 = re.compile('<[a-zA-Z0-9+_\-\.]+@[0-9a-zA-Z][.-0-9a-zA-Z]*[a-zA-Z]+>')
    clean2 = re.compile('[a-zA-Z0-9+_\-\.]+@[0-9a-zA-Z][.-0-9a-zA-Z]*[a-zA-Z]+')
    return re.sub(clean2, 'emailaddr', re.sub(clean1, 'emailaddr', text))

def normalize_urls(text):
    """Substitue urls to 'httpaddr' in a string"""
    clean1 = re.compile('<http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+>')
    clean2 = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    return re.sub(clean2, 'httpaddr', re.sub(clean1, 'httpaddr', text))

def remove_html_tags(text):
    """Remove html tags from a string"""
    #clean = re.compile('<.*?>')
    clean = re.compile('<(.|\n)*?>')
    return re.sub(clean, '', text)

def is_not_url(text):
    clean = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    return re.sub(clean, '', text).find(':') != -1

In [37]:
# parsing raw email messages

def parse_raw_email(lines):
    """Preprocess raw email message"""
    email = {}
    message = ''
    keys_to_extract = ['from', 'to']
    
    raw_message = '*_*'.join(lines)
    # converting email body into lower case
    raw_message = raw_message.lower()
    # substitute urls to 'httpaddr' and email address to 'emailaddr'
    raw_message = normalize_urls(raw_message)
    raw_message = normalize_emailaddress(raw_message)
    # stripping html tags
    raw_message = remove_html_tags(raw_message)
    
    # removing meta information
    for line in raw_message.split('*_*'):
        if ':' in line and is_not_url(line):
            pairs = line.split(':')
            key = pairs[0].lower()
            val = pairs[1].strip()
            if key in keys_to_extract:
                email[key] = val
        #elif (not line.startswith('\t')) and (not line.startswith('    ')):
        else:
            message += line.strip('\n') + ' '
    
    #remove non-words
    stemmer = SnowballStemmer('english')
    stop_words = stopwords.words('english')
    words_only = re.compile('[^a-zA-Z]')
    message = ' '.join([(words_only.sub('', word)) for word in message.split(' ') if word not in stop_words])
    
    
    email['body'] = re.sub(re.compile('\s{2,}'), ' ', message)
                
    return email


def parse_into_dataframe(data, labels):
    emails_dict = {
        'body': [email['body'] for email in data], 
    }
    
    emails_dict['is_spam'] = labels
    
    return pd.DataFrame(emails_dict)
    

In [5]:
# testing parse_raw_emails method

l = os.listdir(r'C:\Users\kevin\OneDrive\Desktop\Data Science\Email Spam Classification\data\spam_data')
base = r'C:\Users\kevin\OneDrive\Desktop\Data Science\Email Spam Classification\data\spam_data'
lines = open(base + '\\' + l[10], 'r').readlines()
e = parse_raw_email(lines)
e['body']

'by hqpronsnet esmtp id gnltshy versiontlsvsslv cipheredhdssdescbcsha bits verifyno envelopefrom emailaddr by hqpronsnet submit id gnltsb by hqpronsnet esmtp id gnltlhy versiontlsvsslv cipheredhdssdescbcsha bits verifyno envelopefrom emailaddr by locustmindernet esmtp id gnltjj envelopefrom emailaddr by wastemindernet id gnltj by wastemindernet esmtp id gnltfr cpurf bzcda bace aabaabhaceadbdc nbsp b saqbvancvaecanadcco nbsp adabeb cfccb x abha ceadaqbv x d auaddaebaqaeaavboace nbspnbspnbspnbspnbspnbspnbsp x x d akaaaeaqaea x x dakabbaeaqaea bdfaauaac afaaecfboacbaedadeancvaafcbaa aynbsp nbspnbsp x d d bcaaddaebayaicbbzacdauab ancvaqbuabaaabeaqaadaddaeba caebbccaccbcbaacbab atabbahcubfbmaedadbzbfefbedc bd dbfabauacaaedaeeaaadaadccbnacbbzbhaa fafbdbdaabeaaecaceadbddaedbbbz ainbspnbspnbspnbsp nbspnbspnbsp amawag nbspaknbs pnbspnbspnbspnbspnbspnbspnbspnbsp a knbspnbsp acdag nbspnbspnbsp nbspemailag bqbdcag aaaag abhaceadcbibccagn bsp abaeab biabhacead abaeabbiabha cead abaeabbiabha cead ab

In [7]:
# reading email dataset


base_path = r'C:\Users\kevin\OneDrive\Desktop\Data Science\Email Spam Classification\data'
folders = {'spam_data': 1, 'non_spam_data': 0}    # spam and non-spam (or ham)
data = []
labels = []

for f in folders.keys():
    path = os.path.join(base_path, f)
    corpus = os.listdir(path)
    
    count = 0
    
    for email_file in corpus:
        email = {}
        with open(path + '\\' + email_file, 'r') as file:
            try:
                lines = file.readlines()
                email = parse_raw_email(lines)
                data.append(email)
                labels.append(folders[f])
        
                count += 1
                print('email #{} processed...'.format(count))
            except:
                continue
    
    print('{} done!'.format(f))    

email #1 processed...
email #2 processed...
email #3 processed...
email #4 processed...
email #5 processed...
email #6 processed...
email #7 processed...
email #8 processed...
email #9 processed...
email #10 processed...
email #11 processed...
email #12 processed...
email #13 processed...
email #14 processed...
email #15 processed...
email #16 processed...
email #17 processed...
email #18 processed...
email #19 processed...
email #20 processed...
email #21 processed...
email #22 processed...
email #23 processed...
email #24 processed...
email #25 processed...
email #26 processed...
email #27 processed...
email #28 processed...
email #29 processed...
email #30 processed...
email #31 processed...
email #32 processed...
email #33 processed...
email #34 processed...
email #35 processed...
email #36 processed...
email #37 processed...
email #38 processed...
email #39 processed...
email #40 processed...
email #41 processed...
email #42 processed...
email #43 processed...
email #44 processed.

email #359 processed...
email #360 processed...
email #361 processed...
email #362 processed...
email #363 processed...
email #364 processed...
email #365 processed...
email #366 processed...
email #367 processed...
email #368 processed...
email #369 processed...
email #370 processed...
email #371 processed...
email #372 processed...
email #373 processed...
email #374 processed...
email #375 processed...
email #376 processed...
email #377 processed...
email #378 processed...
email #379 processed...
email #380 processed...
email #381 processed...
email #382 processed...
email #383 processed...
email #384 processed...
email #385 processed...
email #386 processed...
email #387 processed...
email #388 processed...
email #389 processed...
email #390 processed...
email #391 processed...
email #392 processed...
email #393 processed...
email #394 processed...
email #395 processed...
email #396 processed...
email #397 processed...
email #398 processed...
email #399 processed...
email #400 proce

email #703 processed...
email #704 processed...
email #705 processed...
email #706 processed...
email #707 processed...
email #708 processed...
email #709 processed...
email #710 processed...
email #711 processed...
email #712 processed...
email #713 processed...
email #714 processed...
email #715 processed...
email #716 processed...
email #717 processed...
email #718 processed...
email #719 processed...
email #720 processed...
email #721 processed...
email #722 processed...
email #723 processed...
email #724 processed...
email #725 processed...
email #726 processed...
email #727 processed...
email #728 processed...
email #729 processed...
email #730 processed...
email #731 processed...
email #732 processed...
email #733 processed...
email #734 processed...
email #735 processed...
email #736 processed...
email #737 processed...
email #738 processed...
email #739 processed...
email #740 processed...
email #741 processed...
email #742 processed...
email #743 processed...
email #744 proce

email #1043 processed...
email #1044 processed...
email #1045 processed...
email #1046 processed...
email #1047 processed...
email #1048 processed...
email #1049 processed...
email #1050 processed...
email #1051 processed...
email #1052 processed...
email #1053 processed...
email #1054 processed...
email #1055 processed...
email #1056 processed...
email #1057 processed...
email #1058 processed...
email #1059 processed...
email #1060 processed...
email #1061 processed...
email #1062 processed...
email #1063 processed...
email #1064 processed...
email #1065 processed...
email #1066 processed...
email #1067 processed...
email #1068 processed...
email #1069 processed...
email #1070 processed...
email #1071 processed...
email #1072 processed...
email #1073 processed...
email #1074 processed...
email #1075 processed...
email #1076 processed...
email #1077 processed...
email #1078 processed...
email #1079 processed...
email #1080 processed...
email #1081 processed...
email #1082 processed...


email #1373 processed...
email #1374 processed...
email #1375 processed...
email #1376 processed...
email #1377 processed...
email #1378 processed...
email #1379 processed...
email #1380 processed...
email #1381 processed...
email #1382 processed...
email #1383 processed...
email #1384 processed...
email #1385 processed...
email #1386 processed...
email #1387 processed...
email #1388 processed...
email #1389 processed...
email #1390 processed...
email #1391 processed...
email #1392 processed...
email #1393 processed...
email #1394 processed...
email #1395 processed...
email #1396 processed...
email #1397 processed...
email #1398 processed...
email #1399 processed...
email #1400 processed...
email #1401 processed...
email #1402 processed...
email #1403 processed...
email #1404 processed...
email #1405 processed...
email #1406 processed...
email #1407 processed...
email #1408 processed...
email #1409 processed...
email #1410 processed...
email #1411 processed...
email #1412 processed...


email #1706 processed...
email #1707 processed...
email #1708 processed...
email #1709 processed...
email #1710 processed...
email #1711 processed...
email #1712 processed...
email #1713 processed...
email #1714 processed...
email #1715 processed...
email #1716 processed...
email #1717 processed...
email #1718 processed...
email #1719 processed...
email #1720 processed...
email #1721 processed...
email #1722 processed...
email #1723 processed...
email #1724 processed...
email #1725 processed...
email #1726 processed...
email #1727 processed...
email #1728 processed...
email #1729 processed...
email #1730 processed...
email #1731 processed...
email #1732 processed...
email #1733 processed...
email #1734 processed...
email #1735 processed...
email #1736 processed...
email #1737 processed...
email #1738 processed...
email #1739 processed...
email #1740 processed...
email #1741 processed...
email #1742 processed...
email #1743 processed...
email #1744 processed...
email #1745 processed...


email #185 processed...
email #186 processed...
email #187 processed...
email #188 processed...
email #189 processed...
email #190 processed...
email #191 processed...
email #192 processed...
email #193 processed...
email #194 processed...
email #195 processed...
email #196 processed...
email #197 processed...
email #198 processed...
email #199 processed...
email #200 processed...
email #201 processed...
email #202 processed...
email #203 processed...
email #204 processed...
email #205 processed...
email #206 processed...
email #207 processed...
email #208 processed...
email #209 processed...
email #210 processed...
email #211 processed...
email #212 processed...
email #213 processed...
email #214 processed...
email #215 processed...
email #216 processed...
email #217 processed...
email #218 processed...
email #219 processed...
email #220 processed...
email #221 processed...
email #222 processed...
email #223 processed...
email #224 processed...
email #225 processed...
email #226 proce

email #538 processed...
email #539 processed...
email #540 processed...
email #541 processed...
email #542 processed...
email #543 processed...
email #544 processed...
email #545 processed...
email #546 processed...
email #547 processed...
email #548 processed...
email #549 processed...
email #550 processed...
email #551 processed...
email #552 processed...
email #553 processed...
email #554 processed...
email #555 processed...
email #556 processed...
email #557 processed...
email #558 processed...
email #559 processed...
email #560 processed...
email #561 processed...
email #562 processed...
email #563 processed...
email #564 processed...
email #565 processed...
email #566 processed...
email #567 processed...
email #568 processed...
email #569 processed...
email #570 processed...
email #571 processed...
email #572 processed...
email #573 processed...
email #574 processed...
email #575 processed...
email #576 processed...
email #577 processed...
email #578 processed...
email #579 proce

email #887 processed...
email #888 processed...
email #889 processed...
email #890 processed...
email #891 processed...
email #892 processed...
email #893 processed...
email #894 processed...
email #895 processed...
email #896 processed...
email #897 processed...
email #898 processed...
email #899 processed...
email #900 processed...
email #901 processed...
email #902 processed...
email #903 processed...
email #904 processed...
email #905 processed...
email #906 processed...
email #907 processed...
email #908 processed...
email #909 processed...
email #910 processed...
email #911 processed...
email #912 processed...
email #913 processed...
email #914 processed...
email #915 processed...
email #916 processed...
email #917 processed...
email #918 processed...
email #919 processed...
email #920 processed...
email #921 processed...
email #922 processed...
email #923 processed...
email #924 processed...
email #925 processed...
email #926 processed...
email #927 processed...
email #928 proce

email #1222 processed...
email #1223 processed...
email #1224 processed...
email #1225 processed...
email #1226 processed...
email #1227 processed...
email #1228 processed...
email #1229 processed...
email #1230 processed...
email #1231 processed...
email #1232 processed...
email #1233 processed...
email #1234 processed...
email #1235 processed...
email #1236 processed...
email #1237 processed...
email #1238 processed...
email #1239 processed...
email #1240 processed...
email #1241 processed...
email #1242 processed...
email #1243 processed...
email #1244 processed...
email #1245 processed...
email #1246 processed...
email #1247 processed...
email #1248 processed...
email #1249 processed...
email #1250 processed...
email #1251 processed...
email #1252 processed...
email #1253 processed...
email #1254 processed...
email #1255 processed...
email #1256 processed...
email #1257 processed...
email #1258 processed...
email #1259 processed...
email #1260 processed...
email #1261 processed...


email #1554 processed...
email #1555 processed...
email #1556 processed...
email #1557 processed...
email #1558 processed...
email #1559 processed...
email #1560 processed...
email #1561 processed...
email #1562 processed...
email #1563 processed...
email #1564 processed...
email #1565 processed...
email #1566 processed...
email #1567 processed...
email #1568 processed...
email #1569 processed...
email #1570 processed...
email #1571 processed...
email #1572 processed...
email #1573 processed...
email #1574 processed...
email #1575 processed...
email #1576 processed...
email #1577 processed...
email #1578 processed...
email #1579 processed...
email #1580 processed...
email #1581 processed...
email #1582 processed...
email #1583 processed...
email #1584 processed...
email #1585 processed...
email #1586 processed...
email #1587 processed...
email #1588 processed...
email #1589 processed...
email #1590 processed...
email #1591 processed...
email #1592 processed...
email #1593 processed...


In [45]:
# creating dataframe from preprocessed emails dataset

emails = parse_into_dataframe(data, labels)
print('Number of emails: ', emails.shape[0])
emails.head(10)

Number of emails:  3490


Unnamed: 0,body,is_spam
0,by phoboslabsnetnoteinccom postfix esmtp id ef...,1
1,by phoboslabsspamassassintaintorg postfix esmt...,1
2,dogmaslashnullorg esmtp id gdkce mandarklabsn...,1
3,by phoboslabsspamassassintaintorg postfix esmt...,1
4,by phoboslabsspamassassintaintorg postfix esmt...,1
5,dogmaslashnullorg esmtp id gfwie mandarklabsn...,1
6,dogmaslashnullorg esmtp id ggaqe mandarklabsn...,1
7,by phoboslabsspamassassintaintorg postfix esmt...,1
8,by phoboslabsspamassassintaintorg postfix esmt...,1
9,by phoboslabsnetnoteinccom postfix esmtp id c ...,1


In [41]:
# splitting data into training (80%) and testing sets (20%)

labels = emails.is_spam
x_train, x_test, y_train, y_test = train_test_split(emails['body'], labels, test_size=0.2, random_state=7)

print('x_train size: ', x_train.shape[0])
print('x_test size: ', x_test.shape[0])

x_train size:  2792
x_test size:  698


<h3>Feature Extraction</h3>

In [46]:
# Let’s initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 
# (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are 
# to be filtered out before processing the natural language data. A TfidfVectorizer turns a collection of raw documents 
# into a matrix of TF-IDF features.


tfidfvectorizer = TfidfVectorizer(stop_words='english', min_df=100, use_idf=True)
# fit the vectorizer on the train set, learning vocabulary and idf from the training set
tfidfvectorizer.fit(x_train)
# encode the training set and test set
tfidfvectorizer_vectors_xtrain = tfidfvectorizer.transform(x_train).toarray()
tfidfvectorizer_vectors_xtest = tfidfvectorizer.transform(x_test).toarray()

print(type(tfidfvectorizer_vectors_xtrain))
tfidfvectorizer_vectors_xtrain.shape     # bag of words

<class 'numpy.ndarray'>


(2792, 466)

In [49]:
# vocabulary and idf values learned by the vectorizer

vocabulary = tfidfvectorizer.vocabulary_
idf_values = tfidfvectorizer.idf_

print('vocabulary size (learned): ', len(vocabulary))
print(idf_values.shape)

vocabulary size (learned):  466
(466,)


In [50]:
# to get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame 
# the values will be sorted in descending order

df_idf = pd.DataFrame(tfidfvectorizer.idf_, index = tfidfvectorizer.get_feature_names(), columns=["idf_weights"])
# sort descending
df_idf.sort_values(by=['idf_weights'], ascending=False)

Unnamed: 0,idf_weights
answer,4.319751
protect,4.319751
join,4.319751
text,4.319751
machine,4.319751
...,...
fetchmail,1.361854
localhost,1.360827
postfix,1.292347
esmtp,1.047288


<h3>Training SVM for spam classification</h3>

In [58]:
# since the SVM fitting algorithm is very sensitive to feature scaling, let's just get that out of the way 
# right from the start

# scale the data for SVMs

scaler = StandardScaler()
x_train = scaler.fit(tfidfvectorizer_vectors_xtrain).transform(tfidfvectorizer_vectors_xtrain)
x_test = scaler.fit(tfidfvectorizer_vectors_xtest).transform(tfidfvectorizer_vectors_xtest)

# printing feature values for the first 20 words (in vocabulary) for the first document (i.e. first email message) in training set
print('Before standardizing: \n', tfidfvectorizer_vectors_xtrain[0, :20])
print('After standardizing: \n', x_train[0, :20])

Before standardizing: 
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
After standardizing: 
 [-0.23133116 -0.16731454 -0.2369112  -0.20335225 -0.18941798 -0.22664108
 -0.1817858  -0.20565159 -0.1663218  -0.35931181 -0.20974045 -0.1430817
 -0.18432748 -0.21762497 -0.18833467 -0.15185683 -0.16957759 -0.18866975
 -0.15479566 -0.16756686]


In [62]:
# fitting SVC (with Guassian kernel) to training data to learn complex non-linear decision boundary

clf = SVC(kernel='rbf', gamma='scale', random_state=0, C=1000, tol=1e-4)
clf.fit(x_train, y_train)

print('Training score: ', clf.score(x_train, y_train))    # returns the mean accuracy for classifying data (x_train,y_train)
print('Testing score: ', clf.score(x_test, y_test))

Training score:  1.0
Testing score:  0.9684813753581661


In [107]:
# fitting Linear SVC (with linear kernel) to training data

clf_linear = LinearSVC(random_state=0, C=100, tol=1e-3, max_iter=10000)
clf_linear.fit(x_train, y_train)

w = np.squeeze(clf_linear.coef_).reshape(466, 1)
b = clf_linear.intercept_.reshape(1, 1)

print('theta intercept: ', b,  'theta co-efficients: ', w)

theta intercept:  [[0.10221735]] theta co-efficients:  [[ 1.31633811e-02]
 [ 2.54209460e-01]
 [ 3.32939806e-01]
 [-2.58183320e-01]
 [ 3.88386761e-02]
 [-6.06194652e-03]
 [ 1.90483857e-01]
 [ 3.35761606e-01]
 [-1.91118963e-01]
 [-7.00485231e-02]
 [-1.00230310e-02]
 [ 2.82562191e-01]
 [-1.50860914e-01]
 [-1.09473756e-02]
 [ 9.23017305e-02]
 [-5.52611746e-02]
 [ 1.25693784e-01]
 [-2.08917038e-01]
 [ 3.29536905e-01]
 [-1.36406292e-01]
 [-2.67246173e-01]
 [-1.48021593e-01]
 [-2.90189977e-01]
 [-1.05449576e-01]
 [-5.05307916e-02]
 [-9.79490157e-02]
 [-7.85654124e-02]
 [ 2.61905305e-01]
 [-1.00767928e-01]
 [-1.73567341e-01]
 [-5.10590789e-01]
 [-3.28555695e-01]
 [ 1.29726409e-01]
 [-1.17002533e-01]
 [-2.25445051e-01]
 [-9.08935247e-02]
 [ 2.01934993e-01]
 [ 1.04412560e-01]
 [-6.17850231e-02]
 [ 1.22428475e-01]
 [-1.32913008e-01]
 [ 1.26155619e-01]
 [-1.68041064e-01]
 [ 7.56662928e-02]
 [-2.23221224e-01]
 [-2.88682918e-01]
 [ 9.68033539e-02]
 [-5.95353528e-01]
 [ 2.07682589e-01]
 [-3.30155253e

<h3>Top Predictors for Spam</h3>

In [108]:
# since the model we are training is a linear SVM, we can inspect the weights learned by the model to understand better 
# how it is determining whether an email is spam or not. The following code finds the words with the highest weights in 
# the classifier. Informally, the classifier 'thinks' that these words are the most likely indicators of spam.

# top 20 words
top20_parameters = np.sort(np.squeeze(w), axis=0)[-20:][::-1]
words = tfidfvectorizer.get_feature_names()
top20_words = []


for value in top20_parameters:
    itemindex = numpy.where(np.squeeze(w) == value)
    top20_words.append(words[itemindex[0][0]])
    
print('Top 20 words that are most likely indicative of a spam email are: \n', top10_words)

Top 20 words that are most likely indicative of a spam email are: 
 ['zzzzasonorg', 'financial', 'mandarklabsnetnoteinccom', 'offer', 'email', 'professional', 'price', 'important', 'phoboslabsspamassassintaintorg', 'marketing', 'spamassassinsightings', 'dogmaslashnullorg', 'easy', 'money', 'opportunity', 'free', 'guarantee', 'mailings', 'soon', 'instructions']
