### Spam Classiffication with SVM_Sklearn



Many email services today provide spam fillters that are able to classify emails
into spam and non-spam email with high accuracy. Using SVMs to build spam filter.
You will be training a classiffier to classify whether a given email, x, is
spam (y = 1) or non-spam (y = 0). In particular,each
email will be converted into a feature vector.

<img src="./images/mail_spam.jpeg" width="200" />

The dataset included for this exercise is based on a a subset of
the SpamAssassin Public Corpus, only the body of email is used here.

In [11]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.io import loadmat
import warnings; warnings.simplefilter('ignore')
import re
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, f1_score

#### Emails Preprocessing

In [136]:
email_contents = open('./data/spamSample1.txt').read()
print('Email contents before preprocessing:\n')
print(re.sub(r'\n', ' ', email_contents))

Email contents before preprocessing:

Do You Want To Make $1000 Or More Per Week?     If you are a motivated and qualified individual - I  will personally demonstrate to you a system that will  make you $1,000 per week or more! This is NOT mlm.     Call our 24 hour pre-recorded number to get the  details.       000-456-789     I need people who want to make serious money.  Make  the call and get the facts.   Invest 2 minutes in yourself now!     000-456-789     Looking forward to your call and I will introduce you  to people like yourself who are currently making $10,000 plus per week!     000-456-789    3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72  


In [15]:
def getVocabList():
    with open('./data/vocab.txt', 'r') as f:
        s = f.readlines()
        vocabList = [i.split()[1] for i in s]
    return vocabList
    
def processEmail(email_contents): 
    
    vocabList = getVocabList()
    
    # ========================== Preprocess Email ===========================
    # Lower case
    email_contents = email_contents.lower()
    # Strip all HTML
    # Looks for any expression that starts with < and ends with > and replace
    # and does not have any < or > in the tag it with a space
    email_contents =re.sub(r'<[^<>]+>', ' ', email_contents);
    # Handle Numbers
    # Look for one or more characters between 0-9
    email_contents = re.sub(r'[0-9]+', 'number', email_contents);
    # Handle URLS
    # Look for strings starting with http:// or https://
    email_contents = re.sub(r'(http|https)://[^\s]*', 'httpaddr',email_contents);
    # Handle Email Addresses
    # Look for strings with @ in the middle
    email_contents = re.sub( '[^\s]+@[^\s]+', 'emailaddr',email_contents);
    # Handle $ sign
    email_contents = re.sub( '[$]+', 'dollar',email_contents);
    
    # ========================== Tokenize Email ===============================
    l = 0
    if len(email_contents.split()):
        # Tokenize and also get rid of any punctuation
        # Remove any non alphanumeric characters
        email_contents = re.sub(r'[^a-zA-Z0-9]+',' ',email_contents)
        str1 = email_contents.split(' ')
        
    word_indices = []
    for i in email_contents.split():
        if i in vocabList:
            word_indices.append(vocabList.index(i))
    return email_contents, word_indices

In [138]:
email_contents, word_indices =  processEmail(email_contents)
print('Email cotents after preprocessing:\n',email_contents)
print('Word Indices:\n', word_indices)

Email cotents after preprocessing:
 do you want to make dollarnumber or more per week if you are a motivated and qualified individual i will personally demonstrate to you a system that will make you dollarnumber number per week or more this is not mlm call our number hour pre recorded number to get the details number number number i need people who want to make serious money make the call and get the facts invest number minutes in yourself now number number number looking forward to your call and i will introduce you to people like yourself who are currently making dollarnumber number plus per week number number number numberljgvnumber numberleannumberlrmsnumber numberwxhonumberqiytnumber numberrjuvnumberhqcfnumber numbereidbnumberdmtvlnumber 
Word Indices:
 [470, 1892, 1808, 1698, 996, 1181, 1063, 1230, 1826, 809, 1892, 73, 1851, 1698, 1892, 1630, 1664, 1851, 996, 1892, 1119, 1230, 1826, 1181, 1063, 876, 1112, 233, 1190, 1119, 791, 1286, 1119, 1698, 707, 1665, 1119, 1119, 1119, 1092, 

#### Feature Extraction

In [17]:
def emailFeatures(word_indices):
    n = 1899
    x = np.zeros((n,1))
    x[word_indices] = 1
    return x
features = emailFeatures(word_indices)
print('Length of feature vector:', features.shape[0])
print('Number of non-zero entries:', features[features==1].shape[0])  

Length of feature vector: 1899
Number of non-zero entries: 11


#### Spam Classification with SVM Model

In [7]:
#Data in spamTrain.mat and spamTest.mat have already been preprocessed with the method below.
data = loadmat('./data/spamTrain.mat')
Xtrain = data['X']; ytrain = data['y']
data = loadmat('./data/spamTest.mat')
Xtest = data['Xtest']; ytest = data['ytest']

In [13]:
# Simple model.fit on train and model.score on test approach with crossval score with C=1.0
model = svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, \
            shrinking=True, probability=True, tol=0.001, cache_size=200, \
            class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', \
            random_state=None)
Result_crossval = cross_val_score(model, Xtrain, ytrain, scoring='recall_macro',cv=5)
model = model.fit(Xtrain,ytrain)
ypred = model.predict(Xtest)
score = f1_score(ytest, ypred, pos_label=1, average='binary')
print('Cross-validation recall score is:',np.mean(Result_crossval))
print('Test Accuracy with this model (f1-score):',score)

Cross-validation recall score is: 0.9072190929123943
Test Accuracy with this model (f1-score): 0.919104991394148


In [150]:
#Test on spamSample1.txt 
email_contents = open('./data/spamSample1.txt').read()
email_contents, word_indices =  processEmail(email_contents)
features = emailFeatures(word_indices)
predict = model.predict(features.reshape(1,1899))
print('SVM model predict that this mail', 'is Spam!' if predict[0] == 1 else 'is not Spam.')
print('predicted spam  probability is:',(model.predict_proba(features.reshape(1,1899)))[0,1] )

SVM model predict that this mail is Spam!
predicted spam  probability is: 0.8978750274709653


In [152]:
#Test on spamSample2.txt
email_contents = open('./data/spamSample2.txt').read()
print('Email contents before preprocessing:\n')
print(re.sub(r'\n', ' ', email_contents))

Email contents before preprocessing:

Best Buy Viagra Generic Online  Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!  We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers! http://medphysitcstech.ru   


In [155]:
email_contents, word_indices =  processEmail(email_contents)
features = emailFeatures(word_indices)
predict = model.predict(features.reshape(1,1899))
print('SVM model predict that this mail', 'is Spam!' if predict[0] == 1 else 'is not Spam.')
print('predicted spam  probability is:',(model.predict_proba(features.reshape(1,1899)))[0,1] )

SVM model predict that this mail is not Spam.
predicted spam  probability is: 0.2965737692099745


The wrong prediction for the second mail is probably due to the small dictionnary size in which 
even the words like 'buy' are not included. 

#### Hyperparameter Tunning using Gridsearch with Cross-Validation 

In [184]:
tuned_parameters = [{'kernel': ['rbf', 'linear'], 'C': [1, 10, 100, 1000]}]

clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=5, scoring='recall_macro')
clf.fit(Xtrain, ytrain)

print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = ytest, clf.predict(Xtest)
print(classification_report(y_true, y_pred))


Best parameters set found on development set:

{'C': 100, 'kernel': 'rbf'}

Grid scores on development set:

0.907 (+/-0.015) for {'C': 1, 'kernel': 'rbf'}
0.966 (+/-0.015) for {'C': 1, 'kernel': 'linear'}
0.963 (+/-0.016) for {'C': 10, 'kernel': 'rbf'}
0.963 (+/-0.013) for {'C': 10, 'kernel': 'linear'}
0.972 (+/-0.015) for {'C': 100, 'kernel': 'rbf'}
0.957 (+/-0.012) for {'C': 100, 'kernel': 'linear'}
0.967 (+/-0.014) for {'C': 1000, 'kernel': 'rbf'}
0.957 (+/-0.012) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      0.99      0.99       692
          1       0.97      0.99      0.98       308

avg / total       0.99      0.99      0.99      1000




Compared to the accuracy (f1-score) of **0.92** obtained with C=1, the accuracy obtained after GridSearch is now **0.99** for test set, real improvement!