## Spam Classification

The dataset included for this exercise is based on a a subset of
[the SpamAssassin Public Corpus](http://spamassassin.apache.org/old/publiccorpus/).

In [48]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import re
import nltk, nltk.stem.porter

### Preprocessing Emails

In [5]:
print(os.listdir('../Exercise 6'))

['Spam Classification.ipynb', 'vocab.txt', 'ex6data1.mat', 'ex6.pdf', 'Ex 6 Support Vector Machines.ipynb', 'spamSample2.txt', 'spamSample1.txt', 'spamTest.mat', 'emailSample2.txt', 'emailSample1.txt', 'ex6data3.mat', '.ipynb_checkpoints', 'spamTrain.mat', 'ex6data2.mat']


In [72]:
def getVocabList():
    vocabList = {}
    with open('../Exercise 6/vocab.txt') as file:
        for line in file:
            (index, word) = line.split()
            vocabList[word] = int(index)
    return vocabList

In [73]:
def processEmail(file_content):

    # load vocabList
    vocabList = getVocabList()
    
    # lower casing
    file_content = file_content.lower()
    
    # Stripping Html
    file_content = re.compile('<[^<>]+>').sub(' ', file_content)
    
    # Normalizing URLs : All URLs are replaced with the text “httpaddr”.
    file_content = re.compile('(http|https)://(\S)*').sub('httpaddr', file_content)
    
    # Normalizing Email Addresses: with the text “emailaddr”.
    file_content = re.compile('\S*(.com)').sub('emailaddr', file_content)
    
    # Normalizing Numbers: “number”
    file_content = re.compile('[0-9]+').sub('number', file_content)
    
    # Normalizing Dollars: All dollar signs ($) are replaced with the text “dollar”.
    file_content = re.compile('[$]+').sub('dollar', file_content)
    
    # get rid of any punctuation
    file_content = re.split('[ @$/#.-:&*+=\[\]?!(){},''">_<;%\n\r]', file_content)
    
    # removal of non-words
    file_content = [word for word in file_content if len(word) > 1]
    
    tokenlist = []
    
    stemmer = nltk.stem.porter.PorterStemmer()
    
    for token in file_content:
      
        token = re.sub('[^a-zA-Z0-9]', '', token);
        stemmed = stemmer.stem( token )
        #Throw out empty tokens
        if not len(token): 
            continue
        #Store a list of all unique stemmed words
        tokenlist.append(stemmed)
    
    word_indices = []
    for token in tokenlist:
        if vocabList.get(token) is not None:
            word_indices.append(vocabList.get(token))
        
    return word_indices

In [74]:
emailSample1 = open('../Exercise 6/emailSample1.txt', 'r')
file_content = emailSample1.read()
print(file_content)

email1 = processEmail(file_content)
print(email1)
emailSample1.close()

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com


[86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831, 883, 431, 1171, 794, 1002, 1895, 592, 1676, 238, 162, 89, 688, 945, 1663, 1120, 1062, 1699, 375, 1162, 479, 1893, 1510, 799, 1182, 1237, 810, 1895, 1440, 1547, 181, 1699, 1758, 1896, 688, 1676, 992, 961, 1477, 71, 530, 1699, 531]


In [75]:
emailSample2 = open('../Exercise 6/emailSample2.txt', 'r')
file_content = emailSample2.read()
print(file_content)

email2 = processEmail(file_content)
print(email2)
emailSample2.close()

Folks,
 
my first time posting - have a bit of Unix experience, but am new to Linux.

 
Just got a new PC at home - Dell box with Windows XP. Added a second hard disk
for Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went
fine except it didn't pick up my monitor.
 
I have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4
Ti4200 video card, both of which are probably too new to feature in Suse's default
set. I downloaded a driver from the nVidia website and installed it using RPM.
Then I ran Sax2 (as was recommended in some postings I found on the net), but
it still doesn't feature my video card in the available list. What next?
 
Another problem. I have a Dell branded keyboard and if I hit Caps-Lock twice,
the whole machine crashes (in Linux, not Windows) - even the on/off switch is
inactive, leaving me to reach for the power cable instead.
 
If anyone can help me in any way with these probs., I'd be really grateful -
I've searched the 'ne

In [76]:
spamSample1 = open('../Exercise 6/spamSample1.txt', 'r')
file_content = spamSample1.read()
print(file_content)

spam1 = processEmail(file_content)
print(spam1)
spamSample1.close()

Do You Want To Make $1000 Or More Per Week?

 

If you are a motivated and qualified individual - I 
will personally demonstrate to you a system that will 
make you $1,000 per week or more! This is NOT mlm.

 

Call our 24 hour pre-recorded number to get the 
details.  

 

000-456-789

 

I need people who want to make serious money.  Make 
the call and get the facts. 

Invest 2 minutes in yourself now!

 

000-456-789

 

Looking forward to your call and I will introduce you 
to people like yourself who
are currently making $10,000 plus per week!

 

000-456-789



3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72


[471, 1893, 1809, 1699, 997, 479, 1182, 1064, 1231, 1827, 810, 1893, 1070, 74, 1346, 837, 1852, 1242, 1699, 1893, 1631, 1665, 1852, 997, 1893, 479, 1120, 1231, 1827, 1182, 1064, 1676, 877, 1113, 234, 1191, 1120, 792, 1120, 1699, 708, 1666, 440, 1093, 1230, 1844, 1809, 1699, 997, 1490, 997, 1666, 234, 74, 708, 1666, 608, 869, 1120, 1048, 825, 

In [77]:
spamSample2 = open('../Exercise 6/spamSample2.txt', 'r')
file_content = spamSample2.read()
print(file_content)

spam2 = processEmail(file_content)
print(spam2)
spamSample2.close()

Best Buy Viagra Generic Online

Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!

We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers!
http://medphysitcstech.ru



[176, 707, 1174, 1120, 479, 681, 460, 1711, 1475, 1120, 1347, 739, 1819, 10, 1795, 1012, 1227, 1120, 388, 799]


### Feature Extraction
convert the word indices into vectors

In [82]:
def emailFeatures(word_indices):
    
    n = 1899 # num of words in vocablist
    
    email_vector = np.zeros((n,1))
    
    for i in word_indices:
        email_vector[i,0] = 1
    
    return email_vector

In [87]:
email1_features = emailFeatures(email1)

print(f'Length of feature vector : {len(email1_features)}')
print(f'Number of non zero entries : {np.sum(email1_features == 1)}')

Length of feature vector : 1899
Number of non zero entries : 44


### Train Linear SVM for Spam Classification